llamacpp_on_dragon_wing_q6a.../README.md

# Q6A Hexagon v68 + llama.cpp — Complete Guide

This repo documents how to get llama.cpp running with the **Qualcomm Hexagon CDSP v68** (NPU/DSP) backend on a **Radxa Dragon Q6A** board (SA8775P).

## Overview

The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. The key insight from weeks of debugging: **let libcdsprpc handle `FASTRPC_IOCTL_INIT_CREATE` internally** — do NOT attempt it manually. Use the system's `libcdsprpc.so`, not the SDK's cross-compiled version.

## Prerequisites

### Build Machine (x86_64)
- Ubuntu 24.04 (or similar with cross-compilation packages)
- Packages:
  ```bash
  sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake
  sudo apt install libc6-arm64-cross libc6-dev-arm64-cross
  ```
- Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at `/local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/`
  - Must include `hexagon-clang` at `tools/HEXAGON_Tools/8.7.06/Tools/bin/`
  - Must include `qaic` IDL compiler at `tools/qaic/bin/qaic`
  - Must include `incs/` with SDK headers
  - Must include `ipc/fastrpc/` with libcdsprpc and rpcmem headers

### Target Machine (Q6A — aarch64)
- Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04
- `fastrpc` package installed: `sudo apt install fastrpc fastrpc-test`
- User `radxa` in `render` group (for `/dev/fastrpc-cdsp-secure` access)
- CDSP firmware running: `cat /sys/class/remoteproc/remoteproc1/state` → `running`

## Quick Start

### 1. Build llama.cpp with Hexagon backend

```bash
cd ~/llama.cpp
bash scripts/build-hexagon.sh
```

This cross-compiles llama.cpp for aarch64 with `-DGGML_HEXAGON=ON`. Output goes to `build-hexagon/bin/`.

### 2. Deploy to Q6A

```bash
# Deploy ARM64 binaries
scp build-hexagon/bin/llama-cli radxa@192.168.1.11:~/llama/bin/
scp build-hexagon/bin/libggml*.so* radxa@192.168.1.11:~/llama/bin/
scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/

# Deploy DSP skel
scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/
ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"
```

### 3. Run inference test

```bash
ssh radxa@192.168.1.11
cd ~/llama/bin
GGML_HEXAGON=1 LD_LIBRARY_PATH=. ./llama-cli \
    -m ~/models/llama-3.2-1b-q4km.gguf \
    -n 32 -p "Hello, what is your name?" -ngl 0
```

Expected output:
```
ggml-hex: Loading driver libcdsprpc.so
ggml-hex: Hexagon Arch version v68
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 ...
[ Prompt: 32.8 t/s | Generation: 4.5 t/s ]
```

## Build Script Details

The `scripts/build-hexagon.sh` script:

1. **CMake configure** with:
   - `-DCMAKE_BUILD_TYPE=Release`
   - `-DBUILD_SHARED_LIBS=ON` (required for HTP plugin .so)
   - `-DCMAKE_INSTALL_RPATH='$ORIGIN'` (libraries alongside binary)
   - `-DGGML_HEXAGON=ON`
   - `-DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF`
   - `-DMAX_DOMAIN_NAMELEN=64` on both C and CXX flags

2. **Do NOT set `CMAKE_SYSROOT`** — the cross-compiler's own linker scripts conflict with `--sysroot` on Ubuntu's `gcc-aarch64-linux-gnu` packages.

3. **Do NOT set explicit OpenSSL paths** — they're unnecessary when `LLAMA_BUILD_SERVER=OFF`.

## Critical Lessons Learned

### Root Cause of `remote_handle64_open` Error 0xe

The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).

### Do NOT Call INIT_CREATE Manually

Attempting `FASTRPC_IOCTL_INIT_CREATE` via ioctl on `/dev/fastrpc-cdsp-secure` always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:

```c
/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */
remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);
remote_handle64_open(uri, &handle);
```

### Verified Q6A Constants

| Item | Value |
|------|-------|
| CDSP device node | `/dev/fastrpc-cdsp-secure` |
| Shell path | `/usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3` |
| Domain ID | `CDSP_DOMAIN_ID` = 3 |
| Unsigned module flag | `FASTRPC_MODE_UNSIGNED_MODULE` = `(1 << 3)` = 0x8 |
| DSP .so path | `/usr/lib/dsp/cdsp/` |
| System libcdsprpc | `/usr/lib/libcdsprpc.so.1` (symlink at `/usr/lib/libcdsprpc.so` already exists) |
| Kernel header | `/usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h` |

### dspqueue Symbols

All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
`dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.

### Cross-Compile Pitfalls

1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.

2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.

## Why the NPU Isn't Accelerating Inference

After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows:

```
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
```

The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:

### 1. `offload_op` callback is NULL

The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.

```c
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
const ggml_backend_device_t ggml_backend_hexagon_device_registration = {
    /* .name                  = */ GGML_HEXAGON_DEVICE_NAME,
    /* .description           = */ "Hexagon NPU",
    /* .get_memory            = */ ggml_backend_hexagon_device_get_memory,
    /* .get_version           = */ NULL,
    /* .get_best_device       = */ NULL,
    /* .get_device_for_tensor */ NULL,
    /* .offload_op            = */ NULL,     // <--- STUBBED
    /* .supports_op           = */ ggml_backend_hexagon_device_supports_op,
    ...
};
```

### 2. 2048 MiB limit is hardcoded, not queried

The 2GB "free memory" reported is a hardcoded constant, not a hardware query:

```c
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
    // ~2GB per session for now
    *free  = 2ULL * 1024 * 1024 * 1024;
    *total = *free;
    GGML_UNUSED(dev);
}
```

It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.

### 3. Q4_K_M is not a supported quantization for HTP

The Hexagon HTP kernels only support these quantization types:
- `GGML_TYPE_Q4_0`
- `GGML_TYPE_Q8_0`
- `GGML_TYPE_IQ4_NL`
- `GGML_TYPE_MXFP4`

**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.

### Summary of Issues

| Issue | Root cause |
|-------|-----------|
| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration |
| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query |
| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list |
| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers |

## Performance Benchmarks

All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).

### 1B Model (Llama 3.2, Q4_K_M)

| Metric | CPU-only | With Hexagon backend |
|--------|----------|---------------------|
| Prompt processing | 32.3 t/s | 32.0 t/s |
| Generation | 4.5 t/s | 4.5 t/s |

No difference — CPU handles the 1B model natively.

### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)

Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).

| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total |
|---------|------|-----------|---------|-------|---------|---------|-------|
| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 |
| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 |
| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |

**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.

Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.

### What would need to change to get actual NPU offload

1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory
3. Query actual rpcmem capacity instead of hardcoding 2GB
4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)

## Known Issues

- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
- **DSP library is rebuilt every time** the cmake build runs.
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.

## Files in This Repo

| File | Purpose |
|------|---------|
| `src/test_fastrpc_fixed.c` | Corrected test harness with proper init sequence |
| `src/htp_minimal_impl.c` | Minimal DSP stub (for experimentation) |
| `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
| `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A |
| `scripts/test-on-q6a.sh` | Run full inference test on Q6A |
| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes |
| `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) |
| `AGENTS.md` | Context for AI coding agents working with this codebase |
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00			`# Q6A Hexagon v68 + llama.cpp — Complete Guide`

			`This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 (NPU/DSP) backend on a Radxa Dragon Q6A board (SA8775P).`

			`## Overview`

			The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. The key insight from weeks of debugging: let libcdsprpc handle `FASTRPC_IOCTL_INIT_CREATE` internally — do NOT attempt it manually. Use the system's `libcdsprpc.so`, not the SDK's cross-compiled version.

			`## Prerequisites`

			`### Build Machine (x86_64)`
			`- Ubuntu 24.04 (or similar with cross-compilation packages)`
			`- Packages:`
			```bash
			`sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake`
			`sudo apt install libc6-arm64-cross libc6-dev-arm64-cross`
			```
			- Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at `/local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/`
			- Must include `hexagon-clang` at `tools/HEXAGON_Tools/8.7.06/Tools/bin/`
			- Must include `qaic` IDL compiler at `tools/qaic/bin/qaic`
			- Must include `incs/` with SDK headers
			- Must include `ipc/fastrpc/` with libcdsprpc and rpcmem headers

			`### Target Machine (Q6A — aarch64)`
			`- Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04`
			- `fastrpc` package installed: `sudo apt install fastrpc fastrpc-test`
			- User `radxa` in `render` group (for `/dev/fastrpc-cdsp-secure` access)
			- CDSP firmware running: `cat /sys/class/remoteproc/remoteproc1/state` → `running`

			`## Quick Start`

			`### 1. Build llama.cpp with Hexagon backend`

			```bash
			`cd ~/llama.cpp`
			`bash scripts/build-hexagon.sh`
			```

			This cross-compiles llama.cpp for aarch64 with `-DGGML_HEXAGON=ON`. Output goes to `build-hexagon/bin/`.

			`### 2. Deploy to Q6A`

			```bash
			`# Deploy ARM64 binaries`
			`scp build-hexagon/bin/llama-cli radxa@192.168.1.11:~/llama/bin/`
			`scp build-hexagon/bin/libggml.so radxa@192.168.1.11:~/llama/bin/`
			`scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/`

			`# Deploy DSP skel`
			`scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/`
			`ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"`
			```

			`### 3. Run inference test`

			```bash
			`ssh radxa@192.168.1.11`
			`cd ~/llama/bin`
			`GGML_HEXAGON=1 LD_LIBRARY_PATH=. ./llama-cli \`
			`-m ~/models/llama-3.2-1b-q4km.gguf \`
			`-n 32 -p "Hello, what is your name?" -ngl 0`
			```

			`Expected output:`
			```
			`ggml-hex: Loading driver libcdsprpc.so`
			`ggml-hex: Hexagon Arch version v68`
			`ggml-hex: new session: HTP0 : session-id 0 domain-id 3 ...`
			`[ Prompt: 32.8 t/s \| Generation: 4.5 t/s ]`
			```

			`## Build Script Details`

			The `scripts/build-hexagon.sh` script:

			`1. CMake configure with:`
			- `-DCMAKE_BUILD_TYPE=Release`
			- `-DBUILD_SHARED_LIBS=ON` (required for HTP plugin .so)
			- `-DCMAKE_INSTALL_RPATH='$ORIGIN'` (libraries alongside binary)
			- `-DGGML_HEXAGON=ON`
			- `-DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF`
			- `-DMAX_DOMAIN_NAMELEN=64` on both C and CXX flags

			2. Do NOT set `CMAKE_SYSROOT` — the cross-compiler's own linker scripts conflict with `--sysroot` on Ubuntu's `gcc-aarch64-linux-gnu` packages.

			3. Do NOT set explicit OpenSSL paths — they're unnecessary when `LLAMA_BUILD_SERVER=OFF`.

			`## Critical Lessons Learned`

			### Root Cause of `remote_handle64_open` Error 0xe

Update with full NPU analysis and benchmarks Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init) 2026-05-02 10:42:42 +00:00			The error occurs because the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00
			`### Do NOT Call INIT_CREATE Manually`

			Attempting `FASTRPC_IOCTL_INIT_CREATE` via ioctl on `/dev/fastrpc-cdsp-secure` always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:

			```c
			`/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */`
			`remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);`
			`remote_handle64_open(uri, &handle);`
			```

			`### Verified Q6A Constants`

			`\| Item \| Value \|`
			`\|------\|-------\|`
			\| CDSP device node \| `/dev/fastrpc-cdsp-secure` \|
			\| Shell path \| `/usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3` \|
			\| Domain ID \| `CDSP_DOMAIN_ID` = 3 \|
			\| Unsigned module flag \| `FASTRPC_MODE_UNSIGNED_MODULE` = `(1 << 3)` = 0x8 \|
			\| DSP .so path \| `/usr/lib/dsp/cdsp/` \|
			\| System libcdsprpc \| `/usr/lib/libcdsprpc.so.1` (symlink at `/usr/lib/libcdsprpc.so` already exists) \|
			\| Kernel header \| `/usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h` \|

			`### dspqueue Symbols`

			All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
			`dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.

Update with full NPU analysis and benchmarks Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init) 2026-05-02 10:42:42 +00:00			`### Cross-Compile Pitfalls`
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00
Update with full NPU analysis and benchmarks Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init) 2026-05-02 10:42:42 +00:00			1. `CMAKE_SYSROOT` breaks the linker — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.

			2. `rpcmem_init` is optional when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.

			`## Why the NPU Isn't Accelerating Inference`

			`After extensive testing, the Hexagon backend loads and initializes successfully but never actually offloads any computation. Every test shows:`

			```
			`llama_memory_breakdown_print: \| - HTP0 (Hexagon) \| 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 \|`
			```

			`The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:`

			### 1. `offload_op` callback is NULL

			The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.

			```c
			`// ggml/src/ggml-hexagon/ggml-hexagon.cpp`
			`const ggml_backend_device_t ggml_backend_hexagon_device_registration = {`
			`/* .name = */ GGML_HEXAGON_DEVICE_NAME,`
			`/* .description = */ "Hexagon NPU",`
			`/* .get_memory = */ ggml_backend_hexagon_device_get_memory,`
			`/* .get_version = */ NULL,`
			`/* .get_best_device = */ NULL,`
			`/* .get_device_for_tensor */ NULL,`
			`/* .offload_op = */ NULL, // <--- STUBBED`
			`/* .supports_op = */ ggml_backend_hexagon_device_supports_op,`
			`...`
			`};`
			```

			`### 2. 2048 MiB limit is hardcoded, not queried`

			`The 2GB "free memory" reported is a hardcoded constant, not a hardware query:`

			```c
			`// ggml/src/ggml-hexagon/ggml-hexagon.cpp`
			`static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {`
			`// ~2GB per session for now`
			`free = 2ULL 1024 * 1024 * 1024;`
			`total = free;`
			`GGML_UNUSED(dev);`
			`}`
			```

			It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.

			`### 3. Q4_K_M is not a supported quantization for HTP`

			`The Hexagon HTP kernels only support these quantization types:`
			- `GGML_TYPE_Q4_0`
			- `GGML_TYPE_Q8_0`
			- `GGML_TYPE_IQ4_NL`
			- `GGML_TYPE_MXFP4`

			Q4_K_M is NOT in this list. Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.

			`### Summary of Issues`

			`\| Issue \| Root cause \|`
			`\|-------\|-----------\|`
			\| 0 MiB used for model/context/compute \| `offload_op = NULL` in device registration \|
			`\| 2048 MiB cap \| Hardcoded constant, not a FastRPC/ION query \|`
			`\| Q4_K_M tensors don't offload \| Q4_K_M not in HTP supported type list \|`
			`\| Ops always rejected by supports_op \| Chicken-and-egg: tensors never in Hexagon buffers \|`

			`## Performance Benchmarks`

			`All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).`

			`### 1B Model (Llama 3.2, Q4_K_M)`

			`\| Metric \| CPU-only \| With Hexagon backend \|`
			`\|--------\|----------\|---------------------\|`
			`\| Prompt processing \| 32.3 t/s \| 32.0 t/s \|`
			`\| Generation \| 4.5 t/s \| 4.5 t/s \|`

			`No difference — CPU handles the 1B model natively.`

			`### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)`

			Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).

			`\| Context \| Test \| Prompt t/s \| Gen t/s \| Model \| Context \| Compute \| Total \|`
			`\|---------\|------\|-----------\|---------\|-------\|---------\|---------\|-------\|`
			`\| 2K \| CPU \| 2.7 \| 1.9 \| 4460 \| 112 \| 311 \| 4883 \|`
			`\| 2K \| NPU \| 2.8 \| 1.8 \| 4460 \| 112 \| 311 \| 4883 \|`
			`\| 32K \| CPU \| 2.7 \| 1.9 \| 4460 \| 1792 \| 396 \| 6648 \|`
			`\| 32K \| NPU \| 2.7 \| 1.9 \| 4460 \| 1792 \| 396 \| 6648 \|`
			`\| 64K \| CPU \| 2.5 \| 1.8 \| 4460 \| 3584 \| 429 \| 8473 \|`
			`\| 64K \| NPU \| 2.5 \| 1.8 \| 4460 \| 3584 \| 429 \| 8473 \|`

			`Every NPU vs CPU comparison is identical. The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.`

			`Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.`

			`### What would need to change to get actual NPU offload`

			1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
			`2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory`
			`3. Query actual rpcmem capacity instead of hardcoding 2GB`
			`4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)`

			`## Known Issues`

			- Minimal stub library (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
			`- DSP library is rebuilt every time the cmake build runs.`
			- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00
			`## Files in This Repo`

			`\| File \| Purpose \|`
			`\|------\|---------\|`
			\| `src/test_fastrpc_fixed.c` \| Corrected test harness with proper init sequence \|
			\| `src/htp_minimal_impl.c` \| Minimal DSP stub (for experimentation) \|
			\| `scripts/build-hexagon.sh` \| Cross-compile script for llama.cpp with GGML_HEXAGON=ON \|
			\| `scripts/deploy-to-q6a.sh` \| Deploy built binaries + DSP .so to Q6A \|
			\| `scripts/test-on-q6a.sh` \| Run full inference test on Q6A \|
Update with full NPU analysis and benchmarks Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init) 2026-05-02 10:42:42 +00:00			\| `scripts/test-7b.sh` \| Run 7B model benchmarks at various context sizes \|
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00			\| `references/fastrpc.h` \| Q6A kernel header (ioctl struct definitions) \|
			\| `AGENTS.md` \| Context for AI coding agents working with this codebase \|