Update with full NPU analysis and benchmarks

Adds:
- Detailed explanation of why Hexagon NPU doesn't accelerate inference
  - offload_op callback is NULL in ggml-hexagon.cpp
  - 2048 MiB limit is hardcoded, not hardware-queried
  - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4)
- Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU
  - All results show CPU and NPU identical within margin of error
- 7B test script (test-7b.sh)
- Updated deploy script with password handling for DSP .so
- Performance baseline in AGENTS.md
- Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
This commit is contained in:
Jimmy Devine 2026-05-02 12:42:42 +02:00
parent 18970e3258
commit 627236a505
4 changed files with 179 additions and 30 deletions

View file

@ -9,6 +9,14 @@ This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP
3. **Do NOT set CMAKE_SYSROOT** in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
4. **Use rpcmem_alloc for DSP compute buffers** — stack arrays only work for tiny buffers (~4KB fragile slow path).
## Why NPU Doesn't Actually Accelerate
Source code analysis of `ggml-hexagon.cpp` revealed:
- **`offload_op` callback is NULL** — scheduler never moves tensors to NPU
- **2048 MiB limit is hardcoded** (`2ULL * 1024 * 1024 * 1024`), not a hardware query
- **Q4_K_M not supported** — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
- Performance identical CPU vs NPU in all benchmarks (1B and 7B models)
## Build Command
```bash
@ -22,10 +30,14 @@ bash scripts/build-hexagon.sh
Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
```
## Test Command
## Test Commands
```bash
# Quick 1B test
bash scripts/test-on-q6a.sh
# 7B benchmarks at various context sizes
Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh
```
## File Reference
@ -34,12 +46,18 @@ bash scripts/test-on-q6a.sh
- `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead)
- `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon
- `scripts/deploy-to-q6a.sh` — Deploy to Q6A
- `scripts/test-on-q6a.sh` — Run inference test on Q6A
- `scripts/test-on-q6a.sh` — Quick 1B inference test
- `scripts/test-7b.sh` — 7B model benchmarks
- `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel
- `README.md` — Full guide with troubleshooting
- `README.md` — Full guide with all findings
## Performance Baseline
- Prompt processing: ~32 t/s (on 8 CPU cores)
- Generation: ~4.5 t/s
- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)
| Model | Config | Prompt t/s | Gen t/s |
|-------|--------|-----------|---------|
| 1B Q4_K_M | any | 32 | 4.5 |
| 7B Q4_K_M | 2K ctx | 2.7 | 1.9 |
| 7B Q4_K_M | 32K ctx | 2.7 | 1.9 |
| 7B Q4_K_M | 64K ctx | 2.5 | 1.8 |
NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).

119
README.md
View file

@ -89,7 +89,7 @@ The `scripts/build-hexagon.sh` script:
### Root Cause of `remote_handle64_open` Error 0xe
The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix is always compile and link natively on the Q6A (or link against the system library).
The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).
### Do NOT Call INIT_CREATE Manually
@ -118,12 +118,118 @@ remote_handle64_open(uri, &handle);
All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
`dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.
### Known Issues / Future Work
### Cross-Compile Pitfalls
- **Minimal stub library** (`htp_minimal_impl.c`) still fails to load on the DSP with error `0x80000442` (likely missing initialization that the full library does in its `main.c`). The full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
- **4.5 tok/s generation speed** is CPU-bound with partial DSP offload. More aggressive offloading of matrix ops to the NPU could improve this.
- **DSP library is rebuilt every time** the cmake build runs. You don't need to touch it unless you modify the Hexagon backend C code.
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's actually an output. Fix upstream to `rout` for correctness.
1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.
2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.
## Why the NPU Isn't Accelerating Inference
After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows:
```
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
```
The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:
### 1. `offload_op` callback is NULL
The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.
```c
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
const ggml_backend_device_t ggml_backend_hexagon_device_registration = {
/* .name = */ GGML_HEXAGON_DEVICE_NAME,
/* .description = */ "Hexagon NPU",
/* .get_memory = */ ggml_backend_hexagon_device_get_memory,
/* .get_version = */ NULL,
/* .get_best_device = */ NULL,
/* .get_device_for_tensor */ NULL,
/* .offload_op = */ NULL, // <--- STUBBED
/* .supports_op = */ ggml_backend_hexagon_device_supports_op,
...
};
```
### 2. 2048 MiB limit is hardcoded, not queried
The 2GB "free memory" reported is a hardcoded constant, not a hardware query:
```c
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
// ~2GB per session for now
*free = 2ULL * 1024 * 1024 * 1024;
*total = *free;
GGML_UNUSED(dev);
}
```
It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.
### 3. Q4_K_M is not a supported quantization for HTP
The Hexagon HTP kernels only support these quantization types:
- `GGML_TYPE_Q4_0`
- `GGML_TYPE_Q8_0`
- `GGML_TYPE_IQ4_NL`
- `GGML_TYPE_MXFP4`
**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.
### Summary of Issues
| Issue | Root cause |
|-------|-----------|
| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration |
| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query |
| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list |
| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers |
## Performance Benchmarks
All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).
### 1B Model (Llama 3.2, Q4_K_M)
| Metric | CPU-only | With Hexagon backend |
|--------|----------|---------------------|
| Prompt processing | 32.3 t/s | 32.0 t/s |
| Generation | 4.5 t/s | 4.5 t/s |
No difference — CPU handles the 1B model natively.
### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)
Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).
| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total |
|---------|------|-----------|---------|-------|---------|---------|-------|
| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 |
| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 |
| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.
Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.
### What would need to change to get actual NPU offload
1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory
3. Query actual rpcmem capacity instead of hardcoding 2GB
4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)
## Known Issues
- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
- **DSP library is rebuilt every time** the cmake build runs.
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.
## Files in This Repo
@ -134,5 +240,6 @@ All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.
| `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
| `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A |
| `scripts/test-on-q6a.sh` | Run full inference test on Q6A |
| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes |
| `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) |
| `AGENTS.md` | Context for AI coding agents working with this codebase |

View file

@ -3,41 +3,38 @@
set -euo pipefail
Q6A="${Q6A:-radxa@192.168.1.11}"
Q6A_PASS="${Q6A_PASS:-radxa}"
BUILD_DIR="${BUILD_DIR:-$HOME/llama.cpp/build-hexagon}"
DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"
echo "=== Deploying to ${Q6A}:${DEPLOY_DIR} ==="
# Check build artifacts exist
# Create deploy dir
ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
# Deploy ARM binaries
echo "--- ARM binaries ---"
for f in llama-cli libggml-hexagon.so libggml-hexagon.so.0 libggml-hexagon.so.0.9.11 \
libggml-base.so libggml-base.so.0 libggml-base.so.0.9.11 \
libggml-cpu.so libggml-cpu.so.0 libggml-cpu.so.0.9.11 \
libggml.so libggml.so.0 libggml.so.0.9.11 \
libllama.so libllama.so.0; do
if [ ! -f "${BUILD_DIR}/bin/${f}" ]; then
echo "WARNING: ${f} not found — build may be incomplete"
src="${BUILD_DIR}/bin/${f}"
if [ -f "$src" ]; then
scp "$src" "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null
fi
done
# Create deploy dir
ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
# Deploy ARM binaries
scp "${BUILD_DIR}/bin/llama-cli" "${Q6A}:~/${DEPLOY_DIR}/"
scp "${BUILD_DIR}/bin/libggml-hexagon.so" "${BUILD_DIR}/bin/libggml-hexagon.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
scp "${BUILD_DIR}/bin/libggml-base.so" "${BUILD_DIR}/bin/libggml-base.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
scp "${BUILD_DIR}/bin/libggml-cpu.so" "${BUILD_DIR}/bin/libggml-cpu.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
scp "${BUILD_DIR}/bin/libggml.so" "${BUILD_DIR}/bin/libggml.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
scp "${BUILD_DIR}/bin/libllama.so" "${BUILD_DIR}/bin/libllama.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
echo " done"
# Deploy DSP skel
DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagonal/libggml-htp-v68.so"
echo "--- DSP .so ---"
DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagon/libggml-htp-v68.so"
if [ -f "$DSP_SO" ]; then
scp "$DSP_SO" "${Q6A}:/tmp/"
ssh "${Q6A}" "echo radxa | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so"
echo "DSP .so deployed"
ssh "${Q6A}" "echo '${Q6A_PASS}' | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so 2>&1"
echo " deployed to /usr/lib/dsp/cdsp/"
else
echo "WARNING: DSP .so not found at $DSP_SO"
echo " WARNING: DSP .so not found at $DSP_SO"
fi
echo "=== Deploy complete ==="

27
scripts/test-7b.sh Normal file
View file

@ -0,0 +1,27 @@
#!/usr/bin/env bash
# test-7b.sh — Run 7B model benchmarks on Q6A at various context sizes
set -euo pipefail
Q6A="${Q6A:-radxa@192.168.1.11}"
MODEL="${MODEL:-/home/radxa/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf}"
DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"
CONTEXTS=("2048" "8192" "32768" "65536")
echo "=== 7B Model Benchmarks ==="
echo "Model: ${MODEL}"
echo ""
for ctx in "${CONTEXTS[@]}"; do
echo "--- Context ${ctx} (NPU) ---"
ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && GGML_HEXAGON=1 LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1" | grep -E 'Prompt:|Generation:|memory'
echo ""
echo "--- Context ${ctx} (CPU-only) ---"
ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && mv libggml-hexagon.so libggml-hexagon.so.disabled 2>/dev/null; mv libggml-hexagon.so.0 libggml-hexagon.so.0.disabled 2>/dev/null; mv libggml-hexagon.so.0.9.11 libggml-hexagon.so.0.9.11.disabled 2>/dev/null; LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1 | grep -E 'Prompt:|Generation:'; mv libggml-hexagon.so.disabled libggml-hexagon.so 2>/dev/null; mv libggml-hexagon.so.0.disabled libggml-hexagon.so.0 2>/dev/null; mv libggml-hexagon.so.0.9.11.disabled libggml-hexagon.so.0.9.11 2>/dev/null"
echo ""
done
echo "=== Done ==="