Update with full NPU analysis and benchmarks

Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
2026-05-02 12:42:42 +02:00 · 2026-05-02 12:42:42 +02:00 · 627236a505
commit 627236a505
parent 18970e3258
4 changed files with 179 additions and 30 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -9,6 +9,14 @@ This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP
 3. **Do NOT set CMAKE_SYSROOT** in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
 4. **Use rpcmem_alloc for DSP compute buffers** — stack arrays only work for tiny buffers (~4KB fragile slow path).

+## Why NPU Doesn't Actually Accelerate
+
+Source code analysis of `ggml-hexagon.cpp` revealed:
+- **`offload_op` callback is NULL** — scheduler never moves tensors to NPU
+- **2048 MiB limit is hardcoded** (`2ULL * 1024 * 1024 * 1024`), not a hardware query
+- **Q4_K_M not supported** — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
+- Performance identical CPU vs NPU in all benchmarks (1B and 7B models)
+
 ## Build Command

 ```bash
@ -22,10 +30,14 @@ bash scripts/build-hexagon.sh
 Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
 ```

-## Test Command
+## Test Commands

 ```bash
+# Quick 1B test
 bash scripts/test-on-q6a.sh
+
+# 7B benchmarks at various context sizes
+Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh
 ```

 ## File Reference
@ -34,12 +46,18 @@ bash scripts/test-on-q6a.sh
 - `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead)
 - `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon
 - `scripts/deploy-to-q6a.sh` — Deploy to Q6A
- `scripts/test-on-q6a.sh` — Run inference test on Q6A
+- `scripts/test-on-q6a.sh` — Quick 1B inference test
+- `scripts/test-7b.sh` — 7B model benchmarks
 - `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel
- `README.md` — Full guide with troubleshooting
+- `README.md` — Full guide with all findings

 ## Performance Baseline

- Prompt processing: ~32 t/s (on 8 CPU cores)
- Generation: ~4.5 t/s
- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)
+| Model | Config | Prompt t/s | Gen t/s |
+|-------|--------|-----------|---------|
+| 1B Q4_K_M | any | 32 | 4.5 |
+| 7B Q4_K_M | 2K ctx | 2.7 | 1.9 |
+| 7B Q4_K_M | 32K ctx | 2.7 | 1.9 |
+| 7B Q4_K_M | 64K ctx | 2.5 | 1.8 |
+
+NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).
--- a/README.md
+++ b/README.md
@ -89,7 +89,7 @@ The `scripts/build-hexagon.sh` script:

 ### Root Cause of `remote_handle64_open` Error 0xe

-The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix is always compile and link natively on the Q6A (or link against the system library).
+The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).

 ### Do NOT Call INIT_CREATE Manually

@ -118,12 +118,118 @@ remote_handle64_open(uri, &handle);
 All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
 `dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.

-### Known Issues / Future Work
+### Cross-Compile Pitfalls

- **Minimal stub library** (`htp_minimal_impl.c`) still fails to load on the DSP with error `0x80000442` (likely missing initialization that the full library does in its `main.c`). The full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
- **4.5 tok/s generation speed** is CPU-bound with partial DSP offload. More aggressive offloading of matrix ops to the NPU could improve this.
- **DSP library is rebuilt every time** the cmake build runs. You don't need to touch it unless you modify the Hexagon backend C code.
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's actually an output. Fix upstream to `rout` for correctness.
+1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.
+
+2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.
+
+## Why the NPU Isn't Accelerating Inference
+
+After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows:
+
+```
+llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
+```
+
+The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:
+
+### 1. `offload_op` callback is NULL
+
+The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.
+
+```c
+// ggml/src/ggml-hexagon/ggml-hexagon.cpp
+const ggml_backend_device_t ggml_backend_hexagon_device_registration = {
+    /* .name                  = */ GGML_HEXAGON_DEVICE_NAME,
+    /* .description           = */ "Hexagon NPU",
+    /* .get_memory            = */ ggml_backend_hexagon_device_get_memory,
+    /* .get_version           = */ NULL,
+    /* .get_best_device       = */ NULL,
+    /* .get_device_for_tensor */ NULL,
+    /* .offload_op            = */ NULL,     // <--- STUBBED
+    /* .supports_op           = */ ggml_backend_hexagon_device_supports_op,
+    ...
+};
+```
+
+### 2. 2048 MiB limit is hardcoded, not queried
+
+The 2GB "free memory" reported is a hardcoded constant, not a hardware query:
+
+```c
+// ggml/src/ggml-hexagon/ggml-hexagon.cpp
+static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
+    // ~2GB per session for now
+    *free  = 2ULL * 1024 * 1024 * 1024;
+    *total = *free;
+    GGML_UNUSED(dev);
+}
+```
+
+It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.
+
+### 3. Q4_K_M is not a supported quantization for HTP
+
+The Hexagon HTP kernels only support these quantization types:
+- `GGML_TYPE_Q4_0`
+- `GGML_TYPE_Q8_0`
+- `GGML_TYPE_IQ4_NL`
+- `GGML_TYPE_MXFP4`
+
+**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.
+
+### Summary of Issues
+
+| Issue | Root cause |
+|-------|-----------|
+| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration |
+| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query |
+| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list |
+| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers |
+
+## Performance Benchmarks
+
+All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).
+
+### 1B Model (Llama 3.2, Q4_K_M)
+
+| Metric | CPU-only | With Hexagon backend |
+|--------|----------|---------------------|
+| Prompt processing | 32.3 t/s | 32.0 t/s |
+| Generation | 4.5 t/s | 4.5 t/s |
+
+No difference — CPU handles the 1B model natively.
+
+### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)
+
+Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).
+
+| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total |
+|---------|------|-----------|---------|-------|---------|---------|-------|
+| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 |
+| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 |
+| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
+| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
+| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
+| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
+
+**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.
+
+Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.
+
+### What would need to change to get actual NPU offload
+
+1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
+2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory
+3. Query actual rpcmem capacity instead of hardcoding 2GB
+4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)
+
+## Known Issues
+
+- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
+- **DSP library is rebuilt every time** the cmake build runs.
+- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.

 ## Files in This Repo

@ -134,5 +240,6 @@ All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.
 | `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
 | `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A |
 | `scripts/test-on-q6a.sh` | Run full inference test on Q6A |
+| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes |
 | `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) |
 | `AGENTS.md` | Context for AI coding agents working with this codebase |
--- a/scripts/deploy-to-q6a.sh
+++ b/scripts/deploy-to-q6a.sh
@ -3,41 +3,38 @@
 set -euo pipefail

 Q6A="${Q6A:-radxa@192.168.1.11}"
+Q6A_PASS="${Q6A_PASS:-radxa}"
 BUILD_DIR="${BUILD_DIR:-$HOME/llama.cpp/build-hexagon}"
 DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"

 echo "=== Deploying to ${Q6A}:${DEPLOY_DIR} ==="

-# Check build artifacts exist
+# Create deploy dir
+ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
+
+# Deploy ARM binaries
+echo "--- ARM binaries ---"
 for f in llama-cli libggml-hexagon.so libggml-hexagon.so.0 libggml-hexagon.so.0.9.11 \
         libggml-base.so libggml-base.so.0 libggml-base.so.0.9.11 \
         libggml-cpu.so libggml-cpu.so.0 libggml-cpu.so.0.9.11 \
         libggml.so libggml.so.0 libggml.so.0.9.11 \
         libllama.so libllama.so.0; do
-    if [ ! -f "${BUILD_DIR}/bin/${f}" ]; then
-        echo "WARNING: ${f} not found — build may be incomplete"
+    src="${BUILD_DIR}/bin/${f}"
+    if [ -f "$src" ]; then
+        scp "$src" "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null
    fi
 done
-
-# Create deploy dir
-ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
-
-# Deploy ARM binaries
-scp "${BUILD_DIR}/bin/llama-cli" "${Q6A}:~/${DEPLOY_DIR}/"
-scp "${BUILD_DIR}/bin/libggml-hexagon.so" "${BUILD_DIR}/bin/libggml-hexagon.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
-scp "${BUILD_DIR}/bin/libggml-base.so" "${BUILD_DIR}/bin/libggml-base.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
-scp "${BUILD_DIR}/bin/libggml-cpu.so" "${BUILD_DIR}/bin/libggml-cpu.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
-scp "${BUILD_DIR}/bin/libggml.so" "${BUILD_DIR}/bin/libggml.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
-scp "${BUILD_DIR}/bin/libllama.so" "${BUILD_DIR}/bin/libllama.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
+echo "  done"

 # Deploy DSP skel
-DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagonal/libggml-htp-v68.so"
+echo "--- DSP .so ---"
+DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagon/libggml-htp-v68.so"
 if [ -f "$DSP_SO" ]; then
    scp "$DSP_SO" "${Q6A}:/tmp/"
-    ssh "${Q6A}" "echo radxa | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so"
-    echo "DSP .so deployed"
+    ssh "${Q6A}" "echo '${Q6A_PASS}' | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so 2>&1"
+    echo "  deployed to /usr/lib/dsp/cdsp/"
 else
-    echo "WARNING: DSP .so not found at $DSP_SO"
+    echo "  WARNING: DSP .so not found at $DSP_SO"
 fi

 echo "=== Deploy complete ==="
--- a/scripts/test-7b.sh
+++ b/scripts/test-7b.sh
@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+# test-7b.sh — Run 7B model benchmarks on Q6A at various context sizes
+set -euo pipefail
+
+Q6A="${Q6A:-radxa@192.168.1.11}"
+MODEL="${MODEL:-/home/radxa/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf}"
+DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"
+
+CONTEXTS=("2048" "8192" "32768" "65536")
+
+echo "=== 7B Model Benchmarks ==="
+echo "Model: ${MODEL}"
+echo ""
+
+for ctx in "${CONTEXTS[@]}"; do
+    echo "--- Context ${ctx} (NPU) ---"
+    ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && GGML_HEXAGON=1 LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1" | grep -E 'Prompt:|Generation:|memory'
+    
+    echo ""
+    
+    echo "--- Context ${ctx} (CPU-only) ---"
+    ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && mv libggml-hexagon.so libggml-hexagon.so.disabled 2>/dev/null; mv libggml-hexagon.so.0 libggml-hexagon.so.0.disabled 2>/dev/null; mv libggml-hexagon.so.0.9.11 libggml-hexagon.so.0.9.11.disabled 2>/dev/null; LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1 | grep -E 'Prompt:|Generation:'; mv libggml-hexagon.so.disabled libggml-hexagon.so 2>/dev/null; mv libggml-hexagon.so.0.disabled libggml-hexagon.so.0 2>/dev/null; mv libggml-hexagon.so.0.9.11.disabled libggml-hexagon.so.0.9.11 2>/dev/null"
+    
+    echo ""
+done
+
+echo "=== Done ==="