diff --git a/AGENTS.md b/AGENTS.md index f95c299..2ab6014 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -9,6 +9,14 @@ This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP 3. **Do NOT set CMAKE_SYSROOT** in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts. 4. **Use rpcmem_alloc for DSP compute buffers** — stack arrays only work for tiny buffers (~4KB fragile slow path). +## Why NPU Doesn't Actually Accelerate + +Source code analysis of `ggml-hexagon.cpp` revealed: +- **`offload_op` callback is NULL** — scheduler never moves tensors to NPU +- **2048 MiB limit is hardcoded** (`2ULL * 1024 * 1024 * 1024`), not a hardware query +- **Q4_K_M not supported** — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4 +- Performance identical CPU vs NPU in all benchmarks (1B and 7B models) + ## Build Command ```bash @@ -22,10 +30,14 @@ bash scripts/build-hexagon.sh Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh ``` -## Test Command +## Test Commands ```bash +# Quick 1B test bash scripts/test-on-q6a.sh + +# 7B benchmarks at various context sizes +Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh ``` ## File Reference @@ -34,12 +46,18 @@ bash scripts/test-on-q6a.sh - `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead) - `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon - `scripts/deploy-to-q6a.sh` — Deploy to Q6A -- `scripts/test-on-q6a.sh` — Run inference test on Q6A +- `scripts/test-on-q6a.sh` — Quick 1B inference test +- `scripts/test-7b.sh` — 7B model benchmarks - `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel -- `README.md` — Full guide with troubleshooting +- `README.md` — Full guide with all findings ## Performance Baseline -- Prompt processing: ~32 t/s (on 8 CPU cores) -- Generation: ~4.5 t/s -- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M) +| Model | Config | Prompt t/s | Gen t/s | +|-------|--------|-----------|---------| +| 1B Q4_K_M | any | 32 | 4.5 | +| 7B Q4_K_M | 2K ctx | 2.7 | 1.9 | +| 7B Q4_K_M | 32K ctx | 2.7 | 1.9 | +| 7B Q4_K_M | 64K ctx | 2.5 | 1.8 | + +NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone). diff --git a/README.md b/README.md index 94a2fa3..d3d2c59 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,7 @@ The `scripts/build-hexagon.sh` script: ### Root Cause of `remote_handle64_open` Error 0xe -The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix is always compile and link natively on the Q6A (or link against the system library). +The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library). ### Do NOT Call INIT_CREATE Manually @@ -118,12 +118,118 @@ remote_handle64_open(uri, &handle); All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`: `dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc. -### Known Issues / Future Work +### Cross-Compile Pitfalls -- **Minimal stub library** (`htp_minimal_impl.c`) still fails to load on the DSP with error `0x80000442` (likely missing initialization that the full library does in its `main.c`). The full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly. -- **4.5 tok/s generation speed** is CPU-bound with partial DSP offload. More aggressive offloading of matrix ops to the NPU could improve this. -- **DSP library is rebuilt every time** the cmake build runs. You don't need to touch it unless you modify the Hexagon backend C code. -- The `htp_iface.idl` declares `dst` as `in sequence` (input-only) but it's actually an output. Fix upstream to `rout` for correctness. +1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot. + +2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all. + +## Why the NPU Isn't Accelerating Inference + +After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows: + +``` +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +``` + +The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons: + +### 1. `offload_op` callback is NULL + +The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed. + +```c +// ggml/src/ggml-hexagon/ggml-hexagon.cpp +const ggml_backend_device_t ggml_backend_hexagon_device_registration = { + /* .name = */ GGML_HEXAGON_DEVICE_NAME, + /* .description = */ "Hexagon NPU", + /* .get_memory = */ ggml_backend_hexagon_device_get_memory, + /* .get_version = */ NULL, + /* .get_best_device = */ NULL, + /* .get_device_for_tensor */ NULL, + /* .offload_op = */ NULL, // <--- STUBBED + /* .supports_op = */ ggml_backend_hexagon_device_supports_op, + ... +}; +``` + +### 2. 2048 MiB limit is hardcoded, not queried + +The 2GB "free memory" reported is a hardcoded constant, not a hardware query: + +```c +// ggml/src/ggml-hexagon/ggml-hexagon.cpp +static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) { + // ~2GB per session for now + *free = 2ULL * 1024 * 1024 * 1024; + *total = *free; + GGML_UNUSED(dev); +} +``` + +It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks. + +### 3. Q4_K_M is not a supported quantization for HTP + +The Hexagon HTP kernels only support these quantization types: +- `GGML_TYPE_Q4_0` +- `GGML_TYPE_Q8_0` +- `GGML_TYPE_IQ4_NL` +- `GGML_TYPE_MXFP4` + +**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload. + +### Summary of Issues + +| Issue | Root cause | +|-------|-----------| +| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration | +| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query | +| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list | +| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers | + +## Performance Benchmarks + +All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04). + +### 1B Model (Llama 3.2, Q4_K_M) + +| Metric | CPU-only | With Hexagon backend | +|--------|----------|---------------------| +| Prompt processing | 32.3 t/s | 32.0 t/s | +| Generation | 4.5 t/s | 4.5 t/s | + +No difference — CPU handles the 1B model natively. + +### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M) + +Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB). + +| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total | +|---------|------|-----------|---------|-------|---------|---------|-------| +| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 | +| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 | +| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 | +| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 | +| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 | +| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 | + +**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases. + +Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support. + +### What would need to change to get actual NPU offload + +1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration` +2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory +3. Query actual rpcmem capacity instead of hardcoding 2GB +4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels) + +## Known Issues + +- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly. +- **DSP library is rebuilt every time** the cmake build runs. +- The `htp_iface.idl` declares `dst` as `in sequence` (input-only) but it's an output buffer. Should be `rout`. ## Files in This Repo @@ -134,5 +240,6 @@ All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc. | `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON | | `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A | | `scripts/test-on-q6a.sh` | Run full inference test on Q6A | +| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes | | `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) | | `AGENTS.md` | Context for AI coding agents working with this codebase | diff --git a/scripts/deploy-to-q6a.sh b/scripts/deploy-to-q6a.sh index 8fdc9a1..f1976c7 100755 --- a/scripts/deploy-to-q6a.sh +++ b/scripts/deploy-to-q6a.sh @@ -3,41 +3,38 @@ set -euo pipefail Q6A="${Q6A:-radxa@192.168.1.11}" +Q6A_PASS="${Q6A_PASS:-radxa}" BUILD_DIR="${BUILD_DIR:-$HOME/llama.cpp/build-hexagon}" DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}" echo "=== Deploying to ${Q6A}:${DEPLOY_DIR} ===" -# Check build artifacts exist +# Create deploy dir +ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}" + +# Deploy ARM binaries +echo "--- ARM binaries ---" for f in llama-cli libggml-hexagon.so libggml-hexagon.so.0 libggml-hexagon.so.0.9.11 \ libggml-base.so libggml-base.so.0 libggml-base.so.0.9.11 \ libggml-cpu.so libggml-cpu.so.0 libggml-cpu.so.0.9.11 \ libggml.so libggml.so.0 libggml.so.0.9.11 \ libllama.so libllama.so.0; do - if [ ! -f "${BUILD_DIR}/bin/${f}" ]; then - echo "WARNING: ${f} not found — build may be incomplete" + src="${BUILD_DIR}/bin/${f}" + if [ -f "$src" ]; then + scp "$src" "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null fi done - -# Create deploy dir -ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}" - -# Deploy ARM binaries -scp "${BUILD_DIR}/bin/llama-cli" "${Q6A}:~/${DEPLOY_DIR}/" -scp "${BUILD_DIR}/bin/libggml-hexagon.so" "${BUILD_DIR}/bin/libggml-hexagon.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true -scp "${BUILD_DIR}/bin/libggml-base.so" "${BUILD_DIR}/bin/libggml-base.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true -scp "${BUILD_DIR}/bin/libggml-cpu.so" "${BUILD_DIR}/bin/libggml-cpu.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true -scp "${BUILD_DIR}/bin/libggml.so" "${BUILD_DIR}/bin/libggml.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true -scp "${BUILD_DIR}/bin/libllama.so" "${BUILD_DIR}/bin/libllama.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true +echo " done" # Deploy DSP skel -DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagonal/libggml-htp-v68.so" +echo "--- DSP .so ---" +DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagon/libggml-htp-v68.so" if [ -f "$DSP_SO" ]; then scp "$DSP_SO" "${Q6A}:/tmp/" - ssh "${Q6A}" "echo radxa | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so" - echo "DSP .so deployed" + ssh "${Q6A}" "echo '${Q6A_PASS}' | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so 2>&1" + echo " deployed to /usr/lib/dsp/cdsp/" else - echo "WARNING: DSP .so not found at $DSP_SO" + echo " WARNING: DSP .so not found at $DSP_SO" fi echo "=== Deploy complete ===" diff --git a/scripts/test-7b.sh b/scripts/test-7b.sh new file mode 100644 index 0000000..edb06f5 --- /dev/null +++ b/scripts/test-7b.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# test-7b.sh — Run 7B model benchmarks on Q6A at various context sizes +set -euo pipefail + +Q6A="${Q6A:-radxa@192.168.1.11}" +MODEL="${MODEL:-/home/radxa/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf}" +DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}" + +CONTEXTS=("2048" "8192" "32768" "65536") + +echo "=== 7B Model Benchmarks ===" +echo "Model: ${MODEL}" +echo "" + +for ctx in "${CONTEXTS[@]}"; do + echo "--- Context ${ctx} (NPU) ---" + ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && GGML_HEXAGON=1 LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1" | grep -E 'Prompt:|Generation:|memory' + + echo "" + + echo "--- Context ${ctx} (CPU-only) ---" + ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && mv libggml-hexagon.so libggml-hexagon.so.disabled 2>/dev/null; mv libggml-hexagon.so.0 libggml-hexagon.so.0.disabled 2>/dev/null; mv libggml-hexagon.so.0.9.11 libggml-hexagon.so.0.9.11.disabled 2>/dev/null; LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1 | grep -E 'Prompt:|Generation:'; mv libggml-hexagon.so.disabled libggml-hexagon.so 2>/dev/null; mv libggml-hexagon.so.0.disabled libggml-hexagon.so.0 2>/dev/null; mv libggml-hexagon.so.0.9.11.disabled libggml-hexagon.so.0.9.11 2>/dev/null" + + echo "" +done + +echo "=== Done ==="