Update with full NPU analysis and benchmarks
Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
This commit is contained in:
parent
18970e3258
commit
627236a505
4 changed files with 179 additions and 30 deletions
30
AGENTS.md
30
AGENTS.md
|
|
@ -9,6 +9,14 @@ This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP
|
|||
3. **Do NOT set CMAKE_SYSROOT** in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
|
||||
4. **Use rpcmem_alloc for DSP compute buffers** — stack arrays only work for tiny buffers (~4KB fragile slow path).
|
||||
|
||||
## Why NPU Doesn't Actually Accelerate
|
||||
|
||||
Source code analysis of `ggml-hexagon.cpp` revealed:
|
||||
- **`offload_op` callback is NULL** — scheduler never moves tensors to NPU
|
||||
- **2048 MiB limit is hardcoded** (`2ULL * 1024 * 1024 * 1024`), not a hardware query
|
||||
- **Q4_K_M not supported** — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
|
||||
- Performance identical CPU vs NPU in all benchmarks (1B and 7B models)
|
||||
|
||||
## Build Command
|
||||
|
||||
```bash
|
||||
|
|
@ -22,10 +30,14 @@ bash scripts/build-hexagon.sh
|
|||
Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
|
||||
```
|
||||
|
||||
## Test Command
|
||||
## Test Commands
|
||||
|
||||
```bash
|
||||
# Quick 1B test
|
||||
bash scripts/test-on-q6a.sh
|
||||
|
||||
# 7B benchmarks at various context sizes
|
||||
Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh
|
||||
```
|
||||
|
||||
## File Reference
|
||||
|
|
@ -34,12 +46,18 @@ bash scripts/test-on-q6a.sh
|
|||
- `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead)
|
||||
- `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon
|
||||
- `scripts/deploy-to-q6a.sh` — Deploy to Q6A
|
||||
- `scripts/test-on-q6a.sh` — Run inference test on Q6A
|
||||
- `scripts/test-on-q6a.sh` — Quick 1B inference test
|
||||
- `scripts/test-7b.sh` — 7B model benchmarks
|
||||
- `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel
|
||||
- `README.md` — Full guide with troubleshooting
|
||||
- `README.md` — Full guide with all findings
|
||||
|
||||
## Performance Baseline
|
||||
|
||||
- Prompt processing: ~32 t/s (on 8 CPU cores)
|
||||
- Generation: ~4.5 t/s
|
||||
- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)
|
||||
| Model | Config | Prompt t/s | Gen t/s |
|
||||
|-------|--------|-----------|---------|
|
||||
| 1B Q4_K_M | any | 32 | 4.5 |
|
||||
| 7B Q4_K_M | 2K ctx | 2.7 | 1.9 |
|
||||
| 7B Q4_K_M | 32K ctx | 2.7 | 1.9 |
|
||||
| 7B Q4_K_M | 64K ctx | 2.5 | 1.8 |
|
||||
|
||||
NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).
|
||||
|
|
|
|||
119
README.md
119
README.md
|
|
@ -89,7 +89,7 @@ The `scripts/build-hexagon.sh` script:
|
|||
|
||||
### Root Cause of `remote_handle64_open` Error 0xe
|
||||
|
||||
The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix is always compile and link natively on the Q6A (or link against the system library).
|
||||
The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).
|
||||
|
||||
### Do NOT Call INIT_CREATE Manually
|
||||
|
||||
|
|
@ -118,12 +118,118 @@ remote_handle64_open(uri, &handle);
|
|||
All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
|
||||
`dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.
|
||||
|
||||
### Known Issues / Future Work
|
||||
### Cross-Compile Pitfalls
|
||||
|
||||
- **Minimal stub library** (`htp_minimal_impl.c`) still fails to load on the DSP with error `0x80000442` (likely missing initialization that the full library does in its `main.c`). The full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
|
||||
- **4.5 tok/s generation speed** is CPU-bound with partial DSP offload. More aggressive offloading of matrix ops to the NPU could improve this.
|
||||
- **DSP library is rebuilt every time** the cmake build runs. You don't need to touch it unless you modify the Hexagon backend C code.
|
||||
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's actually an output. Fix upstream to `rout` for correctness.
|
||||
1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.
|
||||
|
||||
2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.
|
||||
|
||||
## Why the NPU Isn't Accelerating Inference
|
||||
|
||||
After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows:
|
||||
|
||||
```
|
||||
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
```
|
||||
|
||||
The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:
|
||||
|
||||
### 1. `offload_op` callback is NULL
|
||||
|
||||
The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.
|
||||
|
||||
```c
|
||||
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
|
||||
const ggml_backend_device_t ggml_backend_hexagon_device_registration = {
|
||||
/* .name = */ GGML_HEXAGON_DEVICE_NAME,
|
||||
/* .description = */ "Hexagon NPU",
|
||||
/* .get_memory = */ ggml_backend_hexagon_device_get_memory,
|
||||
/* .get_version = */ NULL,
|
||||
/* .get_best_device = */ NULL,
|
||||
/* .get_device_for_tensor */ NULL,
|
||||
/* .offload_op = */ NULL, // <--- STUBBED
|
||||
/* .supports_op = */ ggml_backend_hexagon_device_supports_op,
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
### 2. 2048 MiB limit is hardcoded, not queried
|
||||
|
||||
The 2GB "free memory" reported is a hardcoded constant, not a hardware query:
|
||||
|
||||
```c
|
||||
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
|
||||
static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
|
||||
// ~2GB per session for now
|
||||
*free = 2ULL * 1024 * 1024 * 1024;
|
||||
*total = *free;
|
||||
GGML_UNUSED(dev);
|
||||
}
|
||||
```
|
||||
|
||||
It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.
|
||||
|
||||
### 3. Q4_K_M is not a supported quantization for HTP
|
||||
|
||||
The Hexagon HTP kernels only support these quantization types:
|
||||
- `GGML_TYPE_Q4_0`
|
||||
- `GGML_TYPE_Q8_0`
|
||||
- `GGML_TYPE_IQ4_NL`
|
||||
- `GGML_TYPE_MXFP4`
|
||||
|
||||
**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.
|
||||
|
||||
### Summary of Issues
|
||||
|
||||
| Issue | Root cause |
|
||||
|-------|-----------|
|
||||
| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration |
|
||||
| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query |
|
||||
| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list |
|
||||
| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers |
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).
|
||||
|
||||
### 1B Model (Llama 3.2, Q4_K_M)
|
||||
|
||||
| Metric | CPU-only | With Hexagon backend |
|
||||
|--------|----------|---------------------|
|
||||
| Prompt processing | 32.3 t/s | 32.0 t/s |
|
||||
| Generation | 4.5 t/s | 4.5 t/s |
|
||||
|
||||
No difference — CPU handles the 1B model natively.
|
||||
|
||||
### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)
|
||||
|
||||
Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).
|
||||
|
||||
| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total |
|
||||
|---------|------|-----------|---------|-------|---------|---------|-------|
|
||||
| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 |
|
||||
| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 |
|
||||
| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
|
||||
| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
|
||||
| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
|
||||
| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
|
||||
|
||||
**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.
|
||||
|
||||
Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.
|
||||
|
||||
### What would need to change to get actual NPU offload
|
||||
|
||||
1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
|
||||
2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory
|
||||
3. Query actual rpcmem capacity instead of hardcoding 2GB
|
||||
4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)
|
||||
|
||||
## Known Issues
|
||||
|
||||
- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
|
||||
- **DSP library is rebuilt every time** the cmake build runs.
|
||||
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.
|
||||
|
||||
## Files in This Repo
|
||||
|
||||
|
|
@ -134,5 +240,6 @@ All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.
|
|||
| `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
|
||||
| `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A |
|
||||
| `scripts/test-on-q6a.sh` | Run full inference test on Q6A |
|
||||
| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes |
|
||||
| `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) |
|
||||
| `AGENTS.md` | Context for AI coding agents working with this codebase |
|
||||
|
|
|
|||
|
|
@ -3,41 +3,38 @@
|
|||
set -euo pipefail
|
||||
|
||||
Q6A="${Q6A:-radxa@192.168.1.11}"
|
||||
Q6A_PASS="${Q6A_PASS:-radxa}"
|
||||
BUILD_DIR="${BUILD_DIR:-$HOME/llama.cpp/build-hexagon}"
|
||||
DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"
|
||||
|
||||
echo "=== Deploying to ${Q6A}:${DEPLOY_DIR} ==="
|
||||
|
||||
# Check build artifacts exist
|
||||
# Create deploy dir
|
||||
ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
|
||||
|
||||
# Deploy ARM binaries
|
||||
echo "--- ARM binaries ---"
|
||||
for f in llama-cli libggml-hexagon.so libggml-hexagon.so.0 libggml-hexagon.so.0.9.11 \
|
||||
libggml-base.so libggml-base.so.0 libggml-base.so.0.9.11 \
|
||||
libggml-cpu.so libggml-cpu.so.0 libggml-cpu.so.0.9.11 \
|
||||
libggml.so libggml.so.0 libggml.so.0.9.11 \
|
||||
libllama.so libllama.so.0; do
|
||||
if [ ! -f "${BUILD_DIR}/bin/${f}" ]; then
|
||||
echo "WARNING: ${f} not found — build may be incomplete"
|
||||
src="${BUILD_DIR}/bin/${f}"
|
||||
if [ -f "$src" ]; then
|
||||
scp "$src" "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null
|
||||
fi
|
||||
done
|
||||
|
||||
# Create deploy dir
|
||||
ssh "${Q6A}" "mkdir -p ~/${DEPLOY_DIR}"
|
||||
|
||||
# Deploy ARM binaries
|
||||
scp "${BUILD_DIR}/bin/llama-cli" "${Q6A}:~/${DEPLOY_DIR}/"
|
||||
scp "${BUILD_DIR}/bin/libggml-hexagon.so" "${BUILD_DIR}/bin/libggml-hexagon.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
|
||||
scp "${BUILD_DIR}/bin/libggml-base.so" "${BUILD_DIR}/bin/libggml-base.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
|
||||
scp "${BUILD_DIR}/bin/libggml-cpu.so" "${BUILD_DIR}/bin/libggml-cpu.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
|
||||
scp "${BUILD_DIR}/bin/libggml.so" "${BUILD_DIR}/bin/libggml.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
|
||||
scp "${BUILD_DIR}/bin/libllama.so" "${BUILD_DIR}/bin/libllama.so".* "${Q6A}:~/${DEPLOY_DIR}/" 2>/dev/null || true
|
||||
echo " done"
|
||||
|
||||
# Deploy DSP skel
|
||||
DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagonal/libggml-htp-v68.so"
|
||||
echo "--- DSP .so ---"
|
||||
DSP_SO="${BUILD_DIR}/ggml/src/ggml-hexagon/libggml-htp-v68.so"
|
||||
if [ -f "$DSP_SO" ]; then
|
||||
scp "$DSP_SO" "${Q6A}:/tmp/"
|
||||
ssh "${Q6A}" "echo radxa | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so"
|
||||
echo "DSP .so deployed"
|
||||
ssh "${Q6A}" "echo '${Q6A_PASS}' | sudo -S cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/libggml-htp-v68.so 2>&1"
|
||||
echo " deployed to /usr/lib/dsp/cdsp/"
|
||||
else
|
||||
echo "WARNING: DSP .so not found at $DSP_SO"
|
||||
echo " WARNING: DSP .so not found at $DSP_SO"
|
||||
fi
|
||||
|
||||
echo "=== Deploy complete ==="
|
||||
|
|
|
|||
27
scripts/test-7b.sh
Normal file
27
scripts/test-7b.sh
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
#!/usr/bin/env bash
|
||||
# test-7b.sh — Run 7B model benchmarks on Q6A at various context sizes
|
||||
set -euo pipefail
|
||||
|
||||
Q6A="${Q6A:-radxa@192.168.1.11}"
|
||||
MODEL="${MODEL:-/home/radxa/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf}"
|
||||
DEPLOY_DIR="${DEPLOY_DIR:-llama/bin}"
|
||||
|
||||
CONTEXTS=("2048" "8192" "32768" "65536")
|
||||
|
||||
echo "=== 7B Model Benchmarks ==="
|
||||
echo "Model: ${MODEL}"
|
||||
echo ""
|
||||
|
||||
for ctx in "${CONTEXTS[@]}"; do
|
||||
echo "--- Context ${ctx} (NPU) ---"
|
||||
ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && GGML_HEXAGON=1 LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1" | grep -E 'Prompt:|Generation:|memory'
|
||||
|
||||
echo ""
|
||||
|
||||
echo "--- Context ${ctx} (CPU-only) ---"
|
||||
ssh "${Q6A}" "cd ~/${DEPLOY_DIR} && mv libggml-hexagon.so libggml-hexagon.so.disabled 2>/dev/null; mv libggml-hexagon.so.0 libggml-hexagon.so.0.disabled 2>/dev/null; mv libggml-hexagon.so.0.9.11 libggml-hexagon.so.0.9.11.disabled 2>/dev/null; LD_LIBRARY_PATH=. timeout 120 ./llama-cli -m '${MODEL}' -n 8 -p Hello -ngl 0 -c ${ctx} --no-display-prompt 2>&1 | grep -E 'Prompt:|Generation:'; mv libggml-hexagon.so.disabled libggml-hexagon.so 2>/dev/null; mv libggml-hexagon.so.0.disabled libggml-hexagon.so.0 2>/dev/null; mv libggml-hexagon.so.0.9.11.disabled libggml-hexagon.so.0.9.11 2>/dev/null"
|
||||
|
||||
echo ""
|
||||
done
|
||||
|
||||
echo "=== Done ==="
|
||||
Loading…
Reference in a new issue