Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
2.2 KiB
2.2 KiB
Q6A Hexagon Guide — AGENTS.md
This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).
Key Rules
- Do NOT call FASTRPC_IOCTL_INIT_CREATE manually. Let libcdsprpc handle it.
- Always link against Q6A system libcdsprpc (
/usr/lib/libcdsprpc.so.1), not the SDK's cross-compiled version. - Do NOT set CMAKE_SYSROOT in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
- Use rpcmem_alloc for DSP compute buffers — stack arrays only work for tiny buffers (~4KB fragile slow path).
Why NPU Doesn't Actually Accelerate
Source code analysis of ggml-hexagon.cpp revealed:
offload_opcallback is NULL — scheduler never moves tensors to NPU- 2048 MiB limit is hardcoded (
2ULL * 1024 * 1024 * 1024), not a hardware query - Q4_K_M not supported — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
- Performance identical CPU vs NPU in all benchmarks (1B and 7B models)
Build Command
cd ~/llama.cpp
bash scripts/build-hexagon.sh
Deploy Command
Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
Test Commands
# Quick 1B test
bash scripts/test-on-q6a.sh
# 7B benchmarks at various context sizes
Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh
File Reference
src/test_fastrpc_fixed.c— Correct init sequence (reference for how to open HTP handles)src/htp_minimal_impl.c— Minimal DSP stub (for testing, full library works instead)scripts/build-hexagon.sh— llama.cpp cmake build for aarch64 + Hexagonscripts/deploy-to-q6a.sh— Deploy to Q6Ascripts/test-on-q6a.sh— Quick 1B inference testscripts/test-7b.sh— 7B model benchmarksreferences/fastrpc.h— FastRPC ioctl definitions from Q6A kernelREADME.md— Full guide with all findings
Performance Baseline
| Model | Config | Prompt t/s | Gen t/s |
|---|---|---|---|
| 1B Q4_K_M | any | 32 | 4.5 |
| 7B Q4_K_M | 2K ctx | 2.7 | 1.9 |
| 7B Q4_K_M | 32K ctx | 2.7 | 1.9 |
| 7B Q4_K_M | 64K ctx | 2.5 | 1.8 |
NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).