llamacpp_on_dragon_wing_q6a.../AGENTS.md
Jimmy Devine 627236a505 Update with full NPU analysis and benchmarks
Adds:
- Detailed explanation of why Hexagon NPU doesn't accelerate inference
  - offload_op callback is NULL in ggml-hexagon.cpp
  - 2048 MiB limit is hardcoded, not hardware-queried
  - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4)
- Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU
  - All results show CPU and NPU identical within margin of error
- 7B test script (test-7b.sh)
- Updated deploy script with password handling for DSP .so
- Performance baseline in AGENTS.md
- Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
2026-05-02 12:42:42 +02:00

2.2 KiB

Q6A Hexagon Guide — AGENTS.md

This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).

Key Rules

  1. Do NOT call FASTRPC_IOCTL_INIT_CREATE manually. Let libcdsprpc handle it.
  2. Always link against Q6A system libcdsprpc (/usr/lib/libcdsprpc.so.1), not the SDK's cross-compiled version.
  3. Do NOT set CMAKE_SYSROOT in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
  4. Use rpcmem_alloc for DSP compute buffers — stack arrays only work for tiny buffers (~4KB fragile slow path).

Why NPU Doesn't Actually Accelerate

Source code analysis of ggml-hexagon.cpp revealed:

  • offload_op callback is NULL — scheduler never moves tensors to NPU
  • 2048 MiB limit is hardcoded (2ULL * 1024 * 1024 * 1024), not a hardware query
  • Q4_K_M not supported — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
  • Performance identical CPU vs NPU in all benchmarks (1B and 7B models)

Build Command

cd ~/llama.cpp
bash scripts/build-hexagon.sh

Deploy Command

Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh

Test Commands

# Quick 1B test
bash scripts/test-on-q6a.sh

# 7B benchmarks at various context sizes
Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh

File Reference

  • src/test_fastrpc_fixed.c — Correct init sequence (reference for how to open HTP handles)
  • src/htp_minimal_impl.c — Minimal DSP stub (for testing, full library works instead)
  • scripts/build-hexagon.sh — llama.cpp cmake build for aarch64 + Hexagon
  • scripts/deploy-to-q6a.sh — Deploy to Q6A
  • scripts/test-on-q6a.sh — Quick 1B inference test
  • scripts/test-7b.sh — 7B model benchmarks
  • references/fastrpc.h — FastRPC ioctl definitions from Q6A kernel
  • README.md — Full guide with all findings

Performance Baseline

Model Config Prompt t/s Gen t/s
1B Q4_K_M any 32 4.5
7B Q4_K_M 2K ctx 2.7 1.9
7B Q4_K_M 32K ctx 2.7 1.9
7B Q4_K_M 64K ctx 2.5 1.8

NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).