pingu_98/llamacpp_on_dragon_wing_q6a_guide

Jimmy Devine 627236a505 Update with full NPU analysis and benchmarks

Adds:
- Detailed explanation of why Hexagon NPU doesn't accelerate inference
  - offload_op callback is NULL in ggml-hexagon.cpp
  - 2048 MiB limit is hardcoded, not hardware-queried
  - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4)
- Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU
  - All results show CPU and NPU identical within margin of error
- 7B test script (test-7b.sh)
- Updated deploy script with password handling for DSP .so
- Performance baseline in AGENTS.md
- Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)

2026-05-02 12:42:42 +02:00

2.2 KiB

Raw Blame History

Q6A Hexagon Guide — AGENTS.md

This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).

Key Rules

Do NOT call FASTRPC_IOCTL_INIT_CREATE manually. Let libcdsprpc handle it.
Always link against Q6A system libcdsprpc (/usr/lib/libcdsprpc.so.1), not the SDK's cross-compiled version.
Do NOT set CMAKE_SYSROOT in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
Use rpcmem_alloc for DSP compute buffers — stack arrays only work for tiny buffers (~4KB fragile slow path).

Why NPU Doesn't Actually Accelerate

Source code analysis of ggml-hexagon.cpp revealed:

offload_op callback is NULL — scheduler never moves tensors to NPU
2048 MiB limit is hardcoded (2ULL * 1024 * 1024 * 1024), not a hardware query
Q4_K_M not supported — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4
Performance identical CPU vs NPU in all benchmarks (1B and 7B models)

Build Command

cd ~/llama.cpp
bash scripts/build-hexagon.sh

Deploy Command

Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh

Test Commands

# Quick 1B test
bash scripts/test-on-q6a.sh

# 7B benchmarks at various context sizes
Q6A=radxa@192.168.1.11 bash scripts/test-7b.sh

File Reference

src/test_fastrpc_fixed.c — Correct init sequence (reference for how to open HTP handles)
src/htp_minimal_impl.c — Minimal DSP stub (for testing, full library works instead)
scripts/build-hexagon.sh — llama.cpp cmake build for aarch64 + Hexagon
scripts/deploy-to-q6a.sh — Deploy to Q6A
scripts/test-on-q6a.sh — Quick 1B inference test
scripts/test-7b.sh — 7B model benchmarks
references/fastrpc.h — FastRPC ioctl definitions from Q6A kernel
README.md — Full guide with all findings

Performance Baseline

Model	Config	Prompt t/s	Gen t/s
1B Q4_K_M	any	32	4.5
7B Q4_K_M	2K ctx	2.7	1.9
7B Q4_K_M	32K ctx	2.7	1.9
7B Q4_K_M	64K ctx	2.5	1.8

NPU and CPU are identical in all configurations. The 7B model needs -c N to limit KV cache size (default 128K consumes 7GB+ for KV alone).

2.2 KiB Raw Blame History