Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model
1.5 KiB
1.5 KiB
Q6A Hexagon Guide — AGENTS.md
This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).
Key Rules
- Do NOT call FASTRPC_IOCTL_INIT_CREATE manually. Let libcdsprpc handle it.
- Always link against Q6A system libcdsprpc (
/usr/lib/libcdsprpc.so.1), not the SDK's cross-compiled version. - Do NOT set CMAKE_SYSROOT in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
- Use rpcmem_alloc for DSP compute buffers — stack arrays only work for tiny buffers (~4KB fragile slow path).
Build Command
cd ~/llama.cpp
bash scripts/build-hexagon.sh
Deploy Command
Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
Test Command
bash scripts/test-on-q6a.sh
File Reference
src/test_fastrpc_fixed.c— Correct init sequence (reference for how to open HTP handles)src/htp_minimal_impl.c— Minimal DSP stub (for testing, full library works instead)scripts/build-hexagon.sh— llama.cpp cmake build for aarch64 + Hexagonscripts/deploy-to-q6a.sh— Deploy to Q6Ascripts/test-on-q6a.sh— Run inference test on Q6Areferences/fastrpc.h— FastRPC ioctl definitions from Q6A kernelREADME.md— Full guide with troubleshooting
Performance Baseline
- Prompt processing: ~32 t/s (on 8 CPU cores)
- Generation: ~4.5 t/s
- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)