Commit graph

2 commits

Author SHA1 Message Date
Jimmy Devine
627236a505 Update with full NPU analysis and benchmarks
Adds:
- Detailed explanation of why Hexagon NPU doesn't accelerate inference
  - offload_op callback is NULL in ggml-hexagon.cpp
  - 2048 MiB limit is hardcoded, not hardware-queried
  - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4)
- Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU
  - All results show CPU and NPU identical within margin of error
- 7B test script (test-7b.sh)
- Updated deploy script with password handling for DSP .so
- Performance baseline in AGENTS.md
- Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
2026-05-02 12:42:42 +02:00
Jimmy Devine
18970e3258 Initial commit: Q6A Hexagon v68 + llama.cpp guide
Complete documentation for running llama.cpp with the Qualcomm Hexagon
CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board.

Includes:
- Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE)
- Minimal DSP stub library
- Cross-compile build script for llama.cpp
- Deploy and test scripts for Q6A
- Kernel FastRPC header for reference
- Comprehensive README with lessons learned

Key findings:
- Do NOT call FASTRPC_IOCTL_INIT_CREATE manually
- Must link against Q6A system libcdsprpc (not SDK cross-compiled)
- Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model
2026-05-02 10:28:51 +02:00