Commit graph

3 commits

Author SHA1 Message Date
Jimmy Devine
e6fa9052b3 Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup
- offload_op callback now implemented (MUL_MAT/MUL_MAT_ID)
- Memory raised to 10 GiB
- Direct compute mode bypasses broken dspqueue on this board
- Q8_0 1B model: 115 t/s prompt (4.3x vs CPU 27 t/s)
- Generation 9.6 t/s (27% slower than CPU, expected)
- dspqueue path fails with error 0x0000002e
- llama-cli renamed to llama-simple in current build
- Updated scripts for direct-compute mode
- Docs updated with new findings and instructions
2026-05-02 14:17:27 +02:00
Jimmy Devine
627236a505 Update with full NPU analysis and benchmarks
Adds:
- Detailed explanation of why Hexagon NPU doesn't accelerate inference
  - offload_op callback is NULL in ggml-hexagon.cpp
  - 2048 MiB limit is hardcoded, not hardware-queried
  - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4)
- Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU
  - All results show CPU and NPU identical within margin of error
- 7B test script (test-7b.sh)
- Updated deploy script with password handling for DSP .so
- Performance baseline in AGENTS.md
- Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
2026-05-02 12:42:42 +02:00
Jimmy Devine
18970e3258 Initial commit: Q6A Hexagon v68 + llama.cpp guide
Complete documentation for running llama.cpp with the Qualcomm Hexagon
CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board.

Includes:
- Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE)
- Minimal DSP stub library
- Cross-compile build script for llama.cpp
- Deploy and test scripts for Q6A
- Kernel FastRPC header for reference
- Comprehensive README with lessons learned

Key findings:
- Do NOT call FASTRPC_IOCTL_INIT_CREATE manually
- Must link against Q6A system libcdsprpc (not SDK cross-compiled)
- Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model
2026-05-02 10:28:51 +02:00