llamacpp_on_dragon_wing_q6a.../AGENTS.md

# Q6A Hexagon Guide — AGENTS.md

This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).

## Key Rules

1. **Do NOT call FASTRPC_IOCTL_INIT_CREATE manually.** Let libcdsprpc handle it.
2. **Always link against Q6A system libcdsprpc** (`/usr/lib/libcdsprpc.so.1`), not the SDK's cross-compiled version.
3. **Do NOT set CMAKE_SYSROOT** in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.
4. **Use rpcmem_alloc for DSP compute buffers** — stack arrays only work for tiny buffers (~4KB fragile slow path).

## Build Command

```bash
cd ~/llama.cpp
bash scripts/build-hexagon.sh
```

## Deploy Command

```bash
Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh
```

## Test Command

```bash
bash scripts/test-on-q6a.sh
```

## File Reference

- `src/test_fastrpc_fixed.c` — Correct init sequence (reference for how to open HTP handles)
- `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead)
- `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon
- `scripts/deploy-to-q6a.sh` — Deploy to Q6A
- `scripts/test-on-q6a.sh` — Run inference test on Q6A
- `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel
- `README.md` — Full guide with troubleshooting

## Performance Baseline

- Prompt processing: ~32 t/s (on 8 CPU cores)
- Generation: ~4.5 t/s
- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)
Initial commit: Q6A Hexagon v68 + llama.cpp guide Complete documentation for running llama.cpp with the Qualcomm Hexagon CDSP v68 NPU backend on a Radxa Dragon Q6A (SA8775P) board. Includes: - Corrected FastRPC test harness (libcdsprpc handles INIT_CREATE) - Minimal DSP stub library - Cross-compile build script for llama.cpp - Deploy and test scripts for Q6A - Kernel FastRPC header for reference - Comprehensive README with lessons learned Key findings: - Do NOT call FASTRPC_IOCTL_INIT_CREATE manually - Must link against Q6A system libcdsprpc (not SDK cross-compiled) - Build verified: 32 t/s prompt, 4.5 t/s generation on 1B model 2026-05-02 08:28:51 +00:00			`# Q6A Hexagon Guide — AGENTS.md`

			`This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 backend on a Radxa Dragon Q6A board (SA8775P).`

			`## Key Rules`

			`1. Do NOT call FASTRPC_IOCTL_INIT_CREATE manually. Let libcdsprpc handle it.`
			2. Always link against Q6A system libcdsprpc (`/usr/lib/libcdsprpc.so.1`), not the SDK's cross-compiled version.
			`3. Do NOT set CMAKE_SYSROOT in the cross-compile — it conflicts with Ubuntu's cross-compiler linker scripts.`
			`4. Use rpcmem_alloc for DSP compute buffers — stack arrays only work for tiny buffers (~4KB fragile slow path).`

			`## Build Command`

			```bash
			`cd ~/llama.cpp`
			`bash scripts/build-hexagon.sh`
			```

			`## Deploy Command`

			```bash
			`Q6A=radxa@192.168.1.11 bash scripts/deploy-to-q6a.sh`
			```

			`## Test Command`

			```bash
			`bash scripts/test-on-q6a.sh`
			```

			`## File Reference`

			- `src/test_fastrpc_fixed.c` — Correct init sequence (reference for how to open HTP handles)
			- `src/htp_minimal_impl.c` — Minimal DSP stub (for testing, full library works instead)
			- `scripts/build-hexagon.sh` — llama.cpp cmake build for aarch64 + Hexagon
			- `scripts/deploy-to-q6a.sh` — Deploy to Q6A
			- `scripts/test-on-q6a.sh` — Run inference test on Q6A
			- `references/fastrpc.h` — FastRPC ioctl definitions from Q6A kernel
			- `README.md` — Full guide with troubleshooting

			`## Performance Baseline`

			`- Prompt processing: ~32 t/s (on 8 CPU cores)`
			`- Generation: ~4.5 t/s`
			`- Model: llama-3.2-1b-q4km.gguf (1B params, Q4_K_M)`