Adds: - Detailed explanation of why Hexagon NPU doesn't accelerate inference - offload_op callback is NULL in ggml-hexagon.cpp - 2048 MiB limit is hardcoded, not hardware-queried - Q4_K_M not supported by HTP kernels (only Q4_0, Q8_0, IQ4_NL, MXFP4) - Full benchmark table: 1B and 7B models, 2K/32K/64K context, CPU vs NPU - All results show CPU and NPU identical within margin of error - 7B test script (test-7b.sh) - Updated deploy script with password handling for DSP .so - Performance baseline in AGENTS.md - Cross-compile pitfalls (CMAKE_SYSROOT, rpcmem_init)
245 lines
10 KiB
Markdown
245 lines
10 KiB
Markdown
# Q6A Hexagon v68 + llama.cpp — Complete Guide
|
|
|
|
This repo documents how to get llama.cpp running with the **Qualcomm Hexagon CDSP v68** (NPU/DSP) backend on a **Radxa Dragon Q6A** board (SA8775P).
|
|
|
|
## Overview
|
|
|
|
The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. The key insight from weeks of debugging: **let libcdsprpc handle `FASTRPC_IOCTL_INIT_CREATE` internally** — do NOT attempt it manually. Use the system's `libcdsprpc.so`, not the SDK's cross-compiled version.
|
|
|
|
## Prerequisites
|
|
|
|
### Build Machine (x86_64)
|
|
- Ubuntu 24.04 (or similar with cross-compilation packages)
|
|
- Packages:
|
|
```bash
|
|
sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake
|
|
sudo apt install libc6-arm64-cross libc6-dev-arm64-cross
|
|
```
|
|
- Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at `/local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/`
|
|
- Must include `hexagon-clang` at `tools/HEXAGON_Tools/8.7.06/Tools/bin/`
|
|
- Must include `qaic` IDL compiler at `tools/qaic/bin/qaic`
|
|
- Must include `incs/` with SDK headers
|
|
- Must include `ipc/fastrpc/` with libcdsprpc and rpcmem headers
|
|
|
|
### Target Machine (Q6A — aarch64)
|
|
- Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04
|
|
- `fastrpc` package installed: `sudo apt install fastrpc fastrpc-test`
|
|
- User `radxa` in `render` group (for `/dev/fastrpc-cdsp-secure` access)
|
|
- CDSP firmware running: `cat /sys/class/remoteproc/remoteproc1/state` → `running`
|
|
|
|
## Quick Start
|
|
|
|
### 1. Build llama.cpp with Hexagon backend
|
|
|
|
```bash
|
|
cd ~/llama.cpp
|
|
bash scripts/build-hexagon.sh
|
|
```
|
|
|
|
This cross-compiles llama.cpp for aarch64 with `-DGGML_HEXAGON=ON`. Output goes to `build-hexagon/bin/`.
|
|
|
|
### 2. Deploy to Q6A
|
|
|
|
```bash
|
|
# Deploy ARM64 binaries
|
|
scp build-hexagon/bin/llama-cli radxa@192.168.1.11:~/llama/bin/
|
|
scp build-hexagon/bin/libggml*.so* radxa@192.168.1.11:~/llama/bin/
|
|
scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/
|
|
|
|
# Deploy DSP skel
|
|
scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/
|
|
ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"
|
|
```
|
|
|
|
### 3. Run inference test
|
|
|
|
```bash
|
|
ssh radxa@192.168.1.11
|
|
cd ~/llama/bin
|
|
GGML_HEXAGON=1 LD_LIBRARY_PATH=. ./llama-cli \
|
|
-m ~/models/llama-3.2-1b-q4km.gguf \
|
|
-n 32 -p "Hello, what is your name?" -ngl 0
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
ggml-hex: Loading driver libcdsprpc.so
|
|
ggml-hex: Hexagon Arch version v68
|
|
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 ...
|
|
[ Prompt: 32.8 t/s | Generation: 4.5 t/s ]
|
|
```
|
|
|
|
## Build Script Details
|
|
|
|
The `scripts/build-hexagon.sh` script:
|
|
|
|
1. **CMake configure** with:
|
|
- `-DCMAKE_BUILD_TYPE=Release`
|
|
- `-DBUILD_SHARED_LIBS=ON` (required for HTP plugin .so)
|
|
- `-DCMAKE_INSTALL_RPATH='$ORIGIN'` (libraries alongside binary)
|
|
- `-DGGML_HEXAGON=ON`
|
|
- `-DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF`
|
|
- `-DMAX_DOMAIN_NAMELEN=64` on both C and CXX flags
|
|
|
|
2. **Do NOT set `CMAKE_SYSROOT`** — the cross-compiler's own linker scripts conflict with `--sysroot` on Ubuntu's `gcc-aarch64-linux-gnu` packages.
|
|
|
|
3. **Do NOT set explicit OpenSSL paths** — they're unnecessary when `LLAMA_BUILD_SERVER=OFF`.
|
|
|
|
## Critical Lessons Learned
|
|
|
|
### Root Cause of `remote_handle64_open` Error 0xe
|
|
|
|
The error occurs because **the SDK's cross-compiled `libcdsprpc.so` does NOT handle `FASTRPC_IOCTL_INIT_CREATE` internally** for unsigned PDs. The Q6A system `/usr/lib/libcdsprpc.so.1` does. The fix: compile and link natively on the Q6A (or link against the system library).
|
|
|
|
### Do NOT Call INIT_CREATE Manually
|
|
|
|
Attempting `FASTRPC_IOCTL_INIT_CREATE` via ioctl on `/dev/fastrpc-cdsp-secure` always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:
|
|
|
|
```c
|
|
/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */
|
|
remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);
|
|
remote_handle64_open(uri, &handle);
|
|
```
|
|
|
|
### Verified Q6A Constants
|
|
|
|
| Item | Value |
|
|
|------|-------|
|
|
| CDSP device node | `/dev/fastrpc-cdsp-secure` |
|
|
| Shell path | `/usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3` |
|
|
| Domain ID | `CDSP_DOMAIN_ID` = 3 |
|
|
| Unsigned module flag | `FASTRPC_MODE_UNSIGNED_MODULE` = `(1 << 3)` = 0x8 |
|
|
| DSP .so path | `/usr/lib/dsp/cdsp/` |
|
|
| System libcdsprpc | `/usr/lib/libcdsprpc.so.1` (symlink at `/usr/lib/libcdsprpc.so` already exists) |
|
|
| Kernel header | `/usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h` |
|
|
|
|
### dspqueue Symbols
|
|
|
|
All required `dspqueue_*` symbols are present in the SA8775P system `libcdsprpc.so.1`:
|
|
`dspqueue_create`, `dspqueue_close`, `dspqueue_export`, `dspqueue_write`, `dspqueue_read`, etc.
|
|
|
|
### Cross-Compile Pitfalls
|
|
|
|
1. **`CMAKE_SYSROOT` breaks the linker** — Ubuntu's `gcc-aarch64-linux-gnu` packages install linker scripts with absolute paths (e.g., `/usr/aarch64-linux-gnu/lib/libm.so` contains `GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )`). When `--sysroot` is set to `/usr/aarch64-linux-gnu`, these absolute paths double up to `/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...` and fail. Solution: let the compiler use its built-in sysroot.
|
|
|
|
2. **`rpcmem_init` is optional** when linked against `libcdsprpc.so`. The SDK's cross-compiled `libcdsprpc` only exports `rpcmem_alloc`/`free` but not `rpcmem_init`/`deinit`. The Q6A system libcdsprpc has them all.
|
|
|
|
## Why the NPU Isn't Accelerating Inference
|
|
|
|
After extensive testing, the Hexagon backend loads and initializes successfully but **never actually offloads any computation**. Every test shows:
|
|
|
|
```
|
|
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
|
```
|
|
|
|
The NPU reports 2048 MiB but uses 0 MiB for model, context, and compute. Source code analysis reveals three reasons:
|
|
|
|
### 1. `offload_op` callback is NULL
|
|
|
|
The backend device struct registers `/* .offload_op = */ NULL`, so the scheduler never proactively moves tensors to the NPU. Even when `-ngl N` is specified, without this callback no layers get claimed.
|
|
|
|
```c
|
|
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
|
|
const ggml_backend_device_t ggml_backend_hexagon_device_registration = {
|
|
/* .name = */ GGML_HEXAGON_DEVICE_NAME,
|
|
/* .description = */ "Hexagon NPU",
|
|
/* .get_memory = */ ggml_backend_hexagon_device_get_memory,
|
|
/* .get_version = */ NULL,
|
|
/* .get_best_device = */ NULL,
|
|
/* .get_device_for_tensor */ NULL,
|
|
/* .offload_op = */ NULL, // <--- STUBBED
|
|
/* .supports_op = */ ggml_backend_hexagon_device_supports_op,
|
|
...
|
|
};
|
|
```
|
|
|
|
### 2. 2048 MiB limit is hardcoded, not queried
|
|
|
|
The 2GB "free memory" reported is a hardcoded constant, not a hardware query:
|
|
|
|
```c
|
|
// ggml/src/ggml-hexagon/ggml-hexagon.cpp
|
|
static void ggml_backend_hexagon_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
|
|
// ~2GB per session for now
|
|
*free = 2ULL * 1024 * 1024 * 1024;
|
|
*total = *free;
|
|
GGML_UNUSED(dev);
|
|
}
|
|
```
|
|
|
|
It never calls `rpcmem_alloc2()` or checks kernel ION/DMA-BUF heap sizes. The 2GB is a rough placeholder. On QCS6490 with 11GB system RAM, the CDSP carveout is typically 1-4 GB depending on firmware config, but the code never checks.
|
|
|
|
### 3. Q4_K_M is not a supported quantization for HTP
|
|
|
|
The Hexagon HTP kernels only support these quantization types:
|
|
- `GGML_TYPE_Q4_0`
|
|
- `GGML_TYPE_Q8_0`
|
|
- `GGML_TYPE_IQ4_NL`
|
|
- `GGML_TYPE_MXFP4`
|
|
|
|
**Q4_K_M is NOT in this list.** Every MUL_MAT operation with Q4_K_M weights fails the `supports_op` type check, regardless of buffer placement. This means even if you fix the buffer allocation path, the 7B Q4_K_M model still won't offload.
|
|
|
|
### Summary of Issues
|
|
|
|
| Issue | Root cause |
|
|
|-------|-----------|
|
|
| 0 MiB used for model/context/compute | `offload_op = NULL` in device registration |
|
|
| 2048 MiB cap | Hardcoded constant, not a FastRPC/ION query |
|
|
| Q4_K_M tensors don't offload | Q4_K_M not in HTP supported type list |
|
|
| Ops always rejected by supports_op | Chicken-and-egg: tensors never in Hexagon buffers |
|
|
|
|
## Performance Benchmarks
|
|
|
|
All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).
|
|
|
|
### 1B Model (Llama 3.2, Q4_K_M)
|
|
|
|
| Metric | CPU-only | With Hexagon backend |
|
|
|--------|----------|---------------------|
|
|
| Prompt processing | 32.3 t/s | 32.0 t/s |
|
|
| Generation | 4.5 t/s | 4.5 t/s |
|
|
|
|
No difference — CPU handles the 1B model natively.
|
|
|
|
### 7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)
|
|
|
|
Need `-c 2048` or smaller to fit in 11GB RAM (model alone is 4460 MiB, KV cache at default 128K context adds 7GB).
|
|
|
|
| Context | Test | Prompt t/s | Gen t/s | Model | Context | Compute | Total |
|
|
|---------|------|-----------|---------|-------|---------|---------|-------|
|
|
| 2K | CPU | 2.7 | 1.9 | 4460 | 112 | 311 | 4883 |
|
|
| 2K | NPU | 2.8 | 1.8 | 4460 | 112 | 311 | 4883 |
|
|
| 32K | CPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
|
|
| 32K | NPU | 2.7 | 1.9 | 4460 | 1792 | 396 | 6648 |
|
|
| 64K | CPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
|
|
| 64K | NPU | 2.5 | 1.8 | 4460 | 3584 | 429 | 8473 |
|
|
|
|
**Every NPU vs CPU comparison is identical.** The Hexagon backend never offloads any tensors, so all computation runs on CPU in both cases.
|
|
|
|
Memory at 64K context (8.5 GiB total) approaches the 11 GiB ceiling but fits with swap support.
|
|
|
|
### What would need to change to get actual NPU offload
|
|
|
|
1. Implement `offload_op` callback in `ggml_backend_hexagon_device_registration`
|
|
2. Wire the repack buffer type for weight tensors so quantized weights land in Hexagon-accessible memory
|
|
3. Query actual rpcmem capacity instead of hardcoding 2GB
|
|
4. Use a Q4_0 or Q8_0 quantized model (Q4_K_M not supported by HTP kernels)
|
|
|
|
## Known Issues
|
|
|
|
- **Minimal stub library** (`htp_minimal_impl.c`) fails to load on the DSP with error `0x80000442` — the full `libggml-htp-v68.so` (generated by the cmake build from `ggml-hexagon/main.c`) works correctly.
|
|
- **DSP library is rebuilt every time** the cmake build runs.
|
|
- The `htp_iface.idl` declares `dst` as `in sequence<uint8>` (input-only) but it's an output buffer. Should be `rout`.
|
|
|
|
## Files in This Repo
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `src/test_fastrpc_fixed.c` | Corrected test harness with proper init sequence |
|
|
| `src/htp_minimal_impl.c` | Minimal DSP stub (for experimentation) |
|
|
| `scripts/build-hexagon.sh` | Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
|
|
| `scripts/deploy-to-q6a.sh` | Deploy built binaries + DSP .so to Q6A |
|
|
| `scripts/test-on-q6a.sh` | Run full inference test on Q6A |
|
|
| `scripts/test-7b.sh` | Run 7B model benchmarks at various context sizes |
|
|
| `references/fastrpc.h` | Q6A kernel header (ioctl struct definitions) |
|
|
| `AGENTS.md` | Context for AI coding agents working with this codebase |
|