- offload_op callback now implemented (MUL_MAT/MUL_MAT_ID) - Memory raised to 10 GiB - Direct compute mode bypasses broken dspqueue on this board - Q8_0 1B model: 115 t/s prompt (4.3x vs CPU 27 t/s) - Generation 9.6 t/s (27% slower than CPU, expected) - dspqueue path fails with error 0x0000002e - llama-cli renamed to llama-simple in current build - Updated scripts for direct-compute mode - Docs updated with new findings and instructions |
||
|---|---|---|
| references | ||
| scripts | ||
| src | ||
| .gitignore | ||
| AGENTS.md | ||
| README.md | ||
Q6A Hexagon v68 + llama.cpp — Complete Guide
This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 (NPU/DSP) backend on a Radxa Dragon Q6A board (SA8775P), including the patches required to achieve real NPU acceleration.
Overview
The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. After implementing offload_op and using direct-compute mode, the NPU shows 4.3x faster prompt processing vs CPU for a 1B Q8_0 model.
Prerequisites
Build Machine (x86_64)
- Ubuntu 24.04 (or similar with cross-compilation packages)
- Packages:
sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake sudo apt install libc6-arm64-cross libc6-dev-arm64-cross - Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at
/local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/- Must include
hexagon-clangattools/HEXAGON_Tools/8.7.06/Tools/bin/ - Must include
qaicIDL compiler attools/qaic/bin/qaic - Must include
incs/with SDK headers - Must include
ipc/fastrpc/with libcdsprpc and rpcmem headers
- Must include
Target Machine (Q6A — aarch64)
- Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04
fastrpcpackage installed:sudo apt install fastrpc fastrpc-test- User
radxainrendergroup (for/dev/fastrpc-cdsp-secureaccess) - CDSP firmware running:
cat /sys/class/remoteproc/remoteproc1/state→running
Quick Start
1. Build llama.cpp with Hexagon backend
cd ~/llama.cpp
bash scripts/build-hexagon.sh
This cross-compiles llama.cpp for aarch64 with -DGGML_HEXAGON=ON. Output goes to build-hexagon/bin/.
2. Deploy to Q6A
# Deploy ARM64 binaries
scp build-hexagon/bin/llama-simple radxa@192.168.1.11:~/llama/bin/llama-cli
scp build-hexagon/bin/libggml*.so* radxa@192.168.1.11:~/llama/bin/
scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/
# Deploy DSP skel
scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/
ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"
3. Run inference test
ssh radxa@192.168.1.11
cd ~/llama/bin
GGML_HEXAGON_DIRECT_COMPUTE=1 ./llama-cli \
-m ~/models/llama-1b-q8_0.gguf \
-n 32 -p "Hello, what is your name?"
Expected output:
ggml-hex: allocating new session: HTP0
ggml-hex: direct-compute mode enabled (dspqueue still created for session init)
[ Prompt: 105.9 t/s | Generation: 9.4 t/s ]
Build Script Details
The scripts/build-hexagon.sh script:
-
CMake configure with:
-DCMAKE_BUILD_TYPE=Release-DBUILD_SHARED_LIBS=ON(required for HTP plugin .so)-DCMAKE_INSTALL_RPATH='$ORIGIN'(libraries alongside binary)-DGGML_HEXAGON=ON-DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF-DMAX_DOMAIN_NAMELEN=64on both C and CXX flags
-
Do NOT set
CMAKE_SYSROOT— the cross-compiler's own linker scripts conflict with--sysrooton Ubuntu'sgcc-aarch64-linux-gnupackages. -
Do NOT set explicit OpenSSL paths — they're unnecessary when
LLAMA_BUILD_SERVER=OFF.
Critical Lessons Learned
Root Cause of remote_handle64_open Error 0xe
The error occurs because the SDK's cross-compiled libcdsprpc.so does NOT handle FASTRPC_IOCTL_INIT_CREATE internally for unsigned PDs. The Q6A system /usr/lib/libcdsprpc.so.1 does. The fix: compile and link natively on the Q6A (or link against the system library).
Do NOT Call INIT_CREATE Manually
Attempting FASTRPC_IOCTL_INIT_CREATE via ioctl on /dev/fastrpc-cdsp-secure always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:
/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */
remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);
remote_handle64_open(uri, &handle);
Verified Q6A Constants
| Item | Value |
|---|---|
| CDSP device node | /dev/fastrpc-cdsp-secure |
| Shell path | /usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3 |
| Domain ID | CDSP_DOMAIN_ID = 3 |
| Unsigned module flag | FASTRPC_MODE_UNSIGNED_MODULE = (1 << 3) = 0x8 |
| DSP .so path | /usr/lib/dsp/cdsp/ |
| System libcdsprpc | /usr/lib/libcdsprpc.so.1 (symlink at /usr/lib/libcdsprpc.so already exists) |
| Kernel header | /usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h |
dspqueue Symbols
All required dspqueue_* symbols are present in the SA8775P system libcdsprpc.so.1:
dspqueue_create, dspqueue_close, dspqueue_export, dspqueue_write, dspqueue_read, etc.
Cross-Compile Pitfalls
-
CMAKE_SYSROOTbreaks the linker — Ubuntu'sgcc-aarch64-linux-gnupackages install linker scripts with absolute paths (e.g.,/usr/aarch64-linux-gnu/lib/libm.socontainsGROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )). When--sysrootis set to/usr/aarch64-linux-gnu, these absolute paths double up to/usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/...and fail. Solution: let the compiler use its built-in sysroot. -
rpcmem_initis optional when linked againstlibcdsprpc.so. The SDK's cross-compiledlibcdsprpconly exportsrpcmem_alloc/freebut notrpcmem_init/deinit. The Q6A system libcdsprpc has them all. -
The old
llama-clibinary has hexagon init code baked in. The llama.cpp build generatesllama-simple, notllama-cli. Deployllama-simpleasllama-clito get a binary that loads the hexagon backend from the shared library.
NPU Acceleration
After implementing offload_op and enabling direct-compute mode, the Hexagon backend now actually offloads operations to the NPU. Here is everything that was done:
Required Changes to ggml-hexagon.cpp
-
offload_opcallback — Added a function that returnstrueforGGML_OP_MUL_MATandGGML_OP_MUL_MAT_ID, so the scheduler claims these ops for the NPU:static bool ggml_backend_hexagon_device_offload_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) { return (op->op == GGML_OP_MUL_MAT || op->op == GGML_OP_MUL_MAT_ID); } -
Memory increased to 10 GiB — Changed from the hardcoded 2 GiB to 10 GiB (matching QCS6490's 11 GB RAM):
*free = 10ULL * 1024 * 1024 * 1024; -
Direct compute mode — Added a code path that bypasses the broken
dspqueueand callshtp_iface_compute()via FastRPC directly. The dspqueue is still created for session initialization. Enable withGGML_HEXAGON_DIRECT_COMPUTE=1.
dspqueue Status
The default dspqueue path (GGML_HEXAGON_DIRECT_COMPUTE=0) fails on this Q6A board with error 0x0000002e (unknown DSP-side error). This affects both Q4_K_M and Q8_0 models. The direct compute path (GGML_HEXAGON_DIRECT_COMPUTE=1) works correctly and shows real NPU acceleration.
Possible causes for dspqueue failure:
- CDSP daemon in bad state (restart with
sudo systemctl restart cdsprpcdmay fix) - Firmware version mismatch with dspqueue implementation
- Queue size or buffer allocation issue specific to this board
Supported Quantizations
The Hexagon HTP kernels support these quantization types:
GGML_TYPE_Q4_0GGML_TYPE_Q8_0GGML_TYPE_IQ4_NLGGML_TYPE_MXFP4
Q4_K_M is NOT supported. Use Q8_0 or Q4_0 models for NPU offload.
Performance Benchmarks
All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).
1B Model (Llama 3.2, Q8_0) — NPU vs CPU
| Metric | CPU | NPU (direct compute) | Speedup |
|---|---|---|---|
| Prompt eval | 26.8 t/s | 114.9 t/s | 4.3x faster |
| Generation | 13.1 t/s | 9.6 t/s | 27% slower |
Analysis:
- Prompt eval is 4x faster on NPU — this is the expected win. Prompt eval processes many tokens in parallel, hitting the NPU's massive matrix multiplication throughput.
- Generation is 27% slower on NPU — also expected. Autoregressive generation is memory-bandwidth-bound (single token at a time). The NPU communication overhead (FastRPC call + DMA sync) outweighs the compute benefit for batch-1 operations.
- CPU-only runs through
ggml-backend-cpu. NPU runs through the Hexagon backend with offloaded MUL_MAT ops.
Previous: 1B Model (Llama 3.2, Q4_K_M) — Baseline (no offload)
From before the offload_op fix, the NPU path was identical to CPU:
| Metric | CPU-only | With Hexagon backend (no offload) |
|---|---|---|
| Prompt processing | 32.3 t/s | 32.0 t/s |
| Generation | 4.5 t/s | 4.5 t/s |
Difference is within noise — no ops were actually offloaded.
7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)
With -c 2048 to fit in 11GB RAM (model alone is 4460 MiB):
| Context | Test | Prompt t/s | Gen t/s |
|---|---|---|---|
| 2K | CPU | 2.7 | 1.9 |
| 2K | NPU | 2.8 | 1.8 |
| 32K | CPU | 2.7 | 1.9 |
| 32K | NPU | 2.7 | 1.9 |
| 64K | CPU | 2.5 | 1.8 |
| 64K | NPU | 2.5 | 1.8 |
NPU vs CPU identical — Q4_K_M is not a supported quantization for HTP kernels, so ops always fall back to CPU.
How the Direct Compute Path Works
When GGML_HEXAGON_DIRECT_COMPUTE=1:
- The dspqueue is still created and exported normally (required for DSP session init via
htp_iface_start) htp_iface_start()is called with the real queue ID to set up the DSP session- When
enqueue()is called for an op, it callshtp_iface_compute()via FastRPC directly instead of writing to the dspqueue - The op data (req struct) is sent as the first buffer, with DMA-BUF file descriptors for weight and activation tensors
- After the direct compute call completes, ARM caches are synced for the output DMA-BUFs
This bypasses the broken dspqueue read path while still using the standard FastRPC session for initialization.
Known Issues
- dspqueue path fails with error
0x0000002e— useGGML_HEXAGON_DIRECT_COMPUTE=1to bypass. - CDSP daemon restart may fix dspqueue —
sudo systemctl restart cdsprpcdmay restore dspqueue functionality. - Q4_K_M won't offload — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4. Even with
offload_openabled,supports_opreturns false for Q4_K_M. - Memory number is still hardcoded — 10 GiB is a reasonable estimate for QCS6490 with 11 GB RAM, but never queried from hardware.
- Direct compute uses stack memory for IDL params — the implementation passes tiny stack-allocated dummy buffers to avoid FastRPC EREMCHG errors from ION/CMA memory. This works but is fragile.
Files in This Repo
| File | Purpose |
|---|---|
src/test_fastrpc_fixed.c |
Corrected test harness with proper init sequence |
src/htp_minimal_impl.c |
Minimal DSP stub (for experimentation) |
scripts/build-hexagon.sh |
Cross-compile script for llama.cpp with GGML_HEXAGON=ON |
scripts/deploy-to-q6a.sh |
Deploy built binaries + DSP .so to Q6A |
scripts/test-on-q6a.sh |
Run full inference test on Q6A |
scripts/test-7b.sh |
Run 7B model benchmarks at various context sizes |
references/fastrpc.h |
Q6A kernel header (ioctl struct definitions) |
AGENTS.md |
Context for AI coding agents working with this codebase |