pingu_98/llamacpp_on_dragon_wing_q6a_guide

No description

Find a file

Jimmy Devine e6fa9052b3 Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup - offload_op callback now implemented (MUL_MAT/MUL_MAT_ID) - Memory raised to 10 GiB - Direct compute mode bypasses broken dspqueue on this board - Q8_0 1B model: 115 t/s prompt (4.3x vs CPU 27 t/s) - Generation 9.6 t/s (27% slower than CPU, expected) - dspqueue path fails with error 0x0000002e - llama-cli renamed to llama-simple in current build - Updated scripts for direct-compute mode - Docs updated with new findings and instructions		2026-05-02 14:17:27 +02:00
references	Initial commit: Q6A Hexagon v68 + llama.cpp guide	2026-05-02 10:28:51 +02:00
scripts	Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup	2026-05-02 14:17:27 +02:00
src	Initial commit: Q6A Hexagon v68 + llama.cpp guide	2026-05-02 10:28:51 +02:00
.gitignore	Initial commit: Q6A Hexagon v68 + llama.cpp guide	2026-05-02 10:28:51 +02:00
AGENTS.md	Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup	2026-05-02 14:17:27 +02:00
README.md	Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup	2026-05-02 14:17:27 +02:00

README.md

Q6A Hexagon v68 + llama.cpp — Complete Guide

This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 (NPU/DSP) backend on a Radxa Dragon Q6A board (SA8775P), including the patches required to achieve real NPU acceleration.

Overview

The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. After implementing offload_op and using direct-compute mode, the NPU shows 4.3x faster prompt processing vs CPU for a 1B Q8_0 model.

Prerequisites

Build Machine (x86_64)

Ubuntu 24.04 (or similar with cross-compilation packages)

Packages:

sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake
sudo apt install libc6-arm64-cross libc6-dev-arm64-cross

Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at /local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/
- Must include hexagon-clang at tools/HEXAGON_Tools/8.7.06/Tools/bin/
- Must include qaic IDL compiler at tools/qaic/bin/qaic
- Must include incs/ with SDK headers
- Must include ipc/fastrpc/ with libcdsprpc and rpcmem headers

Target Machine (Q6A — aarch64)

Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04
fastrpc package installed: sudo apt install fastrpc fastrpc-test
User radxa in render group (for /dev/fastrpc-cdsp-secure access)
CDSP firmware running: cat /sys/class/remoteproc/remoteproc1/state → running

Quick Start

1. Build llama.cpp with Hexagon backend

cd ~/llama.cpp
bash scripts/build-hexagon.sh

This cross-compiles llama.cpp for aarch64 with -DGGML_HEXAGON=ON. Output goes to build-hexagon/bin/.

2. Deploy to Q6A

# Deploy ARM64 binaries
scp build-hexagon/bin/llama-simple radxa@192.168.1.11:~/llama/bin/llama-cli
scp build-hexagon/bin/libggml*.so* radxa@192.168.1.11:~/llama/bin/
scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/

# Deploy DSP skel
scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/
ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"

3. Run inference test

ssh radxa@192.168.1.11
cd ~/llama/bin
GGML_HEXAGON_DIRECT_COMPUTE=1 ./llama-cli \
    -m ~/models/llama-1b-q8_0.gguf \
    -n 32 -p "Hello, what is your name?"

Expected output:

ggml-hex: allocating new session: HTP0
ggml-hex: direct-compute mode enabled (dspqueue still created for session init)
[ Prompt: 105.9 t/s | Generation: 9.4 t/s ]

Build Script Details

The scripts/build-hexagon.sh script:

CMake configure with:
- -DCMAKE_BUILD_TYPE=Release
- -DBUILD_SHARED_LIBS=ON (required for HTP plugin .so)
- -DCMAKE_INSTALL_RPATH='$ORIGIN' (libraries alongside binary)
- -DGGML_HEXAGON=ON
- -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF
- -DMAX_DOMAIN_NAMELEN=64 on both C and CXX flags
Do NOT set CMAKE_SYSROOT — the cross-compiler's own linker scripts conflict with --sysroot on Ubuntu's gcc-aarch64-linux-gnu packages.
Do NOT set explicit OpenSSL paths — they're unnecessary when LLAMA_BUILD_SERVER=OFF.

Critical Lessons Learned

Root Cause of `remote_handle64_open` Error 0xe

The error occurs because the SDK's cross-compiled libcdsprpc.so does NOT handle FASTRPC_IOCTL_INIT_CREATE internally for unsigned PDs. The Q6A system /usr/lib/libcdsprpc.so.1 does. The fix: compile and link natively on the Q6A (or link against the system library).

Do NOT Call INIT_CREATE Manually

Attempting FASTRPC_IOCTL_INIT_CREATE via ioctl on /dev/fastrpc-cdsp-secure always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:

/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */
remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);
remote_handle64_open(uri, &handle);

Verified Q6A Constants

Item	Value
CDSP device node	`/dev/fastrpc-cdsp-secure`
Shell path	`/usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3`
Domain ID	`CDSP_DOMAIN_ID` = 3
Unsigned module flag	`FASTRPC_MODE_UNSIGNED_MODULE` = `(1 << 3)` = 0x8
DSP .so path	`/usr/lib/dsp/cdsp/`
System libcdsprpc	`/usr/lib/libcdsprpc.so.1` (symlink at `/usr/lib/libcdsprpc.so` already exists)
Kernel header	`/usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h`

dspqueue Symbols

All required dspqueue_* symbols are present in the SA8775P system libcdsprpc.so.1: dspqueue_create, dspqueue_close, dspqueue_export, dspqueue_write, dspqueue_read, etc.

Cross-Compile Pitfalls

CMAKE_SYSROOT breaks the linker — Ubuntu's gcc-aarch64-linux-gnu packages install linker scripts with absolute paths (e.g., /usr/aarch64-linux-gnu/lib/libm.so contains GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )). When --sysroot is set to /usr/aarch64-linux-gnu, these absolute paths double up to /usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/... and fail. Solution: let the compiler use its built-in sysroot.
rpcmem_init is optional when linked against libcdsprpc.so. The SDK's cross-compiled libcdsprpc only exports rpcmem_alloc/free but not rpcmem_init/deinit. The Q6A system libcdsprpc has them all.
The old llama-cli binary has hexagon init code baked in. The llama.cpp build generates llama-simple, not llama-cli. Deploy llama-simple as llama-cli to get a binary that loads the hexagon backend from the shared library.

NPU Acceleration

After implementing offload_op and enabling direct-compute mode, the Hexagon backend now actually offloads operations to the NPU. Here is everything that was done:

Required Changes to `ggml-hexagon.cpp`

offload_op callback — Added a function that returns true for GGML_OP_MUL_MAT and GGML_OP_MUL_MAT_ID, so the scheduler claims these ops for the NPU:

static bool ggml_backend_hexagon_device_offload_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
    return (op->op == GGML_OP_MUL_MAT || op->op == GGML_OP_MUL_MAT_ID);
}

Memory increased to 10 GiB — Changed from the hardcoded 2 GiB to 10 GiB (matching QCS6490's 11 GB RAM):
```
*free  = 10ULL * 1024 * 1024 * 1024;
```
Direct compute mode — Added a code path that bypasses the broken dspqueue and calls htp_iface_compute() via FastRPC directly. The dspqueue is still created for session initialization. Enable with GGML_HEXAGON_DIRECT_COMPUTE=1.

dspqueue Status

The default dspqueue path (GGML_HEXAGON_DIRECT_COMPUTE=0) fails on this Q6A board with error 0x0000002e (unknown DSP-side error). This affects both Q4_K_M and Q8_0 models. The direct compute path (GGML_HEXAGON_DIRECT_COMPUTE=1) works correctly and shows real NPU acceleration.

Possible causes for dspqueue failure:

CDSP daemon in bad state (restart with sudo systemctl restart cdsprpcd may fix)
Firmware version mismatch with dspqueue implementation
Queue size or buffer allocation issue specific to this board

Supported Quantizations

The Hexagon HTP kernels support these quantization types:

GGML_TYPE_Q4_0
GGML_TYPE_Q8_0
GGML_TYPE_IQ4_NL
GGML_TYPE_MXFP4

Q4_K_M is NOT supported. Use Q8_0 or Q4_0 models for NPU offload.

Performance Benchmarks

All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).

1B Model (Llama 3.2, Q8_0) — NPU vs CPU

Metric	CPU	NPU (direct compute)	Speedup
Prompt eval	26.8 t/s	114.9 t/s	4.3x faster
Generation	13.1 t/s	9.6 t/s	27% slower

Analysis:

Prompt eval is 4x faster on NPU — this is the expected win. Prompt eval processes many tokens in parallel, hitting the NPU's massive matrix multiplication throughput.
Generation is 27% slower on NPU — also expected. Autoregressive generation is memory-bandwidth-bound (single token at a time). The NPU communication overhead (FastRPC call + DMA sync) outweighs the compute benefit for batch-1 operations.
CPU-only runs through ggml-backend-cpu. NPU runs through the Hexagon backend with offloaded MUL_MAT ops.

Previous: 1B Model (Llama 3.2, Q4_K_M) — Baseline (no offload)

From before the offload_op fix, the NPU path was identical to CPU:

Metric	CPU-only	With Hexagon backend (no offload)
Prompt processing	32.3 t/s	32.0 t/s
Generation	4.5 t/s	4.5 t/s

Difference is within noise — no ops were actually offloaded.

7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)

With -c 2048 to fit in 11GB RAM (model alone is 4460 MiB):

Context	Test	Prompt t/s	Gen t/s
2K	CPU	2.7	1.9
2K	NPU	2.8	1.8
32K	CPU	2.7	1.9
32K	NPU	2.7	1.9
64K	CPU	2.5	1.8
64K	NPU	2.5	1.8

NPU vs CPU identical — Q4_K_M is not a supported quantization for HTP kernels, so ops always fall back to CPU.

How the Direct Compute Path Works

When GGML_HEXAGON_DIRECT_COMPUTE=1:

The dspqueue is still created and exported normally (required for DSP session init via htp_iface_start)
htp_iface_start() is called with the real queue ID to set up the DSP session
When enqueue() is called for an op, it calls htp_iface_compute() via FastRPC directly instead of writing to the dspqueue
The op data (req struct) is sent as the first buffer, with DMA-BUF file descriptors for weight and activation tensors
After the direct compute call completes, ARM caches are synced for the output DMA-BUFs

This bypasses the broken dspqueue read path while still using the standard FastRPC session for initialization.

Known Issues

dspqueue path fails with error 0x0000002e — use GGML_HEXAGON_DIRECT_COMPUTE=1 to bypass.
CDSP daemon restart may fix dspqueue — sudo systemctl restart cdsprpcd may restore dspqueue functionality.
Q4_K_M won't offload — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4. Even with offload_op enabled, supports_op returns false for Q4_K_M.
Memory number is still hardcoded — 10 GiB is a reasonable estimate for QCS6490 with 11 GB RAM, but never queried from hardware.
Direct compute uses stack memory for IDL params — the implementation passes tiny stack-allocated dummy buffers to avoid FastRPC EREMCHG errors from ION/CMA memory. This works but is fragile.

Files in This Repo

File	Purpose
`src/test_fastrpc_fixed.c`	Corrected test harness with proper init sequence
`src/htp_minimal_impl.c`	Minimal DSP stub (for experimentation)
`scripts/build-hexagon.sh`	Cross-compile script for llama.cpp with GGML_HEXAGON=ON
`scripts/deploy-to-q6a.sh`	Deploy built binaries + DSP .so to Q6A
`scripts/test-on-q6a.sh`	Run full inference test on Q6A
`scripts/test-7b.sh`	Run 7B model benchmarks at various context sizes
`references/fastrpc.h`	Q6A kernel header (ioctl struct definitions)
`AGENTS.md`	Context for AI coding agents working with this codebase

Context	Test	Prompt t/s	Gen t/s
2K	CPU	2.7	1.9
2K	NPU	2.8	1.8
32K	CPU	2.7	1.9
32K	NPU	2.7	1.9
64K	CPU	2.5	1.8
64K	NPU	2.5	1.8

Context	Test	Prompt t/s	Gen t/s
2K	CPU	2.7	1.9
2K	NPU	2.8	1.8
32K	CPU	2.7	1.9
32K	NPU	2.7	1.9
64K	CPU	2.5	1.8
64K	NPU	2.5	1.8