No description
Find a file
Jimmy Devine e6fa9052b3 Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup
- offload_op callback now implemented (MUL_MAT/MUL_MAT_ID)
- Memory raised to 10 GiB
- Direct compute mode bypasses broken dspqueue on this board
- Q8_0 1B model: 115 t/s prompt (4.3x vs CPU 27 t/s)
- Generation 9.6 t/s (27% slower than CPU, expected)
- dspqueue path fails with error 0x0000002e
- llama-cli renamed to llama-simple in current build
- Updated scripts for direct-compute mode
- Docs updated with new findings and instructions
2026-05-02 14:17:27 +02:00
references Initial commit: Q6A Hexagon v68 + llama.cpp guide 2026-05-02 10:28:51 +02:00
scripts Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup 2026-05-02 14:17:27 +02:00
src Initial commit: Q6A Hexagon v68 + llama.cpp guide 2026-05-02 10:28:51 +02:00
.gitignore Initial commit: Q6A Hexagon v68 + llama.cpp guide 2026-05-02 10:28:51 +02:00
AGENTS.md Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup 2026-05-02 14:17:27 +02:00
README.md Add NPU offload results: offload_op, direct-compute, 10GB, Q8_0 4.3x prompt speedup 2026-05-02 14:17:27 +02:00

Q6A Hexagon v68 + llama.cpp — Complete Guide

This repo documents how to get llama.cpp running with the Qualcomm Hexagon CDSP v68 (NPU/DSP) backend on a Radxa Dragon Q6A board (SA8775P), including the patches required to achieve real NPU acceleration.

Overview

The Q6A has a Qualcomm QCS6490 SoC with a Hexagon CDSP v68 that can accelerate matrix operations in llama.cpp via FastRPC. After implementing offload_op and using direct-compute mode, the NPU shows 4.3x faster prompt processing vs CPU for a 1B Q8_0 model.

Prerequisites

Build Machine (x86_64)

  • Ubuntu 24.04 (or similar with cross-compilation packages)
  • Packages:
    sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build cmake
    sudo apt install libc6-arm64-cross libc6-dev-arm64-cross
    
  • Qualcomm Hexagon SDK 5.5.6.0 (with Tools 8.7.06) at /local/mnt/workspace/Qualcomm/Hexagon_SDK/5.5.6.0/
    • Must include hexagon-clang at tools/HEXAGON_Tools/8.7.06/Tools/bin/
    • Must include qaic IDL compiler at tools/qaic/bin/qaic
    • Must include incs/ with SDK headers
    • Must include ipc/fastrpc/ with libcdsprpc and rpcmem headers

Target Machine (Q6A — aarch64)

  • Radxa Dragon Q6A (SA8775P) running Ubuntu 24.04
  • fastrpc package installed: sudo apt install fastrpc fastrpc-test
  • User radxa in render group (for /dev/fastrpc-cdsp-secure access)
  • CDSP firmware running: cat /sys/class/remoteproc/remoteproc1/staterunning

Quick Start

1. Build llama.cpp with Hexagon backend

cd ~/llama.cpp
bash scripts/build-hexagon.sh

This cross-compiles llama.cpp for aarch64 with -DGGML_HEXAGON=ON. Output goes to build-hexagon/bin/.

2. Deploy to Q6A

# Deploy ARM64 binaries
scp build-hexagon/bin/llama-simple radxa@192.168.1.11:~/llama/bin/llama-cli
scp build-hexagon/bin/libggml*.so* radxa@192.168.1.11:~/llama/bin/
scp build-hexagon/bin/libllama.so* radxa@192.168.1.11:~/llama/bin/

# Deploy DSP skel
scp build-hexagon/ggml/src/ggml-hexagon/libggml-htp-v68.so radxa@192.168.1.11:/tmp/
ssh radxa@192.168.1.11 "sudo cp /tmp/libggml-htp-v68.so /usr/lib/dsp/cdsp/"

3. Run inference test

ssh radxa@192.168.1.11
cd ~/llama/bin
GGML_HEXAGON_DIRECT_COMPUTE=1 ./llama-cli \
    -m ~/models/llama-1b-q8_0.gguf \
    -n 32 -p "Hello, what is your name?"

Expected output:

ggml-hex: allocating new session: HTP0
ggml-hex: direct-compute mode enabled (dspqueue still created for session init)
[ Prompt: 105.9 t/s | Generation: 9.4 t/s ]

Build Script Details

The scripts/build-hexagon.sh script:

  1. CMake configure with:

    • -DCMAKE_BUILD_TYPE=Release
    • -DBUILD_SHARED_LIBS=ON (required for HTP plugin .so)
    • -DCMAKE_INSTALL_RPATH='$ORIGIN' (libraries alongside binary)
    • -DGGML_HEXAGON=ON
    • -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_SERVER=OFF
    • -DMAX_DOMAIN_NAMELEN=64 on both C and CXX flags
  2. Do NOT set CMAKE_SYSROOT — the cross-compiler's own linker scripts conflict with --sysroot on Ubuntu's gcc-aarch64-linux-gnu packages.

  3. Do NOT set explicit OpenSSL paths — they're unnecessary when LLAMA_BUILD_SERVER=OFF.

Critical Lessons Learned

Root Cause of remote_handle64_open Error 0xe

The error occurs because the SDK's cross-compiled libcdsprpc.so does NOT handle FASTRPC_IOCTL_INIT_CREATE internally for unsigned PDs. The Q6A system /usr/lib/libcdsprpc.so.1 does. The fix: compile and link natively on the Q6A (or link against the system library).

Do NOT Call INIT_CREATE Manually

Attempting FASTRPC_IOCTL_INIT_CREATE via ioctl on /dev/fastrpc-cdsp-secure always returns EINVAL because the kernel expects the struct to be set up by libcdsprpc's internal state machine. The correct approach:

/* ONLY these two calls are needed — libcdsprpc handles INIT_CREATE */
remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE, ...);
remote_handle64_open(uri, &handle);

Verified Q6A Constants

Item Value
CDSP device node /dev/fastrpc-cdsp-secure
Shell path /usr/lib/dsp/cdsp/fastrpc_shell_unsigned_3
Domain ID CDSP_DOMAIN_ID = 3
Unsigned module flag FASTRPC_MODE_UNSIGNED_MODULE = (1 << 3) = 0x8
DSP .so path /usr/lib/dsp/cdsp/
System libcdsprpc /usr/lib/libcdsprpc.so.1 (symlink at /usr/lib/libcdsprpc.so already exists)
Kernel header /usr/src/linux-headers-6.18.2-3-qcom/include/uapi/misc/fastrpc.h

dspqueue Symbols

All required dspqueue_* symbols are present in the SA8775P system libcdsprpc.so.1: dspqueue_create, dspqueue_close, dspqueue_export, dspqueue_write, dspqueue_read, etc.

Cross-Compile Pitfalls

  1. CMAKE_SYSROOT breaks the linker — Ubuntu's gcc-aarch64-linux-gnu packages install linker scripts with absolute paths (e.g., /usr/aarch64-linux-gnu/lib/libm.so contains GROUP( /usr/aarch64-linux-gnu/lib/libm.so.6 ... )). When --sysroot is set to /usr/aarch64-linux-gnu, these absolute paths double up to /usr/aarch64-linux-gnu/usr/aarch64-linux-gnu/lib/... and fail. Solution: let the compiler use its built-in sysroot.

  2. rpcmem_init is optional when linked against libcdsprpc.so. The SDK's cross-compiled libcdsprpc only exports rpcmem_alloc/free but not rpcmem_init/deinit. The Q6A system libcdsprpc has them all.

  3. The old llama-cli binary has hexagon init code baked in. The llama.cpp build generates llama-simple, not llama-cli. Deploy llama-simple as llama-cli to get a binary that loads the hexagon backend from the shared library.

NPU Acceleration

After implementing offload_op and enabling direct-compute mode, the Hexagon backend now actually offloads operations to the NPU. Here is everything that was done:

Required Changes to ggml-hexagon.cpp

  1. offload_op callback — Added a function that returns true for GGML_OP_MUL_MAT and GGML_OP_MUL_MAT_ID, so the scheduler claims these ops for the NPU:

    static bool ggml_backend_hexagon_device_offload_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
        return (op->op == GGML_OP_MUL_MAT || op->op == GGML_OP_MUL_MAT_ID);
    }
    
  2. Memory increased to 10 GiB — Changed from the hardcoded 2 GiB to 10 GiB (matching QCS6490's 11 GB RAM):

    *free  = 10ULL * 1024 * 1024 * 1024;
    
  3. Direct compute mode — Added a code path that bypasses the broken dspqueue and calls htp_iface_compute() via FastRPC directly. The dspqueue is still created for session initialization. Enable with GGML_HEXAGON_DIRECT_COMPUTE=1.

dspqueue Status

The default dspqueue path (GGML_HEXAGON_DIRECT_COMPUTE=0) fails on this Q6A board with error 0x0000002e (unknown DSP-side error). This affects both Q4_K_M and Q8_0 models. The direct compute path (GGML_HEXAGON_DIRECT_COMPUTE=1) works correctly and shows real NPU acceleration.

Possible causes for dspqueue failure:

  • CDSP daemon in bad state (restart with sudo systemctl restart cdsprpcd may fix)
  • Firmware version mismatch with dspqueue implementation
  • Queue size or buffer allocation issue specific to this board

Supported Quantizations

The Hexagon HTP kernels support these quantization types:

  • GGML_TYPE_Q4_0
  • GGML_TYPE_Q8_0
  • GGML_TYPE_IQ4_NL
  • GGML_TYPE_MXFP4

Q4_K_M is NOT supported. Use Q8_0 or Q4_0 models for NPU offload.

Performance Benchmarks

All tests on Radxa Dragon Q6A (QCS6490, 11GB RAM, 8x ARM Cortex cores, Ubuntu 24.04).

1B Model (Llama 3.2, Q8_0) — NPU vs CPU

Metric CPU NPU (direct compute) Speedup
Prompt eval 26.8 t/s 114.9 t/s 4.3x faster
Generation 13.1 t/s 9.6 t/s 27% slower

Analysis:

  • Prompt eval is 4x faster on NPU — this is the expected win. Prompt eval processes many tokens in parallel, hitting the NPU's massive matrix multiplication throughput.
  • Generation is 27% slower on NPU — also expected. Autoregressive generation is memory-bandwidth-bound (single token at a time). The NPU communication overhead (FastRPC call + DMA sync) outweighs the compute benefit for batch-1 operations.
  • CPU-only runs through ggml-backend-cpu. NPU runs through the Hexagon backend with offloaded MUL_MAT ops.

Previous: 1B Model (Llama 3.2, Q4_K_M) — Baseline (no offload)

From before the offload_op fix, the NPU path was identical to CPU:

Metric CPU-only With Hexagon backend (no offload)
Prompt processing 32.3 t/s 32.0 t/s
Generation 4.5 t/s 4.5 t/s

Difference is within noise — no ops were actually offloaded.

7B Model (DeepSeek R1 Distill Qwen, Q4_K_M)

With -c 2048 to fit in 11GB RAM (model alone is 4460 MiB):

Context Test Prompt t/s Gen t/s
2K CPU 2.7 1.9
2K NPU 2.8 1.8
32K CPU 2.7 1.9
32K NPU 2.7 1.9
64K CPU 2.5 1.8
64K NPU 2.5 1.8

NPU vs CPU identical — Q4_K_M is not a supported quantization for HTP kernels, so ops always fall back to CPU.

How the Direct Compute Path Works

When GGML_HEXAGON_DIRECT_COMPUTE=1:

  1. The dspqueue is still created and exported normally (required for DSP session init via htp_iface_start)
  2. htp_iface_start() is called with the real queue ID to set up the DSP session
  3. When enqueue() is called for an op, it calls htp_iface_compute() via FastRPC directly instead of writing to the dspqueue
  4. The op data (req struct) is sent as the first buffer, with DMA-BUF file descriptors for weight and activation tensors
  5. After the direct compute call completes, ARM caches are synced for the output DMA-BUFs

This bypasses the broken dspqueue read path while still using the standard FastRPC session for initialization.

Known Issues

  • dspqueue path fails with error 0x0000002e — use GGML_HEXAGON_DIRECT_COMPUTE=1 to bypass.
  • CDSP daemon restart may fix dspqueuesudo systemctl restart cdsprpcd may restore dspqueue functionality.
  • Q4_K_M won't offload — HTP kernels only support Q4_0, Q8_0, IQ4_NL, MXFP4. Even with offload_op enabled, supports_op returns false for Q4_K_M.
  • Memory number is still hardcoded — 10 GiB is a reasonable estimate for QCS6490 with 11 GB RAM, but never queried from hardware.
  • Direct compute uses stack memory for IDL params — the implementation passes tiny stack-allocated dummy buffers to avoid FastRPC EREMCHG errors from ION/CMA memory. This works but is fragile.

Files in This Repo

File Purpose
src/test_fastrpc_fixed.c Corrected test harness with proper init sequence
src/htp_minimal_impl.c Minimal DSP stub (for experimentation)
scripts/build-hexagon.sh Cross-compile script for llama.cpp with GGML_HEXAGON=ON
scripts/deploy-to-q6a.sh Deploy built binaries + DSP .so to Q6A
scripts/test-on-q6a.sh Run full inference test on Q6A
scripts/test-7b.sh Run 7B model benchmarks at various context sizes
references/fastrpc.h Q6A kernel header (ioctl struct definitions)
AGENTS.md Context for AI coding agents working with this codebase