llama.cpp ggml-backend scheduler fix + benchmarks for Turnip Adreno 643 on Radxa Q6A

Find a file

Jimmy Devine 00d65a9ef3 Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks		2026-05-10 20:31:16 +02:00
benchmarks.md	Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks	2026-05-10 20:31:16 +02:00
ggml-backend.cpp.patch	Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks	2026-05-10 20:31:16 +02:00
README.md	Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks	2026-05-10 20:31:16 +02:00

README.md

Q6A llama.cpp Vulkan Patch + Benchmarks

Fixes llama.cpp's GPU backend on Turnip Adreno 643 (Mesa Freedreno) — the GPU inside the Radxa Dragon Wing Q6A (Qualcomm QCS6490).

The Problem

With -ngl N where N >= 3, llama.cpp crashes at context creation:

pre-allocated tensor (cache_k_l3) in a buffer (Vulkan0) that cannot run the operation (NONE)

Root cause: The backend scheduler in ggml-backend.cpp aborts when it finds a pre-allocated tensor (KV cache) whose backend supports the buffer type but doesn't register the NONE identity operation. On Turnip / Vulkan, NONE ops have no shader backend, so the scheduler gives up and aborts.

The Fix

A 12-line patch in ggml/src/ggml-backend.cpp. Before aborting, it tries a fallback match: find a backend that supports the tensor's buffer type (buft) even if it doesn't support the particular op. For KV cache tensors (data-only), this is the correct behavior.

+        {
+            ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
+            for (int i = 0; i < sched->n_backends; i++) {
+                if (ggml_backend_supports_buft(sched->backends[i], buf->buft)) {
+                    cur_backend_id = i;
+                    SET_CAUSE(tensor, "1.buft");
+                    return cur_backend_id;
+                }
+            }
+        }

Apply the Patch

cd /path/to/llama.cpp
git apply /path/to/ggml-backend.cpp.patch
mkdir -p build && cd build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --target llama-cli -j$(nproc)

Then run with full GPU offload:

./build/bin/llama-cli -m model.gguf -p "Hello" -n 128 -ngl 99

The patch works with or without --no-warmup. Does not require -nkvo or -fa.

Benchmarks

Hardware: Radxa Dragon Wing Q6A — Qualcomm QCS6490, 12GB RAM, Adreno 643 (Turnip Mesa 25.0.7)

Qwen3.5-0.8B — Q4_K_M (521 MB)

Config	Prefill pp32	Gen tg128
CPU (ngl=0)	15.17 t/s	12.24 t/s
GPU (ngl=99)	21.18 t/s	8.24 t/s

Qwen3.5-0.8B — Q8_K_XL (1.09 GB)

Config	Prefill pp32	Gen tg64
CPU (ngl=0)	13.01 t/s	8.74 t/s
GPU (ngl=1)	9.4 t/s	3.3 t/s
GPU (ngl=99)	21.01 t/s	7.49 t/s

Observations

GPU prefill is 40-60% faster than CPU in both quantizations
CPU generation is faster for Q4_K_M (12 vs 8 t/s) — Turnip lacks INT4 dot-product instructions, so the GPU dequantizes to fp16 internally
Q4_K_M on CPU is the overall sweet spot: 15/12 t/s, no GPU setup needed
The patch enables -ngl 99 for anyone who wants fast prefill or batch processing

What Didn't Work

Claude Code suggested three alternatives that all failed:

--flash-attn (-fa) — still hits the same scheduler abort
GGML_VK_DISABLE_F16=1 — no effect on the scheduler bug
-ub 64 — doesn't help when KV offload is on

The only workaround before the patch was -ngl 99 -nkvo (no KV offload), which limits performance to 8.9/4.0 t/s.

Other Findings

NPU (Hexagon v68): Blocked — kernel lacks CONFIG_QCOM_FASTRPC_UNSIGNED_MODULE=y
LiteRT-LM: Works on CPU/GPU, see npu/litert-lm.md
llvmpipe: Software Vulkan renderer will be detected but ggml_vulkan intentionally skips CPU-type devices

Files

ggml-backend.cpp.patch — the actual patch
README.md — this file