From 00d65a9ef3e5696ef82eb72288f256e1dcd95527 Mon Sep 17 00:00:00 2001
From: Jimmy Devine <pingud98@gmail.com>
Date: Sun, 10 May 2026 20:31:16 +0200
Subject: [PATCH] Initial: ggml-backend Turnip Vulkan scheduler fix +
 benchmarks

---
 README.md              | 95 ++++++++++++++++++++++++++++++++++++++++++
 benchmarks.md          | 54 ++++++++++++++++++++++++
 ggml-backend.cpp.patch | 21 ++++++++++
 3 files changed, 170 insertions(+)
 create mode 100644 README.md
 create mode 100644 benchmarks.md
 create mode 100644 ggml-backend.cpp.patch

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1c78e62
--- /dev/null
+++ b/README.md
@@ -0,0 +1,95 @@
+# Q6A llama.cpp Vulkan Patch + Benchmarks
+
+Fixes llama.cpp's GPU backend on **Turnip Adreno 643** (Mesa Freedreno) — the GPU inside the **Radxa Dragon Wing Q6A** (Qualcomm QCS6490).
+
+## The Problem
+
+With `-ngl N` where N >= 3, llama.cpp crashes at context creation:
+
+```
+pre-allocated tensor (cache_k_l3) in a buffer (Vulkan0) that cannot run the operation (NONE)
+```
+
+**Root cause:** The backend scheduler in `ggml-backend.cpp` aborts when it finds a pre-allocated tensor (KV cache) whose backend supports the buffer type but doesn't register the NONE identity operation. On Turnip / Vulkan, NONE ops have no shader backend, so the scheduler gives up and aborts.
+
+## The Fix
+
+A 12-line patch in `ggml/src/ggml-backend.cpp`. Before aborting, it tries a fallback match: find a backend that supports the tensor's buffer type (`buft`) even if it doesn't support the particular op. For KV cache tensors (data-only), this is the correct behavior.
+
+```patch
++        {
++            ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
++            for (int i = 0; i < sched->n_backends; i++) {
++                if (ggml_backend_supports_buft(sched->backends[i], buf->buft)) {
++                    cur_backend_id = i;
++                    SET_CAUSE(tensor, "1.buft");
++                    return cur_backend_id;
++                }
++            }
++        }
+```
+
+## Apply the Patch
+
+```bash
+cd /path/to/llama.cpp
+git apply /path/to/ggml-backend.cpp.patch
+mkdir -p build && cd build
+cmake -B build -DGGML_VULKAN=ON
+cmake --build build --target llama-cli -j$(nproc)
+```
+
+Then run with full GPU offload:
+
+```bash
+./build/bin/llama-cli -m model.gguf -p "Hello" -n 128 -ngl 99
+```
+
+The patch works with or without `--no-warmup`. Does not require `-nkvo` or `-fa`.
+
+## Benchmarks
+
+Hardware: **Radxa Dragon Wing Q6A** — Qualcomm QCS6490, 12GB RAM, Adreno 643 (Turnip Mesa 25.0.7)
+
+### Qwen3.5-0.8B — Q4_K_M (521 MB)
+
+| Config | Prefill pp32 | Gen tg128 |
+|--------|-------------|-----------|
+| CPU (ngl=0) | 15.17 t/s | 12.24 t/s |
+| **GPU (ngl=99)** | **21.18 t/s** | **8.24 t/s** |
+
+### Qwen3.5-0.8B — Q8_K_XL (1.09 GB)
+
+| Config | Prefill pp32 | Gen tg64 |
+|--------|-------------|----------|
+| CPU (ngl=0) | 13.01 t/s | 8.74 t/s |
+| GPU (ngl=1) | 9.4 t/s | 3.3 t/s |
+| **GPU (ngl=99)** | **21.01 t/s** | **7.49 t/s** |
+
+### Observations
+
+- **GPU prefill is 40-60% faster** than CPU in both quantizations
+- **CPU generation is faster** for Q4_K_M (12 vs 8 t/s) — Turnip lacks INT4 dot-product instructions, so the GPU dequantizes to fp16 internally
+- **Q4_K_M on CPU** is the overall sweet spot: 15/12 t/s, no GPU setup needed
+- The patch enables `-ngl 99` for anyone who wants fast prefill or batch processing
+
+## What Didn't Work
+
+Claude Code suggested three alternatives that all failed:
+
+- `--flash-attn` (-fa) — still hits the same scheduler abort
+- `GGML_VK_DISABLE_F16=1` — no effect on the scheduler bug
+- `-ub 64` — doesn't help when KV offload is on
+
+The only workaround before the patch was `-ngl 99 -nkvo` (no KV offload), which limits performance to 8.9/4.0 t/s.
+
+## Other Findings
+
+- **NPU (Hexagon v68)**: Blocked — kernel lacks `CONFIG_QCOM_FASTRPC_UNSIGNED_MODULE=y`
+- **LiteRT-LM**: Works on CPU/GPU, see [npu/litert-lm.md](https://github.com/pingud98/projects-wiki/blob/main/npu/litert-lm.md)
+- **llvmpipe**: Software Vulkan renderer will be detected but ggml_vulkan intentionally skips CPU-type devices
+
+## Files
+
+- `ggml-backend.cpp.patch` — the actual patch
+- `README.md` — this file
diff --git a/benchmarks.md b/benchmarks.md
new file mode 100644
index 0000000..77b4e36
--- /dev/null
+++ b/benchmarks.md
@@ -0,0 +1,54 @@
+# Q6A llama.cpp Vulkan Patch + Benchmarks
+
+Everything needed to enable full GPU offload on Turnip Adreno 643.
+
+## The Problem
+
+llama.cpp's GPU backend crashes at context creation when offloading 3+ layers to Vulkan on Mesa Turnip:
+
+```
+pre-allocated tensor (cache_k_l3) in a buffer (Vulkan0)
+that cannot run the operation (NONE)
+```
+
+## The Fix
+
+A 12-line patch in `ggml/src/ggml-backend.cpp` that adds a fallback path: before aborting, it finds a backend matching the tensor's buffer type, ignoring op support. KV cache tensors don't need op-specific backends.
+
+## Quick Start
+
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+git apply /path/to/ggml-backend.cpp.patch
+cmake -B build -DGGML_VULKAN=ON
+cmake --build build --target llama-cli -j$(nproc)
+./build/bin/llama-cli -m model.gguf -ngl 99
+```
+
+## Benchmarks
+
+System: Q6A / Radxa Dragon Wing / Qualcomm QCS6490 / Adreno 643 / Mesa 25.0.7 / 12GB RAM
+
+### Qwen3.5-0.8B Q4_K_M (521 MB)
+
+| Config | Prefill (pp32) | Generation (tg128) |
+|--------|---------------|-------------------|
+| CPU | 15.2 tok/s | 12.2 tok/s |
+| GPU (ngl=99) | 21.2 tok/s | 8.2 tok/s |
+| *Delta* | *+40%* | *-33%* |
+
+### Qwen3.5-0.8B Q8_K_XL (1.09 GB)
+
+| Config | Prefill (pp32) | Generation (tg64) |
+|--------|---------------|-------------------|
+| CPU | 13.0 tok/s | 8.7 tok/s |
+| GPU (ngl=99) | 21.0 tok/s | 7.5 tok/s |
+| *Delta* | *+61%* | *-14%* |
+
+### Summary
+
+- GPU prefill is **40-61% faster** than CPU
+- GPU generation is **14-33% slower** (Turnip lacks INT4 dot-product, dequantizes to fp16)
+- **Q4_K_M on CPU is the sweet spot** for interactive use: 15/12 tok/s, zero setup
+- Use GPU when you care about time-to-first-token (batch/search workloads)
diff --git a/ggml-backend.cpp.patch b/ggml-backend.cpp.patch
new file mode 100644
index 0000000..1db1d16
--- /dev/null
+++ b/ggml-backend.cpp.patch
@@ -0,0 +1,21 @@
+--- ggml/src/ggml-backend.cpp	2026-05-10 17:51:34
++++ ggml/src/ggml-backend.cpp	2026-05-10 17:55:54
+@@ -894,6 +894,18 @@
+ 
+     if (tensor->buffer || (tensor->view_src && tensor->view_src->buffer)) {
+         // since the tensor is pre-allocated, it cannot be moved to another backend
++        // Try to find a backend matching this buffer, ignoring op support
++        // (some ops like NONE or data movement ops may not be registered on all backends)
++        {
++            ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
++            for (int i = 0; i < sched->n_backends; i++) {
++                if (ggml_backend_supports_buft(sched->backends[i], buf->buft)) {
++                    cur_backend_id = i;
++                    SET_CAUSE(tensor, "1.buft");
++                    return cur_backend_id;
++                }
++            }
++        }
+         ggml_backend_buffer_t buffer = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
+         GGML_ABORT("pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)", tensor->name, ggml_backend_buffer_name(buffer), ggml_op_name(tensor->op));
+     }