q6a-llama-vulkan-patch/README.md

# Intro 

I've been trying to get the "best" performance out of LLMs on my new Radxa Q6A SBC and to be honest I'm not very impressed with the results (before or after) but here's a way to get the GPU working with Llama.cpp. I think my expectations of the 15 TOPS NPU were way too big, as the Hexagon v68 is a pretty small and limited thing, likewise the Adreno 643 is no RTX 5090. I got deepseek v4 flash to help me build this after giving up on LiteRT-LM with the NPU. Basically the results are a wash all round and getting more than 10tok/s isn't possible with this hardware and a decent sized context. I'm moving on trying Immich AI on it and TTS/STT next. You can also check out my other repos with Llama cpp patched to run on the NPU and a guide to how that works ("working" being slower than CPU. 

# Q6A llama.cpp Vulkan Patch + Benchmarks

This repo fixes llama.cpp's GPU backend on **Turnip Adreno 643** (Mesa Freedreno) — the GPU inside the **Radxa Dragon Wing Q6A** (Qualcomm QCS6490).

## The Problem

With `-ngl N` where N >= 3, llama.cpp crashes at context creation:

```
pre-allocated tensor (cache_k_l3) in a buffer (Vulkan0) that cannot run the operation (NONE)
```

**Root cause:** The backend scheduler in `ggml-backend.cpp` aborts when it finds a pre-allocated tensor (KV cache) whose backend supports the buffer type but doesn't register the NONE identity operation. On Turnip / Vulkan, NONE ops have no shader backend, so the scheduler gives up and aborts.

## The Fix

A 12-line patch in `ggml/src/ggml-backend.cpp`. Before aborting, it tries a fallback match: find a backend that supports the tensor's buffer type (`buft`) even if it doesn't support the particular op. For KV cache tensors (data-only), this is the correct behavior.

```patch
+        {
+            ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
+            for (int i = 0; i < sched->n_backends; i++) {
+                if (ggml_backend_supports_buft(sched->backends[i], buf->buft)) {
+                    cur_backend_id = i;
+                    SET_CAUSE(tensor, "1.buft");
+                    return cur_backend_id;
+                }
+            }
+        }
```

## Apply the Patch

```bash
cd /path/to/llama.cpp
git apply /path/to/ggml-backend.cpp.patch
mkdir -p build && cd build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --target llama-cli -j$(nproc)
```

Then run with full GPU offload:

```bash
./build/bin/llama-cli -m model.gguf -p "Hello" -n 128 -ngl 99
```

The patch works with or without `--no-warmup`. Does not require `-nkvo` or `-fa`.

## Benchmarks

Hardware: **Radxa Dragon Wing Q6A** — Qualcomm QCS6490, 12GB RAM, Adreno 643 (Turnip Mesa 25.0.7)

### Qwen3.5-0.8B — Q4_K_M (521 MB)

| Config | Prefill pp32 | Gen tg128 |
|--------|-------------|-----------|
| CPU (ngl=0) | 15.17 t/s | 12.24 t/s |
| **GPU (ngl=99)** | **21.18 t/s** | **8.24 t/s** |

### Qwen3.5-0.8B — Q8_K_XL (1.09 GB)

| Config | Prefill pp32 | Gen tg64 |
|--------|-------------|----------|
| CPU (ngl=0) | 13.01 t/s | 8.74 t/s |
| GPU (ngl=1) | 9.4 t/s | 3.3 t/s |
| **GPU (ngl=99)** | **21.01 t/s** | **7.49 t/s** |

### Observations

- **GPU prefill is 40-60% faster** than CPU in both quantizations
- **CPU generation is faster** for Q4_K_M (12 vs 8 t/s) — Turnip lacks INT4 dot-product instructions, so the GPU dequantizes to fp16 internally
- **Q4_K_M on CPU** is the overall sweet spot: 15/12 t/s, no GPU setup needed
- The patch enables `-ngl 99` for anyone who wants fast prefill or batch processing

## What Didn't Work

Claude Code suggested three alternatives that all failed:

- `--flash-attn` (-fa) — still hits the same scheduler abort
- `GGML_VK_DISABLE_F16=1` — no effect on the scheduler bug
- `-ub 64` — doesn't help when KV offload is on

The only workaround before the patch was `-ngl 99 -nkvo` (no KV offload), which limits performance to 8.9/4.0 t/s.

## Other Findings

- **NPU (Hexagon v68)**: Blocked — kernel lacks `CONFIG_QCOM_FASTRPC_UNSIGNED_MODULE=y`
- **LiteRT-LM**: Works on CPU/GPU, see [npu/litert-lm.md](https://github.com/pingud98/projects-wiki/blob/main/npu/litert-lm.md)
- **llvmpipe**: Software Vulkan renderer will be detected but ggml_vulkan intentionally skips CPU-type devices

## Files

- `ggml-backend.cpp.patch` — the actual patch
- `README.md` — this file
Update README.md 2026-05-10 18:42:29 +00:00			`# Intro`

			I've been trying to get the "best" performance out of LLMs on my new Radxa Q6A SBC and to be honest I'm not very impressed with the results (before or after) but here's a way to get the GPU working with Llama.cpp. I think my expectations of the 15 TOPS NPU were way too big, as the Hexagon v68 is a pretty small and limited thing, likewise the Adreno 643 is no RTX 5090. I got deepseek v4 flash to help me build this after giving up on LiteRT-LM with the NPU. Basically the results are a wash all round and getting more than 10tok/s isn't possible with this hardware and a decent sized context. I'm moving on trying Immich AI on it and TTS/STT next. You can also check out my other repos with Llama cpp patched to run on the NPU and a guide to how that works ("working" being slower than CPU.

Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks 2026-05-10 18:31:16 +00:00			`# Q6A llama.cpp Vulkan Patch + Benchmarks`

Update README.md 2026-05-10 18:42:29 +00:00			`This repo fixes llama.cpp's GPU backend on Turnip Adreno 643 (Mesa Freedreno) — the GPU inside the Radxa Dragon Wing Q6A (Qualcomm QCS6490).`
Initial: ggml-backend Turnip Vulkan scheduler fix + benchmarks 2026-05-10 18:31:16 +00:00
			`## The Problem`

			With `-ngl N` where N >= 3, llama.cpp crashes at context creation:

			```
			`pre-allocated tensor (cache_k_l3) in a buffer (Vulkan0) that cannot run the operation (NONE)`
			```

			Root cause: The backend scheduler in `ggml-backend.cpp` aborts when it finds a pre-allocated tensor (KV cache) whose backend supports the buffer type but doesn't register the NONE identity operation. On Turnip / Vulkan, NONE ops have no shader backend, so the scheduler gives up and aborts.

			`## The Fix`

			A 12-line patch in `ggml/src/ggml-backend.cpp`. Before aborting, it tries a fallback match: find a backend that supports the tensor's buffer type (`buft`) even if it doesn't support the particular op. For KV cache tensors (data-only), this is the correct behavior.

			```patch
			`+ {`
			`+ ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;`
			`+ for (int i = 0; i < sched->n_backends; i++) {`
			`+ if (ggml_backend_supports_buft(sched->backends[i], buf->buft)) {`
			`+ cur_backend_id = i;`
			`+ SET_CAUSE(tensor, "1.buft");`
			`+ return cur_backend_id;`
			`+ }`
			`+ }`
			`+ }`
			```

			`## Apply the Patch`

			```bash
			`cd /path/to/llama.cpp`
			`git apply /path/to/ggml-backend.cpp.patch`
			`mkdir -p build && cd build`
			`cmake -B build -DGGML_VULKAN=ON`
			`cmake --build build --target llama-cli -j$(nproc)`
			```

			`Then run with full GPU offload:`

			```bash
			`./build/bin/llama-cli -m model.gguf -p "Hello" -n 128 -ngl 99`
			```

			The patch works with or without `--no-warmup`. Does not require `-nkvo` or `-fa`.

			`## Benchmarks`

			`Hardware: Radxa Dragon Wing Q6A — Qualcomm QCS6490, 12GB RAM, Adreno 643 (Turnip Mesa 25.0.7)`

			`### Qwen3.5-0.8B — Q4_K_M (521 MB)`

			`\| Config \| Prefill pp32 \| Gen tg128 \|`
			`\|--------\|-------------\|-----------\|`
			`\| CPU (ngl=0) \| 15.17 t/s \| 12.24 t/s \|`
			`\| GPU (ngl=99) \| 21.18 t/s \| 8.24 t/s \|`

			`### Qwen3.5-0.8B — Q8_K_XL (1.09 GB)`

			`\| Config \| Prefill pp32 \| Gen tg64 \|`
			`\|--------\|-------------\|----------\|`
			`\| CPU (ngl=0) \| 13.01 t/s \| 8.74 t/s \|`
			`\| GPU (ngl=1) \| 9.4 t/s \| 3.3 t/s \|`
			`\| GPU (ngl=99) \| 21.01 t/s \| 7.49 t/s \|`

			`### Observations`

			`- GPU prefill is 40-60% faster than CPU in both quantizations`
			`- CPU generation is faster for Q4_K_M (12 vs 8 t/s) — Turnip lacks INT4 dot-product instructions, so the GPU dequantizes to fp16 internally`
			`- Q4_K_M on CPU is the overall sweet spot: 15/12 t/s, no GPU setup needed`
			- The patch enables `-ngl 99` for anyone who wants fast prefill or batch processing

			`## What Didn't Work`

			`Claude Code suggested three alternatives that all failed:`

			- `--flash-attn` (-fa) — still hits the same scheduler abort
			- `GGML_VK_DISABLE_F16=1` — no effect on the scheduler bug
			- `-ub 64` — doesn't help when KV offload is on

			The only workaround before the patch was `-ngl 99 -nkvo` (no KV offload), which limits performance to 8.9/4.0 t/s.

			`## Other Findings`

			- NPU (Hexagon v68): Blocked — kernel lacks `CONFIG_QCOM_FASTRPC_UNSIGNED_MODULE=y`
			`- LiteRT-LM: Works on CPU/GPU, see [npu/litert-lm.md](https://github.com/pingud98/projects-wiki/blob/main/npu/litert-lm.md)`
			`- llvmpipe: Software Vulkan renderer will be detected but ggml_vulkan intentionally skips CPU-type devices`

			`## Files`

			- `ggml-backend.cpp.patch` — the actual patch
			- `README.md` — this file