llama_cpp_for_radxa_dragon_wing_q6a

pingu_98/llama_cpp_for_radxa_dragon_wing_q6a

History

Georgi Gerganov d28961d81e llama : enable chunked fused GDN path (#20340 ) * llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 206890897546bd16602c3b79394fd5ea09ef199f * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>		2026-03-11 22:46:40 +02:00
..
peg-parser	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
.gitignore
CMakeLists.txt	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
get-model.cpp
get-model.h
gguf-model-data.cpp
gguf-model-data.h
run-json-schema-to-grammar.mjs
test-alloc.cpp
test-arg-parser.cpp
test-autorelease.cpp
test-backend-ops.cpp	llama : enable chunked fused GDN path (#20340 )	2026-03-11 22:46:40 +02:00
test-backend-sampler.cpp
test-barrier.cpp
test-c.c
test-chat-auto-parser.cpp	common : gracefully handle incomplete output (#20191 )	2026-03-08 17:17:02 +01:00
test-chat-peg-parser.cpp	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
test-chat-template.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-chat.cpp	common: map developer role to system (#20215 )	2026-03-09 14:25:11 +01:00
test-double-float.cpp
test-gbnf-validator.cpp
test-gguf-model-data.cpp
test-gguf.cpp
test-grammar-integration.cpp
test-grammar-llguidance.cpp
test-grammar-parser.cpp
test-jinja.cpp
test-json-partial.cpp
test-json-schema-to-grammar.cpp	examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968 )	2026-03-10 14:38:18 +01:00
test-llama-archs.cpp	llama: end-to-end tests (#19802 )	2026-03-08 12:30:21 +01:00
test-llama-grammar.cpp
test-log.cpp
test-lora-conversion-inference.sh
test-model-load-cancel.cpp
test-mtmd-c-api.c
test-opt.cpp
test-peg-parser.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-quantize-fns.cpp	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
test-quantize-perf.cpp
test-quantize-stats.cpp
test-reasoning-budget.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
test-regex-partial.cpp
test-rope.cpp
test-sampling.cpp
test-state-restore-fragmented.cpp
test-thread-safety.cpp
test-tokenizer-0.cpp
test-tokenizer-0.py
test-tokenizer-0.sh
test-tokenizer-1-bpe.cpp
test-tokenizer-1-spm.cpp
test-tokenizer-random.py
test-tokenizers-repo.sh
testing.h