llama_cpp_for_radxa_dragon_wing_q6a

pingu_98/llama_cpp_for_radxa_dragon_wing_q6a

History

Oliver Simons 36f0132464 CUDA: Factor out and re-use `block_reduce` function (#18785 ) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>		2026-01-15 10:44:54 +08:00
..
peg-parser
.gitignore
CMakeLists.txt	ci, tests : use cmake to download models and remove libcurl dependency (#18791 )	2026-01-14 07:46:27 +01:00
get-model.cpp
get-model.h
run-json-schema-to-grammar.mjs
test-alloc.cpp
test-arg-parser.cpp	ci, tests : use cmake to download models and remove libcurl dependency (#18791 )	2026-01-14 07:46:27 +01:00
test-autorelease.cpp
test-backend-ops.cpp	CUDA: Factor out and re-use `block_reduce` function (#18785 )	2026-01-15 10:44:54 +08:00
test-backend-sampler.cpp	tests : refactor test-backend-sampler (#18753 )	2026-01-11 17:31:03 +02:00
test-barrier.cpp
test-c.c
test-chat-parser.cpp
test-chat-peg-parser.cpp
test-chat-template.cpp
test-chat.cpp	chat: make tool description and parameters optional per OpenAI spec (#18478 )	2025-12-31 17:21:37 -06:00
test-double-float.cpp
test-gbnf-validator.cpp
test-gguf.cpp
test-grammar-integration.cpp
test-grammar-llguidance.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-grammar-parser.cpp
test-json-partial.cpp
test-json-schema-to-grammar.cpp
test-llama-grammar.cpp
test-log.cpp
test-lora-conversion-inference.sh
test-model-load-cancel.cpp
test-mtmd-c-api.c
test-opt.cpp
test-peg-parser.cpp
test-quantize-fns.cpp
test-quantize-perf.cpp
test-quantize-stats.cpp
test-regex-partial.cpp	common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )	2026-01-03 16:02:43 -06:00
test-rope.cpp
test-sampling.cpp
test-state-restore-fragmented.cpp
test-thread-safety.cpp
test-tokenizer-0.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-0.py
test-tokenizer-0.sh
test-tokenizer-1-bpe.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-1-spm.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-random.py
test-tokenizers-repo.sh