llama_cpp_for_radxa_dragon_.../common
Berk Idem 56666fa607
common: skip reasoning budget sampler when no budget is requested (#21870)
* common: skip reasoning budget sampler when no budget is requested

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

* common: preserve rbudget when grammar is lazy

Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
2026-04-14 12:43:06 +02:00
..
jinja
arg.cpp
arg.h
base64.hpp
build-info.cpp.in
chat-auto-parser-generator.cpp
chat-auto-parser-helpers.cpp
chat-auto-parser-helpers.h
chat-auto-parser.h
chat-diff-analyzer.cpp
chat-peg-parser.cpp
chat-peg-parser.h
chat.cpp common/gemma4 : handle parsing edge cases (#21760) 2026-04-13 18:18:18 -05:00
chat.h
CMakeLists.txt
common.cpp
common.h
console.cpp
console.h
debug.cpp
debug.h
download.cpp common : add download cancellation and temp file cleanup (#21813) 2026-04-13 11:18:23 +02:00
download.h common : add download cancellation and temp file cleanup (#21813) 2026-04-13 11:18:23 +02:00
hf-cache.cpp
hf-cache.h
http.h
json-partial.cpp
json-partial.h
json-schema-to-grammar.cpp
json-schema-to-grammar.h
llguidance.cpp
log.cpp
log.h
ngram-cache.cpp
ngram-cache.h
ngram-map.cpp
ngram-map.h
ngram-mod.cpp
ngram-mod.h
peg-parser.cpp common/gemma4 : handle parsing edge cases (#21760) 2026-04-13 18:18:18 -05:00
peg-parser.h common/gemma4 : handle parsing edge cases (#21760) 2026-04-13 18:18:18 -05:00
preset.cpp
preset.h
reasoning-budget.cpp
reasoning-budget.h
regex-partial.cpp
regex-partial.h
sampling.cpp common: skip reasoning budget sampler when no budget is requested (#21870) 2026-04-14 12:43:06 +02:00
sampling.h
speculative.cpp
speculative.h
unicode.cpp
unicode.h