llama_cpp_for_radxa_dragon_.../common
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
..
arg.cpp llama : add high-throughput mode (#14363) 2025-07-16 16:35:42 +03:00
arg.h
base64.hpp
build-info.cpp.in cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167) 2025-06-13 10:38:52 +02:00
chat-parser.cpp llama-chat : Do not throw when tool parsing fails (#14012) 2025-06-14 17:25:15 +01:00
chat-parser.h llama-chat : Do not throw when tool parsing fails (#14012) 2025-06-14 17:25:15 +01:00
chat.cpp server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196) 2025-06-29 20:02:53 +02:00
chat.h server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196) 2025-06-29 20:02:53 +02:00
CMakeLists.txt cmake : do not search for curl libraries by ourselves (#14613) 2025-07-10 15:29:05 +03:00
common.cpp llama : add high-throughput mode (#14363) 2025-07-16 16:35:42 +03:00
common.h llama : add high-throughput mode (#14363) 2025-07-16 16:35:42 +03:00
console.cpp
console.h
json-partial.cpp sync : vendor (#13901) 2025-05-30 16:25:45 +03:00
json-partial.h sync : vendor (#13901) 2025-05-30 16:25:45 +03:00
json-schema-to-grammar.cpp common : use std::string_view now that we target c++17 (#14319) 2025-06-22 08:37:43 +03:00
json-schema-to-grammar.h sync : vendor (#13901) 2025-05-30 16:25:45 +03:00
llguidance.cpp
log.cpp
log.h
ngram-cache.cpp
ngram-cache.h
regex-partial.cpp
regex-partial.h
sampling.cpp server: streaming of tool calls and thoughts when --jinja is on (#12379) 2025-05-25 01:48:08 +01:00
sampling.h
speculative.cpp llama : deprecate llama_kv_self_ API (#14030) 2025-06-06 14:11:15 +03:00
speculative.h