llama_cpp_for_radxa_dragon_wing_q6a

pingu_98/llama_cpp_for_radxa_dragon_wing_q6a

History

Justine Tunney 8cc91dc63c ggml : add llamafile sgemm (#6414 ) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/		2024-04-16 21:55:30 +03:00
..
baby-llama
batched
batched-bench	bench : make n_batch and n_ubatch configurable in Batched bench (#6500 )	2024-04-05 21:34:53 +03:00
batched.swift
beam-search
benchmark
convert-llama2c-to-ggml
embedding	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
eval-callback	model: support arch `DbrxForCausalLM` (#6515 )	2024-04-13 11:33:52 +02:00
export-lora
finetune
gbnf-validator	grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609 )	2024-04-11 19:47:34 +01:00
gguf	gguf : add option to not check tensor data (#6582 )	2024-04-10 21:16:48 +03:00
gguf-split	Fix --split-max-size (#6655 )	2024-04-14 13:12:59 +02:00
gritlm	gritlm : add --outdir option to hf.sh script (#6699 )	2024-04-16 09:34:06 +03:00
imatrix	imatrix : remove invalid assert (#6632 )	2024-04-12 11:49:58 +03:00
infill	infill : add download instructions for model (#6626 )	2024-04-12 15:11:46 +03:00
jeopardy
llama-bench	ggml : add llamafile sgemm (#6414 )	2024-04-16 21:55:30 +03:00
llama.android
llama.swiftui
llava	chore: Fix markdown warnings (#6625 )	2024-04-12 10:52:36 +02:00
lookahead	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
lookup	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
main	`main`: add --json-schema / -j flag (#6659 )	2024-04-15 18:35:21 +01:00
main-cmake-pkg
parallel
passkey
perplexity	perplexity : require positive --ctx-size arg (#6695 )	2024-04-16 09:28:33 +03:00
quantize	chore: Fix markdown warnings (#6625 )	2024-04-12 10:52:36 +02:00
quantize-stats
retrieval
save-load-state	llama : save and restore kv cache for single seq id (#6341 )	2024-04-08 15:43:30 +03:00
server	`main`: add --json-schema / -j flag (#6659 )	2024-04-15 18:35:21 +01:00
simple
speculative	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
sycl	fix memcpy() crash, add missed cmd in guide, fix softmax (#6622 )	2024-04-14 10:42:29 +08:00
tokenize	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
train-text-from-scratch
alpaca.sh
base-translate.sh
chat-13B.bat
chat-13B.sh
chat-persistent.sh
chat-vicuna.sh
chat.sh
CMakeLists.txt	eval-callback: Example how to use eval callback for debugging (#6576 )	2024-04-11 14:51:07 +02:00
gpt4all.sh
json-schema-pydantic-example.py
json_schema_to_grammar.py	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00
llama.vim
llama2-13b.sh
llama2.sh
llm.vim
make-ggml.py
Miku.sh
pydantic-models-to-grammar-examples.py
pydantic_models_to_grammar.py
reason-act.sh
regex-to-grammar.py	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00
server-embd.py
server-llama2-13B.sh
ts-type-to-grammar.sh	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00