llama_cpp_for_radxa_dragon_.../README.md

# llama.cpp

![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Release](https://img.shields.io/github/v/release/ggml-org/llama.cpp)](https://github.com/ggml-org/llama.cpp/releases)
[![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)

[Manifesto](https://github.com/ggml-org/llama.cpp/discussions/205) / [ggml](https://github.com/ggml-org/ggml) / [ops](https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)

LLM inference in C/C++

## Recent API changes

- [Changelog for `libllama` API](https://github.com/ggml-org/llama.cpp/issues/9289)
- [Changelog for `llama-server` REST API](https://github.com/ggml-org/llama.cpp/issues/9291)

## Hot topics

- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
- Hugging Face GGUF editor: [discussion](https://github.com/ggml-org/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)

----

## Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

- Install `llama.cpp` using [brew, nix or winget](docs/install.md)
- Run with Docker - see our [Docker documentation](docs/docker.md)
- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
- Build from source by cloning this repository - check out [our build guide](docs/build.md)

Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.

Example command:

```sh
# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
```

## Description

The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.

- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggml-org/ggml) library.

<details>
<summary>Models</summary>

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)

#### Text-only

- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [x] LLaMA 3 🦙🦙🦙
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
- [x] [Jamba](https://huggingface.co/ai21labs)
- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
- [X] [Starcoder models](https://github.com/ggml-org/llama.cpp/pull/3187)
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
- [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417)
- [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553)
- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
- [X] [StableLM models](https://huggingface.co/stabilityai)
- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
- [x] [Gemma](https://ai.google.dev/gemma)
- [x] [Mamba](https://github.com/state-spaces/mamba)
- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
- [x] [Xverse](https://huggingface.co/models?search=xverse)
- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
- [x] [OLMo](https://allenai.org/olmo)
- [x] [OLMo 2](https://allenai.org/olmo)
- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
- [x] [Smaug](https://huggingface.co/models?search=Smaug)
- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)
- [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
- [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
- [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview)
- [x] [Ling models](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)
- [x] [LFM2 models](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)
- [x] [Hunyuan models](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)
- [x] [BailingMoeV2 (Ring/Ling 2.0) models](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)

#### Multimodal

- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)
- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
- [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)

</details>

<details>
<summary>Bindings</summary>

- Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama)
- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
- Rust (automated build from crates.io): [ShelbyJenkins/llm_client](https://github.com/ShelbyJenkins/llm_client)
- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
- Java: [QuasarByte/llama-cpp-jna](https://github.com/QuasarByte/llama-cpp-jna)
- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggml-org/llama.cpp/pull/6326)
- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
- Delphi [Embarcadero/llama-cpp-delphi](https://github.com/Embarcadero/llama-cpp-delphi)
- Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma)
- Android: [llama.android](/examples/llama.android)

</details>

<details>
<summary>UIs</summary>

*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)
- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
- [Dot](https://github.com/alexpinel/Dot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
- [janhq/jan](https://github.com/janhq/jan) (AGPL)
- [johnbean393/Sidekick](https://github.com/johnbean393/Sidekick) (MIT)
- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
- [LARS](https://github.com/abgulati/LARS) (AGPL)
- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
- [LlamaLib](https://github.com/undreamai/LlamaLib) (Apache-2.0)
- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
- [LMStudio](https://lmstudio.ai/) (proprietary)
- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
- [MindMac](https://mindmac.app) (proprietary)
- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
- [nat/openplayground](https://github.com/nat/openplayground) (MIT)
- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
- [ollama/ollama](https://github.com/ollama/ollama) (MIT)
- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
- [ramalama](https://github.com/containers/ramalama) (MIT)
- [semperai/amica](https://github.com/semperai/amica) (MIT)
- [withcatai/catai](https://github.com/withcatai/catai) (MIT)
- [Autopen](https://github.com/blackhole89/autopen) (GPL)

</details>

<details>
<summary>Tools</summary>

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML
- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
- [unslothai/unsloth](https://github.com/unslothai/unsloth) – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)

</details>

<details>
<summary>Infrastructure</summary>

- [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
- [LLMKube](https://github.com/defilantech/llmkube) - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal
  support"
</details>

<details>
<summary>Games</summary>

- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.

</details>


## Supported backends

| Backend | Target devices |
| --- | --- |
| [Metal](docs/build.md#metal-build) | Apple Silicon |
| [BLAS](docs/build.md#blas-build) | All |
| [BLIS](docs/backend/BLIS.md) | All |
| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
| [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs |
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
| [HIP](docs/build.md#hip) | AMD GPU |
| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
| [Vulkan](docs/build.md#vulkan) | GPU |
| [CANN](docs/build.md#cann) | Ascend NPU |
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
| [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE |
| [WebGPU [In Progress]](docs/build.md#webgpu) | All |
| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
| [Hexagon [In Progress]](docs/backend/snapdragon/README.md) | Snapdragon |
| [VirtGPU](docs/backend/VirtGPU.md) | VirtGPU APIR |

## Obtaining and quantizing models

The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:

- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)

You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:

```sh
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
```

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.

After downloading a model, use the CLI tools to run it locally - see below.

`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:

- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123)
- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268)
- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)

To learn more about model quantization, [read this documentation](tools/quantize/README.md)

## [`llama-cli`](tools/cli)

#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

- <details open>
    <summary>Run in conversation mode</summary>

    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`

    ```bash
    llama-cli -m model.gguf

    # > hi, who are you?
    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
    #
    # > what is 1+1?
    # Easy peasy! The answer to 1+1 is... 2!
    ```

    </details>

- <details>
    <summary>Run in conversation mode with custom chat template</summary>

    ```bash
    # use the "chatml" template (use -h to see the list of supported templates)
    llama-cli -m model.gguf -cnv --chat-template chatml

    # use a custom template
    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
    ```

    </details>

- <details>
    <summary>Constrain the output with a custom grammar</summary>

    ```bash
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
    ```

    The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).

    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

    </details>


## [`llama-server`](tools/server)

#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.

- <details open>
    <summary>Start a local HTTP server with default configuration on port 8080</summary>

    ```bash
    llama-server -m model.gguf --port 8080

    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    ```

    </details>

- <details>
    <summary>Support multiple-users and parallel decoding</summary>

    ```bash
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
    ```

    </details>

- <details>
    <summary>Enable speculative decoding</summary>

    ```bash
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
    ```

    </details>

- <details>
    <summary>Serve an embedding model</summary>

    ```bash
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
    ```

    </details>

- <details>
    <summary>Serve a reranking model</summary>

    ```bash
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
    ```

    </details>

- <details>
    <summary>Constrain all outputs with a grammar</summary>

    ```bash
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf

    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf
    ```

    </details>


## [`llama-perplexity`](tools/perplexity)

#### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text.

- <details open>
    <summary>Measure the perplexity over a text file</summary>

    ```bash
    llama-perplexity -m model.gguf -f file.txt

    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
    # Final estimate: PPL = 5.4007 +/- 0.67339
    ```

    </details>

- <details>
    <summary>Measure KL divergence</summary>

    ```bash
    # TODO
    ```

    </details>

[^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)

## [`llama-bench`](tools/llama-bench)

#### Benchmark the performance of the inference for various parameters.

- <details open>
    <summary>Run default benchmark</summary>

    ```bash
    llama-bench -m model.gguf

    # Output:
    # | model               |       size |     params | backend    | threads |          test |                  t/s |
    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
    #
    # build: 3e0ba0e60 (4229)
    ```

    </details>

## [`llama-simple`](examples/simple)

#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.

- <details>
    <summary>Basic text completion</summary>

    ```bash
    llama-simple -m model.gguf

    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
    ```

    </details>


## Contributing

- Contributors can open PRs
- Collaborators will be invited based on contributions
- Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Any help with managing issues, PRs and projects is very appreciated!
- See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
- Make sure to read this: [Inference at the edge](https://github.com/ggml-org/llama.cpp/discussions/205)
- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)

## Other documentation

- [cli](tools/cli/README.md)
- [completion](tools/completion/README.md)
- [server](tools/server/README.md)
- [GBNF grammars](grammars/README.md)

#### Development documentation

- [How to build](docs/build.md)
- [Running on Docker](docs/docker.md)
- [Build on Android](docs/android.md)
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks)

#### Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- GPT-3
    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- GPT-3.5 / InstructGPT / ChatGPT:
    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

## XCFramework
The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS,
and macOS. It can be used in Swift projects without the need to compile the
library from source. For example:
```swift
// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)
```
The above example is using an intermediate build `b5046` of the library. This can be modified
to use a different version by changing the URL and checksum.

## Completions
Command-line completion is available for some environments.

#### Bash Completion
```bash
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash
```
Optionally this can be added to your `.bashrc` or `.bash_profile` to load it
automatically. For example:
```console
$ echo "source ~/.llama-completion.bash" >> ~/.bashrc
```

## Dependencies

- [yhirose/cpp-httplib](https://github.com/yhirose/cpp-httplib) - Single-header HTTP server, used by `llama-server` - MIT license
- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain
- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License
- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
+								# llama.cpp
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
+								![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
-												Add logo to README.md
											
										
										
											2023-03-26 07:20:49 +00:00
-												Fix conan badge display [no ci] (#7645)


											
										
										
											2024-05-30 15:07:39 +00:00
+								[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
-												readme : add badge (#13938)


											
										
										
											2025-06-05 07:50:55 +00:00
+								[![Release](https://img.shields.io/github/v/release/ggml-org/llama.cpp)](https://github.com/ggml-org/llama.cpp/releases)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								[![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)
-												Update README.md
											
										
										
											2023-03-12 20:09:26 +00:00
-												readme : add hot PRs (#14636)

* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont
											
										
										
											2025-07-11 13:07:55 +00:00
+								[Manifesto](https://github.com/ggml-org/llama.cpp/discussions/205) / [ggml](https://github.com/ggml-org/ggml) / [ops](https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)
-												readme : add new roadmap + manifesto
											
										
										
											2023-06-25 13:08:12 +00:00
-												readme : add hot PRs (#14636)

* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont
											
										
										
											2025-07-11 13:07:55 +00:00
+								LLM inference in C/C++
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
+								## Recent API changes
-												readme : add API changes section
											
										
										
											2024-03-03 10:44:03 +00:00
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [Changelog for `libllama` API](https://github.com/ggml-org/llama.cpp/issues/9289)
 								- [Changelog for `llama-server` REST API](https://github.com/ggml-org/llama.cpp/issues/9291)
-												readme : add API changes section
											
										
										
											2024-03-03 10:44:03 +00:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
+								## Hot topics
-												readme : update hot topics
											
										
										
											2023-08-27 11:44:35 +00:00
-												readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
											
										
										
											2026-03-24 09:35:07 +00:00
+								- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
-												readme : update hot topics (#17002)


											
										
										
											2025-11-04 15:21:31 +00:00
+								- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
 								- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
 								- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
-												readme : update hot topics (#15097)


											
										
										
											2025-08-05 17:19:33 +00:00
+								- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
-												readme : add hot PRs (#14636)

* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont
											
										
										
											2025-07-11 13:07:55 +00:00
+								- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
-												readme : update hot topics (#13150)


											
										
										
											2025-04-28 09:10:18 +00:00
+								- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
-												readme : add plugin links (#11355)


											
										
										
											2025-01-22 17:44:26 +00:00
+								- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
 								- Hugging Face GGUF editor: [discussion](https://github.com/ggml-org/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
-												readme : incoming BREAKING CHANGE
											
										
										
											2023-08-18 14:48:31 +00:00
 								----
-												Add Misc section + update hot topics + minor fixes
											
										
										
											2023-03-14 07:43:52 +00:00
-												docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
											
										
										
											2025-06-03 11:09:36 +00:00
+								## Quick start
 								Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
 								- Install `llama.cpp` using [brew, nix or winget](docs/install.md)
 								- Run with Docker - see our [Docker documentation](docs/docker.md)
 								- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
 								- Build from source by cloning this repository - check out [our build guide](docs/build.md)
 								Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.
 								Example command:
 								```sh
 								# Use a local model file
 								llama-cli -m my_model.gguf
 								# Or download and run a model directly from Hugging Face
 								llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
 								# Launch OpenAI-compatible API server
 								llama-server -hf ggml-org/gemma-3-1b-it-GGUF
 								```
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
+								## Description
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								range of hardware - locally and in the cloud.
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- Plain C/C++ implementation without any dependencies
 								- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
-												add amx kernel for gemm (#8998)

add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
											
										
										
											2024-10-18 05:34:36 +00:00
+								- AVX, AVX2, AVX512 and AMX support for x86 architectures
-												ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784)

* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* cmake: enable RISC-V zihintpause extension for Spacemit builds

* readme : add ZIHINTPAUSE support for RISC-V

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
											
										
										
											2025-12-08 08:41:34 +00:00
+								- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
-												readme : update (#5572)

Added 1.5-bit on README.md
											
										
										
											2024-02-19 07:39:31 +00:00
+								- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
-												musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
											
										
										
											2025-05-21 01:58:49 +00:00
+								- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
-												ggml : remove OpenCL (#7735)

ggml-ci
											
										
										
											2024-06-04 18:23:20 +00:00
+								- Vulkan and SYCL backend support
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggml-org/ggml) library.
-												Update README.md
											
										
										
											2023-03-11 10:31:21 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								<details>
 								<summary>Models</summary>
-												readme : add GPT4All instructions (close #588)
											
										
										
											2023-03-29 16:37:20 +00:00
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 06:21:30 +00:00
+								Typically finetunes of the base models below are supported as well.
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### Text-only
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
-												readme : update supported models
											
										
										
											2023-03-30 19:31:54 +00:00
+								- [X] LLaMA 🦙
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 01:14:11 +00:00
+								- [x] LLaMA 2 🦙🦙
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 13:52:28 +00:00
+								- [x] LLaMA 3 🦙🦙🦙
-												readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
											
										
										
											2024-02-07 06:21:30 +00:00
+								- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
-												model: support arch `DbrxForCausalLM` (#6515)

* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
											
										
										
											2024-04-13 09:33:52 +00:00
+								- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
-												docs : add Jamba to Text-only models list (#16778)


											
										
										
											2025-10-26 12:01:20 +00:00
+								- [x] [Jamba](https://huggingface.co/ai21labs)
-												doc : add link to falcon (#6789)


											
										
										
											2024-04-21 12:35:40 +00:00
+								- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
-												readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)

* add support for chinese llama-2 / alpaca-2

* remove white spaces
											
										
										
											2023-08-02 06:18:31 +00:00
+								- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
-												readme : update supported models
											
										
										
											2023-03-30 19:31:54 +00:00
+								- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423)
-												Add BAIR's Koala to supported models (#877)


											
										
										
											2023-04-10 20:41:53 +00:00
+								- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 18:13:21 +00:00
+								- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
 								- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [X] [Starcoder models](https://github.com/ggml-org/llama.cpp/pull/3187)
-												readme : update models, cuda + ppl instructions (#3510)


											
										
										
											2023-10-06 19:13:36 +00:00
+								- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417)
 								- [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 07:38:49 +00:00
+								- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 14:06:48 +00:00
+								- [X] [StableLM models](https://huggingface.co/stabilityai)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 07:38:49 +00:00
+								- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
 								- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 14:06:48 +00:00
+								- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003)
-												gpt2 : Add gpt2 architecture integration (#4555)


											
										
										
											2023-12-28 14:03:57 +00:00
+								- [x] [GPT-2](https://huggingface.co/gpt2)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 14:06:48 +00:00
+								- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
-												readme : add CodeShell models to the supported models list (#5330)


											
										
										
											2024-02-05 07:41:38 +00:00
+								- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
-												llama : add `gemma` model (#5631)

There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
											
										
										
											2024-02-21 13:08:22 +00:00
+								- [x] [Gemma](https://ai.google.dev/gemma)
-												llama : support Mamba Selective State Space Models (#5328)

* mamba : begin working on support for Mamba SSM

* mamba : begin figuring out how to (ab)use the kv cache for Mamba

* mamba : recurrent inference almost works, but incoherent

* mamba : recurrent inference WORKS!!!

* convert : optionally use d_conv and d_state from config.json for Mamba

* mamba : refactor recurrent conv, resulting in 20% perf increase

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.

I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

* ggml : parallelize ggml_exp

This results in 8% faster token generation for Mamba-130M.

* mamba : simplify the conv step with a self-overlapping view

Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.

Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.

Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).

* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32

Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.

* mamba : fix self-overlapping view depth stride

* mamba : handle batches of more than 1 token

This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.

Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.

* ggml: add ggml_ssm_scan to help with parallel selective scan

If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.

* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.

* mamba : very basic quantization support

Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)

Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.

Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.

* convert : fix wrong name for layer norm weight of offical Mamba models

I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")

* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator

This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.

However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.

* convert : for Mamba, also consider the "MambaLMHeadModel" arch name

It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json

* mamba : fix vocab size problems with official models

The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.

Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.

* ggml : remove ggml_exp and ggml_soft_plus

They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.

* mamba : remove some useless comments

No code change.

* convert : fix flake8 linter errors

* mamba : apply suggestions from code review

* mamba : remove unecessary branch for row-wise ssm_state and C multiplication

It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.

* ggml : in ggml_ssm_scan, use more appropriate asserts

* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32

* mamba : multiple sequences, but one at a time

This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).

The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)

Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.

* mamba : support llama_kv_cache_seq_cp

This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.

Each KV cell is dedicated to the sequence ID corresponding to its own index.

* mamba : use a state mask

It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.

inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).

* llama : replace the usage of n_ctx with kv_self.size in many places

* mamba : use n_tokens directly instead of n_tok

* mamba : in comments, properly refer to KV cells instead of slots

* mamba : reduce memory usage of ggml_ssm_scan

From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.

The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.

* mamba : simultaneous sequence processing

A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).

* mamba : support llama_kv_cache_seq_cp copy chains

* mamba : support shifting and dividing the kv cache pos

* mamba : make the server and parallel examples work with whole sequences

A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not

* mamba : dedicate an input tensor for state copy indices

This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.

* mamba : adapt perplexity, batched, and batched-bench examples

* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.

* mamba : stop abusing attention metadata

This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"

* mamba : more correctly update the "used" field of the KV cache

* ggml : in ggml_ssm_scan, use a threshold for soft_plus

This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.

* convert : for Mamba, fallback to internal NeoX tokenizer

The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.

* mamba : support state saving and restoring

* ggml : implicitly pass src tensors through dst for Mamba-related ops

* mamba : clarify some comments

* server : fix cache_tokens not getting correctly resized

Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.

* convert-hf : support new metadata keys for Mamba

For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

* mamba : rename metadata to be more similar to transformers library

This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

* mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight

* mamba : add missing spaces

This is purely a formatting change.

* convert-hf : omit output.weight when identical with token_embd.weight

Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.

* readme : add Mamba to supported models, and add recent API changes

* mamba : move state_seq and state_mask views outside layer loop

A few tensors were also missing `struct` in front of `ggml_tensor`.
											
										
										
											2024-03-08 22:31:00 +00:00
+								- [x] [Mamba](https://github.com/state-spaces/mamba)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 13:52:28 +00:00
+								- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
-												[Model] Add support for xverse (#6301)

* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
											
										
										
											2024-03-29 13:37:03 +00:00
+								- [x] [Xverse](https://huggingface.co/models?search=xverse)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 13:52:28 +00:00
+								- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
-												llama : add SEA-LION support (#6448)

* initial commit for sealion support

* add sealion support

* minor fix

* q/k ln and pos_embd only if required

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : clear whitespaces

---------

Co-authored-by: bryan <bryansiow@aisingapore.org>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-03 18:05:10 +00:00
+								- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
-												Add GritLM as supported models. (#6513)


											
										
										
											2024-04-07 17:33:59 +00:00
+								- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
-												Implement the OLMo architecture (#6741)

* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
											
										
										
											2024-04-19 09:35:54 +00:00
+								- [x] [OLMo](https://allenai.org/olmo)
-												Add OLMo 2 model in docs (#10530)

* Add link to OLMo 2 model in docs

* Change link to landing page
											
										
										
											2024-11-26 20:55:29 +00:00
+								- [x] [OLMo 2](https://allenai.org/olmo)
-												llama : support OLMoE (#9462)


											
										
										
											2024-09-16 06:47:37 +00:00
+								- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 05:54:10 +00:00
+								- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
-												readme : add GPT-NeoX + Pythia to the list of supported models (#7491)


											
										
										
											2024-05-23 12:12:43 +00:00
+								- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 05:54:10 +00:00
+								- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
 								- [x] [Smaug](https://huggingface.co/models?search=Smaug)
 								- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
 								- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
 								- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
 								- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
-												llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)

* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2025-02-02 07:48:46 +00:00
+								- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)
-												llama-model : add Glm4Model implementation for GLM-4-0414 (#12867)

* GLM-4-0414

* use original one

* Using with tensor map

* fix bug

* change order

* change order

* format with flask8
											
										
										
											2025-04-11 10:10:10 +00:00
+								- [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
-												readme : update model list (#8851)


											
										
										
											2024-08-05 05:54:10 +00:00
+								- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
-												llama : add EXAONE model support (#9025)

* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* fix lint

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-16 06:35:18 +00:00
+								- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
-												llama : support for `falcon-mamba` architecture (#9074)

* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <git@compilade.net>
											
										
										
											2024-08-21 08:06:36 +00:00
+								- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
-												Add Jais to list of supported models (#9439)

Co-authored-by: fmz <quic_fzaghlou@quic.com>
											
										
										
											2024-09-12 00:29:53 +00:00
+								- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
-												Update README.md (#9591)

Add Bielik model.
											
										
										
											2024-10-01 17:18:46 +00:00
+								- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
-												readme: update RWKV7 model links (#19061)

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
											
										
										
											2026-01-25 07:11:19 +00:00
+								- [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)
-												llama : add chat template for RWKV-World + fix EOT (#9968)

* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
											
										
										
											2024-10-22 10:33:37 +00:00
+								- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
-												llama: add support for QRWKV6 model architecture (#11001)

llama: add support for QRWKV6 model architecture (#11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
											
										
										
											2025-01-10 01:58:08 +00:00
+								- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
-												llama : add Deepseek MoE v1 & GigaChat models (#10827)

* Add deepseek v1 arch & gigachat template

* improve template code

* add readme

* delete comments

* remove comment

* fix format

* lint llama.cpp

* fix order of deepseek and deepseek2, move gigachat temlate to the end of func

* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need

* remove comments

* move deepseek above deepseek2

* change placement of gigachat chat template
											
										
										
											2024-12-15 17:02:46 +00:00
+								- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
-												llama : add Trillion 7B model support (#12556)

* Support Trillion 7B

* Update llama.h

* Update llama.h

* Update llama-vocab.cpp for Trillion

* Update llama-vocab.cpp
											
										
										
											2025-03-30 18:38:33 +00:00
+								- [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview)
-												llama : support BailingMoE (Ling) (#12634)


											
										
										
											2025-03-30 20:21:03 +00:00
+								- [x] [Ling models](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)
-												docs : add LFM2 to models section (#14650)

* readme : add LFM2 to models section

* fix copy paste...
											
										
										
											2025-07-12 17:07:08 +00:00
+								- [x] [LFM2 models](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)
-												docs : add Hunyuan to models section (#15707)

Signed-off-by: Jie Fu <jiefu@tencent.com>
											
										
										
											2025-09-01 07:34:59 +00:00
+								- [x] [Hunyuan models](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)
-												model : add BailingMoeV2 support (#16063)

* add BailingMoeV2 support

* update llm types

* undo

* undo

* update llm types

* add model collection link

* update

* almost working

* correct group selection and rename n_group_exp

* avoid large top_k and use argmax instead for now

if we had something like argmax2 that would be equivalent, but this works fine until then

* poke

* skip group selection when there are no tokens

* fix 1T conversion

* hopefully fixed expert group selection

third time's the charm?

* make expert group selection generally available

The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.

* allow n_expert_groups to be 1 (Kimi K2)

* address review suggestions
											
										
										
											2025-10-20 19:38:20 +00:00
+								- [x] [BailingMoeV2 (Ring/Ling 2.0) models](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 07:38:49 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### Multimodal
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 07:38:49 +00:00
-												readme : add link to LLaVA 1.6 models (#5758)

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
											
										
										
											2024-02-28 08:39:39 +00:00
+								- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
-												readme : update supported model list (#4457)


											
										
										
											2023-12-14 07:38:49 +00:00
+								- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
 								- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
-												readme : add MobileVLM 1.7B/3B to the supported models list (#5107)

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
											
										
										
											2024-01-25 20:14:32 +00:00
+								- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
-												readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)


											
										
										
											2024-02-06 14:06:48 +00:00
+								- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
-												readme : update model list (#6908)

* Update README.md

* missing space

* llama3 !
											
										
										
											2024-04-25 13:52:28 +00:00
+								- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
-												llava : fix moondream support (#7163)

* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737.

* Fix num_positions and embeddings initialization
											
										
										
											2024-05-10 06:41:10 +00:00
+								- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
-												readme : remove trailing space (#7469)

											
										
										
											2024-05-23 14:43:18 +00:00
+								- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
-												llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)

* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2025-02-02 07:48:46 +00:00
+								- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)
-												llama : add Qwen2VL support + multimodal RoPE (#10361)

* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-12-14 12:43:46 +00:00
+								- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
-												readme : model : mtdm : lfm2 improvements (#15476)

* Support untied embeddings

* Increase number of image tokens to 1024

* Add LFM2-VL to readme

* Actually use untied embeddings
											
										
										
											2025-08-22 07:29:08 +00:00
+								- [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)
-												readme : update hot-topics & models, detail windows release in usage (#3615)

* Update README.md

* Update README.md

* Update README.md

* move "Running on Windows" section below "Prepare data and run"

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-10-17 18:13:21 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								</details>
 								<details>
 								<summary>Bindings</summary>
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
-												readme : update bindings (#13950)


											
										
										
											2025-06-01 08:44:30 +00:00
+								- Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama)
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
+								- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
 								- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
-												readme : remove unsupported node.js library (#3703)

- https://github.com/Atome-FE/llama-node is quite out of date
- doesn't support recent/current llama.cpp functionality
											
										
										
											2023-10-22 18:16:43 +00:00
+								- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
-												readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)


											
										
										
											2024-01-07 20:24:11 +00:00
+								- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
-												readme : add programmable prompt engine language CLI (#9599)


											
										
										
											2024-09-23 15:58:17 +00:00
+								- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
-												readme : add JavaScript/Wasm repo (#5415)


											
										
										
											2024-02-09 10:17:00 +00:00
+								- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
-												readme : add wllama as a wasm binding (#6100)


											
										
										
											2024-03-16 15:42:08 +00:00
+								- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
-												readme : add Ruby bindings (#1029)


											
										
										
											2023-04-17 19:34:35 +00:00
+								- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
-												readme : add feature-rich rust bindings (#6465)


											
										
										
											2024-04-03 17:53:37 +00:00
+								- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
-												readme : add link to rust bindings (#5148)

* added link to another set of rust bindings with brief note on differences.

* fixed link name
											
										
										
											2024-01-28 08:30:44 +00:00
+								- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
 								- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
-												readme : add llm_client Rust crate to readme bindings (#11628)

[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.

It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.

It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.

So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
											
										
										
											2025-02-04 11:20:55 +00:00
+								- Rust (automated build from crates.io): [ShelbyJenkins/llm_client](https://github.com/ShelbyJenkins/llm_client)
-												readme : add C#/.NET bindings repo (#1409)


											
										
										
											2023-05-12 05:39:40 +00:00
+								- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
-												readme : update bindings list (#9951)

Update the binding list by adding LM-Kit.NET (C# & VB.NET)
											
										
										
											2024-10-20 16:25:41 +00:00
+								- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
-												readme : add Scala 3 bindings repo (#2010)


											
										
										
											2023-06-26 19:47:59 +00:00
+								- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
-												Add link to clojure bindings to Readme. (#2659)


											
										
										
											2023-08-18 19:39:22 +00:00
+								- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
-												readme : add react-native binding (#2869)


											
										
										
											2023-08-29 09:30:10 +00:00
+								- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
-												docs : add java-llama.cpp to README.md (#2935)


											
										
										
											2023-09-01 13:36:14 +00:00
+								- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
-												readme : update bindings (#16144)

Link to Java JNA bindings to llama.cpp native libraries
											
										
										
											2025-09-25 15:20:34 +00:00
+								- Java: [QuasarByte/llama-cpp-jna](https://github.com/QuasarByte/llama-cpp-jna)
-												readme : add zig bindings (#4581)


											
										
										
											2023-12-22 06:49:54 +00:00
+								- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
-												Add a dart/flutter binding to README.md (#4882)


											
										
										
											2024-01-20 08:05:43 +00:00
+								- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
-												docs : update bindings list (#10261)

Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
											
										
										
											2024-11-13 11:17:10 +00:00
+								- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggml-org/llama.cpp/pull/6326)
-												readme : update bindings list (#8222)

* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
											
										
										
											2024-07-07 13:21:37 +00:00
+								- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
-												readme : update bindings list (#9889)


											
										
										
											2024-10-15 08:20:34 +00:00
+								- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
-												readme : update bindings list (#9918)

Co-authored-by: Tim Wang <tim.wang@ing.com>
											
										
										
											2024-10-17 06:57:14 +00:00
+								- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
-												readme : update bindings (#12229)


											
										
										
											2025-03-06 19:15:13 +00:00
+								- Delphi [Embarcadero/llama-cpp-delphi](https://github.com/Embarcadero/llama-cpp-delphi)
-												readme: update bindings (#16651)

Signed-off-by: deadprogram <ron@hybridgroup.com>
											
										
										
											2025-10-20 08:20:04 +00:00
+								- Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma)
-												llama.android : Rewrite Android binding (w/o cpu_features dep) (#17413)

* UI: implement basic UI components

* util: implement performance monitor; wrap it with a viewmodel

* util: implement user preferences utility

* UI: implement core flow's screens

* UI: add a new MainActivity; update manifest

* [WIP] DI: implement simple local vm factory provider

* UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark

* UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark

* UI: split a nested parent settings screen into separate child settings screens

* UI: polish system prompt setup UI

* Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject

* DB: setup Room database

* data: introduce repo for System Prompt; flow data from Room to VM

* bugfix: properly handle user's quitting conversation screen while tokens in generation

* UI: rename `ModeSelection` to `ModelLoading` for better clarity

* UI: update app name to be more Arm

* UI: polish conversation screen

* data: code polish

* UI: code polish

* bugfix: handle user quitting on model loading

* UI: locks user in alert dialog when model is unloading

* vm: replace token metrics stubs with actual implementation

* UI: refactor top app bars

* nit: combine temperatureMetrics and useFahrenheit

* DI: introduce Hilt plugin + processor + lib dependencies

* DI: make app Hilt injectable

* DI: make viewmodels Hilt injectable

* DI: replace manual DI with Hilt DI

* UI: optimize AppContent's composing

* bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController

* UI: navigation with more natural animated transitions

* DI: Optimize AppModule

* Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule

* UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel

* DI: abstract the protocol of SystemPromptRepository; update AppModule

* data: [WIP] prepare for ModelRepository refactor & impl

* data: introduce Model entity and DAO; update DI module

* UI: replace Models Management screen's stubbing with instrumentation

* UI: polish sort order menu

* data: import local model with file picker

* bugfix: use List instead of Collection for ModelDao's deletion

* data: add a util file for extracting file name & size and model metadata

* UI: enrich ModelManagementState; extract filename to show correct importing UI

* UI: implement multiple models deletion; update Models Management screen

* UI: handle back navigation when user is in multi-selection mode

* util: extract file size formatting into ModelUtils

* UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog

* UI: extract a shared ModelCard component

* UI: replace model selection screen's data stubbing; add empty view

* nit: tidy SystemPromptViewModel

* Util: split FileUtils from ModelUtils; extract copy methods into FileUtils

* data: pass through getModelById from ModelDao into ModelRepository

* core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine

* vm: split mono MainViewModel into separate individual ViewModels

* vm: merge SystemPromptViewModel into ModelLoadingViewModel

* core: break down InferenceManager due to Interface Segregation Principle

* UI: show model card in Model Loading screen

* UI: show model card in Conversation screen

* UI: unify Model Card components

* core: swap in LLamaAndroid and mark stub engine for testing only

* data: allow canceling the ongoing model import

* UI: update UI ongoing model import's cancellation

* LLama: update engine state after handling the cancellation of sendUserPrompt

* VM: handle the cancellation of ongoing token generation

* LLama: refactor loadModel by splitting the system prompt setting into a separate method

* feature: check for available space before copying local model

* UI: centralize the AppScaffold and modularize its configs

* UI: refactor BottomBarConfig.ModelsManagement APIs

* UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig

* UI: replace ugly optional as casts in AppScaffold with extension functions

* UI: fix the typo `totalGb` in `StorageMetrics`

* UI: remove code duplication in sort menu

* LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages

* UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent

* UI: implement BenchmarkScreen's individual back handling

* LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized

* UI: Introduce an abstract ViewModel to handle additional model unloading logics

* UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt

* UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen

* UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen

* nit: extract app name into a constant value; remove unused onBackPressed callbacks

* UI: update AppContent to pass in correct navigation callbacks

* nit: polish ModelLoadingScreen UI

* core: throw Exception instead of returning null if model fails to load

* navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens

* gguf: add GGUF metadata data holder and its corresponding extractor implementation

* DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library

* GGUF: make GgufMetadata serializable in order to be compatible with Room

* nit: refactor data.local package structure

* nit: rename lastUsed field to dateLastUsed; add dateAdded field

* UI: refactor ModelCard UI to show GGUF metadata

* UI: update ModelSelectionScreen with a preselect mechanism

* UI: polish model card

* nit: allow deselect model on Model Selection screen

* nit: revert accidental committing of debug code

* UI: polish ModelLoading screen

* util: extract formatting helper functions from FileUtils into a new FormatUtils

* UI: polish model cards on Benchmark and Conversation screens to show model loading metrics

* UI: show a Snack bar to warn user that system prompt is not always supported

* UI: handle back press on Model Selection screen

* UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation

* feature: support searching on Model Selection screen

* nit: move scaffold related UI components into a separate package

* UI: extract InfoView out into a separate file for reusability

* data: move Model related actions (query, filter, sort) into ModelInfo file

* UI: animate FAB on model preselection states

* feature: support filtering in Model Management screen

* ui: show empty models info in Model Management screen

* ui: add filter off icon to "Clear filters" menu item

* [WIP] ui: polish Benchmark screen; implement its bottom app bar

* ui: polish Benchmark screen; implement its bottom app bar's rerun and share

* nit: disable mode selection's radio buttons when loading model

* feature: implement Conversation screen's bottom app bar

* pkg: restructure BottomAppBars into separate files in a child package

* pkg: restructure TopBarApps into separate files in a child package

* pkg: restructure system metrics into a separate file

* UI: polish Conversation screen

* data: update system prompt presets

* UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement

* data: update & enhance system prompt presets

* deps: introduce Retrofit2

* data: implement HuggingFace data model, data source with Retrofit API

* data: update Model data repository to support fetching HuggingFace models

* [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call

* UI: map language codes into country Emojis

* ui: add "clear results" action to Benchmark screen

* nit: print current pp & tg in llama-bench

* UI: disable landscape mode; prevent duplicated benchmark running

* llama: migrate C/CXX flags into CMakeList

* [WIP] llama: ABI split builds five .so artifacts.

However, all .so are performing on SVE level

* [WIP] llama: ABI split where five tiers are built sequentially.

* [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE

* [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below

```CMake
if (NOT SME_ENABLED MATCHES -1)
...
    set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2")
...
```

* core: add Google's cpu_features as a submodule

* core: implement cpu_detector native lib

* core: swap out hardcoded LlamaAndroid library loading

* core: add back OpenMP due to huge perf loss on TG128

* misc: reorg the pkg structure

* misc: rename LlamaAndroid related class to InferenceEngine prefixes

* [WIP] lib: move GgufMetadata into the lib submodule

* lib: expose GgufMetadataReader as interface only

* lib: replace the naive & plain SharedPreferences with DataStore implementation

* lib: hide the internal implementations, only expose a facade and interfaces

* lib: expose Arm features

* di: add a stub TierDetection; provide both actual impl and stub in AppModule

* UI: add visualizer UI for Arm features

* misc: UI polish

* lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier

* UI: support `NONE` Llama Tier in general settings

* lib: optimize engine loader; always perform a fresh detection when cache is null

* remote: add HuggingFaceModelDetails data class

* remote: refine HuggingFaceModel data class

* nit: remove `trendingScore` field from HuggingFace model entities, weird...

* remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource

* remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response

* UI: scaffold Models Management screen and view model

* UI: implement a dialog UI to show fetched HuggingFace models.

* UI: use a broadcast receiver to listen for download complete events and show local import dialog.

* data: handle network exceptions elegantly

* pkg: restructure `data`'s packages

* data: extract local file info, copy and cleanup logics into LocalFileDataSource

* nit: minor UI patch; add missing comments

* bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action.

* UI: improve autoscroll during token generation

* lib: tested on JFrog Artifactory for Maven publishing

* UI: show RAM warning if model too large

* UI: polish model management screen's error dialog

* util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code

* llm: properly propagate error to UI upon failing to load selected model

* UI: avoid duplicated calculation of token metrics

* lib: read & validate the magic number from the picked source file before executing the import

* UI: add "Learn More" hyperlinks to Error dialog upon model import failures

* lib: refactor the GgufMetadataReader to take  InputStream instead of absolute path as argument

* lib: fix the `SIMD` typo in Tier description

* core: verify model file path is readable

* lib: add UnsupportedArchitectureException for triaged error message

* util: split FormatUtils into multiple utils for better readability

* UI: change benchmark screen from raw markdown to table view

* bugfix: reset preselection upon running the preselected model

* misc: linter issue

* bugfix: fix the malfunctioning monitoring switch

* UI: update Arm features indicator; fix the broken hyperlinks

* UI: add quick action buttons to benchmark screen's result card

* UI: hide share fab after clearing all benchmark results

* UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen;

* UI: hide the stubbing actions in Conversation screen

* UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder

* UI: add a info button to explain token metrics

* misc: remove the redundant `Companion` added due to refactoring

* UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator

* UI: add info button to System Prompt switch; expand the model card by default

* UI: disable tag & language chips; add section headers to explain what they are

* misc: replace top bar indicator's spacer with padding

* UI: merge the Model Selection and Model Management into a unified Models screen

* UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity

* UI: add model loading in progress view; polish the empty model info view

* UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models

* build: [BREAKING] bump the versions of libraries and plugins

* UI: fix the breaking build

* UI: add Tooltip on Import FAB for user onboarding

* UI: adds AppPreferences to track user onboarding status

* UI: tracks user's first success on importing a model

* data: add hand crafted rules to filter the models fetched from HuggingFace API

* UI: update app name & about; polish top bars' indicators & buttons

* UI: polish Hugging Face download dialog UI

* UX: implement onboarding tooltips for model import and onboarding

* misc: use sentence case for CTA button labels

* [WIP] UI: add Arm color palette from Philip.Watson3

* UI: address Rojin's UX feedbacks

* UI: address Rojin's UX feedbacks - part 2

* UI: update Arm color palette from Philip.Watson3

* data: make sure fetch preselected models in the same order of their IDs

* UI: fix UI issues in the generic settings screen and navigation drawer

* nit: address Rojin's feedbacks on model import message again

* nit: append `®` to all `Arm` labels

* UI: extract a reusable InfoAlertDialog

* core: support GGML_CPU_ALL_VARIANTS on Android!

* core: restructure Kleidi-Llama library

* core: organizing cmake arguments

* data: sort preselected models according to device's available RAM

* app: update adaptive + themed + legacy icons and app name

* UI: fix the font size auto scaling for ArmFeaturesVisualizer

* core: further improve the performance on native methods

* UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label

* UI: make more room for assistant message bubble's width

* UI: better usage of tertiary colors to highlight model cards but not for warnings

* UI: fix the layout issue on large font sizes

* lib: support x86-64 by dynamically set Arm related definitions

* lib: replace the factory pattern for  deprecated tiered lib loading with single instance pattern

* llama: update the library name in JNI and CMake project

* llama: update the library's package name and namespace

* llama: update the app's package name and namespace

* app: bump ksp version

* app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge

* app: extract AppContent from MainActivity to a separate file in ui package

* lib: add File version for GGUF Magic number verification

* lib: perform engine state check inclusively instead of exclusively

* lib: change `LlamaTier` to `ArmCpuTier`

* lib: remove kleidi-llama related namings

* cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample

Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here.

* [WIP] doc: update main and Android README docs; add self to code owners

* lib: revert System.load back to System.loadLibrary

* jni: introduce a logging util to filter different logging levels on different build types

* lib: enable app optimization

* doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list

* Remove cpu_features

* Fix linters issues in editorconfig-checker job

https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413

* Remove unnecessary Android CMake flag

* purge include/cpu_features directory

---------

Co-authored-by: Han Yin <han.yin@arm.com>
											
										
										
											2025-12-17 08:14:47 +00:00
+								- Android: [llama.android](/examples/llama.android)
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								</details>
 								<details>
 								<summary>UIs</summary>
 								*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
-												readme : update UIs (#18751)


											
										
										
											2026-01-11 11:46:50 +00:00
+								- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
 								- [Dot](https://github.com/alexpinel/Dot) (GPL)
 								- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
 								- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [janhq/jan](https://github.com/janhq/jan) (AGPL)
-												readme: added Sidekick to available UIs (#12311)


											
										
										
											2025-03-10 14:13:09 +00:00
+								- [johnbean393/Sidekick](https://github.com/johnbean393/Sidekick) (MIT)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
 								- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [LARS](https://github.com/abgulati/LARS) (AGPL)
 								- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
-												docs: Add LlamaLib to UI projects (#19181)


											
										
										
											2026-01-30 06:54:28 +00:00
+								- [LlamaLib](https://github.com/undreamai/LlamaLib) (Apache-2.0)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
 								- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [LMStudio](https://lmstudio.ai/) (proprietary)
-												readme : add LocalAI to the availables UI (#5629)


											
										
										
											2024-02-21 14:39:10 +00:00
+								- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [MindMac](https://mindmac.app) (proprietary)
 								- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
 								- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
 								- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
 								- [nat/openplayground](https://github.com/nat/openplayground) (MIT)
 								- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
 								- [ollama/ollama](https://github.com/ollama/ollama) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
 								- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
 								- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
-												README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-02-05 14:55:10 +00:00
+								- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
-												cleanup UI link list (#10577)

* cleanup UI link list

* sort list alphabetically

* add missing licenses
											
										
										
											2024-11-29 16:45:08 +00:00
+								- [ramalama](https://github.com/containers/ramalama) (MIT)
 								- [semperai/amica](https://github.com/semperai/amica) (MIT)
 								- [withcatai/catai](https://github.com/withcatai/catai) (MIT)
-												readme : add link to Autopen under UIs (#11684)

Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.
											
										
										
											2025-02-06 00:55:25 +00:00
+								- [Autopen](https://github.com/blackhole89/autopen) (GPL)
-												readme : add UI (#6724)

* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-04-17 12:47:50 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								</details>
-												readme : add notice for UI list
											
										
										
											2024-03-28 20:56:03 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								<details>
 								<summary>Tools</summary>
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 12:09:42 +00:00
-												readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
											
										
										
											2026-03-24 09:35:07 +00:00
+								- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML
-												readme : add tool (#9655)


											
										
										
											2024-09-28 12:07:14 +00:00
+								- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
-												gemma2: add sliding window mask (#8227)

* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2024-07-01 16:48:34 +00:00
+								- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 12:45:50 +00:00
+								- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
-												readme : add Unsloth exporting to GGUF in tools (#17411)


											
										
										
											2025-11-20 19:07:36 +00:00
+								- [unslothai/unsloth](https://github.com/unslothai/unsloth) – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
-												Readme: add akx/ggify to tools (#1484)


											
										
										
											2024-05-26 12:09:42 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								</details>
 								<details>
 								<summary>Infrastructure</summary>
-												readme: add Paddler to the list of projects (#8239)


											
										
										
											2024-07-01 17:13:22 +00:00
-												readme : update infra list (#15234)


											
										
										
											2025-08-11 12:27:54 +00:00
+								- [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
-												docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
											
										
										
											2024-08-12 12:45:50 +00:00
+								- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
-												readme : update infra list (#9942)

llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.
											
										
										
											2024-10-20 16:01:34 +00:00
+								- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
-												readme : add llama-swap to infrastructure section (#11032)

* list llama-swap under tools in README

* readme: add llama-swap to Infrastructure
											
										
										
											2025-01-02 07:14:54 +00:00
+								- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
-												README : added kalavai to infrastructure list (#11216)


											
										
										
											2025-01-17 00:10:49 +00:00
+								- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
-												readme : update infra list (#9096)

Signed-off-by: kerthcet <kerthcet@gmail.com>
											
										
										
											2025-02-26 07:49:36 +00:00
+								- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
-												readme : update infra list (#20212)
											
										
										
											2026-03-08 10:42:28 +00:00
+								- [LLMKube](https://github.com/defilantech/llmkube) - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal
 								  support"
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								</details>
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
+								<details>
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								<summary>Games</summary>
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
+								</details>
-												docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
											
										
										
											2025-06-03 11:09:36 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								## Supported backends
 								| Backend | Target devices |
 								| --- | --- |
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								| [Metal](docs/build.md#metal-build) | Apple Silicon |
 								| [BLAS](docs/build.md#blas-build) | All |
 								| [BLIS](docs/backend/BLIS.md) | All |
 								| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
-												ggml : add OpenVINO backend (#15307)

* Update build doc

* Add cgraph tensor output name to OV op name

* Update openvino build instructions

* Add initial NPU support

* draft NPU support version 2: prefill + kvcache

* NPU support version 2: prefill + kvcache

* Change due to ggml cgraph changes, not correct yet

* Change due to ggml cgraph changes, llama-3.2 CPU work

* Add AMD64 to CMakeLists

* Change due to ggml cgraph changes, all device work

* Refactor: clean, fix warning

* Update clang-format

* Statful transformation for CPU GPU

* Add SwiGLU

* Fuse to SDPA

* Replace Concat with Broadcast in MulMat for GQA

* Pull out indices creation for kv cache update

* Refactor: remove past_token_len from extra_inputs

* Fix Phi3 SwiGLU and SoftMax

* Pull out sin cos from rope

* Reduce memory: free ov weights node after graph conversion

* Fix CPY due to cgraph change

* Added OpenVINO CI/CD. Updated docs

* Fix llama-cli

* Fix Phi3 ROPE; Add test-backend-ops

* Fix NPU

* Fix llama-bench; Clang-format

* Fix llama-perplexity

* temp. changes for mark decomp

* matmul in fp32

* mulmat input conversion fix

* mulmat type conversion update

* add mark decomp pass

* Revert changes in fuse_to_sdpa

* Update build.md

* Fix test-backend-ops

* Skip test-thread-safety; Run ctest only in ci/run.sh

* Use CiD for NPU

* Optimize tensor conversion, improve TTFT

* Support op SET_ROWS

* Fix NPU

* Remove CPY

* Fix test-backend-ops

* Minor updates for raising PR

* Perf: RMS fused to OV internal RMS op

* Fix after rebasing

- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run

* Change openvino device_type to GPU; Enable flash_attn

* Update supports_buft and supports_op for quantized models

* Add quant weight conversion functions from genai gguf reader

* Quant models run with accuracy issue

* Fix accuracy: disable cpu_repack

* Fix CI; Disable test-backend-ops

* Fix Q4_1

* Fix test-backend-ops: Treat quantized tensors as weights

* Add NPU Q4_0 support

* NPU perf: eliminate zp

* Dequantize q4_1 q4_k q6_k for NPU

* Add custom quant type: q8_1_c, q4_0_128

* Set m_is_static=false as default in decoder

* Simpilfy translation of get_rows

* Fix after rebasing

* Improve debug util; Eliminate nop ReshapeReshape

* STYLE: make get_types_to_requant a function

* Support BF16 model

* Fix NPU compile

* WA for npu 1st token acc issue

* Apply EliminateZP only for npu

* Add GeGLU

* Fix Hunyuan

* Support iSWA

* Fix NPU accuracy

* Fix ROPE accuracy when freq_scale != 1

* Minor: not add attention_size_swa for non-swa model

* Minor refactor

* Add Q5_K to support phi-3-q4_k_m

* Requantize Q6_K (gs16) to gs32 on GPU

* Fix after rebasing

* Always apply Eliminate_ZP to fix GPU compile issue on some platforms

* kvcachefusion support

* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

* Fix for Phi3

* Fix llama-cli (need to run with --no-warmup)

* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working

* fix after rebasing

* Fix llama-3-8b and phi3-mini q4_0 NPU

* Update to OV-2025.3 and CMakeLists.txt

* Add OV CI cache

* Apply CISC review and update CI to OV2025.3

* Update CI to run OV dep install before build

* Update OV dockerfile to use OV2025.3 and update build docs

* Style: use switch in supports_ops

* Style: middle ptr and ref align, omit optional struct keyword

* NPU Unify PD (#14)

* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims

* Clean placeholders in ggml-openvino.cpp

* NPU unify PD (handled internally)

* change graph to 4d, support multi sequences

* Fix llama-bench

* Fix NPU

* Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Remove the second decoder for node. Moving the function into the model decoder

* Fix error for naive

* NPU prefill chunking

* NPU fix llama-bench

* fallback naive run with accuracy issue

* NPU support llma-perplexity -b 512 --no-warmup

* Refactor: split ov_graph_compute for dynamic and static

* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)

* minor update due to ov 2025.4

* remove unused API GgmlOvDecoder::get_output_names()

* remove unused API get_output_shape(const std::string & name)

* Modified API GgmlOvDecoder::get_output_type(const std::string & name)

* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)

* Removed API get_output_ggml_tensor(const std::string & name)

* Removed API m_outputs

* Removed m_output_names

* Removed API GgmlOvDecoder::get_input_names()

* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)

* Removed API get_input_type

* Removed API get_input_type

* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)

* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)

* Fix error for decoder cache

* Reuse cached decoder

* GPU remove Q6_K requantization

* NPU fix wrong model output shape

* NPU fix q4 perf regression

* Remove unused variable nodes

* Fix decoder can_reuse for llama-bench

* Update build.md for Windows

* backend buffer: allocate on host

* Use shared_buffer for GPU NPU; Refactor

* Add ov_backend_host_buffer; Use cached remote context

* Put kvcache on GPU

* Use ggml_aligned_malloc

* only use remote tensor for kvcache

* only use remote tensor for kvcache for GPU

* FIX: use remote tensor from singleton

* Update build.md to include OpenCL

* NPU always requant to q4_0_128

* Optimize symmetric quant weight extraction: use single zp

* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant

* Update build.md

* Support -ctk f32

* Initial stateful graph support

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* code cleanup

* npu perf fix

* requant to f16 for Q6 embed on NPU

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp

* Create OPENVINO.md in llama.cpp backend docs

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update build.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* kq_mask naming fix

* Syntax correction for workflows build file

* Change ov backend buffer is_host to false

* Fix llama-bench -p -n where p<=256

* Fix --direct-io 0

* Don't put kvcache on GPU in stateful mode

* Remove hardcode names

* Fix stateful shapes

* Simplification for stateful and update output shape processing

* Remove hardcode names

* Avoid re-compilation in llama-bench

* Extract zp directly instead of bias

* Refactor weight tensor processing

* create_weight_node accept non-ov backend buffer

* remove changes in llama-graph.cpp

* stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add

* hardcoded name handling for rope_freqs.weight

* Suppress logging and add error handling to allow test-backend-ops to complete

* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases

* Use bias instead of zp in test-backend-ops

* Update OV in CI, Add OV CI Tests in GH Actions

* Temp fix for multithreading bug

* Update OV CI, fix review suggestions.

* fix editorconfig-checker, update docs

* Fix tabs to spaces for editorconfig-checker

* fix editorconfig-checker

* Update docs

* updated model link to be GGUF model links

* Remove GGML_CPU_REPACK=OFF

* Skip permuted ADD and MUL

* Removed static variables from utils.cpp

* Removed initializing non-existing variable

* Remove unused structs

* Fix test-backend-ops for OV GPU

* unify api calling

* Update utils.cpp

* When the dim is dynamic, throw an error, need to is stastic forst

* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using

* No need to return

* Fix test-backend-ops for OV GPU LNL

* Fix test-thread-safety

* use the shape from infer request of output tensor create to avoid issue

* fix dynamic output shape  issue

* fix issue for the unused node in tests

* Remove unused lock

* Add comment

* Update openvino docs

* update to OV release version 2026.0

* add ci ov-gpu self hosted runner

* fix editorconfig

* Fix perplexity

* Rewrite the model inputs finding mechanism  (#54)

* Rewrite the model inputs finding logistic

* Put stateful shape handle in get input shape

* Put the iteration logistic in func

* Added ggml-ci-intel-openvino-gpu and doc update

* .hpp files converted to .h

* fix ggml-ci-x64-intel-openvino-gpu

* Fix for stateful execution bug in llama-bench

* Minor updates after stateful llama-bench fix

* Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* Remove multiple get_shape calls

* Bring back mutex into compute

* Fix VIEW op, which slice the input node

* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access

* Temp. fix for test requant errors

* Update to OV ggml-ci to low-perf

* ci : temporary disable "test-llama-archs"

* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag

* docs : update url

* Fix OV link in docker and Update docs

---------

Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2026-03-14 05:56:55 +00:00
+								| [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs |
-												musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
											
										
										
											2025-05-21 01:58:49 +00:00
+								| [MUSA](docs/build.md#musa) | Moore Threads GPU |
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								| [CUDA](docs/build.md#cuda) | Nvidia GPU |
-												docs: Fix HIP (née hipBLAS) in README (#10880)

Related to #10524 / be0e350c references to hipBLAS have been removed
across the repository.  This fixes the link from the repositories
`README.md`.

Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>
											
										
										
											2024-12-18 08:35:00 +00:00
+								| [HIP](docs/build.md#hip) | AMD GPU |
-												ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690)

* ggml-zennn: add ZenDNN backend support

* ggml-zendnn : address ZenDNN backend review fixes and suggestions

* docs : apply blockquote syntax to ZenDNN docs

---------

Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
											
										
										
											2025-12-06 16:13:33 +00:00
+								| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								| [Vulkan](docs/build.md#vulkan) | GPU |
 								| [CANN](docs/build.md#cann) | Ascend NPU |
-												docs: add OpenCL (#11697)


											
										
										
											2025-02-11 22:04:13 +00:00
+								| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
-												zdnn: refactor codebase + add docs (#16178)

* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
											
										
										
											2025-09-23 06:53:05 +00:00
+								| [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE |
-												ggml: Add initial WebGPU backend (#14521)

* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
											
										
										
											2025-07-16 15:18:51 +00:00
+								| [WebGPU [In Progress]](docs/build.md#webgpu) | All |
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
-												hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139)

* hexagon: add fp16 support for binary ops: add,sub,mul,div

* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)

* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad

* snapdragon: fix readme link

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
											
										
										
											2026-03-06 02:29:13 +00:00
+								| [Hexagon [In Progress]](docs/backend/snapdragon/README.md) | Snapdragon |
-												ggml-virtgpu: add backend documentation (#19354)

* ggml-virtgpu: add backend documentation

Assisted-by-AI: Claude Code

* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget

* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md

* docs/ggml-virt: add link to testing + configuration

* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"

This reverts commit 8ece8e72e24d305f308505c08ebb75804546374e.

* drop the ggml- prefix

* s/ggerganov/ggml-org

* Relocate VirtGPU.md

* reorganize the text

* turn turn the ascii diagram into a mermaid

* README.md: update the link to the main doc
											
										
										
											2026-02-09 12:15:42 +00:00
+								| [VirtGPU](docs/backend/VirtGPU.md) | VirtGPU APIR |
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
 								## Obtaining and quantizing models
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
 								The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
 								- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
 								- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
-												readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
											
										
										
											2026-03-24 09:35:07 +00:00
+								You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
-												docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
											
										
										
											2025-06-03 11:09:36 +00:00
 								```sh
 								llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
 								```
-												contrib: support modelscope community (#12664)

* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
											
										
										
											2025-04-11 12:01:56 +00:00
-												readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
											
										
										
											2026-03-24 09:35:07 +00:00
+								By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								After downloading a model, use the CLI tools to run it locally - see below.
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
-												zig : update build.zig (#872)

* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
											
										
										
											2023-04-13 13:43:22 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 20:03:03 +00:00
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
+								- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123)
 								- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268)
 								- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)
-												Updating build instructions to include BLAS support (#1183)

* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
											
										
										
											2023-04-26 20:03:03 +00:00
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								To learn more about model quantization, [read this documentation](tools/quantize/README.md)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
-												cli: fixed dead links to tools/main for cli and completion, fixed code owners (#17993)

Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
											
										
										
											2025-12-15 10:47:04 +00:00
+								## [`llama-cli`](tools/cli)
-												readme : refresh (#10587)

* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
											
										
										
											2024-11-30 07:47:07 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
-												Create README.md
											
										
										
											2023-03-10 19:47:46 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- <details open>
 								    <summary>Run in conversation mode</summary>
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 15:37:55 +00:00
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    ```bash
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    llama-cli -m model.gguf
-												Add Nix and Flox install instructions (#7899)


											
										
										
											2024-06-17 15:37:55 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    # > hi, who are you?
 								    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
 								    #
 								    # > what is 1+1?
 								    # Easy peasy! The answer to 1+1 is... 2!
 								    ```
-												feature : support blis and other blas implementation  (#1536)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-05-20 14:58:31 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    </details>
-												readme : add note that LLaMA 3 is not supported with convert.py (#7065)


											
										
										
											2024-05-05 05:21:46 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- <details>
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    <summary>Run in conversation mode with custom chat template</summary>
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 19:00:24 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    ```bash
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    # use the "chatml" template (use -h to see the list of supported templates)
 								    llama-cli -m model.gguf -cnv --chat-template chatml
-												Update README.md (#3289)

* Update README.md

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
											
										
										
											2023-09-21 19:00:24 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    # use a custom template
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    ```
-												readme : add docs for chat-persistent.sh (#1568)

* readme : add docs for chat-persistent.sh

* Update README.md
											
										
										
											2023-05-24 06:24:01 +00:00
-												cli : auto activate conversation mode if chat template is available (#11214)

* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
											
										
										
											2025-01-13 19:18:12 +00:00
+								    </details>
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- <details>
 								    <summary>Constrain the output with a custom grammar</summary>
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 01:01:57 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    ```bash
 								    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
-												docs : add grammar docs (#2701)

* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
											
										
										
											2023-08-23 01:01:57 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
 								    ```
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 01:14:11 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
-												Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
											
										
										
											2023-07-28 01:14:11 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
-												Add SHA256SUMS file and instructions to README how to obtain and verify the downloads

Hashes created using:

sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS

											
										
										
											2023-03-20 20:14:06 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								    </details>
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 18:25:25 +00:00
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								## [`llama-server`](tools/server)
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 18:25:25 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 18:25:25 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- <details open>
 								    <summary>Start a local HTTP server with default configuration on port 8080</summary>
 								    ```bash
 								    llama-server -m model.gguf --port 8080
 								    # Basic web UI can be accessed via browser: http://localhost:8080
 								    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
 								    ```
 								    </details>
 								- <details>
 								    <summary>Support multiple-users and parallel decoding</summary>
 								    ```bash
 								    # up to 4 concurrent requests, each with 4096 max context
 								    llama-server -m model.gguf -c 16384 -np 4
 								    ```
 								    </details>
 								- <details>
 								    <summary>Enable speculative decoding</summary>
 								    ```bash
 								    # the draft.gguf model should be a small variant of the target model.gguf
 								    llama-server -m model.gguf -md draft.gguf
 								    ```
 								    </details>
 								- <details>
 								    <summary>Serve an embedding model</summary>
 								    ```bash
 								    # use the /embedding endpoint
 								    llama-server -m model.gguf --embedding --pooling cls -ub 8192
 								    ```
 								    </details>
 								- <details>
 								    <summary>Serve a reranking model</summary>
 								    ```bash
 								    # use the /reranking endpoint
 								    llama-server -m model.gguf --reranking
 								    ```
 								    </details>
 								- <details>
 								    <summary>Constrain all outputs with a grammar</summary>
 								    ```bash
 								    # custom grammar
 								    llama-server -m model.gguf --grammar-file grammar.gbnf
 								    # JSON
 								    llama-server -m model.gguf --grammar-file grammars/json.gbnf
 								    ```
 								    </details>
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								## [`llama-perplexity`](tools/perplexity)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
-												docs : fix link for tools/perplexity in README.md (#14780)


											
										
										
											2025-07-20 18:13:47 +00:00
+								#### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text.
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
 								- <details open>
 								    <summary>Measure the perplexity over a text file</summary>
 								    ```bash
 								    llama-perplexity -m model.gguf -f file.txt
 								    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
 								    # Final estimate: PPL = 5.4007 +/- 0.67339
 								    ```
 								    </details>
 								- <details>
 								    <summary>Measure KL divergence</summary>
 								    ```bash
 								    # TODO
 								    ```
 								    </details>
-												docs : fix link for tools/perplexity in README.md (#14780)


											
										
										
											2025-07-20 18:13:47 +00:00
+								[^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								## [`llama-bench`](tools/llama-bench)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
 								#### Benchmark the performance of the inference for various parameters.
 								- <details open>
 								    <summary>Run default benchmark</summary>
 								    ```bash
 								    llama-bench -m model.gguf
 								    # Output:
 								    # | model               |       size |     params | backend    | threads |          test |                  t/s |
 								    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
 								    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
 								    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
 								    #
 								    # build: 3e0ba0e60 (4229)
 								    ```
 								    </details>
 								## [`llama-simple`](examples/simple)
 								#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
 								- <details>
 								    <summary>Basic text completion</summary>
 								    ```bash
 								    llama-simple -m model.gguf
 								    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
 								    ```
 								    </details>
-												docker : add support for CUDA in docker (#1461)

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2023-07-07 18:25:25 +00:00
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
+								## Contributing
-												Add initial contribution guidelines
											
										
										
											2023-03-13 07:42:26 +00:00
-												Update contribution section, hot topics, limitations, etc.
											
										
										
											2023-03-13 17:21:51 +00:00
+								- Contributors can open PRs
-												Add initial contribution guidelines
											
										
										
											2023-03-13 07:42:26 +00:00
+								- Collaborators will be invited based on contributions
-												contrib : update roles (#16113)

* contrib : update roles

* contrib : merge PR sections + add link to CI instructions

Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
											
										
										
											2025-09-22 07:58:02 +00:00
+								- Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
-												contrib : add Resources section (#9675)


											
										
										
											2024-09-29 11:38:18 +00:00
+								- Any help with managing issues, PRs and projects is very appreciated!
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
-												contributing : update guidelines (#8316)


											
										
										
											2024-07-05 06:09:47 +00:00
+								- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- Make sure to read this: [Inference at the edge](https://github.com/ggml-org/llama.cpp/discussions/205)
-												Adjust repetition penalty ..
											
										
										
											2023-03-23 08:46:58 +00:00
+								- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
-												Add initial contribution guidelines
											
										
										
											2023-03-13 07:42:26 +00:00
-												CMake: default to -arch=native for CUDA build (#10320)


											
										
										
											2024-11-17 08:06:34 +00:00
+								## Other documentation
-												readme : change logo + add bindings + add uis + add wiki
											
										
										
											2023-04-05 15:56:20 +00:00
-												cli: fixed dead links to tools/main for cli and completion, fixed code owners (#17993)

Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
											
										
										
											2025-12-15 10:47:04 +00:00
+								- [cli](tools/cli/README.md)
 								- [completion](tools/completion/README.md)
-												llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
											
										
										
											2025-05-02 18:27:13 +00:00
+								- [server](tools/server/README.md)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- [GBNF grammars](grammars/README.md)
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### Development documentation
-												Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
											
										
										
											2024-07-05 16:08:32 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								- [How to build](docs/build.md)
 								- [Running on Docker](docs/docker.md)
 								- [Build on Android](docs/android.md)
 								- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
-												repo : update links to new url (#11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
											
										
										
											2025-02-15 14:40:57 +00:00
+								- [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks)
-												update main readme (#8333)


											
										
										
											2024-07-06 17:01:23 +00:00
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
+								#### Seminal papers and background on the models
-												update main readme (#8333)


											
										
										
											2024-07-06 17:01:23 +00:00
 								If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
 								- LLaMA:
 								    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
 								    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
 								- GPT-3
 								    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
 								- GPT-3.5 / InstructGPT / ChatGPT:
 								    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
 								    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
-												readme : update the usage section with examples (#10596)

* readme : update the usage section with examples

* readme : more examples
											
										
										
											2024-12-01 09:25:17 +00:00
-												docs : add XCFramework section to README.md [no ci] (#12746)

This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.

The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.
											
										
										
											2025-04-04 08:24:12 +00:00
+								## XCFramework
 								The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS,
 								and macOS. It can be used in Swift projects without the need to compile the
 								library from source. For example:
 								```swift
 								// swift-tools-version: 5.10
 								// The swift-tools-version declares the minimum version of Swift required to build this package.
 								import PackageDescription
 								let package = Package(
 								    name: "MyLlamaPackage",
 								    targets: [
 								        .executableTarget(
 								            name: "MyLlamaPackage",
 								            dependencies: [
 								                "LlamaFramework"
 								            ]),
 								        .binaryTarget(
 								            name: "LlamaFramework",
 								            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
 								            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
 								        )
 								    ]
 								)
 								```
 								The above example is using an intermediate build `b5046` of the library. This can be modified
 								to use a different version by changing the URL and checksum.
-												readme : minor
											
										
										
											2025-02-13 22:16:56 +00:00
+								## Completions
-												llama : add --completion-bash option (#11846)

This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
											
										
										
											2025-02-13 13:46:59 +00:00
+								Command-line completion is available for some environments.
 								#### Bash Completion
 								```bash
 								$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
 								$ source ~/.llama-completion.bash
 								```
 								Optionally this can be added to your `.bashrc` or `.bash_profile` to load it
 								automatically. For example:
 								```console
 								$ echo "source ~/.llama-completion.bash" >> ~/.bashrc
 								```
-												readme : minor
											
										
										
											2025-02-13 22:16:56 +00:00
-												readme : add list of dependencies and their license (#13591)


											
										
										
											2025-05-16 18:04:18 +00:00
+								## Dependencies
 								- [yhirose/cpp-httplib](https://github.com/yhirose/cpp-httplib) - Single-header HTTP server, used by `llama-server` - MIT license
 								- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain
 								- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License
-												mtmd : add ultravox audio input (#13623)

* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
											
										
										
											2025-05-22 18:42:48 +00:00
+								- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
-												server: introduce API for serving / loading / unloading multiple models (#17470)

* server: add model management and proxy

* fix compile error

* does this fix windows?

* fix windows build

* use subprocess.h, better logging

* add test

* fix windows

* feat: Model/Router server architecture WIP

* more stable

* fix unsafe pointer

* also allow terminate loading model

* add is_active()

* refactor: Architecture improvements

* tmp apply upstream fix

* address most problems

* address thread safety issue

* address review comment

* add docs (first version)

* address review comment

* feat: Improved UX for model information, modality interactions etc

* chore: update webui build output

* refactor: Use only the message data `model` property for displaying model used info

* chore: update webui build output

* add --models-dir param

* feat: New Model Selection UX WIP

* chore: update webui build output

* feat: Add auto-mic setting

* feat: Attachments UX improvements

* implement LRU

* remove default model path

* better --models-dir

* add env for args

* address review comments

* fix compile

* refactor: Chat Form Submit component

* ad endpoint docs

* Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2

Co-authored-by: Aleksander <aleksander.grygier@gmail.com>

* feat: Add copy to clipboard to model name in model info dialog

* feat: Model unavailable UI state for model selector

* feat: Chat Form Actions UI logic improvements

* feat: Auto-select model from last assistant response

* chore: update webui build output

* expose args and exit_code in API

* add note

* support extra_args on loading model

* allow reusing args if auto_load

* typo docs

* oai-compat /models endpoint

* cleaner

* address review comments

* feat: Use `model` property for displaying the `repo/model-name` naming format

* refactor: Attachments data

* chore: update webui build output

* refactor: Enum imports

* feat: Improve Model Selector responsiveness

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* refactor: Formatters

* chore: update webui build output

* refactor: Copy To Clipboard Icon component

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* refactor: UI badges

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* chore: update webui build output

* add --models-allow-extra-args for security

* nits

* add stdin_file

* fix merge

* fix: Retrieve lost setting after resolving merge conflict

* refactor: DatabaseStore -> DatabaseService

* refactor: Database, Conversations & Chat services + stores architecture improvements (WIP)

* refactor: Remove redundant settings

* refactor: Multi-model business logic WIP

* chore: update webui build output

* feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic

* chore: update webui build output

* fix: Add `untrack` inside chat processing info data logic to prevent infinite effect

* fix: Regenerate

* feat: Remove redundant settigns + rearrange

* fix: Audio attachments

* refactor: Icons

* chore: update webui build output

* feat: Model management and selection features WIP

* chore: update webui build output

* refactor: Improve server properties management

* refactor: Icons

* chore: update webui build output

* feat: Improve model loading/unloading status updates

* chore: update webui build output

* refactor: Improve API header management via utility functions

* remove support for extra args

* set hf_repo/docker_repo as model alias when posible

* refactor: Remove ConversationsService

* refactor: Chat requests abort handling

* refactor: Server store

* tmp webui build

* refactor: Model modality handling

* chore: update webui build output

* refactor: Processing state reactivity

* fix: UI

* refactor: Services/Stores syntax + logic improvements

Refactors components to access stores directly instead of using exported getter functions.

This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction.

Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`.

* refactor: Architecture cleanup

* feat: Improve statistic badges

* feat: Condition available models based on modality + better model loading strategy & UX

* docs: Architecture documentation

* feat: Update logic for PDF as Image

* add TODO for http client

* refactor: Enhance model info and attachment handling

* chore: update webui build output

* refactor: Components naming

* chore: update webui build output

* refactor: Cleanup

* refactor: DRY `getAttachmentDisplayItems` function + fix UI

* chore: update webui build output

* fix: Modality detection improvement for text-based PDF attachments

* refactor: Cleanup

* docs: Add info comment

* refactor: Cleanup

* re

* refactor: Cleanup

* refactor: Cleanup

* feat: Attachment logic & UI improvements

* refactor: Constants

* feat: Improve UI sidebar background color

* chore: update webui build output

* refactor: Utils imports + move types to `app.d.ts`

* test: Fix Storybook mocks

* chore: update webui build output

* test: Update Chat Form UI tests

* refactor: Tooltip Provider from core layout

* refactor: Tests to separate location

* decouple server_models from server_routes

* test: Move demo test  to tests/server

* refactor: Remove redundant method

* chore: update webui build output

* also route anthropic endpoints

* fix duplicated arg

* fix invalid ptr to shutdown_handler

* server : minor

* rm unused fn

* add ?autoload=true|false query param

* refactor: Remove redundant code

* docs: Update README documentations + architecture & data flow diagrams

* fix: Disable autoload on calling server props for the model

* chore: update webui build output

* fix ubuntu build

* fix: Model status reactivity

* fix: Modality detection for MODEL mode

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
											
										
										
											2025-12-01 18:41:04 +00:00
+								- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain