Commit graph

  • 92fc86582f
    Delete .github/workflows/build-cache.yml main main-b8992-92fc865 James Devine 2026-05-06 00:04:03 +0200
  • f43040117f
    Update README with NPU optimization details main-b8991-f430401 James Devine 2026-05-02 22:50:54 +0200
  • 094f7aaf18 Add Q6A build artifacts: cross-compiled llama-cli + Hexagon NPU DSP libraries pingud98 2026-05-02 20:44:11 +0000
  • c20c44514a
    spec: fix argument typo (#22552) Ben Guidarelli 2026-04-30 10:32:32 -0400
  • 6118c043b1
    ci : bump ty to 0.0.33 (#22535) Sigbjørn Skjæret 2026-04-30 15:15:54 +0200
  • 5f0ab726f7
    vendor : update cpp-httplib to 0.43.2 (#22548) Adrien Gallouët 2026-04-30 15:04:39 +0200
  • e82aaf2587
    CUDA: fix tile FA kernel on Pascal (#22541) Johannes Gäßler 2026-04-30 13:04:50 +0200
  • 27aef3dd91
    scripts : add wc2wt.sh - create worktree from current HEAD (#22513) Georgi Gerganov 2026-04-30 09:20:26 +0300
  • 45155597aa
    add fast matmul iquants (#22504) Rithik Sharma 2026-04-29 22:58:32 -0700
  • 80afa33aad
    spec : fix draft model checkpoints (#22521) Georgi Gerganov 2026-04-30 08:32:18 +0300
  • b42c7fa5b8
    spec : fix vocab compat checks in spec example (#22426) Peter Sideris 2026-04-30 08:18:25 +0300
  • d77599234e
    common : do not pass prompt tokens to reasoning budget sampler (#22488) Aldehir Rojas 2026-04-29 14:10:58 -0500
  • 41a63be28e
    hexagon: make vmem and buffer-size configurable (#22487) Max Krasnyansky 2026-04-29 11:51:21 -0700
  • 098705a29e
    CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) Anav Prasad 2026-04-29 11:39:56 -0700
  • 683c5acb90
    spec : disacard last drafted token with low prob (#22506) Georgi Gerganov 2026-04-29 17:00:00 +0300
  • b1d5f5b449 sync : ggml Georgi Gerganov 2026-04-29 16:43:08 +0300
  • 4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) Georgi Gerganov 2026-04-29 16:41:45 +0300
  • 59237bfbbc
    webui: fix slow mic stop and WAV encode (#22480) Pascal 2026-04-29 12:58:35 +0200
  • 1cbc846eba
    ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293) shalinib-ibm 2026-04-29 16:02:40 +0530
  • 3142f1dbb9
    ggml-cuda: refactor fusion code (#22468) Aman Gupta 2026-04-29 16:19:33 +0800
  • b5c4227dc6
    ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317) qiurui144 2026-04-29 15:59:21 +0800
  • d6a5094004
    ggml-webgpu: Fix bug in FlashAttention support check (#22492) Reese Levine 2026-04-29 00:59:00 -0700
  • 7b95ea5d11
    common: Intentionally leak logger instance to fix hanging on Windows (#22273) Masato Nakasaka 2026-04-29 16:58:43 +0900
  • bdc9c743a5
    ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916) hrushitfujitsu 2026-04-29 13:27:37 +0530
  • 739393beeb
    TP: fix delayed AllReduce + zero-sized slices (#22489) Johannes Gäßler 2026-04-29 08:55:07 +0200
  • fc2b0053ff
    ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) Michael Wand 2026-04-28 15:47:42 -0700
  • 7b8443ac78
    ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) lnigam 2026-04-29 01:07:35 +0530
  • 5d56effdee
    convert : add support for Nemotron Nano 3 Omni (#22481) Daniel Bevenius 2026-04-28 19:17:57 +0200
  • 52e5f0a5c1
    common : re-arm reasoning budget after DONE on new <think> (#22323) Jillis ter Hove 2026-04-28 19:15:36 +0200
  • f9f33654a6
    vulkan: Coalesce Q4_K/Q5_K scale loads (#21751) Matt Corallo 2026-04-28 15:31:04 +0000
  • 98bb57916a
    ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456) Reese Levine 2026-04-28 07:27:17 -0700
  • f42e29fdf1
    webui: Server tools (#21237) Aleksander Grygier 2026-04-28 14:35:49 +0300
  • 19821178be
    vulkan: add barrier after writetimestamp (#21865) Jeff Bolz 2026-04-28 12:28:12 +0200
  • 698d19b93c
    ggml: improve SPIR-V headers detection with __has_include (#21918) Emil Askerov 2026-04-28 13:19:06 +0300
  • 50494a2800
    ggml : skip already registered backends and devices (#22296) Adrien Gallouët 2026-04-28 09:02:32 +0200
  • d530d6e7a2
    ggml : revert to -lm linking instead of find_library (#22355) Adrien Gallouët 2026-04-28 08:56:02 +0200
  • c3e08f4700
    CANN: add new ops, optimize existing ops (#21204) hipudding 2026-04-28 14:27:22 +0800
  • 14e733e36f
    spec : refactor params (#22397) Georgi Gerganov 2026-04-28 09:07:33 +0300
  • 516e8d7a8a
    server: use pos_next instead of n_tokens for m-rope (#22439) Aman Gupta 2026-04-28 13:41:00 +0800
  • 434b2a1ff6
    ggml-webgpu: add Q1_0 support (#22374) Rithik Sharma 2026-04-27 15:50:59 -0700
  • 983ca8992e
    server: (router) Forward form-data to model server (Fixes #22044) (#22118) tha80 2026-04-27 23:55:00 +0200
  • 665abc6097
    add fast mat-vec kernels for i-quants (#22344) Rithik Sharma 2026-04-27 08:25:45 -0700
  • 4414c04b9a
    Additional test for common/gemma4 : handle parsing edge cases (#22420) Igor Rudenko 2026-04-27 17:36:59 +0300
  • ceaf47c4b1
    fix: rpc-server cache may not work in Windows environments (#22394) unraido 2026-04-27 23:25:09 +0900
  • 42401c72b8
    Fix type casting for unaccounted memory calculation (#22424) rankaiyx 2026-04-27 20:31:13 +0800
  • e940b3d468
    download : prefer q8_0 when q4_k not available (#22428) Georgi Gerganov 2026-04-27 15:30:29 +0300
  • 0f1bb602dd
    model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421) ynankani 2026-04-27 07:58:48 +0000
  • d13540becd
    convert : remove input_scale for dequantized fp8 modelopt (#22356) Sigbjørn Skjæret 2026-04-27 08:45:01 +0200
  • f84270ea10
    ggml : use 64 bytes aligned tile buffers (#21058) Adrien Gallouët 2026-04-27 08:30:55 +0200
  • 5594d13224
    common: fix missing exports in llama-common (#22340) Max Krasnyansky 2026-04-26 22:06:39 -0700
  • f535774325
    pr2wt : symlink .pi (#22386) Georgi Gerganov 2026-04-26 19:49:26 +0300
  • 06a811d085
    add performance-portable tuning for register-tile and subgroup matmul (#22241) Rithik Sharma 2026-04-26 09:26:28 -0700
  • 78433f606f
    Fix recurrent state serialization for partial reads and writes (#22362) Gaurav Garg 2026-04-26 17:04:40 +0530
  • 7ec36aa861
    Github: set meta backend code owner (#22388) Johannes Gäßler 2026-04-26 13:34:13 +0200
  • b1a5bd4e0c
    CUDA: better coalesce data-access for contiguous concat (#22330) Oliver Simons 2026-04-26 09:21:45 +0200
  • 0c6ee1cade
    ggml-cpu : re-enable fast gelu_quick_f16 (#22339) Sigbjørn Skjæret 2026-04-26 08:28:14 +0200
  • 2dd84169d1
    ggml-cpu: optimize avx2 q6_k (#22345) Eve 2026-04-26 06:27:50 +0000
  • f454bd7eb8
    opencl: add iq4_nl support (#22272) lhez 2026-04-25 21:21:58 -0700
  • b760272f1a
    hexagon: guard HMX clock request for v75+ platforms (#22377) Trivikram Reddy 2026-04-25 19:58:26 -0500
  • dcad77cc3b
    chat: fix handling of space in reasoning markers (#22353) Piotr Wilkin (ilintar) 2026-04-25 21:24:13 +0200
  • 98dc1418ea
    spec : fix vocab compat checks (#22358) Georgi Gerganov 2026-04-25 20:11:35 +0300
  • 9725a313be
    CUDA: reduce MMQ stream-k overhead (#22298) Johannes Gäßler 2026-04-25 14:15:03 +0200
  • d1649047a3
    metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962) Developer-Ecosystem-Engineering 2026-04-25 05:14:28 -0700
  • 9d34231bb8
    llama-quant : default ftype param Q5_1 --> Q8_0 (#20828) ddh0 2026-04-25 01:25:35 -0500
  • 8ea8fee966
    gitignore : add .pi + personal SYSTEM.md (#22316) Georgi Gerganov 2026-04-25 09:20:45 +0300
  • eddd7a13a5
    [SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291) Neo Zhang 2026-04-25 14:20:14 +0800
  • dd2914dc81
    ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327) Reese Levine 2026-04-24 23:18:15 -0700
  • 0adede866d
    parser: fix structured output bug (#22302) Piotr Wilkin (ilintar) 2026-04-24 23:19:55 +0200
  • 361fe72acb
    Hexagon: Bump HMX Frequency to Max Corner (#22334) Trivikram Reddy 2026-04-24 15:55:17 -0500
  • a702f39597
    CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303) Shreya Jain 2026-04-24 12:21:36 -0700
  • 13d36cf891
    ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199) Zheyuan Chen 2026-04-24 10:39:09 -0700
  • f65bc34c68
    hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306) Mengsheng Wu 2026-04-25 00:21:33 +0800
  • 15fa3c493b
    metal : print GPU description (#22318) Georgi Gerganov 2026-04-24 13:56:03 +0300
  • dc80c5252a
    common : fix jinja warnings with clang 21 (#22313) Adrien Gallouët 2026-04-24 12:36:02 +0200
  • e583f3b4f5
    ggml : minor coding style (#22308) Georgi Gerganov 2026-04-24 11:02:00 +0300
  • 017f090442
    jinja : remove unused header (#22310) Georgi Gerganov 2026-04-24 11:01:46 +0300
  • ffdd983fb8
    server : fix swa-full logic (#22288) Georgi Gerganov 2026-04-24 10:17:37 +0300
  • 793d0a7931
    server: rename debug tags to match --cache-idle-slots naming (#22292) Yes You Can Have Your Own 2026-04-24 09:28:44 +0300
  • 8bc492ebb4
    hexagon: add SOLVE_TRI op (#21974) Mengsheng Wu 2026-04-24 09:39:13 +0800
  • e5f070a1dc
    fix(shader): handle the buffer aliasing for rms fuse (#22266) Chen Yuan 2026-04-23 19:32:59 -0400
  • fa0b8a70a8
    cli: Remove redundant local sampling variables (#20429) (#22264) Ethan Turner 2026-04-23 15:53:23 -0700
  • 5d2b52d80d
    hexagon: add support for basic and extended Op profiling (#22269) Max Krasnyansky 2026-04-23 14:17:21 -0700
  • 187a456370
    Enable testing on Snapdragon devices (#21051) Shreya Jain 2026-04-23 13:08:10 -0700
  • 185cbff6f1
    server : convert_anthropic_to_oai: also copy chat_template_kwargs (#22154) srkizer 2026-04-24 03:32:46 +0900
  • c78fb909b2
    server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) (#22267) Song Li 2026-04-23 12:39:07 -0400
  • 12568ca8c8
    vendor : update LibreSSL to 4.3.1 (#22285) Adrien Gallouët 2026-04-23 17:45:56 +0200
  • c807c6e3b0
    server: (anthropic API) fix prefix caching (#21793) kvc0 2026-04-23 08:45:02 -0700
  • 0949beb5a3
    fix build number for sycl release (#22283) Sigbjørn Skjæret 2026-04-23 15:38:58 +0200
  • 9012c50fc8
    model-conversion : fix mmproj output file name [no ci] (#22274) Daniel Bevenius 2026-04-23 15:07:38 +0200
  • 0dd7f915fd
    cli : cleanup auto-completion code (#21745) Matthias Straka 2026-04-23 15:03:28 +0200
  • 550d684bd1
    server: Enable transcriptions API for LFM2-Audio (#22000) Tarek Dakhran 2026-04-23 10:47:26 +0200
  • 8635e221c8
    metal : fix event synchronization (#22260) Georgi Gerganov 2026-04-23 08:22:49 +0300
  • 930e0210d1
    gitignore: add AGENTS.local.md (#22246) Georgi Gerganov 2026-04-23 08:22:24 +0300
  • 96c1db26c4
    ggml-base: use MATH_LIBRARY variable instead of hardcoded 'm' (#22239) Georgi Gerganov 2026-04-23 08:22:08 +0300
  • 4ead6fd957
    [SYCL] Update oneapi 2025.3.3, Seperate SYCL build, release Ubuntu 24 package. (#22078) Neo Zhang Jianyu 2026-04-23 13:21:36 +0800
  • 5eaee65384
    convert : Handle ModelOpt produced mixed precision model during convert to GGUF (#22247) ynankani 2026-04-23 05:19:51 +0000
  • 60b68a6279
    sycl : fused MoE mul_mat_vec_q for TG (#21920) abotsis 2026-04-22 23:18:56 -0600
  • b76429a69c
    ggml-webgpu: add support for im2col (#22259) Chen Yuan 2026-04-22 23:17:41 -0400
  • 86db42e97f
    CUDA: fuse relu + sqr (#22249) Anav Prasad 2026-04-23 02:28:56 +0000
  • 6217b49583
    HIP: flip GGML_HIP_GRAPHS to default on (#22254) uvos 2026-04-23 02:34:31 +0200