llama_cpp_for_radxa_dragon_.../ggml
Akarshan Biswas 225088ea76
sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119)
* sycl: size mul_mat_id staging buffers by routed rows

Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were
sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1
and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for
MoE models (notably with --cpu-moe). Size them by the actual number of
routed rows (ids->ne[1] * n_ids) instead.

* sycl: add bf16 mul_mat fast path via DNNL

When src0 is BF16 (commonly the case for lm_head / output.weight), the
existing f16 path is skipped because bf16 isn't covered, and the f32
fallback dequantizes the entire src0 slab to f32 in a single pool alloc
(row_diff*ne00 floats). For large-vocab models this can reach several
GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero.

Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage
in place and only materializes a small src1 bf16 conversion buffer. bf16
matmul accumulates in f32, so it's correct even when the op requests
GGML_PREC_F32 (as lm_head does).

- gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16.
- convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16.
- ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl
  when DNNL and GGML_SYCL_HAS_BF16 are both available.
2026-04-22 20:32:56 +08:00
..
cmake ggml: backend-agnostic tensor parallelism (experimental) (#19378) 2026-04-09 16:42:19 +02:00
include CUDA: manage NCCL communicators in context (#21891) 2026-04-15 15:58:40 +02:00
src sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119) 2026-04-22 20:32:56 +08:00
.gitignore
CMakeLists.txt ggml : bump version to 0.10.0 (ggml/1463) 2026-04-21 11:04:21 +03:00