Commit graph

8 commits

Author SHA1 Message Date
Diego Devesa
ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
amritahs-ibm
e89213492d
ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2024-11-09 09:17:50 +02:00
Srihari-mcw
2f8bd2b901
llamafile : extend sgemm.cpp support for Q5_0 models (#10010) 2024-10-25 10:27:41 +03:00
slaren
23e0d70bac
ggml : move common CPU backend impl to new header (#9509) 2024-09-16 16:22:07 +02:00
Eve
5c3d0f1824
ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
2024-09-16 09:48:24 +03:00
Eve
e536426ded
llamafile : disable sgemm for batch-size 1 (#9330) 2024-09-07 22:02:26 +03:00
Srihari-mcw
ea5d7478b1
sgemm : improved Q4_0 and Q8_0 performance via 4xN and Mx4 gemm (#8908) 2024-08-31 11:20:35 +03:00
Georgi Gerganov
6b2a849d1f
ggml : move sgemm sources to llamafile subfolder (#8394)
ggml-ci
2024-07-10 15:23:29 +03:00
Renamed from ggml/src/sgemm.cpp (Browse further)