Diego Devesa
ae8de6d50a
ggml : build backends as libraries ( #10256 )
...
* ggml : build backends as libraries
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
amritahs-ibm
e89213492d
ggml : optimize llamafile cpu matrix multiplication for ppc64le ( #10156 )
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.
This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2024-11-09 09:17:50 +02:00
Srihari-mcw
2f8bd2b901
llamafile : extend sgemm.cpp support for Q5_0 models ( #10010 )
2024-10-25 10:27:41 +03:00
slaren
23e0d70bac
ggml : move common CPU backend impl to new header ( #9509 )
2024-09-16 16:22:07 +02:00
Eve
5c3d0f1824
ggml : IQ4_NL sgemm + Q4_0 AVX optimization ( #9422 )
...
* squashed
readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049
have ggml_vec_dot_q4_0 do two blocks per loop for avx
try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue
* shuffle
* remove f16c iq4_nl as i cant make it faster than before
2024-09-16 09:48:24 +03:00
Eve
e536426ded
llamafile : disable sgemm for batch-size 1 ( #9330 )
2024-09-07 22:02:26 +03:00
Srihari-mcw
ea5d7478b1
sgemm : improved Q4_0 and Q8_0 performance via 4xN and Mx4 gemm ( #8908 )
2024-08-31 11:20:35 +03:00
Georgi Gerganov
6b2a849d1f
ggml : move sgemm sources to llamafile subfolder ( #8394 )
...
ggml-ci
2024-07-10 15:23:29 +03:00