Commit graph

4111 commits

Author SHA1 Message Date
FirstTimeEZ
a3dc73df7a
Merge branch 'ggerganov:master' into patch-2 2024-11-16 12:15:58 +13:00
Srihari-mcw
74d73dc85c
Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) 2024-11-15 22:27:00 +01:00
Johannes Gäßler
4047be74da
scripts: update compare-llama-bench.py (#10319) 2024-11-15 21:19:03 +01:00
slaren
883d206fbd ggml : fix some build issues 2024-11-15 21:45:32 +02:00
FirstTimeEZ
224f70a6d7
Merge branch 'ggerganov:master' into patch-2 2024-11-16 03:17:30 +13:00
Georgi Gerganov
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
2024-11-15 15:44:06 +02:00
thewh1teagle
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
Georgi Gerganov
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd
AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
2024-11-15 12:47:58 +01:00
R0CKSTAR
f0204a0ec7
ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-15 12:47:25 +01:00
FirstTimeEZ
c2410e2b05
Merge branch 'ggerganov:master' into patch-2 2024-11-16 00:22:38 +13:00
Romain Biessy
57f8355b29
sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
FirstTimeEZ
fcf90a1d23
Merge branch 'ggerganov:master' into patch-2 2024-11-15 23:23:38 +13:00
Xuan Son Nguyen
9901068ac7
server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
2024-11-15 10:48:49 +01:00
FirstTimeEZ
469bb57456
Merge branch 'ggerganov:master' into patch-2 2024-11-15 20:15:58 +13:00
Chenguang Li
231f9360d9
cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
Georgi Gerganov
4802ad350b
scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
FirstTimeEZ
e9b707ebee
Merge branch 'ggerganov:master' into patch-2 2024-11-15 16:47:21 +13:00
Romain Biessy
5a54af4d4f
sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
2024-11-15 11:09:12 +08:00
FirstTimeEZ
5d3e7c3579
Merge branch 'ggerganov:master' into patch-2 2024-11-15 13:47:17 +13:00
Charles Xu
1607a5e5b0
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-11-15 01:28:50 +01:00
FirstTimeEZ
e343f74167
Merge branch 'ggerganov:master' into patch-2 2024-11-15 12:20:17 +13:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
FirstTimeEZ
4fc8409a71
Update llama.cpp 2024-11-15 03:36:03 +13:00
FirstTimeEZ
d205ee9273
llama_model_n_params 2024-11-15 03:11:35 +13:00
FirstTimeEZ
0a7fecc17c
llama_model_n_params 2024-11-15 03:11:14 +13:00
FirstTimeEZ
970f1515c9
total size of tensors is size_t
total size of tensors is size_t
total number of parameters is int64_t
2024-11-15 01:23:17 +13:00
FirstTimeEZ
aeb803200b
Merge branch 'ggerganov:master' into patch-2 2024-11-15 01:00:25 +13:00
Johannes Gäßler
4a8ccb37ad
CUDA: no -sm row for very small matrices (#10185) 2024-11-14 13:00:15 +01:00
FirstTimeEZ
229cd05f0d
removes the implicit cast 2024-11-15 00:58:53 +13:00
FirstTimeEZ
443f5cb5bd
Merge branch 'ggerganov:master' into patch-2 2024-11-15 00:48:38 +13:00
FirstTimeEZ
76355779d0
save number of parameters and the size in llama_model 2024-11-14 22:51:47 +13:00
Georgi Gerganov
2a82891a85
speculative : fix out-of-bounds access (#10289) 2024-11-14 11:44:15 +02:00
FirstTimeEZ
0a0f91df61
save number of parameters and the size in llama_model
https://github.com/ggerganov/llama.cpp/issues/10285
2024-11-14 20:40:14 +13:00
Jeff Bolz
af148c9386
vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
2024-11-14 06:22:55 +01:00
Jeff Bolz
66798e42fb
vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
2024-11-13 21:59:47 +01:00
Michael Podvitskiy
fb4a0ec083
llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-13 20:00:35 +02:00
Georgi Gerganov
5ea926dad7
sync : ggml 2024-11-13 18:11:54 +02:00
Small Grass Forest
1ee9eea094
docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
2024-11-13 13:17:10 +02:00
Alexey Parfenov
ff7fb670d0
server : add missing docs (#10269) 2024-11-13 13:16:30 +02:00
Jhen-Jie Hong
0e712a5acb
server : fix incorrect res in validate_model_chat_template (#10272)
* server : fix validate_model_chat_template

* server : fix chat res
2024-11-13 13:15:23 +02:00
Brian
a0ec17b32e
metadata: Detailed Dataset Authorship Metadata (#8875)
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
2024-11-13 21:10:38 +11:00
Alberto Cabrera Pérez
2e82ffa4af
sycl : Fixes to broken builds and test-backend-ops (#10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
2024-11-13 09:40:57 +00:00
Jeff Bolz
80dd7ff22f
vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
2024-11-13 07:58:57 +01:00
Jeff Bolz
54ef9cfc72
vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
2024-11-11 18:13:51 +01:00
Georgi Gerganov
b0cefea58a
metal : more precise Q*K in FA vec kernel (#10247) 2024-11-11 08:39:13 +02:00
Georgi Gerganov
b141e5f6ef
server : enable KV cache defrag by default (#10233)
ggml-ci
2024-11-11 08:38:43 +02:00
Georgi Gerganov
4b3a9212b6
flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
MaggotHATE
505f33274d
server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
Jeff Bolz
160687b3ed
vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) 2024-11-10 12:37:56 +01:00