Commit graph

2508 commits

Author SHA1 Message Date
Georgi Gerganov
dbc35acff0
llama : introduce some typedef helpers 2024-03-22 10:58:42 +02:00
Georgi Gerganov
8326607cfe
llama : minor
ggml-ci
2024-03-22 10:18:04 +02:00
Pierrick HYMBERT
e474e456eb llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22 07:48:50 +01:00
Pierrick HYMBERT
4c04400969 llama_model_loader: fix map -> unordered map 2024-03-22 07:07:00 +01:00
Pierrick HYMBERT
b19af3643f llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer 2024-03-22 07:03:14 +01:00
Pierrick HYMBERT
a9e88c6e57 llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. 2024-03-22 06:59:04 +01:00
Pierrick HYMBERT
ec372c66a4 llama_model_loader: use at instead of operator[] if this should never add to the map. 2024-03-22 06:52:00 +01:00
Pierrick HYMBERT
9940df4f11 llama_model_loader: ensure mappings vector has the expected size 2024-03-22 06:51:21 +01:00
Pierrick HYMBERT
7cbe1eac78 llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting 2024-03-22 06:48:15 +01:00
Pierrick Hymbert
1a179bfc4e
fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-22 00:38:23 +01:00
Pierrick Hymbert
0fd652eba7
spacing
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-22 00:37:01 +01:00
Pierrick HYMBERT
f9a29735fc llama_model_loader: fail if any of backend buffer cannot be allocated 2024-03-22 00:25:11 +01:00
Pierrick HYMBERT
6df9757ad6 llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast 2024-03-22 00:02:55 +01:00
Pierrick HYMBERT
69bdee939a llama_model_loader: only map tensors included in the context 2024-03-21 23:58:12 +01:00
Pierrick HYMBERT
078a1aca06 llama_model_loader: map file to backend buffer if the allocation succeeds only 2024-03-21 23:57:43 +01:00
slaren
02020b0463 fix mmap buffer management 2024-03-21 22:06:37 +01:00
Pierrick HYMBERT
d8b567d254 llama_model_loader: fail if backend cannot allocate buffer 2024-03-21 21:05:15 +01:00
Pierrick Hymbert
1c931f3d4f
Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-21 20:50:28 +01:00
Pierrick Hymbert
c34a5deee8
Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-21 20:50:11 +01:00
Pierrick HYMBERT
00381b07bb avoid copying the entire vector 2024-03-21 19:18:39 +01:00
Pierrick HYMBERT
1892ae7eb1 llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
 - store all ggml_context in a vector as the files and mappings
 - store all weights in a vector along with the source tensor
 - rename ctx_gguf to meta
 - rename ctx_meta to contexts
2024-03-21 19:11:37 +01:00
Pierrick HYMBERT
60a87ae051 Merge branch 'master' into hp/split/load-model 2024-03-21 11:48:58 +01:00
Vaibhav Srivastav
1943c01981
ci : fix indentation error (#6195) 2024-03-21 11:30:40 +02:00
Vaibhav Srivastav
5e43ba8742
build : add mac pre-build binaries (#6182)
* Initial commit - add mac prebuilds.

* forward contribution credits for building the workflow.

* minor : remove trailing whitespaces

---------

Co-authored-by: Nicolas Patry <Narsil@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-21 11:13:12 +02:00
Kawrakow
76aa30a263
Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183)
* k_cache: be able to use Q5_0

* k_cache: be able to use Q5_1 on CODA

* k_cache: be able to use Q5_0 on Metal

* k_cache: be able to use Q5_1 on Metal

* k_cache: be able to use IQ4_NL - just CUDA for now

* k_cache: be able to use IQ4_NL on Metal

* k_cache: add newly added supported types to llama-bench and CUDA supports_op

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-21 08:27:57 +01:00
AidanBeltonS
c5b8595e3f
Add nvidia and amd backends (#6157) 2024-03-21 11:40:52 +05:30
Pierrick HYMBERT
18ff6ca847 split: move llama_tensor_offset to llama_model_loader 2024-03-21 07:06:14 +01:00
Pierrick Hymbert
b8feff411f
Avoir copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-21 04:36:06 +01:00
slaren
42e21c6882
cuda : fix conflict with std::swap (#6186) 2024-03-21 01:47:46 +01:00
Pierrick HYMBERT
7c64fef91b split: support in llama_model_loader 2024-03-20 22:30:20 +01:00
slaren
1c51f98adc
cuda : print the returned error when CUDA initialization fails (#6185) 2024-03-20 21:03:26 +01:00
Ziang Wu
f9c7ba3447
llava : update MobileVLM-README.md (#6180) 2024-03-20 17:29:51 +02:00
Ziang Wu
272935b281
llava : add MobileVLM_V2 backup (#6175)
* Add MobileVLM_V2 backup

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/convert-image-encoder-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* clip :  fix whitespace

* fix deifinition mistake in clip.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-20 17:02:32 +02:00
slaren
ccf58aa3ec
cuda : refactor to remove global resources (#6170)
* cuda : refactor to remove global resources
2024-03-20 14:42:59 +01:00
Xuan Son Nguyen
91f8ad167d
Server: version bump for httplib and json (#6169)
* server: version bump for httplib and json

* fix build

* bring back content_length
2024-03-20 13:30:36 +01:00
Georgi Gerganov
6b7e76d28c
gitignore : ignore curl-related files 2024-03-20 14:17:34 +02:00
Georgi Gerganov
bc0baab2ea
server : allow to override -ngl in tests (#6170) 2024-03-20 14:14:32 +02:00
Georgi Gerganov
d795988d9e
Revert "llava : add a MobileVLM_V2-1.7B backup (#6152)"
This reverts commit f8c4e745e1.
2024-03-20 13:29:49 +02:00
Ziang Wu
f8c4e745e1
llava : add a MobileVLM_V2-1.7B backup (#6152)
* Add MobileVLM_V2 backup

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/convert-image-encoder-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* clip :  fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-20 13:20:37 +02:00
Karthick
47cc7a7bf9
Server: Handle n_keep parameter in the request (#6174) 2024-03-20 12:02:34 +01:00
Jared Van Bortel
bd60d82d0c
server tests : more pythonic process management; fix bare except: (#6146)
* server tests : remove seemingly redundant newlines in print()

* server tests : use built-in subprocess features, not os.kill and psutil

* server tests : do not catch e.g. SystemExit; use print_exc

* server tests: handle TimeoutExpired exception

* server tests: fix connect on dual-stack systems

* server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127)

* server: tests: remove the hack on windows since now we get the good socket family

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

---------

Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-03-20 06:33:49 +01:00
Neo Zhang Jianyu
6c0b287748
update readme sycl for new update (#6151)
* update readme sycl for new update

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

* update by review comments

* update w64devkit link

* update for verify device id part

* Update README-sycl.md

Co-authored-by: Meng, Hengyu <airdldl@163.com>

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-03-20 11:21:41 +08:00
Abhilash Majumder
d26e8b669d
increase igpu cluster limit (#6159) 2024-03-20 08:28:49 +05:30
DAN™
d8b009a945
Remove undeed header file. (#6158) 2024-03-19 17:16:09 +01:00
Pierrick Hymbert
d0d5de42e5
gguf-split: split and merge gguf per batch of tensors (#6135)
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-19 12:05:44 +01:00
Georgi Gerganov
b80cf3b2d1
common : disable repeat penalties by default (#6127) 2024-03-19 10:21:54 +02:00
slaren
970a48060a
ci : exempt some labels from being tagged as stale (#6140) 2024-03-19 10:06:54 +02:00
DAN™
4c28b82529
common : print usage on '-h' and '--help' (#6145) 2024-03-19 07:59:36 +02:00
github-actions[bot]
2d15886bb0 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06)
  → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
2024-03-18 18:51:30 +00:00
Jared Van Bortel
d199ca79f2
mpt : implement backwards compatiblity with duped output tensor (#6139) 2024-03-18 12:49:02 -04:00