Commit graph

2904 commits

Author SHA1 Message Date
Julia Longtin
bb5eb95816 use better memory save operator. 2024-03-23 20:49:11 +00:00
Julia Longtin
9d7ca41703 expand mask, and align memory. 2024-03-23 20:48:43 +00:00
Julia Longtin
bd6d7e6238 try to use vectorized zeroing function. 2024-03-23 19:55:12 +00:00
Julia Longtin
f985372e3a add missing variable. 2024-03-23 19:49:16 +00:00
Julia Longtin
31d4f9312b copy right block. 2024-03-23 19:47:21 +00:00
Georgi Gerganov
95562175f8
gitignore : gguf-split 2024-03-23 21:35:23 +02:00
Pierrick Hymbert
f482bb2e49
common: llama_load_model_from_url split support (#6192)
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
 - fix header name case sensitive
 - support downloading additional split in parallel
 - hide password in url

* common: EOL EOF

* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition

* common: change max url max length

* common: minor comment

* server: support HF URL options

* llama: llama_model_loader fix log

* common: use a constant for max url length

* common: clean up curl if file cannot be loaded in gguf

* server: tests: add split tests, and HF options params

* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda

* server: tests: enable back Release test on PR

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23 18:07:00 +01:00
Pierrick Hymbert
1997577d5e
server: docs: --threads and --threads, --ubatch-size, --log-disable (#6254) 2024-03-23 18:00:38 +01:00
Julius Arkenberg
476b0251b2
llama : add grok-1 support (#6204)
* Add support for Grok model architecture

* Revert convert-hf-to-gguf to default options

* Fixed f_norm_rms_eps bug

* Fix whitespaces

* llama : fix grok rope type

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23 18:41:53 +02:00
Julia Longtin
e43a63e7c6 fix typo. 2024-03-23 16:29:30 +00:00
Julia Longtin
f092a10dc9 promote aux16 into a vector. (part three) 2024-03-23 16:27:11 +00:00
Julia Longtin
c72157a5a6 promote aux16 into a vector. 2024-03-23 16:24:11 +00:00
Julia Longtin
e3503c924a promote aux16 into a vector. 2024-03-23 16:21:20 +00:00
Julia Longtin
edb76ffddb formatting improvement. 2024-03-23 16:19:17 +00:00
Pierrick Hymbert
21cad01b6e
split: add gguf-split in the make build target (#6262) 2024-03-23 17:18:13 +01:00
Julia Longtin
6face8a0be first fixes. 2024-03-23 15:56:47 +00:00
Julia Longtin
0a2051aa88 attempt to speed up float clearing. 2024-03-23 15:55:00 +00:00
Julia Longtin
0b012c03ef allow using code from ggml-phi-knc-dot_q5_K_q8_K.c 2024-03-23 15:02:56 +00:00
Julia Longtin
0b3f17127f force to compile. 2024-03-23 14:58:33 +00:00
Julia Longtin
18f353987c tell ggml-common.h to export what we want. 2024-03-23 14:49:35 +00:00
Julia Longtin
cd20404250 pull in ggml specific types. 2024-03-23 14:38:15 +00:00
Julia Longtin
8f57803f58 import stdio.h for size_t. 2024-03-23 14:29:59 +00:00
Julia Longtin
9bcb8350d5 import stdint.h for sizeSt. 2024-03-23 14:28:29 +00:00
Julia Longtin
a7bd64c130 begin work on targeting dot_q5_K_q8_K. 2024-03-23 14:19:47 +00:00
Pierrick Hymbert
1b26aebe4d
server: flush stdout after logging in both text and json layout (#6253) 2024-03-23 13:18:45 +01:00
Johannes Gäßler
50ccaf5eac
lookup: complement data from context with general text statistics (#5479)
* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
2024-03-23 01:24:36 +01:00
Georgi Gerganov
56a00f0a2f
common : default --hf-file to --model (#6234) 2024-03-22 21:10:39 +02:00
fraxy-v
92397d87a4
convert-llama2c-to-ggml : enable conversion of GQA models (#6237)
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608

* add test in build action

* Update build.yml

* Update build.yml

* Update build.yml

* gg patch
2024-03-22 20:49:06 +02:00
Kawrakow
1d0331c12a
quantize: options for output and token embedding tensors qtype (#6239)
* quantize: be able to specify the output tensor type

* quantize: be able to specify the token embedding tensor type

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-22 20:47:14 +02:00
Pierrick Hymbert
dba1af6129
llama_model_loader: support multiple split/shard GGUFs (#6187)
* split: support in llama_model_loader

* avoid copying the entire vector

Co-authored-by: slaren <slarengh@gmail.com>

* split: move llama_tensor_offset to llama_model_loader

* llama_model_loader: PR feedbacks:
 - use only one gguf_context for metadata only
 - store all ggml_context in a vector as the files and mappings
 - store all weights in a vector along with the source tensor
 - rename ctx_gguf to meta
 - rename ctx_meta to contexts

* avoid copying the entire vector

* Simplify this by making these optional, switch some layer creation tensor optional

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Handle optional tensors

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama_model_loader: fail if backend cannot allocate buffer

* fix mmap buffer management

* llama_model_loader: map file to backend buffer if the allocation succeeds only

* llama_model_loader: only map tensors included in the context

* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast

* llama_model_loader: fail if any of backend buffer cannot be allocated

* spacing

Co-authored-by: slaren <slarengh@gmail.com>

* fix loop over pointer

Co-authored-by: slaren <slarengh@gmail.com>

* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting

* llama_model_loader: ensure mappings vector has the expected size

* llama_model_loader:  use at instead of operator[] if this should never add to the map.

* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer

* llama_model_loader: fix map -> unordered map

* llama_split_prefix: use a clearer version, not pass split path len but dest max len.

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* llama : minor

ggml-ci

* llama : introduce some typedef helpers

* docs: add model shard in hot topic

* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated

Co-authored-by: slaren <slarengh@gmail.com>

* fix llama_split_prefix

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22 19:00:01 +01:00
Minsoo Cheong
ee804f6223
ci: apply concurrency limit for github workflows (#6243) 2024-03-22 19:15:06 +02:00
Georgi Gerganov
80bd33bc2c
common : add HF arg helpers (#6234)
* common : add HF arg helpers

* common : remove defaults
2024-03-22 15:33:38 +02:00
Nexesenex
e80f06d2a1
llama : correction of the attn.v.weight quantization for IQ3_XS (#6209)
IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice.

That PR corrects this in the manner which was probably intended initially.
2024-03-22 15:32:02 +02:00
Olivier Chafik
f77a8ffd3b
tests : conditional python & node json schema tests (#6207)
* json: only attempt python & node schema conversion tests if their bins are present

Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978
disabled in https://github.com/ggerganov/llama.cpp/pull/6198

* json: orange warnings when tests skipped

* json: ensure py/js schema conv tested on ubuntu-focal-make

* json: print env vars in test
2024-03-22 15:09:07 +02:00
Olivier Chafik
72114edf06
json-schema-to-grammar : fix order of props + non-str const/enum (#6232)
* json: ordered json in server/schema converter to respect orig order

* json: ws nits

* json: support non-string const / enums
2024-03-22 15:07:44 +02:00
slaren
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy

* add LLAMA_CUDA_NO_PEER_COPY to HIP build
2024-03-22 14:05:31 +01:00
Xiaoyi Chen
29ab270e65
readme : add RecurseChat to the list of UIs (#6219) 2024-03-22 13:29:49 +02:00
Jan Boon
6b8bb3a31d
server : fix n_keep always showing as 0 in response (#6211) 2024-03-22 13:12:05 +02:00
Georgi Gerganov
68e210b354
server : enable continuous batching by default (#6231) 2024-03-22 13:08:28 +02:00
Georgi Gerganov
b3e94f26ba
metal : proper assert for mat-mat memory alignment (#6225)
* metal : proper assert for mat-mat memory alignment

ggml-ci

* readme : add notice about the bug fix

* metal : fix the fix

ggml-ci
2024-03-22 11:35:53 +02:00
Vaibhav Srivastav
b2075fd6a5
ci : add CURL flag for the mac builds (#6214) 2024-03-22 09:53:43 +02:00
Georgi Gerganov
95d576b48e
metal : pad n_ctx by 32 (#6177)
* metal : require ne00 >= 128 for mat-mat kernels

ggml-ci

* llama : pad n_ctx by 32

ggml-ci
2024-03-22 09:36:03 +02:00
Neo Zhang Jianyu
59c17f02de
add blog link (#6222) 2024-03-22 15:19:37 +08:00
DAN™
fa046eafbc
Fix params underscore convert to dash. (#6203)
* Fix params underscore convert to dash.

* Update common/common.cpp

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-22 02:32:42 +01:00
Jan Boon
be07a03217
server : update readme doc from slot_id to id_slot (#6213) 2024-03-21 23:41:24 +01:00
Julia Longtin
9185e14922 be more specific about the length of our list of run amounts. 2024-03-21 20:38:49 +00:00
slaren
d0a71233fb
cuda : disable host register by default (#6206) 2024-03-21 20:54:28 +02:00
Julia Longtin
0979522fbe spacing changes. 2024-03-21 18:36:25 +00:00
semidark
f372c49ccd
Corrected typo to wrong file (#6199)
The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.
2024-03-21 18:52:35 +01:00
Georgi Gerganov
924ce1dce7
tests : disable system() calls (#6198)
ggml-ci
2024-03-21 16:20:05 +02:00