Commit graph

2890 commits

Author SHA1 Message Date
Minsoo Cheong
deb7240100
embedding : adjust n_ubatch value (#6296)
* embedding: assign `n_ubatch` value, print error on `n_batch` overflow

* Update examples/embedding/embedding.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* use %ld instead of %lld

* Revert "use %ld instead of %lld"

This reverts commit ea753ede90.

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-26 11:11:46 +02:00
Jan Boon
3d032ece8e
server : add n_discard parameter (#6300) 2024-03-26 10:47:43 +02:00
Joseph Stahl
e190f1fca6
nix: make xcrun visible in Nix sandbox for precompiling Metal shaders (#6118)
* Symlink to /usr/bin/xcrun so that `xcrun` binary
is usable during build (used for compiling Metal shaders)

Fixes https://github.com/ggerganov/llama.cpp/issues/6117

* cmake - copy default.metallib to install directory

When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp

Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle)

* add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader

Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin
2024-03-25 17:51:46 -07:00
slaren
280345968d
cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
Julia Longtin
12c9576aec fix vector sizes. 2024-03-25 19:43:37 +00:00
Christian Kögler
b06c16ef9f
nix: fix blas support (#6281)
Since no blas was provided to buildInputs, the executable is built without blas support.

This is a backport of NixOS/nixpkgs#298567
2024-03-25 10:52:45 -07:00
Kawrakow
1f2fd4e727
tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-25 19:33:15 +02:00
Georgi Gerganov
43139cc528
flake.lock: Update (#6266)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
  → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-25 08:22:27 -07:00
slaren
2f34b865b6
cuda : fix LLAMA_CUDA_F16 build (#6298) 2024-03-25 16:43:22 +02:00
slaren
ae1f211ce2
cuda : refactor into multiple files (#6269) 2024-03-25 13:50:23 +01:00
Xuan Son Nguyen
ad3a0505e3
Server: clean up OAI params parsing function (#6284)
* server: clean up oai parsing function

* fix response_format

* fix empty response_format

* minor fixes

* add TODO for logprobs

* update docs
2024-03-25 09:42:17 +01:00
Neo Zhang Jianyu
95ad616cdd
[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290)
* fix LOG() error for SYCL, enhance erro check by CI

* rollback to bash

* add newline at end of file
2024-03-25 15:52:41 +08:00
Minsoo Cheong
64e7b47c69
examples : add "retrieval" (#6193)
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25 09:38:22 +02:00
Justine Tunney
7733f0c760
ggml : support AVX512VNNI (#6280)
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some
architectures (e.g. AMD Zen 4).
2024-03-25 07:39:56 +02:00
Rick G
a32b77c4b2
Fix heap corruption from wmode out-of-bound writes on windows (#6272)
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
2024-03-24 22:45:56 +01:00
Georgi Gerganov
a0e584defd
imatrix : fix wname for mul_mat_id ops (#6271)
* imatrix : fix wname for mul_mat_id ops

* also filter tensor names in mul_mat_id ops

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-24 16:18:45 +02:00
Julia Longtin
bc3d6db862 separate filling aux16 from consuming aux16 by making it an array of vectors. 2024-03-24 14:18:08 +00:00
Julia Longtin
ca0dc26704 loosen alignment requirements for zeros, add missing function, and promote aux8 to an array of vectors. 2024-03-24 13:35:05 +00:00
Johannes Gäßler
7aed0ffe68
Fixed lookup compilation issues on Windows (#6273) 2024-03-24 14:21:17 +01:00
Julia Longtin
cf481cf901 promote aux8 into a vector. 2024-03-24 12:50:01 +00:00
Julia Longtin
169a145409 fix our reference to src in the second place, and use a more accurate comment. 2024-03-24 12:41:21 +00:00
Julia Longtin
c28bfe4552 spacing changes, eliminate dead references to k1 or zero, and use the right type when referring to src. 2024-03-24 12:37:47 +00:00
Julia Longtin
ba4f4129b3 better comments, and fix some small errors. 2024-03-24 12:17:06 +00:00
Julia Longtin
03a3e0eb7a perform 16 operations at a time. 2024-03-24 12:04:44 +00:00
Pierrick Hymbert
ea279d5609
ci : close inactive issue, increase operations per run (#6270) 2024-03-24 10:57:06 +02:00
Minsoo Cheong
586e7bc561
sampling : deduplicated code for probability distribution access (#6240)
* sampling: remove duplicated code for probability distribution access

* free original_logits

* fix original_logits allocation

* fixes based on review @cebtenzzre

* change function name to `llama_sampling_prepare`
2024-03-24 10:54:07 +02:00
Meng, Hengyu
ddf6568510
[SYCL] offload op (#6217)
* remove no USM methods

* leave the schedule to ggml_backend_sched entirely
2024-03-24 12:04:25 +08:00
Neo Zhang Jianyu
d03224ac98
Support build win release for SYCL (#6241)
* support release win

* fix value

* fix value

* fix value

* fix error

* fix error

* fix format
2024-03-24 09:44:01 +08:00
Julia Longtin
5935bb34f4 use proper mov operator, and pass addresses. 2024-03-23 23:46:36 +00:00
Jared Van Bortel
94d1b3b411
use _wfopen instead of fopen on Windows (#6248)
also fix missing #defines before windows.h, and BPE LF token on MSVC
2024-03-23 18:48:02 -04:00
Julia Longtin
a5132a1507 attempt our first FMA. 2024-03-23 22:16:57 +00:00
Julia Longtin
4477b8e123 add I32 vector memory clearing. 2024-03-23 21:16:23 +00:00
Julia Longtin
ea1edb0600 promote aux32 to a vector. 2024-03-23 21:12:35 +00:00
Julia Longtin
f967690a41 add missing address of operators. 2024-03-23 21:05:50 +00:00
Julia Longtin
2fdd11fe3a promote aux16 to a vector. 2024-03-23 21:00:51 +00:00
Julia Longtin
f09b3ed79e use quotes properly. 2024-03-23 20:53:16 +00:00
Julia Longtin
bb5eb95816 use better memory save operator. 2024-03-23 20:49:11 +00:00
Julia Longtin
9d7ca41703 expand mask, and align memory. 2024-03-23 20:48:43 +00:00
Julia Longtin
bd6d7e6238 try to use vectorized zeroing function. 2024-03-23 19:55:12 +00:00
Julia Longtin
f985372e3a add missing variable. 2024-03-23 19:49:16 +00:00
Julia Longtin
31d4f9312b copy right block. 2024-03-23 19:47:21 +00:00
Georgi Gerganov
95562175f8
gitignore : gguf-split 2024-03-23 21:35:23 +02:00
Pierrick Hymbert
f482bb2e49
common: llama_load_model_from_url split support (#6192)
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
 - fix header name case sensitive
 - support downloading additional split in parallel
 - hide password in url

* common: EOL EOF

* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition

* common: change max url max length

* common: minor comment

* server: support HF URL options

* llama: llama_model_loader fix log

* common: use a constant for max url length

* common: clean up curl if file cannot be loaded in gguf

* server: tests: add split tests, and HF options params

* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda

* server: tests: enable back Release test on PR

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* spacing

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23 18:07:00 +01:00
Pierrick Hymbert
1997577d5e
server: docs: --threads and --threads, --ubatch-size, --log-disable (#6254) 2024-03-23 18:00:38 +01:00
Julius Arkenberg
476b0251b2
llama : add grok-1 support (#6204)
* Add support for Grok model architecture

* Revert convert-hf-to-gguf to default options

* Fixed f_norm_rms_eps bug

* Fix whitespaces

* llama : fix grok rope type

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23 18:41:53 +02:00
Julia Longtin
e43a63e7c6 fix typo. 2024-03-23 16:29:30 +00:00
Julia Longtin
f092a10dc9 promote aux16 into a vector. (part three) 2024-03-23 16:27:11 +00:00
Julia Longtin
c72157a5a6 promote aux16 into a vector. 2024-03-23 16:24:11 +00:00
Julia Longtin
e3503c924a promote aux16 into a vector. 2024-03-23 16:21:20 +00:00
Julia Longtin
edb76ffddb formatting improvement. 2024-03-23 16:19:17 +00:00