Commit graph

2681 commits

Author SHA1 Message Date
Ashish
13c75c21eb Proper check for None type for new_name to avoid crash; formatting; revert change to base class write_tensors() 2024-04-14 14:28:12 -07:00
Ashish
96695fb96b refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently 2024-04-14 14:21:25 -07:00
Ashish
e3f73604d5 Move QK norm stack to private function so it's easier to read 2024-04-13 23:56:32 -07:00
Ashish
f7b40d7650 Revert formatter 2024-04-13 23:37:46 -07:00
Ashish
0dc779bff9 Removed warnings 2024-04-13 21:27:18 -07:00
Ashish
8dcd9978d2 Fix accidental removal 2024-04-13 21:15:15 -07:00
Ashish
0ec53cfff7 Converge StableLM and StableLM2 code to simplify graph construction 2024-04-13 21:12:30 -07:00
Ashish
29d940b0d7 Do QK norm stacking in model conversion step 2024-04-13 19:27:34 -07:00
Ashish
91a3db9e7d Formatting 2024-04-13 19:27:32 -07:00
Ashish
15a5e7db4c Removed autoformatting; resolved bug where model_arch was not selecting StableLM2 2024-04-13 19:26:26 -07:00
Ashish
0eb8492ccb Added 12B support 2024-04-13 19:22:39 -07:00
Ashish
b5afc44704 fix 2024-04-13 19:22:39 -07:00
Ashish
b89fa9734d StableLM-2-12b model support 2024-04-13 19:22:36 -07:00
Ashish
13387d9c57 StableLM12 tensormapping and constants 2024-04-13 19:21:11 -07:00
Ashish
d383c0d818 StableLM2 12B support for huggingface -> GGUF 2024-04-13 19:14:33 -07:00
Johannes Gäßler
b5e7285baf
CUDA: fix matrix multiplication logic for tests (#6667) 2024-04-14 00:21:55 +02:00
Pierrick Hymbert
4bd0f93e4a
model: support arch DbrxForCausalLM (#6515)
* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-13 11:33:52 +02:00
Olivier Chafik
ab9a3240a9
JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555)
* json: rename python schema converter to make import easier

* server: skip null json_schema / grammar fields

* json: deps management for primitive rules (+ allow null values)

* json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

* grammars: add troubleshooting section to readme

* json: cap length of numbers to 15 digits before/after decimal point

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

* json: unify all repetition code (w/ or w/o sep)

* json: support string minLength/maxLength

* server+json: update server/README w/ result_format

* nits

* json: fix type error w/ python 3.8

* json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions)

* json: simplify DOT `{"type": "string", "pattern": "^.$"}`

* json: remove recursion in opt_repetitions (avoids Python stack overflow)

* json: rm dead code

* json: rm useless assert & ggml.h import
2024-04-12 19:43:38 +01:00
slaren
fbbc030ba9
metal : unify mul_mv_id kernels (#6556) 2024-04-12 18:13:20 +02:00
Daniel Bevenius
4cc120c744
infill : add download instructions for model (#6626)
* infill : add download instructions for model

This commit adds instructions on how to download a CodeLlama model
using the `hf.sh` script. This will download the model and place it
in the `models` directory which is the same model use later by the
infill example.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! infill : add download instructions for model

Clarify the reason for using CodeLlama.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-12 15:11:46 +03:00
Pierrick Hymbert
24ee66ed0d
server : coherent log output for KV cache full (#6637) 2024-04-12 14:49:21 +03:00
jiez
91c736015b
llama : add gguf_remove_key + remove split meta during quantize (#6591)
* Remove split metadata when quantize model shards

* Find metadata key by enum

* Correct loop range for gguf_remove_key and code format

* Free kv memory

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
2024-04-12 13:45:06 +03:00
Rene Leonhardt
5c4d767ac0
chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
Georgi Gerganov
ef21ce4ccb
imatrix : remove invalid assert (#6632) 2024-04-12 11:49:58 +03:00
MasterYi1024
dee7f8d692
Correct free memory and total memory. (#6630)
Co-authored-by: MasterYi <zouxiaoyi@kylinos.cn>
2024-04-12 10:28:12 +02:00
Pierrick Hymbert
81da18e71c
eval-callback: use ggml_op_desc to pretty print unary operator name (#6631) 2024-04-12 10:26:47 +02:00
Georgi Gerganov
9ed2737acc
ci : disable Metal for macOS-latest-cmake-x64 (#6628) 2024-04-12 11:15:05 +03:00
Clint Herron
04a5ac211e
Optimization: eliminate addition of redundant stacks when advancing grammar. (#6616) 2024-04-11 21:44:50 -04:00
Clint Herron
f7001ccc5a
As suggested by @slaren, disabling Metal for test to fix CI build on OSX from #6576 (#6619) 2024-04-11 17:44:48 -04:00
Nikolas
a474f50ebb
Refactor Error Handling for CUDA (#6575)
* Refactor Error Handling for CUDA

Add guidance for setting CUDA_DOCKER_ARCH to match GPU compute capability for CUDA versions < 11.7. Include link to NVIDIA's CUDA GPUs documentation for compute capability reference.

* Update Makefile

Improved wording

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-04-11 21:56:29 +02:00
Olivier Chafik
cbaadc9294
grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609)
* grammars: reserve rejects & next candidates

* grammars: reuse new_stacks

* grammars: fix missing sig change in llama.h

* grammars: fix test (api changed)

* grammars: update gbnf-validator.cpp

* grammars: simpler syntax (no swap)
2024-04-11 19:47:34 +01:00
Hugo Roussel
1bbdaf6ecd
ci: download artifacts to release directory (#6612)
When action download-artifact was updated to v4, the default download path changed.
This fix binaries not being uploaded to releases.
2024-04-11 19:52:21 +02:00
Daniel Bevenius
f4183afe6a
scripts : add --outdir option to hf.sh (#6600)
* scripts : add --outdir option to hf.sh

This commit adds an option to the hf.sh script that allows the user to
specify an output directory for the downloaded file.

The motivation for this changes is that examples that use the hf.sh
script to download models from huggingface can now specify the output
directory, perhaps to the `models` directory to keep them in one place
and not clutter the root directory.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! scripts : add --outdir option to hf.sh

Fix format of the --outdir option in the usage message.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-11 16:22:47 +03:00
Pierrick Hymbert
b804b1ef77
eval-callback: Example how to use eval callback for debugging (#6576)
* gguf-debug: Example how to use ggml callback for debugging

* gguf-debug: no mutex, verify type, fix stride.

* llama: cv eval: move cb eval field in common gpt_params

* ggml_debug: use common gpt_params to pass cb eval.
Fix get tensor SIGV random.

* ggml_debug: ci: add tests

* ggml_debug: EOL in CMakeLists.txt

* ggml_debug: Remove unused param n_batch, no batching here

* ggml_debug: fix trailing spaces

* ggml_debug: fix trailing spaces

* common: fix cb_eval and user data not initialized

* ci: build revert label

* ggml_debug: add main test label

* doc: add a model: add a link to ggml-debug

* ggml-debug: add to make toolchain

* ggml-debug: tests add the main label

* ggml-debug: ci add test curl label

* common: allow the warmup to be disabled in llama_init_from_gpt_params

* ci: add curl test

* ggml-debug: better tensor type support

* gitignore : ggml-debug

* ggml-debug: printing also the sum of each tensor

* ggml-debug: remove block size

* eval-callback: renamed from ggml-debug

* eval-callback: fix make toolchain

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-11 14:51:07 +02:00
Daniel Bevenius
8228b66dbc
gguf : add option to not check tensor data (#6582)
This commit adds an option to the gguf example to not check the tensor
data.

The motivation for this is that it can be nice to use the gguf tool to
read other .gguf files that were not created by the gguf tool.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-10 21:16:48 +03:00
Ralph Soika
b3a96f27f0
minor layout improvements (#6572)
* minor layout improvements

* added missing file, run deps.sh locally
2024-04-10 19:18:25 +02:00
slaren
4f407a0a35
llama : add model types for mixtral (#6589) 2024-04-10 17:24:14 +02:00
slaren
65c64dc36f
convert.py : add consolidated.safetensors for mixtral 8x22b (#6587) 2024-04-10 15:23:12 +02:00
Pierrick Hymbert
67fac4b95f
docs : how to add a model (#6565)
* docs: how to add a model

* docs: model: typo and docs

* docs: model: add prevision on RoPE

* docs: model: rephrasing README.md

* docs: model: rephrasing README.md

* docs: model: README.md fix trailing spaces

* docs : some fixes

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-10 09:58:48 +03:00
Artem Zinnatullin
29122d32ac
readme : fix ROCm link (#6579) 2024-04-10 09:49:12 +03:00
sjxx
b231b37b09
readme : update UI list (#6560) 2024-04-10 09:34:00 +03:00
Jiří Sejkora
ba5e134e07
readme: fix typo in amdgpu target name (#6573) 2024-04-10 00:23:02 +02:00
Jared Van Bortel
1b67731e18
BERT tokenizer fixes (#6498)
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
2024-04-09 13:44:08 -04:00
Georgi Gerganov
c4a3a4ff47
sync : ggml 2024-04-09 20:29:06 +03:00
Ed Lee
400d5d722d
server : detect search query to start webchat (#6554) 2024-04-09 10:31:47 +02:00
Carolinabanana
5dc9dd7152
llama : add Command R Plus support (#6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 11:16:13 +03:00
Georgi Gerganov
e11a8999b5
license : update copyright notice + add AUTHORS (#6405)
* license : add AUTHORS

* authors : update

* scipts : add LICENSE and gen-authors.sh to sync
2024-04-09 09:23:19 +03:00
Georgi Gerganov
cc4a95426d
llama : fix attention layer count sanity check (#6550)
* llama : fix attention layer count sanity check

* llama : fix parentheses in attention layer count sanity check

There was otherwise a warning when compiling.

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-04-08 22:25:49 +03:00
kunnis
cecd8d3c98
Comment explaining a decision (#6531) 2024-04-08 17:44:19 +02:00
Georgi Gerganov
b73e564b16
quantize : fix precedence of cli args (#6541) 2024-04-08 16:23:01 +03:00