Commit graph

3861 commits

Author SHA1 Message Date
Georgi Gerganov
aeac876864
Merge branch 'master' into gg/rerank 2024-09-28 15:15:29 +03:00
Georgi Gerganov
739842703e
llama : add comment about thread-safety [no ci] (#9449) 2024-09-28 15:13:42 +03:00
Zhenwei Jin
6102037bbb
vocab : refactor tokenizer to reduce init overhead (#9449)
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-28 15:10:58 +03:00
nopperl
9a913110cf
llama : add support for Chameleon (#8543)
* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <git@compilade.net>

* fix lint

* trigger ci

---------

Co-authored-by: compilade <git@compilade.net>
2024-09-28 15:08:43 +03:00
Aarni Koskela
43bcdd9703
readme : add tool (#9655) 2024-09-28 15:07:14 +03:00
Dan Johansson
6a0f779484
ggml : add run-time detection of neon, i8mm and sve (#9331)
* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section
2024-09-28 15:06:16 +03:00
Georgi Gerganov
39167b69c0
llama : fix comment [no ci]
ggml-ci
2024-09-28 14:51:57 +03:00
Markus Tavenrath
89f9944981
Enable use to the rebar feature to upload buffers to the device. (#9251) 2024-09-28 12:05:05 +02:00
Georgi Gerganov
b5de3b74a5
readme : update hot topics 2024-09-27 20:57:51 +03:00
Xuan Son Nguyen
a4ac45f659 update server docs 2024-09-27 15:30:41 +02:00
Xuan Son Nguyen
0d6f6a799f add --reranking argument 2024-09-27 15:25:39 +02:00
Georgi Gerganov
84b0af8355
Update examples/server/server.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-27 10:46:37 +03:00
Borislav Stanimirov
44f59b4301
cmake : add option for common library (#9661) 2024-09-27 10:42:06 +03:00
Xuan Son Nguyen
1ae8376d40 change test data 2024-09-26 17:52:57 +02:00
Xuan Son Nguyen
f27dd6990d add reranking test 2024-09-26 17:43:02 +02:00
Georgi Gerganov
f19554f453
ci : add rerank tests
ggml-ci
2024-09-26 15:20:32 +03:00
Georgi Gerganov
ca99a6ce70
llama : fix uninitialized tensors 2024-09-26 15:20:11 +03:00
Georgi Gerganov
4d457755c0
llama : add comment [no ci] 2024-09-26 14:36:14 +03:00
Georgi Gerganov
877a04ccff
server : add docs 2024-09-26 14:32:40 +03:00
Georgi Gerganov
00b33760aa
server : initiate tests for later
ggml-ci
2024-09-26 13:17:22 +03:00
Neo Zhang Jianyu
95bc82fbc0
[SYCL] add missed dll file in package (#9577)
* update oneapi to 2024.2

* use 2024.1

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-26 17:38:31 +08:00
R0CKSTAR
7691654c68
mtgpu: enable VMM (#9597)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-26 03:27:40 +02:00
Georgi Gerganov
84f56f3c45
vocab : minor style
ggml-ci
2024-09-25 20:39:43 +03:00
Georgi Gerganov
866c0113fb
jina : support v1 reranker 2024-09-25 20:39:25 +03:00
Georgi Gerganov
c62a39d91e
embedding : parse special tokens 2024-09-25 20:36:38 +03:00
Xuan Son Nguyen
ea9c32be71
ci : fix docker build number and tag name (#9638)
* ci : fix docker build number and tag name

* fine-grant permissions
2024-09-25 17:26:01 +02:00
Georgi Gerganov
7bde9a0452
server : accept /rerank endpoint in addition to /v1/rerank [no ci] 2024-09-25 17:12:32 +03:00
Georgi Gerganov
62a45d12ef
rerank : cleanup + comments 2024-09-25 17:01:43 +03:00
Georgi Gerganov
6916ed1606
llama : aboud ggml_repeat during classification 2024-09-25 17:00:37 +03:00
Georgi Gerganov
6235c62ac9
server : add rerank endpoint
ggml-ci
2024-09-25 17:00:35 +03:00
Georgi Gerganov
125a0671ab
llama : add "rank" pooling type
ggml-ci
2024-09-25 16:59:49 +03:00
Georgi Gerganov
d0a7bf9382
llama : add classigication head (wip) [no ci] 2024-09-25 16:59:48 +03:00
Georgi Gerganov
dc0cdd8760
llama : read new cls tensors [no ci] 2024-09-25 16:59:48 +03:00
Georgi Gerganov
49f90de363
py : fix position embeddings chop [no ci] 2024-09-25 16:59:48 +03:00
Georgi Gerganov
77723ed69e
py : fix scalar-tensor conversion [no ci] 2024-09-25 16:59:48 +03:00
Georgi Gerganov
3453e62bb9
py : add XLMRobertaForSequenceClassification [no ci] 2024-09-25 16:59:48 +03:00
Charles Xu
1e43630218
ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
2024-09-25 16:12:20 +03:00
Xuan Son Nguyen
afbbfaa537
server : add more env vars, improve gen-docs (#9635)
* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT
2024-09-25 14:05:13 +02:00
Gabe Goodhart
3d6bf6919f
llama : add IBM Granite MoE architecture (#9438)
* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-25 10:06:52 +03:00
Dou Xinpeng
904837e0cb
cann: fix crash when llama-bench is running on multiple cann devices (#9627) 2024-09-25 11:30:38 +08:00
Eric Zhang
70392f1f81
ggml : add AVX512DQ requirement for AVX512 builds (#9622) 2024-09-24 11:03:21 +03:00
Georgi Gerganov
bb5f819975
sync : ggml 2024-09-24 11:01:18 +03:00
Georgi Gerganov
c038931615
examples : adapt to ggml.h changes (ggml/0)
ggml-ci
2024-09-24 11:00:52 +03:00
Georgi Gerganov
31ac5834fe
llama : keep track of all EOG tokens in the vocab (#9609)
ggml-ci
2024-09-24 10:16:06 +03:00
Georgi Gerganov
cea1486ecf
log : add CONT level for continuing previous log entry (#9610) 2024-09-24 10:15:35 +03:00
StrangeBytesDev
0aa15011e3
server : add newline after chat example (#9616) 2024-09-24 09:04:39 +03:00
Georgi Gerganov
b0f27361f3
sampling : avoid expensive softmax during greedy sampling (#9605)
* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 09:03:17 +03:00
Max Krasnyansky
c087b6f11d
threads: fix msvc build without openmp (#9615)
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
2024-09-23 21:18:48 -07:00
Ivan
116efee0ee
cuda: add q8_0->f32 cpy operation (#9571)
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 02:14:24 +02:00
Xuan Son Nguyen
0b3bf966f4
server : add --no-context-shift option (#9607)
* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 22:23:54 +02:00