Commit graph

3642 commits

Author SHA1 Message Date
Nexesenex
a7f91643bb Fix mistake 2024-08-19 20:02:21 +02:00
Nexesenex
caeb839ae3 Boost embeddings and output weights for MOEs.
They are single and non-repeating, the boost is thus reasonable compared to the 4 or more experts size.
2024-08-18 22:20:58 +02:00
Nexesenex
503048a197 Correct IQ3_M 2024-08-18 22:14:05 +02:00
Nexesenex
ddb13732c4 IQ3_XXL and IQ3_XXXL
We now have a full range of quants between IQ3_M and IQ4_XS
2024-08-18 22:14:04 +02:00
Nexesenex
a79633b49e Merge branch 'master' into pr/8836 2024-08-18 22:12:39 +02:00
Nexesenex
b02eaf6803 Mass use of the few/some/more/many bits bump logic
Add few bits logic and rework the 4 settings for 25/37.5/50/75% quant bump when used.
2024-08-18 22:11:24 +02:00
Georgi Gerganov
554b049068
flake.lock: Update (#9068) 2024-08-18 07:43:32 -07:00
ltoniazzi
2339a0be1c
tests : add integration test for lora adapters (#8957)
* Add printing to check weights match torch version

* minor code style changes

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-08-18 11:58:04 +02:00
Nexesenex
4ba561808d Adapt token embeddings and output.weight to vocab size
due to the huge increase of the embeddings and output weight size for models with huge vocab, they seem to quantize with less loss.
2024-08-18 04:13:28 +02:00
Nexesenex
17b71512a6 Update IQ3_M attn_k and IQ3_XL token_embd 2024-08-18 04:12:15 +02:00
Nexesenex
e4c506d794 Merge branch 'master' into pr/8836 2024-08-18 04:09:22 +02:00
Yoshi Suhara
2fb9267887
Fix incorrect use of ctx_split for bias tensors (#9063) 2024-08-17 15:34:21 +02:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint (#9056)
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
tc-mb
d565bb2fd5
llava : support MiniCPM-V-2.6 (#8967)
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* support minicpmv2.6

* modify convert script of minicpmv

* modify convert

* modify convert

* add readme

* add resampler of v2.6

* modify clip

* modify readme

* fix type-check

* fix type-check

* fix type-check

* fix type-check

* modify convert script and readme

* fix convert script and readme

* fix convert

* fix num in convert

* fix type-check

---------

Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
2024-08-16 16:34:41 +03:00
Farbod Bijary
ee2984bdaf
py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928)
Co-authored-by: farbod <farbod.bjary82@gmail.com>
2024-08-16 13:36:30 +03:00
Aisuko
c8ddce8560
Fix inference example lacks required parameters (#9035)
Signed-off-by: Aisuko <urakiny@gmail.com>
2024-08-16 11:08:59 +02:00
compilade
23fd453544
gguf-py : bump version from 0.9.1 to 0.10.0 (#9051) 2024-08-16 09:36:11 +03:00
Minsoo Cheong
c679e0cb5c
llama : add EXAONE model support (#9025)
* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* fix lint

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-16 09:35:18 +03:00
Liu Jia
fb487bb567
common : add support for cpu_get_num_physical_cores() on Windows (#8771)
* Add support for cpu_get_num_phsical_cores() on Windows

* fix build bug on msys2-clang64 and ucrt64

* avoid adding new function

* add new macros to avoid windows+mingw64

* Add error checking to return default value
2024-08-16 09:23:12 +03:00
Yoshi Suhara
2a24c8caa6
Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922)
* Add nemotron GGUF conversion & inference support

* Fix formatting issues

* Remove unnecessary write_tensors()

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Address comments by @compilade

* Replace ggml_mul_mat()->llm_build_lora_mm()

* Remove mutable variable

* Use  for bias tensors

* Cover corner case for role_scaling not in config.json

---------

Co-authored-by: compilade <git@compilade.net>
2024-08-16 04:23:33 +02:00
Nico Bosshard
e3f6fd56b1
ggml : dynamic ggml_sched_max_splits based on graph_size (#9047)
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
2024-08-16 04:22:55 +02:00
gtygo
4b9afbbe90
retrieval : fix memory leak in retrieval query handling (#8955)
* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space
2024-08-15 10:40:12 +03:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings (#8994) 2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token (#8778) 2024-08-15 10:23:23 +03:00
Esko Toivonen
6bda7ce6c3
llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850) 2024-08-15 10:17:12 +03:00
Georgi Gerganov
d5492f0525
ci : disable bench workflow (#9010) 2024-08-15 10:11:11 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct (#9026)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
Nexesenex
eeccd31a9c Merge branch 'master' into pr/8836 2024-08-15 02:30:10 +02:00
0cc4m
5fd89a70ea
Vulkan Optimizations and Fixes (#8959)
* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces
2024-08-14 18:32:53 +02:00
compilade
98a532d474
server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Georgi Gerganov
43bdd3ce18
cmake : remove unused option GGML_CURL (#9011) 2024-08-14 09:14:49 +03:00
Daniel Bevenius
06943a69f6
ggml : move rope type enum to ggml.h (#8949)
* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-13 21:13:15 +02:00
Xuan Son Nguyen
828d6ff7d7
export-lora : throw error if lora is quantized (#9002) 2024-08-13 11:41:14 +02:00
Nexesenex
8c9017bfbe Simplify IQ4_XSR
But leave in place as a "demo" the more complex template set by Ikawrakow to customize the layers quants, with the added attn_q, attn_k, and attn_output tensors.
2024-08-12 22:20:02 +02:00
Nexesenex
8c10533409 Merge branch 'master' into pr/8836 2024-08-12 20:28:38 +02:00
Nexesenex
cd92ba612f IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights
Also, Advise iMatrix for IQ2_M and Q2_K FTypes
2024-08-12 20:27:36 +02:00
Diogo Teles Sant'Anna
fc4ca27b25
ci : fix github workflow vulnerable to script injection (#9008)
Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>
2024-08-12 19:28:23 +03:00
Radoslav Gerganov
1f67436c5e
ci : enable RPC in all of the released builds (#9006)
ref: #8912
2024-08-12 19:17:03 +03:00
Nico Bosshard
0fd93cdef5
llama : model-based max number of graph nodes calculation (#8970)
* llama : model-based max number of graph nodes calculation

* Update src/llama.cpp

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 17:13:59 +02:00
Frank Mai
84eb2f4fad
docs: introduce gpustack and gguf-parser (#8873)
* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-08-12 14:45:50 +02:00
DavidKorczynski
1262e7ed13
grammar-parser : fix possible null-deref (#9004)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 15:36:41 +03:00
Nexesenex
3e2eb6dc57 Merge branch 'master' into pr/8836 2024-08-12 14:25:23 +02:00
DavidKorczynski
df5478fbea
ggml: fix div-by-zero (#9003)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724

In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 14:21:41 +02:00
Liu Jia
2589292cde
Fix a spelling mistake (#9001) 2024-08-12 11:46:03 +02:00
Georgi Gerganov
d3ae0ee8d7
py : fix requirements check '==' -> '~=' (#8982)
* py : fix requirements check '==' -> '~='

* cont : fix the fix

* ci : run on all requirements.txt
2024-08-12 11:02:01 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token (#8997)
ggml-ci
2024-08-12 10:21:50 +03:00
Nexesenex
df9e6fda50 Adjustments on output and embeddings 2024-08-11 21:49:23 +02:00
Nexesenex
1ad18f80e9 Adjustments on attn_k 2024-08-11 21:44:29 +02:00
compilade
4134999e01
gguf-py : Numpy dequantization for most types (#8939)
* gguf-py : Numpy dequantization for most types

* gguf-py : Numpy dequantization for grid-based i-quants
2024-08-11 14:45:41 -04:00
Nexes the Old
8c2c03f4a7
Merge b3569
b3569
2024-08-11 16:46:15 +02:00