Commit graph

2876 commits

Author SHA1 Message Date
Ebey Abraham
b9e74f9bca
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)
* phi2 implementation

* fix breaking change

* phi-2 : various fixes

* phi-2 : use layer norm eps

* py : whitespaces

* llama : fix meta KV override bug

* convert : phi don't add BOS token

* convert : revert "added_tokens_decoder" change

* phi-2 : scale Q instead of KQ for better precision

* ggml : fix NeoX rope to rotate just first n_dims

* cuda : less diff in the rope_neox kernel

* ggml : add ggml_mul_mat_set_prec

ggml-ci

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision

* cuda : remove oboslete comment

---------

Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18 19:27:47 +02:00
hankcs
3c04bf6da8
llama : fix try_override for bool_value which always return true (#4519) 2023-12-18 15:14:58 +02:00
Jared Van Bortel
2994f0c5a2
decode : fix logits_valid for legacy API (#4516) 2023-12-17 19:39:02 -05:00
Georgi Gerganov
b1306c4394
readme : update hot topics 2023-12-17 20:16:23 +02:00
Georgi Gerganov
800a489e4a
llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>
2023-12-17 19:38:41 +02:00
Jared Van Bortel
f7f468a97d
gguf-py : fail fast on nonsensical special token IDs (#4489) 2023-12-17 10:45:46 -05:00
Matheus Gabriel Alves Silva
919c40660f
build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace
2023-12-17 17:23:33 +02:00
slaren
45668633fd
finetune : keep allocs alive until all allocations are done (#4486) 2023-12-17 16:05:56 +01:00
olexiyb
0ffc92d2d2
server : disable llm logs if SERVER_VERBOSE is off (#3792) 2023-12-17 17:02:16 +02:00
AdithyanI
8edd2b40fd
server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
Alexey Parfenov
eb16dae7e7
server : fix possible ambiguity in content type charset (#4501) 2023-12-17 16:56:09 +02:00
mzcu
62bd52b7bf
server : allow requests larger than 8K (#4500) 2023-12-17 16:54:37 +02:00
Bach Le
5daa5f54fd
Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506) 2023-12-17 11:57:33 +01:00
Concedo
ec05230703 updated lite, up ver 2023-12-17 14:38:39 +08:00
Concedo
e8cf7f6ed3 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-12-17 14:37:14 +08:00
slaren
c6c4fc081c
lora : add support for non-llama models (#3333)
* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style
2023-12-16 18:58:46 +01:00
Concedo
76a3ba42eb Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
#	ggml.h
#	requirements.txt
#	tests/test-quantize-perf.cpp
2023-12-16 22:58:53 +08:00
Jared Van Bortel
8a5be3bd58
llama : sanity checks for access to logits (#4274)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 22:16:15 -05:00
ShadovvBeast
88ae8952b6
server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
slaren
ee4725a686
ggml : group mul_mat_id rows by matrix (cpu only) (#4480)
* ggml : group mul_mat_id rows by matrix (cpu only)

* remove mmid parameters from mm forward

* store row groups in wdata and calculate only once in GGML_TASK_INIT

ggml-ci
2023-12-15 12:45:50 +01:00
slaren
6744dbe924
ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14 20:05:21 +01:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
wonjun Jang
c50e400163
py : add protobuf dependency (#4466) 2023-12-14 14:44:49 +02:00
LostRuins
20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
Concedo
7798587990 Workflow Build from experimental branch 2023-12-14 19:17:19 +08:00
Concedo
ae3d829d0c manual workflow for generating builds instead 2023-12-14 19:00:58 +08:00
Concedo
aac7f0b944 Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
2023-12-14 17:24:42 +08:00
Concedo
f0de4953ae fixed length exceeding max ctx 2023-12-14 16:58:41 +08:00
Concedo
04bd895311 Revert "Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values"
This reverts commit 34b3dac66d.
2023-12-14 16:46:29 +08:00
Concedo
53bbd1ee43 Merge branch 'pr_fix_buf_resize_type' into concedo_experimental 2023-12-14 16:45:18 +08:00
Concedo
05f7db4b29 do not cast to size_t, instead just use doubles 2023-12-14 16:43:34 +08:00
Georgi Gerganov
55e87c3749
ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453) 2023-12-14 10:35:29 +02:00
Concedo
c88fc19d59 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-14 16:32:42 +08:00
Concedo
34b3dac66d Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
(cherry picked from commit 1ad8f0d80e)
2023-12-14 16:16:25 +08:00
wonjun Jang
873637afc7
convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14 10:09:34 +02:00
Concedo
1ad8f0d80e Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values 2023-12-14 16:00:44 +08:00
BarfingLemurs
0353a18401
readme : update supported model list (#4457) 2023-12-14 09:38:49 +02:00
Concedo
0e31f53422 Revert "lowvram var defaults"
This reverts commit 7a691522a6.
2023-12-14 15:14:11 +08:00
henk717
146e3bb3aa
Automatically generate Linux Binaries (#564)
* .sh script V1

* koboldcpp.sh polish

* koboldcpp.sh dist generator

* Include html's in dist

* RWKV in Linux Dist

* Lower dependency requirements

* Eliminate wget dependency

* More distinct binary name

I know its technically amd64, but I don't want to cause confusion among nvidia users.

* Use System OpenCL

Unsure how this will behave in the pyinstaller build, but pocl ended up CPU only. With a bit of luck the pyinstaller uses the one from the actual system if compiled in a system without opencl, while conda now includes it for that specific system.

* Add cblas dependency

Missing this causes compile failures on some system's

* ICD workaround

Ideally we find a better solution, but conda forces ICD and needs this for the successful compile. However, pyinstaller then embeds the ICD causing it to be limited to the system it was compiled for. By temporarily removing the ICD pyinstaller can't find it and everything remains functional. Ideally we do this on a pyinstaller level, but I could not find any good options to do so yet.

* Fix & Nocuda

* Automatically build Linux Binary

* Auto build on v tag

* Better on release

* Fix missing jobs:

* More distinct name

* I am to retro...

* Fix release upload

* Another upload attempt

* Another upload attempt

* Also rebuild on release edit

* Placebo commit to maybe fix CI

---------

Co-authored-by: root <root@DESKTOP-DQ1QRAG>
2023-12-14 14:48:22 +08:00
Concedo
8dd975653d removing existing yml files 2023-12-14 14:47:03 +08:00
Concedo
ec2cf6ce7a Merge branch 'concedo' into concedo_experimental 2023-12-14 14:42:00 +08:00
shibe2
948ff137ec
server : fix handling of characters that span multiple tokens when streaming (#4446) 2023-12-13 21:57:15 +02:00
Georgi Gerganov
4d98d9a656
sync : ggml (SD ops, tests, kernels) (#4444)
* sync : ggml (SD ops, tests, kernels)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

ggml-ci

* cuda : restore correct im2col

ggml-ci

* metal : try to fix moe test by reducing expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13 21:54:54 +02:00
Jared Van Bortel
70f806b821
build : detect host compiler and cuda compiler separately (#4414) 2023-12-13 12:10:10 -05:00
henk717
306754791f
Koboldcpp.sh Fix & Nocuda (#562)
* .sh script V1

* koboldcpp.sh polish

* koboldcpp.sh dist generator

* Include html's in dist

* RWKV in Linux Dist

* Lower dependency requirements

* Eliminate wget dependency

* More distinct binary name

I know its technically amd64, but I don't want to cause confusion among nvidia users.

* Use System OpenCL

Unsure how this will behave in the pyinstaller build, but pocl ended up CPU only. With a bit of luck the pyinstaller uses the one from the actual system if compiled in a system without opencl, while conda now includes it for that specific system.

* Add cblas dependency

Missing this causes compile failures on some system's

* ICD workaround

Ideally we find a better solution, but conda forces ICD and needs this for the successful compile. However, pyinstaller then embeds the ICD causing it to be limited to the system it was compiled for. By temporarily removing the ICD pyinstaller can't find it and everything remains functional. Ideally we do this on a pyinstaller level, but I could not find any good options to do so yet.

* Fix & Nocuda

---------

Co-authored-by: root <root@DESKTOP-DQ1QRAG>
2023-12-14 00:06:58 +08:00
Concedo
2810151b98 update docs 2023-12-13 22:48:29 +08:00
Concedo
e447af669c Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-12-13 21:09:47 +08:00
Siwen Yu
9fb13f9584
common : add --version option to show build info in CLI (#4433) 2023-12-13 14:50:14 +02:00
Georgi Gerganov
113f9942fc
readme : update hot topics 2023-12-13 14:05:38 +02:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00