Commit graph

2888 commits

Author SHA1 Message Date
Eugene Palmoff
a787ebe7cf
Handle broken pipe error (#572) 2023-12-21 17:51:36 +08:00
Johannes Gäßler
799fc22689
CUDA: Faster Mixtral prompt processing (#4538)
* CUDA: make MoE tensors contiguous for batch size>1

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-20 15:41:22 +01:00
Eric Sommerlade
328b83de23
ggml : fixed check for _MSC_VER (#4535)
Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
2023-12-19 18:17:01 +02:00
Concedo
3f863eed72 add presence penalty 2023-12-19 23:18:56 +08:00
Concedo
da2db0302c Added support for ssl cert and key 2023-12-19 22:23:19 +08:00
Concedo
49a5dfc604 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-12-19 16:07:48 +08:00
Concedo
1f77d2ad73 move multiprocessing import into function scope 2023-12-19 15:56:58 +08:00
ebolam
6948da5a0d
Fix for windows model unloading not releasing memory (#569)
* Add in model processes as a separate process so it can be killed when unloading to release memory on windows

* Fix from Henky
2023-12-19 15:55:41 +08:00
Concedo
4c274dc2fd fix tools compilation 2023-12-19 15:53:22 +08:00
arlo-phoenix
a7aee47b98
ggml-cuda: Fix HIP build (#4528)
regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.

Currently using deprecated hipblasDatatype_t since newer ones very recent.
2023-12-18 22:33:45 +01:00
Georgi Gerganov
0e18b2e7d0
llama.swiftui : add tinyllama 1.1B F16 2023-12-18 20:17:43 +02:00
Georgi Gerganov
6ff39b129d
llama.swiftui : add more models 2023-12-18 20:05:12 +02:00
Ebey Abraham
b9e74f9bca
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)
* phi2 implementation

* fix breaking change

* phi-2 : various fixes

* phi-2 : use layer norm eps

* py : whitespaces

* llama : fix meta KV override bug

* convert : phi don't add BOS token

* convert : revert "added_tokens_decoder" change

* phi-2 : scale Q instead of KQ for better precision

* ggml : fix NeoX rope to rotate just first n_dims

* cuda : less diff in the rope_neox kernel

* ggml : add ggml_mul_mat_set_prec

ggml-ci

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision

* cuda : remove oboslete comment

---------

Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18 19:27:47 +02:00
hankcs
3c04bf6da8
llama : fix try_override for bool_value which always return true (#4519) 2023-12-18 15:14:58 +02:00
Jared Van Bortel
2994f0c5a2
decode : fix logits_valid for legacy API (#4516) 2023-12-17 19:39:02 -05:00
Georgi Gerganov
b1306c4394
readme : update hot topics 2023-12-17 20:16:23 +02:00
Georgi Gerganov
800a489e4a
llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>
2023-12-17 19:38:41 +02:00
Jared Van Bortel
f7f468a97d
gguf-py : fail fast on nonsensical special token IDs (#4489) 2023-12-17 10:45:46 -05:00
Matheus Gabriel Alves Silva
919c40660f
build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace
2023-12-17 17:23:33 +02:00
slaren
45668633fd
finetune : keep allocs alive until all allocations are done (#4486) 2023-12-17 16:05:56 +01:00
olexiyb
0ffc92d2d2
server : disable llm logs if SERVER_VERBOSE is off (#3792) 2023-12-17 17:02:16 +02:00
AdithyanI
8edd2b40fd
server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
Alexey Parfenov
eb16dae7e7
server : fix possible ambiguity in content type charset (#4501) 2023-12-17 16:56:09 +02:00
mzcu
62bd52b7bf
server : allow requests larger than 8K (#4500) 2023-12-17 16:54:37 +02:00
Bach Le
5daa5f54fd
Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506) 2023-12-17 11:57:33 +01:00
Concedo
ec05230703 updated lite, up ver 2023-12-17 14:38:39 +08:00
Concedo
e8cf7f6ed3 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-12-17 14:37:14 +08:00
slaren
c6c4fc081c
lora : add support for non-llama models (#3333)
* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style
2023-12-16 18:58:46 +01:00
Concedo
76a3ba42eb Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
#	ggml.h
#	requirements.txt
#	tests/test-quantize-perf.cpp
2023-12-16 22:58:53 +08:00
Jared Van Bortel
8a5be3bd58
llama : sanity checks for access to logits (#4274)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 22:16:15 -05:00
ShadovvBeast
88ae8952b6
server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
slaren
ee4725a686
ggml : group mul_mat_id rows by matrix (cpu only) (#4480)
* ggml : group mul_mat_id rows by matrix (cpu only)

* remove mmid parameters from mm forward

* store row groups in wdata and calculate only once in GGML_TASK_INIT

ggml-ci
2023-12-15 12:45:50 +01:00
slaren
6744dbe924
ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14 20:05:21 +01:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
wonjun Jang
c50e400163
py : add protobuf dependency (#4466) 2023-12-14 14:44:49 +02:00
LostRuins
20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
Concedo
7798587990 Workflow Build from experimental branch 2023-12-14 19:17:19 +08:00
Concedo
ae3d829d0c manual workflow for generating builds instead 2023-12-14 19:00:58 +08:00
Concedo
aac7f0b944 Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
2023-12-14 17:24:42 +08:00
Concedo
f0de4953ae fixed length exceeding max ctx 2023-12-14 16:58:41 +08:00
Concedo
04bd895311 Revert "Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values"
This reverts commit 34b3dac66d.
2023-12-14 16:46:29 +08:00
Concedo
53bbd1ee43 Merge branch 'pr_fix_buf_resize_type' into concedo_experimental 2023-12-14 16:45:18 +08:00
Concedo
05f7db4b29 do not cast to size_t, instead just use doubles 2023-12-14 16:43:34 +08:00
Georgi Gerganov
55e87c3749
ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453) 2023-12-14 10:35:29 +02:00
Concedo
c88fc19d59 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-14 16:32:42 +08:00
Concedo
34b3dac66d Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
(cherry picked from commit 1ad8f0d80e)
2023-12-14 16:16:25 +08:00
wonjun Jang
873637afc7
convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14 10:09:34 +02:00
Concedo
1ad8f0d80e Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values 2023-12-14 16:00:44 +08:00
BarfingLemurs
0353a18401
readme : update supported model list (#4457) 2023-12-14 09:38:49 +02:00
Concedo
0e31f53422 Revert "lowvram var defaults"
This reverts commit 7a691522a6.
2023-12-14 15:14:11 +08:00