Sang-Kil Park
f68664ac24
convert : fix TypeError on GPT-2 vocab.json ( #5288 )
2024-02-06 23:28:00 -05:00
hsnmkls
425ae7401f
Merge branch 'ggerganov:master' into master
2024-02-07 07:31:47 +08:00
Hasan Mukhlis
0c02642d03
build.zig add macos
2024-02-07 07:26:17 +08:00
Alexey Parfenov
213d1439fa
server : remove model.json endpoint ( #5371 )
2024-02-06 20:08:38 +02:00
Johannes Gäßler
17c97fb062
CUDA: mul_mat_vec_q max. batch size 8 -> 4 ( #5370 )
2024-02-06 19:43:06 +02:00
Kawrakow
b08f22c882
Update README.md ( #5366 )
...
Add some links to quantization related PRs
2024-02-06 19:00:16 +02:00
Kawrakow
f57fadc009
Slight quantization improvement for Q4_K and Q5_K ( #5361 )
...
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06 17:28:02 +02:00
BarfingLemurs
2e9c0bd6b3
readme : add phi, orion 14b, internlm2, and yi-VL to readme ( #5362 )
2024-02-06 16:06:48 +02:00
Johannes Gäßler
2c516611f1
CUDA: mul_mat_vec_q for batch sizes > 1 ( #5351 )
2024-02-06 14:44:06 +01:00
hazelnutcloud
279a1d7448
working vulkan zig build
2024-02-06 20:36:50 +08:00
hsnmkls
16ecbc9a02
Merge branch 'ggerganov:master' into master
2024-02-06 17:24:00 +08:00
Justin Parker
8a79c591de
server : include total "num_slots" in props endpoint ( #5349 )
2024-02-06 11:20:59 +02:00
Michael Coppola
31e7903221
server : add dynatemp_range
and dynatemp_exponent
( #5352 )
...
* server: added `dynatemp_range` and `dynatemp_exponent`
* Update README.md
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-02-06 11:20:00 +02:00
Niall Coates
4ffc7a17d4
server : various fixes for the prompt field in /completion ( #5300 )
...
server : fix deadlock when prompt array contains strings and numbers
server : removed an unnecessary generation when generating multi-prompts
server : removed an unnecessary assert
2024-02-06 10:16:23 +02:00
Georgi Gerganov
906cff55c2
py : handle byte tokens in get_token_type
( #5341 )
...
* py : handle byte tokens in `get_token_type`
* py : fix empty bytes arg
2024-02-06 07:47:22 +02:00
hsnmkls
40b9ba117c
Merge branch 'ggerganov:master' into master
2024-02-06 13:32:24 +08:00
Johannes Gäßler
098f6d737b
make: Use ccache for faster compilation ( #5318 )
...
* make: Use ccache for faster compilation
2024-02-05 19:33:00 +01:00
Johannes Gäßler
78b00dda6c
README: updated introduction ( #5343 )
...
* README: updated introduction
* readme : update
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 15:55:10 +01:00
Kawrakow
c6b395535a
ggml : make use of ggml-quants.h possible in C++ code ( #5338 )
...
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 14:09:47 +02:00
Dr. Tom Murphy VII Ph.D
abb61944a5
ggml : avoid duplicating function calls using MIN/MAX macros ( #5325 )
...
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 13:13:57 +02:00
Kawrakow
89503dcb5f
iq3_xxs: quards for the no-imatrix situation ( #5334 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 12:32:27 +02:00
Guoteng
7e1ae372f3
py : fix internlm2-hf convert to gguf ( #5305 )
...
* py : fix internlm2-hf convert to gguf
* ggml-ci
2024-02-05 11:04:06 +02:00
Kawrakow
6fdfa2ecc6
iq2_xxs: tune quantization ( #5320 )
...
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 10:46:06 +02:00
Alexey Parfenov
a2d60c9158
server : allow to get default generation settings for completion ( #5307 )
2024-02-05 10:10:22 +02:00
l3utterfly
e6f8177532
common : add dynamic temperature parameters to main example cli ( #5295 )
...
* added dynamic temp params in main
* added help text
2024-02-05 10:00:47 +02:00
Georgi Gerganov
30679d438d
scripts : fix typos, cleanup ( #5303 )
2024-02-05 09:48:03 +02:00
Нияз Гарифзянов
4be04c8965
scripts : add non-interactive server-llm.sh ( #5303 )
...
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 09:43:57 +02:00
chiranko
5d55b0cd82
readme : add CodeShell models to the supported models list ( #5330 )
2024-02-05 09:41:38 +02:00
AidanBeltonS
4833ac209d
[SYCL] Fix cpy with dims of 3 ( #5289 )
...
* Fix cpy with dims of 3
* rm asserts
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-05 12:38:24 +05:30
hsnmkls
770e435e1c
Merge branch 'ggerganov:master' into master
2024-02-05 01:02:28 +08:00
github-actions[bot]
9392ebd49e
flake.lock: Update
...
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
→ 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
→ 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
→ 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04 08:45:35 -08:00
hsnmkls
edf46a38ff
Merge branch 'ggerganov:master' into master
2024-02-04 18:56:19 +08:00
Kawrakow
5ed26e1fc9
Adding some imatrix tools ( #5302 )
...
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-04 10:39:58 +02:00
Welby Seely
277fad30c6
cmake : use set() for LLAMA_WIN_VER ( #5298 )
...
option() is specifically for booleans.
Fixes #5158
2024-02-03 23:18:51 -05:00
Johannes Gäßler
3c0d25c475
make: add nvcc info print ( #5310 )
2024-02-03 20:15:13 +01:00
Johannes Gäßler
3cc5ed353c
make: fix nvcc optimization flags for host code ( #5309 )
2024-02-03 20:14:59 +01:00
Martin Schwaighofer
60ecf099ed
add Vulkan support to Nix flake
2024-02-03 13:13:07 -06:00
Hasan Mukhlis
1582e4e373
Merge branch 'build.zig'
2024-02-04 01:22:45 +08:00
0cc4m
5444f24893
Vulkan Intel Fixes, Optimizations and Debugging Flags ( #5301 )
...
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-04 01:21:28 +08:00
Hasan Mukhlis
a19a4b2a2f
update build.zig to master build
2024-02-04 01:19:45 +08:00
0cc4m
e920ed393d
Vulkan Intel Fixes, Optimizations and Debugging Flags ( #5301 )
...
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03 18:15:00 +01:00
Michael Klimenko
52bb63c708
refactor : switch to emplace_back to avoid extra object ( #5291 )
2024-02-03 13:23:37 +02:00
Jared Van Bortel
1ec3332ade
YaRN : store rope scaling type as int32_t in memory ( #5285 )
...
* YaRN : store rope scaling type as int32_t in memory
* llama : store mapped names as const char *
2024-02-03 13:22:06 +02:00
BADR
6a66c5071a
readme : add tenere in the ui tools list ( #5284 )
2024-02-03 13:20:26 +02:00
AidanBeltonS
a305dba8ff
Fix im2col with 32fp ( #5286 )
2024-02-03 16:11:37 +08:00
kalomaze
191221178f
perplexity : fix KL divergence calculations on Windows ( #5273 )
2024-02-02 16:15:30 +02:00
Georgi Gerganov
e437b37fd0
scripts : parse wtype in server-llm.sh ( #5167 )
...
* scripts : parse wtype in server-llm.sh
* scripts : fix check for wfile
2024-02-02 14:23:40 +02:00
Mirror Azure
2d40085c26
py : add check for '.attn.masked_bias' layers to GPT2model ( #5281 )
2024-02-02 13:39:09 +02:00
AidanBeltonS
b05102fe8c
Tidy ggml-sycl ( #5261 )
...
* Tidy some code in ggml-sycl
* Remove blank space
* Remove std::printf comments
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-02 16:39:48 +08:00
Xuan Son Nguyen
6b91b1e0a9
docker : add build for SYCL, Vulkan + update readme ( #5228 )
...
* add vulkan dockerfile
* intel dockerfile: compile sycl by default
* fix vulkan dockerfile
* add docs for vulkan
* docs: sycl build in docker
* docs: remove trailing spaces
* docs: sycl: add docker section
* docs: clarify install vulkan SDK outside docker
* sycl: use intel/oneapi-basekit docker image
* docs: correct TOC
* docs: correct docker image for Intel oneMKL
2024-02-02 09:56:31 +02:00