Commit graph

2548 commits

Author SHA1 Message Date
Gilad S
ecab1c75de
cmake : fix subdir for LLAMA_METAL_EMBED_LIBRARY (#5985) 2024-03-11 10:00:08 +02:00
Georgi Gerganov
ee35600b90
llama : fix F16/F32 downcast + improve names (#5980) 2024-03-11 09:56:47 +02:00
Kawrakow
be858f6205
Better 1.5 bit quantization (#5971)
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 07:51:49 +01:00
Abhilash Majumder
ef3ced26a3
[SYCL] Add q3_s and q1_s (#5886)
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
2024-03-11 10:27:56 +05:30
ochafik
e1ed7a04d6 json: add date, time, date-time formats 2024-03-11 04:03:05 +00:00
ochafik
9a61802a28 json: add date format + fix uuid 2024-03-11 02:58:14 +00:00
ochafik
d736e928d2 json: support prefixItems alongside array items 2024-03-11 02:32:58 +00:00
ochafik
56b8744158 Update ts-type-to-grammar.sh 2024-03-11 02:11:22 +00:00
ochafik
c8254e5f8a json: port fixes from mjs to python 2024-03-11 02:10:48 +00:00
ochafik
4e2d06c741 json: updated server & chat ( cd examples/server && ./deps.sh ) 2024-03-11 01:51:26 +00:00
ochafik
5389820453 Update json-schema-to-grammar.mjs 2024-03-11 01:47:22 +00:00
AidanBeltonS
3814a07392
[SYCL] Add support for SYCL Nvidia target (#5738)
* Add support for nvidia target in CMake

* Update sycl read-me for Nvidia target

* Fix errors
2024-03-11 09:13:57 +08:00
ochafik
11813a6b0a json: rm trailing spaces 2024-03-11 00:27:50 +00:00
ochafik
0e9494183b json: custom regex parser, adds dot support & JS-portable 2024-03-11 00:24:34 +00:00
Georgi Gerganov
bb6d00bbf9
metal : move mm_id indices to shared mem (#5982) 2024-03-10 23:12:48 +02:00
Dean
7ab7b733bb
android : fix utf8 decoding error (#5935)
* examples: fix utf8 decoding error

some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token
one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137

* android : minor

---------

Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 22:03:17 +02:00
Georgi Gerganov
d9f65c97c3
readme : update hot topics 2024-03-10 20:58:26 +02:00
Georgi Gerganov
b838b53ad6
sync : ggml 2024-03-10 20:10:46 +02:00
Georgi Gerganov
df4dc3e7cb
ggml : try fix 32-bit arm compat (whisper/1938)
* ggml : try fix 32-bit arm compat

* ggml : fix cont
2024-03-10 20:10:39 +02:00
Georgi Gerganov
bf47a5eefc
ggml : remove __constant__ specifier for CUDA tables (#5940) 2024-03-10 20:09:24 +02:00
ochafik
27b1fefdf4 Delete commit.txt 2024-03-10 17:44:46 +00:00
ochafik
478f62ef5c json: support negative ranges in patterns 2024-03-10 17:35:32 +00:00
ochafik
d1fda6f450 json: simplify range escapes 2024-03-10 17:32:45 +00:00
ochafik
f57b467c74 json: add --allow-fetch 2024-03-10 17:20:05 +00:00
ochafik
54291e10d0 json: fix literal escapes 2024-03-10 17:19:27 +00:00
Pierrick Hymbert
fa8a809a91
server: ci: windows build and tests (#5968)
* server: ci: windows build and tests

* server: ci: remove tmp push branch

* server: ci: EOF EOL

* Use builti

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* server: tests: server graceful shutdown, then kill, then hard kill

* server: tests: remove python2 unicode string

* server: tests: remove wrong comment on server starting,  close_fds is always true

* server: tests: server kill, if pid exists

* server: tests: remove dependency to killall

* server: tests: ci windows: pid exists better handling

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-03-10 18:17:47 +01:00
ochafik
e8f25d6f0c json: handle uuid string format 2024-03-10 16:50:06 +00:00
ochafik
37b59d1d3b json: reuse regexp pattern subrules 2024-03-10 16:49:53 +00:00
ochafik
e8b78c28eb json: revert space to 1 at most 2024-03-10 16:49:15 +00:00
ochafik
ade339d55e json: accept duplicate identical rules 2024-03-10 16:48:56 +00:00
ochafik
dab2ea91a6 json: simplify nullable fields handling 2024-03-10 16:48:27 +00:00
DAN™
bcebd7dbf6
llama : add support for GritLM (#5959)
* add gritlm example

* gritlm results match

* tabs to spaces

* comment out debug printing

* rebase to new embed

* gritlm embeddings are back babeee

* add to gitignore

* allow to toggle embedding mode

* Clean-up GritLM sample code.

* Fix types.

* Flush stdout and output ending newline if streaming.

* mostly style fixes; correct KQ_mask comment

* add causal_attn flag to llama_cparams

* gritml : minor

* llama : minor

---------

Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 17:56:30 +02:00
ochafik
8597caa685 Update ts-type-to-grammar.sh 2024-03-10 15:47:03 +00:00
ochafik
364bf9ec3d Update ts-type-to-grammar.sh 2024-03-10 15:44:51 +00:00
ochafik
5764d9ffbc Update json-schema-to-grammar.py 2024-03-10 15:33:59 +00:00
Clint Herron
2960eae847
grammar : verify parsed state (#5950) 2024-03-10 17:17:43 +02:00
ochafik
ee492c9e4d Merge remote-tracking branch 'origin/master' into json-fixes 2024-03-10 15:01:23 +00:00
ochafik
307110ad2c Update json-schema-to-grammar.py 2024-03-10 15:00:07 +00:00
ochafik
f37ad0a043 json: handle schema from pydantic Optional fields 2024-03-10 14:55:03 +00:00
Georgi Gerganov
c78541479c
nix: update flake.lock (#5969)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)
  → 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-10 16:43:08 +02:00
ochafik
ba57964f92 Update json-schema-to-grammar.py 2024-03-10 14:42:39 +00:00
ochafik
b061de52a7 Update json-schema-to-grammar.py 2024-03-10 13:49:27 +00:00
ochafik
259f3505bc Update json-schema-to-grammar.py 2024-03-10 13:38:40 +00:00
ochafik
1cde8ded7c json: extract repeated regexp patterns to subrule 2024-03-10 13:29:56 +00:00
ochafik
add8fee04a Create regex-to-grammar.py 2024-03-10 13:23:00 +00:00
Pierrick Hymbert
621e86b331
server: benchmark: chat/completions scenario and other llm servers comparison (#5941)
* server: bench: Init a bench scenario with K6
See #5827

* server: bench: EOL EOF

* server: bench: PR feedback and improved k6 script configuration

* server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading

server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS

server: bench: increase truncated rate to 80% before failing

* server: bench: fix doc

* server: bench: change gauge custom metrics to trend

* server: bench: change gauge custom metrics to trend
server: bench: add trend custom metrics for total tokens per second average

* server: bench: doc add an option to debug http request

* server: bench: filter dataset too short and too long sequences

* server: bench: allow to filter out conversation in the dataset based on env variable

* server: bench: fix assistant message sent instead of user message

* server: bench: fix assistant message sent instead of user message

* server : add defrag thold parameter

* server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-09 23:41:49 +01:00
Georgi Gerganov
77d1ac7e00
server : print chat template info 2024-03-09 22:04:00 +02:00
slaren
d894f352bf
perplexity : support using multiple sequences to allow larger batch sizes (#5946)
* perplexity : support using multiple sequences to allow larger batch sizes

ggml-ci

* set cparams.n_parallel to the number of sequences

* print tested n_ctx, add assert
2024-03-09 19:55:54 +01:00
Georgi Gerganov
098dbaab44
readme : update hot topics 2024-03-09 18:14:13 +02:00
Georgi Gerganov
8380ecfb21
ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951) 2024-03-09 17:36:20 +02:00