llama.cpp

Author	SHA1	Message	Date
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
Georgi Gerganov	ee35600b90	llama : fix F16/F32 downcast + improve names (#5980 )	2024-03-11 09:56:47 +02:00
Kawrakow	be858f6205	Better 1.5 bit quantization (#5971 ) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 07:51:49 +01:00
Abhilash Majumder	ef3ced26a3	[SYCL] Add q3_s and q1_s (#5886 ) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-11 10:27:56 +05:30
ochafik	e1ed7a04d6	json: add date, time, date-time formats	2024-03-11 04:03:05 +00:00
ochafik	9a61802a28	json: add date format + fix uuid	2024-03-11 02:58:14 +00:00
ochafik	d736e928d2	json: support prefixItems alongside array items	2024-03-11 02:32:58 +00:00
ochafik	56b8744158	Update ts-type-to-grammar.sh	2024-03-11 02:11:22 +00:00
ochafik	c8254e5f8a	json: port fixes from mjs to python	2024-03-11 02:10:48 +00:00
ochafik	4e2d06c741	json: updated server & chat `( cd examples/server && ./deps.sh )`	2024-03-11 01:51:26 +00:00
ochafik	5389820453	Update json-schema-to-grammar.mjs	2024-03-11 01:47:22 +00:00
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
ochafik	11813a6b0a	json: rm trailing spaces	2024-03-11 00:27:50 +00:00
ochafik	0e9494183b	json: custom regex parser, adds dot support & JS-portable	2024-03-11 00:24:34 +00:00
Georgi Gerganov	bb6d00bbf9	metal : move mm_id indices to shared mem (#5982 )	2024-03-10 23:12:48 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00
Georgi Gerganov	d9f65c97c3	readme : update hot topics	2024-03-10 20:58:26 +02:00
Georgi Gerganov	b838b53ad6	sync : ggml	2024-03-10 20:10:46 +02:00
Georgi Gerganov	df4dc3e7cb	ggml : try fix 32-bit arm compat (whisper/1938) * ggml : try fix 32-bit arm compat * ggml : fix cont	2024-03-10 20:10:39 +02:00
Georgi Gerganov	bf47a5eefc	ggml : remove __constant__ specifier for CUDA tables (#5940 )	2024-03-10 20:09:24 +02:00
ochafik	27b1fefdf4	Delete commit.txt	2024-03-10 17:44:46 +00:00
ochafik	478f62ef5c	json: support negative ranges in patterns	2024-03-10 17:35:32 +00:00
ochafik	d1fda6f450	json: simplify range escapes	2024-03-10 17:32:45 +00:00
ochafik	f57b467c74	json: add --allow-fetch	2024-03-10 17:20:05 +00:00
ochafik	54291e10d0	json: fix literal escapes	2024-03-10 17:19:27 +00:00
Pierrick Hymbert	fa8a809a91	server: ci: windows build and tests (#5968 ) * server: ci: windows build and tests * server: ci: remove tmp push branch * server: ci: EOF EOL * Use builti Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * server: tests: server graceful shutdown, then kill, then hard kill * server: tests: remove python2 unicode string * server: tests: remove wrong comment on server starting, close_fds is always true * server: tests: server kill, if pid exists * server: tests: remove dependency to killall * server: tests: ci windows: pid exists better handling --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-10 18:17:47 +01:00
ochafik	e8f25d6f0c	json: handle uuid string format	2024-03-10 16:50:06 +00:00
ochafik	37b59d1d3b	json: reuse regexp pattern subrules	2024-03-10 16:49:53 +00:00
ochafik	e8b78c28eb	json: revert space to 1 at most	2024-03-10 16:49:15 +00:00
ochafik	ade339d55e	json: accept duplicate identical rules	2024-03-10 16:48:56 +00:00
ochafik	dab2ea91a6	json: simplify nullable fields handling	2024-03-10 16:48:27 +00:00
DAN™	bcebd7dbf6	llama : add support for GritLM (#5959 ) * add gritlm example * gritlm results match * tabs to spaces * comment out debug printing * rebase to new embed * gritlm embeddings are back babeee * add to gitignore * allow to toggle embedding mode * Clean-up GritLM sample code. * Fix types. * Flush stdout and output ending newline if streaming. * mostly style fixes; correct KQ_mask comment * add causal_attn flag to llama_cparams * gritml : minor * llama : minor --------- Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 17:56:30 +02:00
ochafik	8597caa685	Update ts-type-to-grammar.sh	2024-03-10 15:47:03 +00:00
ochafik	364bf9ec3d	Update ts-type-to-grammar.sh	2024-03-10 15:44:51 +00:00
ochafik	5764d9ffbc	Update json-schema-to-grammar.py	2024-03-10 15:33:59 +00:00
Clint Herron	2960eae847	grammar : verify parsed state (#5950 )	2024-03-10 17:17:43 +02:00
ochafik	ee492c9e4d	Merge remote-tracking branch 'origin/master' into json-fixes	2024-03-10 15:01:23 +00:00
ochafik	307110ad2c	Update json-schema-to-grammar.py	2024-03-10 15:00:07 +00:00
ochafik	f37ad0a043	json: handle schema from pydantic Optional fields	2024-03-10 14:55:03 +00:00
Georgi Gerganov	c78541479c	nix: update flake.lock (#5969 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) → 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-10 16:43:08 +02:00
ochafik	ba57964f92	Update json-schema-to-grammar.py	2024-03-10 14:42:39 +00:00
ochafik	b061de52a7	Update json-schema-to-grammar.py	2024-03-10 13:49:27 +00:00
ochafik	259f3505bc	Update json-schema-to-grammar.py	2024-03-10 13:38:40 +00:00
ochafik	1cde8ded7c	json: extract repeated regexp patterns to subrule	2024-03-10 13:29:56 +00:00
ochafik	add8fee04a	Create regex-to-grammar.py	2024-03-10 13:23:00 +00:00
Pierrick Hymbert	621e86b331	server: benchmark: chat/completions scenario and other llm servers comparison (#5941 ) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-09 23:41:49 +01:00
Georgi Gerganov	77d1ac7e00	server : print chat template info	2024-03-09 22:04:00 +02:00
slaren	d894f352bf	perplexity : support using multiple sequences to allow larger batch sizes (#5946 ) * perplexity : support using multiple sequences to allow larger batch sizes ggml-ci * set cparams.n_parallel to the number of sequences * print tested n_ctx, add assert	2024-03-09 19:55:54 +01:00
Georgi Gerganov	098dbaab44	readme : update hot topics	2024-03-09 18:14:13 +02:00
Georgi Gerganov	8380ecfb21	ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951 )	2024-03-09 17:36:20 +02:00

1 2 3 4 5 ...

2548 commits