llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	596e1094fb	common : remove obsolete BPE API + disable test-tokenizer-1	2023-08-23 20:31:03 +03:00
Georgi Gerganov	2424e1d08e	llama : remove oboslete comment ggml-ci	2023-08-23 20:16:40 +03:00
Georgi Gerganov	3bfb720642	llama : advanced BPE tokenizer based on ggllm.cpp imlpementation	2023-08-23 20:11:45 +03:00
Georgi Gerganov	c3f8a6e49f	llama : prep new tokenizer support	2023-08-23 19:08:44 +03:00
Georgi Gerganov	6938c5f474	Merge branch 'master' into falcon	2023-08-23 17:08:14 +03:00
Georgi Gerganov	176ea716b3	llama : better model naming and size reporting	2023-08-23 15:53:57 +03:00
slaren	e7299656bd	falcon : add CUDA offloading (#2739 )	2023-08-23 15:51:30 +03:00
Georgi Gerganov	854ae5d030	metal : temporary workaround for the concurrency optimization bug	2023-08-23 15:25:31 +03:00
Georgi Gerganov	0a85ae7397	metal : fix GELU kernel numerical stability by using precise::tanh	2023-08-23 15:05:34 +03:00
klosax	b693000c2e	llama.cpp : fix linefeed token	2023-08-23 13:22:41 +02:00
Kawrakow	8207214b6a	Fix values shown in the quantize tool help (#2735 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-23 12:57:12 +03:00
Kawrakow	62959e740e	Strided perplexity (#2714 ) * Implementing strided computation of perplexity * Alternative way to output PPL results --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-23 12:56:42 +03:00
IgnacioFDM	7f7ddd5002	Fix ggml to gguf conversion on Windows (#2733 ) This fixes `RuntimeWarning: overflow encountered in long_scalars` Credit: anon (not mine)	2023-08-23 03:31:09 -06:00
Georgi Gerganov	e2d23bed1b	falcon : minor changes (still chasing the Metal problem)	2023-08-23 12:25:49 +03:00
Georgi Gerganov	a0dc47a501	metal : print extra compute pipeline info	2023-08-23 11:25:26 +03:00
Georgi Gerganov	b34ab74094	falcon : copy-paste self-attention from LLaMA	2023-08-23 11:04:26 +03:00
Georgi Gerganov	af4bbcc873	ggml : ggml_repeat always creates new tensor	2023-08-23 10:42:02 +03:00
Georgi Gerganov	99bb26078f	metal : implement RoPE (mode = 2) + avoid ggml_repeat	2023-08-23 10:41:35 +03:00
Georgi Gerganov	e3c52bd990	ggml : pass eps to ggml_norm	2023-08-23 10:40:58 +03:00
Xiao-Yong Jin	b8ad1b66b2	server : allow json array in prompt or content for direct token input (#2306 ) * server: allow json array in prompt or content We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content. This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input. With this, we can use EOS and BOS used in llama-2-chat models. * server: use tokenizePrompt(json) and default "" if empty prompt * server: fix prompt check * server: tokenize endpoint no longer adds BOS	2023-08-23 15:12:12 +08:00
Evan Jones	f5fe98d11b	docs : add grammar docs (#2701 ) * docs : add grammar docs * tweaks to grammar guide * rework GBNF example to be a commented grammar	2023-08-22 21:01:57 -04:00
Kerfuffle	777f42ba18	Improve handling of special tokens in GGML to GGUF converter (#2725 ) * Improve UNK, BOS, EOS token handling when converting without metadata. * Allow importing as a module. * Remove some obsolete code and minor cleanups. * Set default UNK token mapping from -1 to 0 in llama.cpp * Try to handle overflow due to buggy Windows Python with a better error message	2023-08-22 17:39:39 -06:00
klosax	d561b7f724	llama.cpp : fix the fix of bpe tokenizer	2023-08-23 00:06:53 +02:00
klosax	a95ae7526a	llama.cpp : fix bpe tokenizer	2023-08-23 00:02:13 +02:00
goerch	46ef5b5fcf	llama : fix whitespace escaping in tokenizer (#2724 )	2023-08-23 00:10:42 +03:00
Johannes Gäßler	c63bb1d16a	CUDA: use mul_mat_q kernels by default (#2683 )	2023-08-22 22:47:05 +02:00
klosax	ffa5099c6d	llama.cpp : llama default UNK token = id 0	2023-08-22 22:34:03 +02:00
klosax	9853f2cfb2	convert-falcon-hf-to-gguf.py : fix special token mapping	2023-08-22 22:29:11 +02:00
Georgi Gerganov	7bbbf38c32	llama : minor updates ggml-ci	2023-08-22 23:26:16 +03:00
Georgi Gerganov	0ec27ad66c	falcon : minor	2023-08-22 23:11:41 +03:00
Georgi Gerganov	2d58444dae	falcon : support non-40B models	2023-08-22 22:52:14 +03:00
Georgi Gerganov	3c7c325b98	falcon : CPU inference working	2023-08-22 22:31:49 +03:00
Georgi Gerganov	085228e1f5	llama : add arch member to llama_model	2023-08-22 22:09:56 +03:00
Alex Petenchea	3b6cfe7c92	convert.py : clarifying error message (#2718 )	2023-08-22 21:58:16 +03:00
Georgi Gerganov	5c5413dc14	llama : fix loading progress bar	2023-08-22 21:53:36 +03:00
Georgi Gerganov	2f3c80a845	falcon : load tensor data (CPU only)	2023-08-22 21:42:12 +03:00
Jiahao Li	800c9635b4	Fix CUDA softmax by subtracting max value before exp (#2665 )	2023-08-22 20:27:06 +02:00
Georgi Gerganov	d1b3b95dc4	convert : add dummy scores + types	2023-08-22 20:55:05 +03:00
Georgi Gerganov	9f28f73785	llm : read arch-specific KVs	2023-08-22 20:34:17 +03:00
Georgi Gerganov	b19c6e4640	Merge branch 'master' into falcon	2023-08-22 20:15:01 +03:00
Georgi Gerganov	3c025a6d07	gguf : add KV constant maps	2023-08-22 20:06:15 +03:00
Georgi Gerganov	deb7dfca4b	gguf : add ftype meta info to the model (#2710 ) * llama : add ftype meta info to the model ggml-ci * convert.py : add ftype when converting (does not work) * convert.py : fix Enum to IntEnum ggml-ci	2023-08-22 20:05:59 +03:00
Georgi Gerganov	3057d6a687	llama : refactor llama_model_load_internal()	2023-08-22 19:30:02 +03:00
Kawrakow	bac66994cf	Quantization imrovements for k_quants (#2707 ) * Improve LLaMA-2 2-, 3- and 4-bit quantization * Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of attention.wv and feed_forward.w2 This leads to a slight model sized increase as follows: Q2_K : 2.684G vs 2.670G Q3_K_S: 2.775G vs 2.745G Q3_K_M: 3.071G vs 3.057G Q4_K_S: 3.592G vs 3.563G LLaMA-2 PPL for context 512 changes as follows: Q2_K : 6.6691 vs 6.8201 Q3_K_S: 6.2129 vs 6.2584 Q3_K_M: 6.0387 vs 6.1371 Q4_K_S: 5.9138 vs 6.0041 There are improvements for LLaMA-1 as well, but they are way smaller than the above. * Minor 4-bit quantization improvement For the same model size as previus commit, we get PPL = 5.9069 vs 5.9138. * Some more fine tuning * Adding make_qkx2_quants With it, we get PPL = 5.8828 for L2-7B Q4_K_S. * Another minor improvement * Q2_K improvement Smaller model, lower perplexity. 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201 12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178 It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk, which are Q2_K * Iterating * Revert Q5_K back to make_qkx1_quants * Better Q6_K * make_qkx2_quants is better for Q5_K after all * Fix after rebasing on master * Fix for changed tensor names --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-08-22 19:14:09 +03:00
Georgi Gerganov	8bd7f06b58	llama : check if model architecture is known	2023-08-22 19:03:08 +03:00
Georgi Gerganov	4ed3469c68	llama : refactor GGUF constants into static maps	2023-08-22 18:59:39 +03:00
slaren	519c981f8b	embedding : evaluate prompt in batches (#2713 )	2023-08-22 16:03:12 +02:00
slaren	1123f7fbdf	ggml-cuda : use graph allocator (#2684 ) use a different function for no_alloc to avoid breaking backwards compat, fixes lora remove 512 n_batch limit fixed 2048 batch size cleanup Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-08-22 15:25:19 +02:00
Georgi Gerganov	ef3f333d37	ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709 ) * ggml : sync latest (SAM + SD operators, CUDA alibi) ggml-ci * ggml : fix tabs	2023-08-22 14:22:08 +03:00
slaren	8e4364f2af	llama-bench : minor fixes (#2695 )	2023-08-22 10:56:03 +03:00

1 2 3 4 5 ...

1067 commits