llama.cpp

Author	SHA1	Message	Date
Kawrakow	76aa30a263	Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183 ) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-21 08:27:57 +01:00
AidanBeltonS	c5b8595e3f	Add nvidia and amd backends (#6157 )	2024-03-21 11:40:52 +05:30
Francis Couture-Harpin	5f33a675ca	perplexity : make hellaswag and multiple-choice outputs identical to master Due to how the KV cache is updated, the logprobs for tokens in a batch are very slightly affected by the other tokens present in the batch, so to make hellaswag and multiple-choice return exactly the same results as on master, the last token of each sequence needs to be evaluated even though its output is not used at all. This will probably be changed back in the future to make these benchmarks a tiny bit faster. * perplexity : fix division by zero when using less than 100 multiple-choice tasks	2024-03-20 23:05:18 -04:00
Francis Couture-Harpin	7d8d6b589f	llama : handle errors from llama_output_reserve at call sites	2024-03-20 23:05:12 -04:00
slaren	42e21c6882	cuda : fix conflict with std::swap (#6186 )	2024-03-21 01:47:46 +01:00
slaren	1c51f98adc	cuda : print the returned error when CUDA initialization fails (#6185 )	2024-03-20 21:03:26 +01:00
Ziang Wu	f9c7ba3447	llava : update MobileVLM-README.md (#6180 )	2024-03-20 17:29:51 +02:00
Ziang Wu	272935b281	llava : add MobileVLM_V2 backup (#6175 ) * Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * clip : fix whitespace * fix deifinition mistake in clip.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-20 17:02:32 +02:00
slaren	ccf58aa3ec	cuda : refactor to remove global resources (#6170 ) * cuda : refactor to remove global resources	2024-03-20 14:42:59 +01:00
Xuan Son Nguyen	91f8ad167d	Server: version bump for httplib and json (#6169 ) * server: version bump for httplib and json * fix build * bring back content_length	2024-03-20 13:30:36 +01:00
Georgi Gerganov	6b7e76d28c	gitignore : ignore curl-related files	2024-03-20 14:17:34 +02:00
Georgi Gerganov	bc0baab2ea	server : allow to override -ngl in tests (#6170 )	2024-03-20 14:14:32 +02:00
Georgi Gerganov	d795988d9e	Revert "llava : add a MobileVLM_V2-1.7B backup (#6152 )" This reverts commit `f8c4e745e1`.	2024-03-20 13:29:49 +02:00
Ziang Wu	f8c4e745e1	llava : add a MobileVLM_V2-1.7B backup (#6152 ) * Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * clip : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-20 13:20:37 +02:00
Karthick	47cc7a7bf9	Server: Handle n_keep parameter in the request (#6174 )	2024-03-20 12:02:34 +01:00
Jared Van Bortel	bd60d82d0c	server tests : more pythonic process management; fix bare `except:` (#6146 ) * server tests : remove seemingly redundant newlines in print() * server tests : use built-in subprocess features, not os.kill and psutil * server tests : do not catch e.g. SystemExit; use print_exc * server tests: handle TimeoutExpired exception * server tests: fix connect on dual-stack systems * server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127) * server: tests: remove the hack on windows since now we get the good socket family * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) --------- Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-03-20 06:33:49 +01:00
Neo Zhang Jianyu	6c0b287748	update readme sycl for new update (#6151 ) * update readme sycl for new update * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> * Update README-sycl.md Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> * update by review comments * update w64devkit link * update for verify device id part * Update README-sycl.md Co-authored-by: Meng, Hengyu <airdldl@163.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-03-20 11:21:41 +08:00
Abhilash Majumder	d26e8b669d	increase igpu cluster limit (#6159 )	2024-03-20 08:28:49 +05:30
Francis Couture-Harpin	615a3a4a50	llama : clearer error messages for invalid logits or embeddings ids * llama : assert all models that can have inp_out_ids Since the graph topology is now constant, this presence check can be done even when there are no outputs. * llama : assert logits and embd buffers exist before writing to them	2024-03-19 15:32:18 -04:00
Francis Couture-Harpin	8f70dcb0f3	perplexity : make Winogrande work as it does on master The problems with the Winogrande implementation will need to be fixed in a separate PR to ease review.	2024-03-19 14:07:48 -04:00
DAN™	d8b009a945	Remove undeed header file. (#6158 )	2024-03-19 17:16:09 +01:00
Pierrick Hymbert	d0d5de42e5	gguf-split: split and merge gguf per batch of tensors (#6135 ) * gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-19 12:05:44 +01:00
Georgi Gerganov	b80cf3b2d1	common : disable repeat penalties by default (#6127 )	2024-03-19 10:21:54 +02:00
slaren	970a48060a	ci : exempt some labels from being tagged as stale (#6140 )	2024-03-19 10:06:54 +02:00
DAN™	4c28b82529	common : print usage on '-h' and '--help' (#6145 )	2024-03-19 07:59:36 +02:00
Francis Couture-Harpin	d04cfaf2f5	llama : fix llama_output_reserve nullptr deref when new_size is 0	2024-03-18 21:26:08 -04:00
Francis Couture-Harpin	8b826c5b08	ggml : skip empty tensors in all backends	2024-03-18 21:15:00 -04:00
Francis Couture-Harpin	4551e7eba8	llama : use a vector for ctx->output_ids * llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future.	2024-03-18 20:51:32 -04:00
Francis Couture-Harpin	09bb15a66a	ggml : make ggml_is_empty public and work with views	2024-03-18 20:21:02 -04:00
github-actions[bot]	2d15886bb0	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)	2024-03-18 18:51:30 +00:00
Jared Van Bortel	d199ca79f2	mpt : implement backwards compatiblity with duped output tensor (#6139 )	2024-03-18 12:49:02 -04:00
Felix	104f5e0fc1	clip : fix memory leak (#6138 )	2024-03-18 17:40:22 +02:00
slaren	5e1b7f94a0	backend : set max split inputs to GGML_MAX_SRC (#6137 )	2024-03-18 16:33:44 +01:00
Georgi Gerganov	ac9ee6a4ad	ci : disable stale issue messages (#6126 )	2024-03-18 13:45:38 +02:00
Georgi Gerganov	4f6d1337ca	ci : temporary disable sanitizer builds (#6128 )	2024-03-18 13:45:27 +02:00
slaren	2bf8d0f7c4	backend : offload large batches to GPU (#6083 ) * backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-03-18 11:03:04 +01:00
DAN™	496bc79bc2	common : tidy-up argument parsing (#6105 ) * Tidy-up argument parsing. * Missing ref. * common : minor * common : add static classifier --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-18 10:27:44 +02:00
Thérence	9b03719ad7	convert : add support for CamembertModel architecture (#6119 ) Adding support for CamembertModel architecture used by : https://huggingface.co/dangvantuan/sentence-camembert-large	2024-03-18 10:17:00 +02:00
Romain D	3a6efdd03c	convert : use f32 outtype for bf16 tensors (#6106 ) The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion. Change the outtype to f32 to default to a lossless conversion.	2024-03-18 10:04:41 +02:00
Francis Couture-Harpin	6bf7f3f41c	ggml : do not multi-thread ops returning empty tensors	2024-03-18 00:35:03 -04:00
Francis Couture-Harpin	99c37ccb6b	ggml : saner ggml_can_repeat with empty tensors * ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1	2024-03-18 00:14:36 -04:00
Francis Couture-Harpin	d100502251	llama : keep same graph topology even when n_outputs == 0	2024-03-17 22:04:42 -04:00
Francis Couture-Harpin	711b0bcb11	llama : fix running a batch with n_outputs == 0 It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests.	2024-03-17 20:41:21 -04:00
Francis Couture-Harpin	a57fa7faa4	llama : fix not-skipping outputs of non-causal models	2024-03-17 20:19:25 -04:00
Francis Couture-Harpin	e19cb3aeb7	llama : fix wrong n_outputs in llama_set_inputs A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch.	2024-03-17 17:04:10 -04:00
Francis Couture-Harpin	408fcb0f91	llama : fix llama_get_embeddings_ith when the resulting id is 0	2024-03-17 15:36:41 -04:00
Francis Couture-Harpin	487f89ec2e	llama : fix embedding conditions	2024-03-17 15:36:41 -04:00
Francis Couture-Harpin	d0129e8e29	perplexity : normalize spaces and punctuation in Winogrande sentences	2024-03-17 15:36:41 -04:00
Francis Couture-Harpin	17b45c96ed	perplexity : fix Winogrande, use correct logits for second choice start The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore.	2024-03-17 15:36:41 -04:00
Francis Couture-Harpin	25981fca37	perplexity : adapt to the logits API changes	2024-03-17 15:36:41 -04:00

1 2 3 4 5 ...

2553 commits