llama.cpp

Author	SHA1	Message	Date
Nexesenex	599a4b2cc6	Update llama.cpp - switch from IQ4_XS to Q4_K in related cases. - There's indeed a slight bonus worthy of not being missed for such a cheap cost with Q4_K compared to IQ4_XS, especially on the K & V attention tensors. - Obsession on size doesn't matter much for the smallest models which are small anyway and need an offset toward quality for the sake of logic, while the bigger models which can actually be usable almost won't be impacted in size but will appreciate the slight quality bump offered by Q4_K vs IQ4_XS.	2024-03-26 13:41:16 +01:00
Nexesenex	eaf9571d9b	Update llama.cpp - exception for the IQ2_S token embedding error	2024-03-26 10:11:46 +01:00
Nexesenex	d1839362fc	Update llama.cpp - remove trailing space	2024-03-26 09:17:09 +01:00
Nexesenex	62c1f5b681	Update llama.cpp typo	2024-03-26 02:25:07 +01:00
Nexesenex	f162b2ef3f	Update llama.cpp - correction embd.weight GQA-4 & qkv.weight to K-Quants Q2_K embed for GQ4 because it helps Mistral 7b. I didn't test a model with attn.qkv weight, so better to be conservative with a K-Quant.	2024-03-26 02:22:04 +01:00
Nexesenex	9c27b0e6ea	Update quantize.cpp - mix label	2024-03-26 01:12:35 +01:00
Nexesenex	3031c01db0	Update llama.cpp - correction wrong case declaration	2024-03-26 00:06:41 +01:00
Nexesenex	066efbb18f	Update llama.cpp - adjustements non-FFN layer tensors	2024-03-25 23:08:19 +01:00
Nexesenex	b3553335a3	Update llama.h - change IQ1_XS enum number From 31 to 32, because IQ1_M will come with 31.	2024-03-25 21:06:46 +01:00
Nexesenex	ddc7701588	Update llama.cpp - Non-FFN layer-tensors strategy	2024-03-25 21:04:01 +01:00
Nexesenex	1c4da5ddac	Update llama.cpp - Embeddings and output tensors strategy.	2024-03-25 20:37:11 +01:00
Nexesenex	51ff04e77e	Update llama.cpp - Fix possible typo LLAMA_FTYPE should be GGML_TYPE there.	2024-03-25 19:31:51 +01:00
Nexesenex	8eff402498	Update llama.cpp - Case IQ1_XS	2024-03-25 19:30:19 +01:00
Nexesenex	3d88431113	Update llama.h - Enum IQ1_XS	2024-03-25 19:25:31 +01:00
Nexesenex	8f7a7ee370	Update quantize.cpp - Quant option IQ1_XS	2024-03-25 19:23:21 +01:00
Nexesenex	f4949bc1ca	b2532 b2532	2024-03-25 19:13:45 +01:00
Christian Kögler	b06c16ef9f	nix: fix blas support (#6281 ) Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567	2024-03-25 10:52:45 -07:00
Kawrakow	1f2fd4e727	tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-25 19:33:15 +02:00
Georgi Gerganov	43139cc528	flake.lock: Update (#6266 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14) → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-25 08:22:27 -07:00
slaren	2f34b865b6	cuda : fix LLAMA_CUDA_F16 build (#6298 )	2024-03-25 16:43:22 +02:00
slaren	ae1f211ce2	cuda : refactor into multiple files (#6269 )	2024-03-25 13:50:23 +01:00
Xuan Son Nguyen	ad3a0505e3	Server: clean up OAI params parsing function (#6284 ) * server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs	2024-03-25 09:42:17 +01:00
Neo Zhang Jianyu	95ad616cdd	[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290 ) * fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file	2024-03-25 15:52:41 +08:00
Minsoo Cheong	64e7b47c69	examples : add "retrieval" (#6193 ) * add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-25 09:38:22 +02:00
Justine Tunney	7733f0c760	ggml : support AVX512VNNI (#6280 ) This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).	2024-03-25 07:39:56 +02:00
Rick G	a32b77c4b2	Fix heap corruption from wmode out-of-bound writes on windows (#6272 ) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `wmode_p++ = (wchar_t)mode;` could write off the end of the allocation * Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248	2024-03-24 22:45:56 +01:00
Georgi Gerganov	a0e584defd	imatrix : fix wname for mul_mat_id ops (#6271 ) * imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-24 16:18:45 +02:00
Johannes Gäßler	7aed0ffe68	Fixed lookup compilation issues on Windows (#6273 )	2024-03-24 14:21:17 +01:00
Pierrick Hymbert	ea279d5609	ci : close inactive issue, increase operations per run (#6270 )	2024-03-24 10:57:06 +02:00
Minsoo Cheong	586e7bc561	sampling : deduplicated code for probability distribution access (#6240 ) * sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`	2024-03-24 10:54:07 +02:00
Meng, Hengyu	ddf6568510	[SYCL] offload op (#6217 ) * remove no USM methods * leave the schedule to ggml_backend_sched entirely	2024-03-24 12:04:25 +08:00
Neo Zhang Jianyu	d03224ac98	Support build win release for SYCL (#6241 ) * support release win * fix value * fix value * fix value * fix error * fix error * fix format	2024-03-24 09:44:01 +08:00
Jared Van Bortel	94d1b3b411	use _wfopen instead of fopen on Windows (#6248 ) also fix missing #defines before windows.h, and BPE LF token on MSVC	2024-03-23 18:48:02 -04:00
Georgi Gerganov	95562175f8	gitignore : gguf-split	2024-03-23 21:35:23 +02:00
Pierrick Hymbert	f482bb2e49	common: llama_load_model_from_url split support (#6192 ) * llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-23 18:07:00 +01:00
Pierrick Hymbert	1997577d5e	server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` (#6254 )	2024-03-23 18:00:38 +01:00
Julius Arkenberg	476b0251b2	llama : add grok-1 support (#6204 ) * Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-23 18:41:53 +02:00
Pierrick Hymbert	21cad01b6e	split: add gguf-split in the make build target (#6262 )	2024-03-23 17:18:13 +01:00
Pierrick Hymbert	1b26aebe4d	server: flush stdout after logging in both text and json layout (#6253 )	2024-03-23 13:18:45 +01:00
Johannes Gäßler	50ccaf5eac	lookup: complement data from context with general text statistics (#5479 ) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens	2024-03-23 01:24:36 +01:00
Georgi Gerganov	56a00f0a2f	common : default --hf-file to --model (#6234 )	2024-03-22 21:10:39 +02:00
fraxy-v	92397d87a4	convert-llama2c-to-ggml : enable conversion of GQA models (#6237 ) * convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch	2024-03-22 20:49:06 +02:00
Kawrakow	1d0331c12a	quantize: options for output and token embedding tensors qtype (#6239 ) * quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-22 20:47:14 +02:00
Pierrick Hymbert	dba1af6129	llama_model_loader: support multiple split/shard GGUFs (#6187 ) * split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <slarengh@gmail.com> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Handle optional tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <slarengh@gmail.com> * fix loop over pointer Co-authored-by: slaren <slarengh@gmail.com> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <slarengh@gmail.com> * fix llama_split_prefix --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-03-22 19:00:01 +01:00
Minsoo Cheong	ee804f6223	ci: apply concurrency limit for github workflows (#6243 )	2024-03-22 19:15:06 +02:00
Georgi Gerganov	80bd33bc2c	common : add HF arg helpers (#6234 ) * common : add HF arg helpers * common : remove defaults	2024-03-22 15:33:38 +02:00
Nexesenex	e80f06d2a1	llama : correction of the attn.v.weight quantization for IQ3_XS (#6209 ) IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice. That PR corrects this in the manner which was probably intended initially.	2024-03-22 15:32:02 +02:00
Olivier Chafik	f77a8ffd3b	tests : conditional python & node json schema tests (#6207 ) * json: only attempt python & node schema conversion tests if their bins are present Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978 disabled in https://github.com/ggerganov/llama.cpp/pull/6198 * json: orange warnings when tests skipped * json: ensure py/js schema conv tested on ubuntu-focal-make * json: print env vars in test	2024-03-22 15:09:07 +02:00
Olivier Chafik	72114edf06	json-schema-to-grammar : fix order of props + non-str const/enum (#6232 ) * json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums	2024-03-22 15:07:44 +02:00
slaren	2f0e81e053	cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208 ) * cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build	2024-03-22 14:05:31 +01:00

1 2 3 4 5 ...

2548 commits