llama.cpp

Author	SHA1	Message	Date
Nexesenex	91db53b645	IQ1_XL and some corrections notably on attn_q and parenthesis	2024-08-11 16:41:23 +02:00
Nexesenex	1268d58ca8	More adjustments	2024-08-11 03:05:52 +02:00
Nexesenex	ef83a87cfe	Revert of ffn gate and up on IQ3_M and indent	2024-08-11 01:30:18 +02:00
Nexesenex	e2e2d77e8e	misplaced file lol	2024-08-11 01:13:12 +02:00
Nexesenex	8ad71f4469	IQ1_XS and small adjustments.	2024-08-11 01:11:24 +02:00
Nexes the Old	14f4f404d5	Merge b3565 Merge b3565	2024-08-10 20:45:26 +02:00
Nexesenex	8bc7a9849e	2 forgotten files	2024-08-10 20:40:27 +02:00
Nexesenex	f0806ac943	IQ2_XL , IQ3_XL , Q2_K_L Plus some adjustments on the FFNs	2024-08-10 20:36:49 +02:00
Nexesenex	49617b1960	Advancing on several tensors - Progressivity for token embeddings and attn_qkv - FFN down for IQ1 and IQ2 quants - FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.	2024-08-10 18:37:29 +02:00
Nexesenex	415d5e40e1	Refactor furthermore attn.v And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S	2024-08-10 17:32:29 +02:00
Nexesenex	8c8e43ce20	Settings for MOE >= 8 experts applied to >= 4 experts	2024-08-10 16:38:11 +02:00
Nexesenex	aa4eb594ef	Further refactor attn_k With attn_k set for all quants bellow 3bpw except Q2_K_S.	2024-08-10 16:33:55 +02:00
slaren	6e02327e8b	metal : fix uninitialized abort_callback (#8968 )	2024-08-10 15:42:10 +02:00
Nexesenex	8f1b99fee8	Shortening formatting	2024-08-10 13:09:11 +02:00
Xuan Son Nguyen	7eb23840ed	llama : default n_swa for phi-3 (#8931 ) * default n_swa for phi-3 * fix * double check swa	2024-08-10 13:04:40 +02:00
Nexesenex	7212098755	IQ1 and IQ2 refactor Attn_q in Q3_K for experts >= 8 Attn_k in Q5_K for experts >= 8 Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS Attn_output in Q4_K for experts >= 8	2024-08-10 12:52:57 +02:00
fairydreaming	7c3f55c100	Add support for encoder-only T5 models (#8900 ) * gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-10 11:43:26 +02:00
Matteo Mortari	911b437f22	gguf-py : fix double call to add_architecture() (#8952 ) Signed-off-by: tarilabs <matteo.mortari@gmail.com>	2024-08-10 08:58:49 +03:00
Nexesenex	1bc4dc5c15	Bump IQ3_M attn.v in Q5_K attn.k in IQ4_XS	2024-08-09 22:49:42 +02:00
Georgi Gerganov	b72942fac9	Merge commit from fork	2024-08-09 23:03:21 +03:00
fairydreaming	6afd1a99dc	llama : add support for lora adapters in T5 model (#8938 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-09 18:53:09 +02:00
Georgi Gerganov	272e3bd95e	make : fix llava obj file race (#8946 ) ggml-ci	2024-08-09 18:24:30 +03:00
Georgi Gerganov	45a55b91aa	llama : better replace_all (cont) (#8926 ) * llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci	2024-08-09 18:23:52 +03:00
tc-mb	3071c0a5f2	llava : support MiniCPM-V-2.5 (#7599 ) * init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * clip : style changes * del common.h in clip * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix makefile error * fix ubuntu-make error * try fix clip * try fix 1 --------- Co-authored-by: Hongji Zhu <fireyoucan@gmail.com> Co-authored-by: harvestingmoon <leewenyeong@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-09 13:33:53 +03:00
Georgi Gerganov	4305b57c80	sync : ggml	2024-08-09 10:03:48 +03:00
Matt Stephenson	70c0ea3560	whisper : use vulkan as gpu backend when available (whisper/2302) * ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>	2024-08-09 10:03:44 +03:00
Daniel Bevenius	5b2c04f492	embedding : add --pooling option to README.md [no ci] (#8934 ) This commit adds the `--pooling` option to the README.md file in the `examples/embedding` directory. The motivation for adding this options is that currently if the model used does not specify a pooling type the embedding example will fail with the following error message: ```console main: error: pooling type NONE not supported ``` This commit also updates the name of the executable in the examples section.	2024-08-09 09:33:30 +03:00
Daniel Bevenius	6f6496bb09	llama : fix typo in llama_tensor_get_type comment [no ci] (#8937 )	2024-08-09 09:32:23 +03:00
Mathieu Geli	daef3ab233	server : add one level list nesting for embeddings (#8936 )	2024-08-09 09:32:02 +03:00
compilade	345a686d82	llama : reduce useless copies when saving session (#8916 ) * llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer	2024-08-08 23:54:00 -04:00
compilade	3a14e00366	gguf-py : simplify support for quant types (#8838 ) * gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.	2024-08-08 13:33:09 -04:00
Nexes the Old	1118c046df	correct mistake in conditionality for attn.k	2024-08-08 18:56:20 +02:00
Nexes the Old	8006b15fd1	Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE	2024-08-08 18:50:48 +02:00
Georgi Gerganov	afd27f01fe	scripts : sync cann files (#0 )	2024-08-08 14:56:52 +03:00
Georgi Gerganov	366d486c16	scripts : fix sync filenames (#0 )	2024-08-08 14:40:12 +03:00
Georgi Gerganov	e44a561ab0	sync : ggml	2024-08-08 13:19:47 +03:00
Borislav Stanimirov	f93d49ab1e	ggml : ignore more msvc warnings (ggml/906)	2024-08-08 13:19:31 +03:00
Georgi Gerganov	5b33ea1ee7	metal : fix struct name (ggml/912) ggml-ci	2024-08-08 13:19:31 +03:00
Conrad Kramer	85fca8deb6	metal : add abort callback (ggml/905)	2024-08-08 13:19:30 +03:00
Pablo Duboue	ebd541a570	make : clean llamafile objects (#8923 ) `ggml/src/llamafile/sgemm.o` was not deleted on `make clean`	2024-08-08 11:44:51 +03:00
slaren	15fa07a5c5	make : use C compiler to build metal embed object (#8899 ) * make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm	2024-08-07 18:24:05 +02:00
slaren	be55695eff	ggml-backend : fix async copy from CPU (#8897 ) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI	0478174d59	[SYCL] Updated SYCL device filtering (#8901 ) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-07 11:25:36 +01:00
Johannes Gäßler	a8dbc6f753	CUDA/HIP: fix tests/test-backend-ops (#8896 )	2024-08-07 09:07:52 +02:00
Zhenwei Jin	506122d854	llama-bench : add support for getting cpu info on Windows (#8824 ) * Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-07 03:01:06 +02:00
Daniel Bevenius	725e3d9437	quantize : update usage comment in quantize.cpp (#8889 ) This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.	2024-08-07 01:43:00 +02:00
Nexes the Old	31958546c3	typo correction (#8891 )	2024-08-07 01:41:54 +02:00
Xuan Son Nguyen	1e6f6554aa	server : add lora hotswap endpoint (WIP) (#8857 ) * server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style	2024-08-06 17:33:39 +02:00
Johannes Gäßler	641f5dd2a6	CUDA: fix padding logic for FP16/FP32 (#8884 )	2024-08-06 17:13:55 +02:00
Daniel Bevenius	5f4dcb1e60	simple : update name of executable to llama-simple (#8885 ) This commit updates the name of the executable in README.md from `simple` to `llama-simple`.	2024-08-06 16:44:35 +02:00

1 2 3 4 5 ...

3588 commits