llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	4305b57c80	sync : ggml	2024-08-09 10:03:48 +03:00
Matt Stephenson	70c0ea3560	whisper : use vulkan as gpu backend when available (whisper/2302) * ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>	2024-08-09 10:03:44 +03:00
Daniel Bevenius	5b2c04f492	embedding : add --pooling option to README.md [no ci] (#8934 ) This commit adds the `--pooling` option to the README.md file in the `examples/embedding` directory. The motivation for adding this options is that currently if the model used does not specify a pooling type the embedding example will fail with the following error message: ```console main: error: pooling type NONE not supported ``` This commit also updates the name of the executable in the examples section.	2024-08-09 09:33:30 +03:00
Daniel Bevenius	6f6496bb09	llama : fix typo in llama_tensor_get_type comment [no ci] (#8937 )	2024-08-09 09:32:23 +03:00
Mathieu Geli	daef3ab233	server : add one level list nesting for embeddings (#8936 )	2024-08-09 09:32:02 +03:00
compilade	345a686d82	llama : reduce useless copies when saving session (#8916 ) * llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer	2024-08-08 23:54:00 -04:00
compilade	3a14e00366	gguf-py : simplify support for quant types (#8838 ) * gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.	2024-08-08 13:33:09 -04:00
Georgi Gerganov	afd27f01fe	scripts : sync cann files (#0 )	2024-08-08 14:56:52 +03:00
Georgi Gerganov	366d486c16	scripts : fix sync filenames (#0 )	2024-08-08 14:40:12 +03:00
Georgi Gerganov	e44a561ab0	sync : ggml	2024-08-08 13:19:47 +03:00
Borislav Stanimirov	f93d49ab1e	ggml : ignore more msvc warnings (ggml/906)	2024-08-08 13:19:31 +03:00
Georgi Gerganov	5b33ea1ee7	metal : fix struct name (ggml/912) ggml-ci	2024-08-08 13:19:31 +03:00
Conrad Kramer	85fca8deb6	metal : add abort callback (ggml/905)	2024-08-08 13:19:30 +03:00
Pablo Duboue	ebd541a570	make : clean llamafile objects (#8923 ) `ggml/src/llamafile/sgemm.o` was not deleted on `make clean`	2024-08-08 11:44:51 +03:00
Nexes the Old	fc4ed23673	correct a third party typo	2024-08-07 23:09:52 +02:00
Nexes the Old	60d11d0107	trailing whitespaces	2024-08-07 22:42:29 +02:00
Nexes the Old	259c5f3a92	correct ident and trailing whitespaces	2024-08-07 22:41:05 +02:00
Nexes the Old	867e3523f9	trailing whitespace	2024-08-07 22:39:39 +02:00
Nexes the Old	28a41e7bdd	Merge branch 'master' into lcpp_pr_specific_quants	2024-08-07 22:13:55 +02:00
Nexesenex	4a95bd5d7d	Quantize: specify each major tensor quant in CLI for common LLMs This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in #6239 to : - attn_q.weight - attn_k.weight - attn_v.weight - attn_qkv.weight - attn_output.weight - ffn_gate - ffn_down - ffn_up This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users. For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s. And one is French, like me, so Miqu is one of his main local model. Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality. That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ. But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source. So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant. And that's just an example. I think that many LCPP users could enjoy this feature for their own needs. This, even if it remains quite basic : This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant). CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.	2024-08-07 22:09:15 +02:00
slaren	15fa07a5c5	make : use C compiler to build metal embed object (#8899 ) * make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm	2024-08-07 18:24:05 +02:00
slaren	be55695eff	ggml-backend : fix async copy from CPU (#8897 ) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI	0478174d59	[SYCL] Updated SYCL device filtering (#8901 ) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-07 11:25:36 +01:00
Johannes Gäßler	a8dbc6f753	CUDA/HIP: fix tests/test-backend-ops (#8896 )	2024-08-07 09:07:52 +02:00
Zhenwei Jin	506122d854	llama-bench : add support for getting cpu info on Windows (#8824 ) * Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-07 03:01:06 +02:00
Daniel Bevenius	725e3d9437	quantize : update usage comment in quantize.cpp (#8889 ) This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.	2024-08-07 01:43:00 +02:00
Nexes the Old	31958546c3	typo correction (#8891 )	2024-08-07 01:41:54 +02:00
Xuan Son Nguyen	1e6f6554aa	server : add lora hotswap endpoint (WIP) (#8857 ) * server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style	2024-08-06 17:33:39 +02:00
Johannes Gäßler	641f5dd2a6	CUDA: fix padding logic for FP16/FP32 (#8884 )	2024-08-06 17:13:55 +02:00
Daniel Bevenius	5f4dcb1e60	simple : update name of executable to llama-simple (#8885 ) This commit updates the name of the executable in README.md from `simple` to `llama-simple`.	2024-08-06 16:44:35 +02:00
Jaeden Amero	db20f50cf4	cmake : Link vulkan-shaders-gen with pthreads (#8835 ) When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread*, >>> void ()(void), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes #8834	2024-08-06 15:21:47 +02:00
MaggotHATE	efda90c93a	[Vulkan] Fix compilation of `vulkan-shaders-gen` on w64devkit after `e31a4f6` (#8880 ) * Fix compilation issue in `vulkan-shaders-gen` `e31a4f6797` broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`	2024-08-06 13:32:03 +02:00
Georgi Gerganov	0bf16de07b	contributing : add note about write access	2024-08-06 11:48:01 +03:00
Molly Sophia	2d5dd7bb3f	ggml : add epsilon as a parameter for group_norm (#8818 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-06 10:26:46 +03:00
Douglas Hanley	cdd1889de6	convert : add support for XLMRoberta embedding models (#8658 ) * add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion	2024-08-06 10:20:54 +03:00
Mengqing Cao	c21a896405	[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871 ) * cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor	2024-08-06 12:42:42 +08:00
Neo Zhang	d4ff847153	[SYCL] correct cmd name (#8877 )	2024-08-06 09:09:12 +08:00
Liu Jia	0a4ce78681	common : Changed tuple to struct (TODO fix) (#8823 ) * common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model , struct llama_context > * delete llama_init_default_params() * delete the extra whitespace	2024-08-05 18:14:10 +02:00
wangshuai09	bc0f887e15	cann: fix buffer_num and runtime speed slowly error (#8865 )	2024-08-05 21:10:37 +08:00
Eric Curtin	b42978e7e4	readme : add ramalama to the availables UI (#8811 ) ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-08-05 15:45:01 +03:00
Justine Tunney	b9dfc25ca3	ggml : fix overflows in elu function (#8866 ) It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.	2024-08-05 15:43:40 +03:00
Brian	1ef14b3007	py: Add more authorship metadata from model card (#8810 ) * py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card	2024-08-05 21:15:28 +10:00
fairydreaming	d3f0c7166a	Stop the generation when <\|eom_id\|> token is encountered - needed for Llama 3.1 tool call support (#8858 ) * gguf-py, llama : add constants and methods related to Llama-3.1 <\|eom_id\|> token * llama : find Llama-3.1 <\|eom_id\|> token id during vocab loading * llama-vocab : add Llama-3.1 <\|eom_id\|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-05 09:38:01 +02:00
stduhpf	e31a4f6797	cmake: fix paths for vulkan shaders compilation on Windows (#8573 ) * Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis	2024-08-05 08:18:27 +02:00
BarfingLemurs	400ae6f65f	readme : update model list (#8851 )	2024-08-05 08:54:10 +03:00
Georgi Gerganov	f1ea5146d7	llama : better replace_all (#8852 )	2024-08-05 08:53:39 +03:00
0cc4m	064cdc265f	vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855 ) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered	2024-08-05 08:52:55 +03:00
Georgi Gerganov	5587e57a76	sync : ggml ggml-ci	2024-08-05 08:50:57 +03:00
0cc4m	a3738b2fa7	vulkan : implement Stable Diffusion operators (ggml/904) * Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op	2024-08-05 08:50:57 +03:00
Daniel Bevenius	655858ace0	ggml : move c parameter comment to ggml_rope_ext (ggml/901) This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-08-05 08:50:57 +03:00

... 6 7 8 9 10 ...

3912 commits