llama.cpp

Author	SHA1	Message	Date
lshzh-ww	807ef887b2	fix white spaces	2023-07-21 12:39:44 -04:00
lshzh-ww	6ee897a501	metal: issue operations concurrently if possible Using the new ggml functions.	2023-07-21 11:23:51 -04:00
lshzh-ww	1c3030ee41	ggml: try to issue operations concurrently on GPU This commit add a ggml_graph_find_concurrency function to find if some operations can be issued simultaneously by GPU. Before sending a graph to the GPU backend we can call the new function to find concurrency in the graph. This will sort all the nodes and insert memory barrier nodes if necessary. one can simply dismiss the barrier nodes and issue operations sequentially, or try to concuurrently issue all the operations between two barriers.	2023-07-21 11:23:18 -04:00
lshzh-ww	c8e6ef1846	metal: only encode in one command buffer It's advised a program should only have one command buffer. This slow inference by ~1 ms on 33B model, but we may avoid it by reusing previous command queue.	2023-07-21 11:17:48 -04:00
Kawrakow	d924522a46	Custom RoPE + bettter memory management for CUDA (#2295 ) * Custom RoPE + bettter memory management for CUDA * Adjusted look ahead in ggml_cuda_pool_malloc to 5% This is sufficient it seems. We end up using about 200 MB less VRAM that way when running the 13B model with context 8192. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-21 17:27:51 +03:00
Kawrakow	4d76a5f49b	Faster Q3_K implementation on Metal (#2307 ) * Faster Q3_K on Metal * Additional Q3_K speedup on Metal * Q3_K for QK_K = 64 * Better Q3_K for QK_K = 64 21.6 ms/t -> 21.1 ms/t --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-21 17:05:30 +03:00
Georgi Gerganov	0db14fef06	ggml : fix the rope fix (`513f861953`)	2023-07-21 15:16:55 +03:00
Ikko Eltociear Ashimine	03e566977b	examples : fix typo in minigpt4.py (#2298 ) promt -> prompt	2023-07-21 14:53:07 +03:00
Georgi Gerganov	513f861953	ggml : fix rope args order + assert (#2054 )	2023-07-21 14:51:34 +03:00
Georgi Gerganov	3973b25a64	gitignore : fix final newline	2023-07-21 14:42:41 +03:00
Guillaume "Vermeille" Sanchez	ab0e26bdfb	llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280 )	2023-07-21 13:58:36 +03:00
Jose Maldonado	73643f5fb1	gitignore : changes for Poetry users + chat examples (#2284 ) A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native Add two examples for interactive mode using Llama2 models (thx TheBloke for models) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-21 13:53:27 +03:00
Georgi Gerganov	a814d04f81	make : fix indentation	2023-07-21 13:50:55 +03:00
Georgi Gerganov	4c013bb738	ci : fix MNT realpath usage (#2250 )	2023-07-21 13:49:18 +03:00
Sky Yan	42c7c2e2e9	make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275 ) Under certain environment, nvcc and gcc is installed under customized path but not standard path Co-authored-by: Yan Lin <yanlin@baidu.com>	2023-07-21 13:38:57 +03:00
wzy	78a3d13424	flake : remove intel mkl from flake.nix due to missing files (#2277 ) NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261 Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975 So remove it from flake.nix Some minor changes: - Change pkgs.python310 to pkgs.python3 to keep latest - Add pkgconfig to devShells.default - Remove installPhase because we have `cmake --install` from #2256	2023-07-21 13:26:34 +03:00
Georgi Gerganov	ae178ab46b	llama : make tensor_split ptr instead of array (#2272 )	2023-07-21 13:10:51 +03:00
Jiří Podivín	54e3bc76fe	make : add new target for test binaries (#2244 ) Programs in the tests directory are now build with target tests and placed in the same location. * clean target was expanded to remove new binaries * test target binaries are listed in a variable * Locations of binaries were added to the .gitignore Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-21 13:09:16 +03:00
Hatsune Miku	019fe257bb	MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287 ) * Miku.sh: Set default model to llama-2-7b-chat * Miku.sh: Set ctx_size to 4096 * Miku.sh: Add in-prefix/in-suffix opts * Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements	2023-07-21 11:13:18 +03:00
Kawrakow	e68c96f7fe	Faster Q2_K on Metal (#2297 ) * Faster Q2_K on Metal * Deleting unnoticed and dangereous trailing white space * Fixed bug in new metal Q2_K implementation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-21 10:44:40 +03:00
Przemysław Pawełczyk	9cf022a188	make : fix embdinput library and server examples building on MSYS2 (#2235 ) * make : fix embdinput library and server examples building on MSYS2 * cmake : fix server example building on MSYS2	2023-07-21 10:42:21 +03:00
Kawrakow	e782c9e735	Faster Q5_K and Q6_K on Metal (#2294 ) * Faster Q6_K on Metal * Faster Q5_K on Metal * Another Q5_K speedup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-20 18:19:45 +03:00
Kawrakow	785829dfe8	Faster Q4_K on Metal (#2290 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-20 15:18:43 +03:00
Georgi Gerganov	fff0e0eafe	llama : fix regression from #2000 - could not load no-mmap models	2023-07-20 13:47:26 +03:00
Shouzheng Liu	417a85a001	metal: minor q4 optimization and reduce code size (#2248 ) * metal: use uint16_t instead of uint8_t. Apple GPU doesn't like uint8_t. For every operation on uint8_t the gpu need to copy the uint8_t to an empty 16 bit register, then it can issue other instructions. For the matrix-vector multiplication kernel only, we observed a 340~350 GB/s memory read speed on M1 Max after this commit, which is very close to the reported hardware limit. * metal: update rms_norm kernel This commit double the speed of rms_norm operations by using 512 threads per threadgroup, combining with SIMD primitives to minimize the need for thread group barriers. * metal: use template to reduce size Revert modifications on block_q4_0 and block_q4_1.	2023-07-20 13:32:22 +03:00
Rinne	294f424554	llama : extend API to get max devices at runtime (#2253 )	2023-07-19 10:06:40 +03:00
wzy	45a1b07e9b	flake : update flake.nix (#2270 ) When `isx86_32 \|\| isx86_64`, it will use mkl, else openblas According to https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3, add -DCMAKE_SKIP_BUILD_RPATH=ON Fix #2261, Nix doesn't provide mkl-sdl.pc. When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64 replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc	2023-07-19 10:01:55 +03:00
wzy	b1f4290953	cmake : install targets (#2256 ) fix #2252	2023-07-19 10:01:11 +03:00
Georgi Gerganov	d01bccde9f	ci : integrate with ggml-org/ci (#2250 ) * ci : run ctest ggml-ci * ci : add open llama 3B-v2 tests ggml-ci * ci : disable wget progress output ggml-ci * ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations ggml-ci * tests : try to fix tail free sampling test ggml-ci * ci : add K-quants ggml-ci * ci : add short perplexity tests ggml-ci * ci : add README.md * ppl : add --chunks argument to limit max number of chunks ggml-ci * ci : update README	2023-07-18 14:24:43 +03:00
Georgi Gerganov	6cbf9dfb32	llama : shorten quantization descriptions	2023-07-18 11:50:49 +03:00
Jiahao Li	7568d1a2b2	Support dup & cont ops on CUDA (#2242 )	2023-07-17 20:39:29 +03:00
Alex Klinkhamer	b7647436cc	llama : fix t_start_sample_us initialization warning (#2238 )	2023-07-17 00:01:45 +03:00
Qingyou Meng	672dda10e4	ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219 ) * fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG * remove ifdef GGML_PERF; update fmt	2023-07-16 22:57:28 +03:00
Jiří Podivín	27ab66e437	py : turn verify-checksum-models.py into executable (#2245 ) README.md was adjusted to reflect the change. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-07-16 22:54:47 +03:00
Xiao-Yong Jin	6e7cca4047	llama : add custom RoPE (#2054 ) * Implement customizable RoPE The original RoPE has pre-defined parameters theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2] Our customizable RoPE, ggml_rope_custom_inplace, uses theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2] with the default matches the original scale = 1.0 base = 10000 The new command line arguments --rope-freq-base --rope-freq-scale set the two new RoPE parameter. Recent researches show changing these two parameters extends the context limit with minimal loss. 1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k 2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595 3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5 * ggml-metal: fix custom rope * common: fix argument names in help * llama: increase MEM_REQ_EVAL for MODEL_3B It avoids crashing for quantized weights on CPU. Better ways to calculate the required buffer size would be better. * llama: make MEM_REQ_EVAL depend on n_ctx * server: use proper Content-Type in curl examples Without the header Content-Type: application/json, curl will POST with Content-Type: application/x-www-form-urlencoded Though our simple server doesn't care, the httplib.h used has a limit with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192 With Content-Type: application/json, we can send large json data. * style : minor fixes, mostly indentations * ggml : fix asserts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-15 13:34:16 +03:00
Dave Della Costa	a6803cab94	flake : add runHook preInstall/postInstall to installPhase so hooks function (#2224 )	2023-07-14 22:13:38 +03:00
wzy	7dabc66f3c	make : use pkg-config for OpenBLAS (#2222 )	2023-07-14 22:05:08 +03:00
Bach Le	7cdd30bf1f	cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer (#2220 )	2023-07-14 22:00:58 +03:00
Evan Miller	e8035f141e	ggml : fix static_assert with older compilers #2024 (#2218 )	2023-07-14 21:55:56 +03:00
Bach Le	7513b7b0a1	llama : add functions that work directly on model (#2197 ) * Remove vocab reference from context * Add functions that works directly with model	2023-07-14 21:55:24 +03:00
Ali Chraghi	de8342423d	build.zig : install config header (#2216 )	2023-07-14 21:50:58 +03:00
Shangning Xu	c48c525f87	examples : fixed path typos in embd-input (#2214 )	2023-07-14 21:40:05 +03:00
Jiahao Li	206e01de11	cuda : support broadcast add & mul (#2192 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-14 21:38:24 +03:00
Johannes Gäßler	4304bd3cde	CUDA: mul_mat_vec_q kernels for k-quants (#2203 )	2023-07-14 19:44:08 +02:00
James Reynolds	229aab351c	make : fix combination of LLAMA_METAL and LLAMA_MPI (#2208 ) Fixes https://github.com/ggerganov/llama.cpp/issues/2166 by moving commands after the CFLAGS are changed.	2023-07-14 20:34:40 +03:00
Georgi Gerganov	697966680b	ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope)	2023-07-14 16:36:41 +03:00
Kawrakow	27ad57a69b	Metal: faster Q4_0 and Q4_1 matrix x vector kernels (#2212 ) * 3-5% faster Q4_0 on Metal * 7-25% faster Q4_1 on Metal * Oops, forgot to delete the original Q4_1 kernel --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-07-14 11:46:21 +02:00
Howard Su	32c5411631	Revert "Support using mmap when applying LoRA (#2095 )" (#2206 ) Has perf regression when mlock is used. This reverts commit `2347463201`.	2023-07-13 21:58:25 +08:00
Howard Su	ff5d58faec	Fix compile error on Windows CUDA (#2207 )	2023-07-13 21:58:09 +08:00
Bodo Graumann	b782422a3e	devops : add missing quotes to bash script (#2193 ) This prevents accidentally expanding arguments that contain spaces.	2023-07-13 16:49:14 +03:00

1 2 3 4 5 ...

877 commits