llama.cpp

Author	SHA1	Message	Date
Jared Van Bortel	df687b10ab	kompute : support mask parameter of softmax	2024-01-24 16:51:27 -05:00
Jared Van Bortel	8bd38fe32d	test-backend-ops : test mask parameter of ggml_soft_max_ext	2024-01-24 16:28:41 -05:00
Jared Van Bortel	308f279622	kompute : support scale parameter of softmax	2024-01-24 16:17:37 -05:00
Jared Van Bortel	1450966071	test-backend-ops : test scale parameter of ggml_soft_max_ext	2024-01-24 16:17:37 -05:00
Jared Van Bortel	2852902eda	test-backend-ops : add llama test	2024-01-24 16:17:29 -05:00
Jared Van Bortel	2b0f642fec	fix f16 mmv, 49 -> 41 failures	2024-01-24 13:43:49 -05:00
Jared Van Bortel	1a14099c43	fix q4_0/q4_1 mmv, 65 -> 49 failures	2024-01-24 13:43:48 -05:00
Jared Van Bortel	0787b80db8	kompute : remove broken mulrow kernel -> 1 less test failure	2024-01-24 13:43:48 -05:00
Jared Van Bortel	2755ae3d10	kompute : fix more dispatch ambiguity -> 12 less failures	2024-01-24 13:43:47 -05:00
Jared Van Bortel	08e23fd78c	kompute : fix op_mul kernel -> 13 less test failures	2024-01-24 13:43:47 -05:00
Jared Van Bortel	0899adf86e	kompute : fix get_rows dispatch -> 4 less failures	2024-01-24 13:43:47 -05:00
Jared Van Bortel	cb9ceff966	minor cleanup	2024-01-24 13:43:46 -05:00
Georgi Gerganov	33e8d6abe1	kompute : fix ggml_add kernel (#5027 )	2024-01-24 13:43:46 -05:00
Jared Van Bortel	2f6a279e29	fix supported ops for kompute backend	2024-01-24 13:43:45 -05:00
Jared Van Bortel	07530731ba	never try to evaluate an empty command buffer This fixes the immediate crashes with test-backend-ops - when evaluatating individual no-ops like OP_VIEW, it tries to submit an empty command buffer, which crashes RADV and hangs AMDVLK.	2024-01-24 13:43:45 -05:00
Jared Van Bortel	729e1a4cc1	sync op_rope_f16 with recent op_rope_f32 changes	2024-01-24 13:43:45 -05:00
Jared Van Bortel	e9d5223da3	actually fix this assertion	2024-01-24 13:43:44 -05:00
Jared Van Bortel	9431026a84	clean up old backend code	2024-01-24 13:43:44 -05:00
Georgi Gerganov	d6bd471693	kompute : fix rope_f32 and scale ops (#5008 )	2024-01-24 13:43:44 -05:00
Jared Van Bortel	76474a7c0d	kompute : ignore exceptions in ggml_vk_available_devices (#12 ) Signed-off-by: Jared Van Bortel <jared@nomic.ai>	2024-01-24 13:43:43 -05:00
Jared Van Bortel	cad72e1252	add sanity check and fix kompute teardown order	2024-01-24 13:43:43 -05:00
Jared Van Bortel	070919dbf7	attempt to get test-backend-ops working	2024-01-24 13:43:43 -05:00
Jared Van Bortel	5f660dada8	fix assertion failure	2024-01-24 13:43:42 -05:00
Jared Van Bortel	298d6eec09	kompute : initial attempt at ggml-backend v2 support	2024-01-24 13:43:40 -05:00
Jared Van Bortel	7c527eb568	Merge commit '`e7e4df031b`' into HEAD	2024-01-24 13:39:17 -05:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Georgi Gerganov	584d674be6	llama : remove redundant assert for StableLM (#4901 )	2024-01-12 20:54:12 +02:00
Daniel Bevenius	930f907d3e	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894 ) This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-12 19:54:53 +02:00
Zay	e790eef21c	llama.swiftui : update models layout (#4826 ) * Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout	2024-01-12 14:48:00 +02:00
Georgi Gerganov	5537d9d36b	gitignore : imatrix	2024-01-12 14:33:21 +02:00
Johannes Gäßler	1b280c9fff	CUDA: fix softmax compile for old CUDA versions (#4862 )	2024-01-12 12:30:41 +01:00
Georgi Gerganov	3cabe80630	llama : fix typo "imp_embd" -> "inp_embd"	2024-01-12 13:11:15 +02:00
howlger	4315a94366	common : streamline the formatting of help (#4890 ) * common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 13:05:32 +02:00
Georgi Gerganov	2d00741e12	py : fix lint (#4889 )	2024-01-12 13:03:38 +02:00
Georgi Gerganov	f445c0e68c	llama : fix llm_build_k_shift to use correct n_rot (#4889 ) * llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot	2024-01-12 13:01:56 +02:00
Kawrakow	326b418b59	Importance Matrix calculation (#4861 ) * imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 06:59:57 +01:00
Georgi Gerganov	1d118386fe	server : fix infill when prompt is empty (#4833 )	2024-01-11 23:23:49 +02:00
Georgi Gerganov	7edefbd79c	main : better name for variable n_print (#4874 )	2024-01-11 22:46:26 +02:00
Georgi Gerganov	3ca63b4538	main : disable token count by default (#4874 )	2024-01-11 22:43:05 +02:00
Georgi Gerganov	b037787548	swift : track ggml release branch (#4867 )	2024-01-11 21:58:28 +02:00
Kawrakow	469e75d0a3	llama : restore intended k-quants mixes for MoE models (#4872 ) * Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-11 21:43:15 +02:00
Kawrakow	49662cbed3	ggml : SOTA 2-bit quants (add IQ2_XS) (#4856 ) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:39:39 +02:00
Georgi Gerganov	3ba5b8ca8e	swift : pin ggml commit + remove ggml.h from spm-headers (#4878 ) ggml-ci	2024-01-11 21:31:31 +02:00
Laura	4330bd83fe	server : implement credentialed CORS (#4514 ) * Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage	2024-01-11 20:02:48 +02:00
Michael Coppola	27379455c3	server : support for multiple api keys (#4864 ) * server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-01-11 19:51:17 +02:00
Behnam M	eab6795006	server : add `LOG_INFO` when model is successfully loaded (#4881 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading	2024-01-11 19:41:39 +02:00
Someone	d8d90aa343	ci: nix-flake-update: new token with pr permissions (#4879 ) * ci: nix-flake-update: new token with pr permissions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-11 17:22:34 +00:00
pudepiedj	43f76bf1c3	main : print total token count and tokens consumed so far (#4874 ) * Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn	2024-01-11 18:14:52 +02:00
Isaac McFadyen	2f043328e3	server : fix typo in model name (#4876 )	2024-01-11 16:33:26 +02:00
Paul Tsochantaris	2a7c94db5f	metal : put encoder debug group behind a define (#4873 )	2024-01-11 16:31:52 +02:00

1 2 3 4 5 ...

1960 commits