llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	523e49b111	llm : fix falcon norm after refactoring (#3837 )	2023-11-01 23:00:50 +02:00
Georgi Gerganov	e16b9fa4ba	metal : multi-simd softmax (#3710 ) ggml-ci	2023-11-01 21:25:00 +02:00
Georgi Gerganov	ff8f9a88da	common : minor (#3715 )	2023-11-01 21:15:55 +02:00
Georgi Gerganov	50337961a6	llm : add llm_build_context (#3881 ) * llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments	2023-11-01 20:11:02 +02:00
bandoti	0e40806c1c	common : allow caller to handle help/argument exceptions (#3715 ) * Allow caller to handle help/argument exceptions * Prepend newline to usage output * Add new gpt_params_parse_ex function to hide arg-parse impl * Fix issue blocking success case * exit instead of returning false * Update common/common.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-11-01 19:42:01 +02:00
Concedo	21588cefd4	tunnel code done (+1 squashed commits) Squashed commits: [b4bc7d20] wip integration of trycloudflare	2023-11-01 23:28:23 +08:00
staviq	a2758d08e4	log : make generating separate log files optional (#3787 ) * impl --log-new, --log-append * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Apply suggestions from code review Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-11-01 16:18:27 +02:00
l3utterfly	e75dfdd31b	sampling : null grammar field after reset (#3885 )	2023-11-01 15:40:43 +02:00
Concedo	3b227fc704	automatic gpu layer detection	2023-11-01 20:55:26 +08:00
Concedo	b395dbf6f5	wip layer calculator	2023-11-01 20:04:10 +08:00
Georgi Gerganov	9a3b4f6c86	ggml : fix UNUSED macro (#3762 )	2023-11-01 13:50:45 +02:00
Andrew Godfrey	73bdcb395e	finetune : add -ngl parameter (#3762 ) * Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-11-01 13:49:04 +02:00
Concedo	ae2cd56de8	kobold integration of min_p sampler (+1 squashed commits) Squashed commits: [8ad2e349] kobold integration for min_p sampler	2023-11-01 19:08:45 +08:00
Concedo	bcb397953f	Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental	2023-11-01 18:29:08 +08:00
Concedo	92d80b94b3	bundle simpleclinfo into pyinstaller except for linux	2023-11-01 18:26:15 +08:00
Concedo	9342636408	Merge branch 'master' into concedo_experimental # Conflicts: # flake.lock # flake.nix	2023-11-01 18:24:36 +08:00
Concedo	df7e757d40	windows: added simpleclinfo, which helps determine clblast platform and device on windows	2023-11-01 18:10:35 +08:00
Georgi Gerganov	f0e209324a	scripts : add server-llm.sh (#3868 ) * scripts : add deploy-server.sh * scripts : rename to server-llm.sh * scripts : working curl pipe	2023-11-01 11:29:07 +02:00
Adrian Hesketh	ca190bca8e	server : re-enable completion and embedded at the same time (#3876 )	2023-11-01 11:28:28 +02:00
Georgi Gerganov	71e3718abd	llama : refactor graph build code (#3837 ) * llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper	2023-11-01 08:04:02 +02:00
kalomaze	238657db23	samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841 ) * Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-31 20:44:49 +01:00
Georgi Gerganov	22cc9bef09	cuda : check if this fixes Pascal card regression	2023-10-31 20:01:47 +02:00
Tungsten842	07178c98e1	flake.nix: fix for rocm 5.7 (#3853 )	2023-10-31 19:24:03 +02:00
Concedo	43a5143450	added clinfo binary, cleanup unused stuff	2023-10-31 22:25:25 +08:00
Concedo	f3690ba6d2	shifting enabled by default	2023-10-31 21:41:57 +08:00
Concedo	e62f38abd1	Merge branch 'master' into concedo_experimental # Conflicts: # tests/test-double-float.cpp # tests/test-quantize-fns.cpp	2023-10-31 21:09:49 +08:00
Concedo	cc5b282350	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # build.zig # flake.lock # flake.nix # ggml.c	2023-10-31 20:44:04 +08:00
Georgi Gerganov	207b51900e	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 ) * ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file	2023-10-30 19:19:15 +02:00
Concedo	9eba77c6a0	finally got something workable	2023-10-30 23:30:21 +08:00
Concedo	61c395833d	context shifting is still buggy	2023-10-30 16:25:01 +08:00
Kerfuffle	6e08281e58	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843 ) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality	2023-10-29 11:31:40 -06:00
cebtenzzre	2046eb4345	make : remove unnecessary dependency on build-info.h (#3842 )	2023-10-29 18:33:47 +02:00
Georgi Gerganov	71a09da301	llama : fix kv shift bug (#3835 ) ggml-ci	2023-10-29 18:32:51 +02:00
Georgi Gerganov	d69d777c02	ggml : quantization refactoring (#3833 ) * ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-29 18:32:28 +02:00
Concedo	7f5d1b2fc6	slider error	2023-10-30 00:02:38 +08:00
Concedo	7f050b5d16	tweak numbers	2023-10-29 22:46:19 +08:00
Concedo	7924592a83	context shift feature done	2023-10-29 18:21:39 +08:00
Concedo	338d6c265d	fixes to smartcontextpro	2023-10-29 10:42:37 +08:00
Erik Scholz	ff3bad83e2	flake : update flake.lock for newer transformers version + provide extra dev shell (#3797 ) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)	2023-10-28 16:41:07 +02:00
Aarni Koskela	82a6646e02	metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793 ) * Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on #1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-28 15:43:01 +03:00
Georgi Gerganov	ba231e8a6d	issues : change label from bug to bug-unconfirmed (#3748 )	2023-10-28 15:35:26 +03:00
Georgi Gerganov	8a2f2fea29	convert : ignore tokens if their IDs are within [0, vocab_size) (#3831 )	2023-10-28 06:25:15 -06:00
Kerfuffle	bd6d9e2059	llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747 ) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit	2023-10-28 14:54:24 +03:00
Georgi Gerganov	ee1a0ec9cb	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Concedo	20ef442c2a	fixed for smartcontext	2023-10-28 19:09:22 +08:00
Henk Poley	177461104b	common : print that one line of the syntax help also to standard output (#3823 )	2023-10-28 13:16:33 +03:00
Concedo	6cf2b4c73b	MMQ optimizations (+1 squashed commits) Squashed commits: [d87de001] mmq optimization (+1 squashed commits) Squashed commits: [f1f67af8] still allow mmq	2023-10-28 17:57:46 +08:00
Georgi Gerganov	fdee152e4e	starcoder : add GPU offloading (#3827 ) * starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci	2023-10-28 12:06:08 +03:00
Concedo	2ea3b567cf	Merge: Testing speed of tensor cores vs MMQ	2023-10-28 16:41:42 +08:00
Concedo	2fa1137890	updated lite	2023-10-28 14:43:15 +08:00

1 2 3 4 5 ...

2547 commits