llama.cpp

Author	SHA1	Message	Date
brian khuu	3625a42061	convert-*.py: add heuristic to directory name fallback Also add source_url for huggingface url	2024-07-16 06:37:42 +10:00
brian khuu	39472a09da	convert-*.py: need to include self in per_model_weight_count_estimation()	2024-07-16 06:37:42 +10:00
brian khuu	54918ad14e	convert-*.py: refactor parameter weight class	2024-07-16 06:37:42 +10:00
brian khuu	32e80e094c	convert-*.py: base_model is actually in spec for model cards	2024-07-16 06:37:42 +10:00
brian khuu	4d5cd0670a	convert-*.py: use heuristics to parse _name_or_path	2024-07-16 06:37:42 +10:00
brian khuu	b0553f42da	convert-*.py: adjust help message	2024-07-16 06:37:42 +10:00
brian khuu	dd1571211e	convert-*.py: add quantized_by and enhance heuristics	2024-07-16 06:37:38 +10:00
brian khuu	5a86dfaa1c	convert-*.py: add general.organization to kv store	2024-07-16 06:36:03 +10:00
brian khuu	f7c20793b9	convert-*.py: enable --model-name direct metadata override	2024-07-16 06:36:03 +10:00
brian khuu	b1927eed82	convert-*.py: move per model weight estimation away from util back to main script plus some refactoring	2024-07-16 06:36:03 +10:00
brian khuu	684c604eca	convert-*.py: add datasets and language to KV store	2024-07-16 06:36:03 +10:00
brian khuu	0f1d50fab7	convert-*.py: add parameter size class	2024-07-16 06:36:03 +10:00
brian khuu	8f734083dd	convert-*.py: add base_version and add tags	2024-07-16 06:36:03 +10:00
brian khuu	b36e391b87	convert-*.py: parse model card in metadata util. Add license_link and license_name to kv store	2024-07-16 06:36:03 +10:00
brian khuu	5c263cb257	convert-*.py: encoding_scheme --> output_type	2024-07-16 06:36:03 +10:00
brian khuu	4d5f18a0e6	convert-*.py: metadata class moved to utility	2024-07-16 06:36:03 +10:00
brian khuu	916872f72f	convert-*.py: model card metadata	2024-07-16 06:36:03 +10:00
brian khuu	a42c2b7efc	convert-*.py: add basename and finetune metadata	2024-07-16 06:36:03 +10:00
brian khuu	dbb1b471e4	convert-*.py: add --get-outfile command and refactor	2024-07-16 06:36:03 +10:00
brian khuu	d3a936fd0e	convert-*.py: licence -> license	2024-07-16 06:36:03 +10:00
Xuan Son Nguyen	97bdd26eee	Refactor lora adapter support (#8332 ) * lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit `42415a4874`. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-07-15 20:50:47 +02:00
Xuan Son Nguyen	4db8f60fe7	fix ci (#8494 )	2024-07-15 19:23:10 +02:00
Daniel Bevenius	8fac431b06	ggml : suppress unknown pragma 'GCC' on windows (#8460 ) This commit adds a macro guard to pragma GCC to avoid the following warning on windows: ```console C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068: unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj] ```	2024-07-15 15:48:17 +03:00
M-A	f17f39ff9c	server: update README.md with llama-server --help output [no ci] (#8472 ) The README.md had a stale information. In particular, the --ctx-size "defaults to 512" confused me and I had to check the code to confirm this was false. This the server is evolving rapidly, it's probably better to keep the source of truth at a single place (in the source) and generate the README.md based on that. Did: make llama-server ./llama-server --help > t.txt vimdiff t.txt examples/server/README.md I copied the content inside a backquote block. I would have preferred proper text but it would require a fair amount of surgery to make the current output compatible with markdown. A follow up could be to automate this process with a script. No functional change.	2024-07-15 15:04:56 +03:00
Georgi Gerganov	9104bc20ed	common : add --no-cont-batching arg (#6358 )	2024-07-15 14:54:58 +03:00
NikolaiLyssogor	fc690b018e	docs: fix links in development docs [no ci] (#8481 ) Fixes a few links to within the repo that were broken in the reorganization of the documentation in #8325.	2024-07-15 14:46:39 +03:00
Meng, Hengyu	16bdfa42ac	[SYCL] add concat through dim 1/2 (#8483 ) * add concat through dim 1/2	2024-07-15 19:32:15 +08:00
Georgi Gerganov	3dfda05956	llama : de-duplicate deepseek2 norm	2024-07-15 14:10:39 +03:00
0cc4m	bda62d7999	Vulkan MMQ Fix (#8479 ) * Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error	2024-07-15 09:38:52 +02:00
compilade	090fca7a07	pydantic : replace uses of __annotations__ with get_type_hints (#8474 ) * pydantic : replace uses of __annotations__ with get_type_hints * pydantic : fix Python 3.9 and 3.10 support	2024-07-14 19:51:21 -04:00
Georgi Gerganov	aaab2419ea	flake.lock: Update (#8475 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) → 'github:NixOS/nixpkgs/7e7c39ea35c5cdd002cd4588b03a3fb9ece6fad9?narHash=sha256-EYekUHJE2gxeo2pM/zM9Wlqw1Uw2XTJXOSAO79ksc4Y%3D' (2024-07-12) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-07-14 08:54:02 -07:00
Georgi Gerganov	73cf442e7b	llama : fix Gemma-2 Query scaling factors (#8473 ) * 9B - query_pre_attn_scalar = 256 not 224 See `03e657582d` Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads) * llama : fix Gemma-2 Query scaling factor ggml-ci --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2024-07-14 14:05:09 +03:00
Brian	e236528e76	gguf_hash.py: Add sha256 (#8470 ) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-14 16:47:14 +10:00
compilade	fa79495bb4	llama : fix pre-tokenization of non-special added tokens (#8228 ) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon	2024-07-13 23:35:10 -04:00
bandoti	17eb6aa8a9	vulkan : cmake integration (#8119 ) * Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg	2024-07-13 18:12:39 +02:00
Georgi Gerganov	c917b67f06	metal : template-ify some of the kernels (#8447 ) ggml-ci	2024-07-13 18:32:33 +03:00
Georgi Gerganov	4e24cffd8c	server : handle content array in chat API (#8449 ) * server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-12 14:48:15 +03:00
Georgi Gerganov	6af51c0d96	main : print error on empty input (#8456 )	2024-07-12 14:48:04 +03:00
Daniel Bevenius	f53226245f	llama : suppress unary minus operator warning (#8448 ) This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```	2024-07-12 12:05:21 +03:00
Douglas Hanley	c3ebcfa148	server : ensure batches are either all embed or all completion (#8420 ) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning	2024-07-12 11:14:12 +03:00
Armen Kaleshian	8a4441ea1a	docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441 ) Commit `b0a4699` changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.	2024-07-12 11:08:19 +03:00
Jiří Podivín	5aefbce27a	convert : remove fsep token from GPTRefactForCausalLM (#8237 ) The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-07-12 11:06:33 +03:00
Georgi Gerganov	71c1121d11	examples : sprintf -> snprintf (#8434 ) * examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants	2024-07-12 10:46:14 +03:00
Georgi Gerganov	370b1f7e7a	ggml : minor naming changes (#8433 ) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-12 10:46:02 +03:00
Chen Xi	b549a1bbef	[SYCL] fix the mul_mat_id ut issues (#8427 ) * fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <xi2chen@intel.com> --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Chen Xi <xi2chen@intel.com>	2024-07-12 08:52:04 +08:00
Nicholai Tukanov	368645698a	ggml : add NVPL BLAS support (#8329 ) (#8425 ) * ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <ntukanov@nvidia.com>	2024-07-11 18:49:15 +02:00
Daniel Bevenius	b078c619aa	cuda : suppress 'noreturn' warn in no_device_code (#8414 ) * cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 \| } \| ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-11 17:53:42 +02:00
Johannes Gäßler	808aba3916	CUDA: optimize and refactor MMQ (#8416 ) * CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation	2024-07-11 16:47:47 +02:00
Georgi Gerganov	a977c11544	gitignore : deprecated binaries	2024-07-11 11:20:40 +03:00
compilade	9a55ffe6fb	tokenize : add --no-parse-special option (#8423 ) This should allow more easily explaining how parse_special affects tokenization.	2024-07-11 10:41:48 +03:00

1 2 3 4 5 ...

3420 commits