llama.cpp

Author	SHA1	Message	Date
Cebtenzzre	389d2e6b9e	gguf : free tensors as they are written	2023-10-31 12:06:21 -04:00
Cebtenzzre	d97afcfc02	gguf : track writer state	2023-10-31 12:06:20 -04:00
Cebtenzzre	3fcdc9330a	gguf : cleanup tensor padding	2023-10-31 12:02:48 -04:00
Cebtenzzre	6df988d5f1	gguf : do not store defaults in class vars Making an assignment in a class outside of a method does not set the default value, it actually sets the attribute on the class itself. Instances of the class inherit these, but it's incorrect to expose these fields here.	2023-10-31 12:02:48 -04:00
Georgi Gerganov	207b51900e	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 ) * ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file	2023-10-30 19:19:15 +02:00
Kerfuffle	6e08281e58	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843 ) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality	2023-10-29 11:31:40 -06:00
cebtenzzre	2046eb4345	make : remove unnecessary dependency on build-info.h (#3842 )	2023-10-29 18:33:47 +02:00
Georgi Gerganov	71a09da301	llama : fix kv shift bug (#3835 ) ggml-ci	2023-10-29 18:32:51 +02:00
Georgi Gerganov	d69d777c02	ggml : quantization refactoring (#3833 ) * ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-29 18:32:28 +02:00
Erik Scholz	ff3bad83e2	flake : update flake.lock for newer transformers version + provide extra dev shell (#3797 ) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)	2023-10-28 16:41:07 +02:00
Aarni Koskela	82a6646e02	metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793 ) * Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on #1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-28 15:43:01 +03:00
Georgi Gerganov	ba231e8a6d	issues : change label from bug to bug-unconfirmed (#3748 )	2023-10-28 15:35:26 +03:00
Georgi Gerganov	8a2f2fea29	convert : ignore tokens if their IDs are within [0, vocab_size) (#3831 )	2023-10-28 06:25:15 -06:00
Kerfuffle	bd6d9e2059	llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747 ) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit	2023-10-28 14:54:24 +03:00
Georgi Gerganov	ee1a0ec9cb	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Henk Poley	177461104b	common : print that one line of the syntax help also to standard output (#3823 )	2023-10-28 13:16:33 +03:00
Georgi Gerganov	fdee152e4e	starcoder : add GPU offloading (#3827 ) * starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci	2023-10-28 12:06:08 +03:00
Kerfuffle	41aee4df82	speculative : ensure draft and target model vocab matches (#3812 ) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab	2023-10-28 00:40:07 +03:00
cebtenzzre	6d459cbfbe	llama : correctly report GGUFv3 format (#3818 )	2023-10-27 17:33:53 -04:00
Thibault Terrasson	c8d6a1f34a	simple : fix batch handling (#3803 )	2023-10-27 08:37:41 -06:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00
Georgi Gerganov	cc44877486	log : disable pid in log filenames	2023-10-25 10:09:16 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
John Smith	abd21fc99f	cmake : add missed dependencies (#3763 )	2023-10-24 20:48:45 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Galunid	daab3d7f45	Add more tokenizer tests (#3742 ) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test	2023-10-24 09:17:17 +02:00
Georgi Gerganov	469c9addef	metal : handle ggml_scale for n%4 != 0 (close #3754 ) ggml-ci	2023-10-24 09:47:22 +03:00
Georgi Gerganov	e3932593d4	Revert "make : add optional CUDA_NATIVE_ARCH (#2482 )" This reverts commit `96981f37b1`. See: https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866	2023-10-23 23:46:05 +03:00
M. Yusuf Sarıgöz	9d02956443	issues : separate bug and enhancement template + no default title (#3748 )	2023-10-23 22:57:16 +03:00
Galunid	69a6735087	Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746 ) We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for - a couple of tokenizer tests - some more `special` and `non-special` added tokens handling (as far as I understand it) * Update special token handling * Add mpt	2023-10-23 21:46:00 +02:00
Marcus Dunn	5be6c803fa	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Galunid	6336701c93	Fix baichuan convert script not detecing model (#3739 ) It seems nobody objects.	2023-10-23 17:47:03 +02:00
Alex	96981f37b1	make : add optional CUDA_NATIVE_ARCH (#2482 ) Use the environment variable `CUDA_NATIVE_ARCH` if present to set NVCC arch. Otherwise, use `native`.	2023-10-22 22:56:53 +03:00
Georgi Gerganov	438c2ca830	server : parallel decoding and multimodal (#3677 ) * implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <go778sgt@gmail.com> Co-authored-by: Damian Stewart <d@damianstewart.com> Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com> Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-10-22 22:53:08 +03:00
goerch	9e70cc0322	Add test for MPT tokenization (#3728 ) * Add test for MPT tokenization * Revert code motion * Remove unnecessary restriction in test case * Clarify logic in conversion	2023-10-22 21:21:42 +02:00
Ian Scrivener	5a42a5f8e8	readme : remove unsupported node.js library (#3703 ) - https://github.com/Atome-FE/llama-node is quite out of date - doesn't support recent/current llama.cpp functionality	2023-10-22 21:16:43 +03:00
Kerfuffle	a5e7dbd614	llama : validate special token ids are in range when loading GGUF model (#3635 ) * Add validation for special token ids to llama.cpp Small optimization for llama_byte_to_token SPM mode * Fix BPE newline check, only I could break something so simple * Killll meeeeee * Account for GGUF_KEY_KEY only setting when the key exists * Minor code cleanups. * Fix convert.py error msg when added tokens are out of range * Make gguf SpecialVocab vocab size-aware Update conversion scripts accordingly * Avoid a string copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-22 21:14:56 +03:00
vvhg1	d3956aea53	main : escape prompt for cfg_negative_prompt and consecutive inputs in main with interactive (#3623 ) * infill tokens correction * serverinfill tokens correction * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape * only rm when params.escape, rm space if possible which is added back or rm added space token * only rm when params.escape, rm space if possible which is added back or rm added space token * Revert "only rm when params.escape, rm space if possible which is added back or rm added space token" This reverts commit `63ba0b621f`. * fix interactive prompt escaping and fix server infill leading space handling * rm unnecessary bool check * process escapes for neg prompt and interactive consec prompts * removed unneccessary static string escape	2023-10-22 21:09:51 +03:00
Georgi Gerganov	22c69a2794	batched : add len CLI argument	2023-10-22 08:37:20 +03:00
shibe2	465219b914	CLBlast: Add outer loops over src0 for broadcasting in mulmat Reduce repeated dequantization of the same data.	2023-10-20 22:30:52 +04:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Qin Yue Chen	8cf19d60dc	gguf : support big endian platform (#3552 ) * check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-20 14:19:40 +03:00
Georgi Gerganov	a0edf73bda	server : fix uninitialized sampling context (close #3685 )	2023-10-20 13:06:10 +03:00
Herman Semenov	f439e506e8	ggml : fix rope + llama minor optimizations (#3560 ) * Minor fixes and fixed memleak * Using const auto references in range-based loop C++17	2023-10-20 13:02:12 +03:00
cebtenzzre	e78f3ef24a	convert : restore compat with old Falcon models (#3680 )	2023-10-20 08:32:08 +03:00
M. Yusuf Sarıgöz	f3b25e4043	multimodal : add BakLLaVA conversion support (#3682 )	2023-10-19 19:40:41 +03:00

1 2 3 4 5 ...

1450 commits