llama.cpp

Author	SHA1	Message	Date
Jared Van Bortel	fefc3db527	address review comments	2023-11-05 16:24:48 -05:00
Galunid	781bc54986	Move everything to convert-hf-to-gguf.py	2023-11-05 08:42:11 +01:00
Galunid	f7de892ee5	Move util to gguf-py/gguf	2023-11-05 00:43:56 +01:00
Galunid	087f88cc15	Rename convert-generic -> convert-hf-to-gguf	2023-11-05 00:37:00 +01:00
Galunid	2120195bb1	Yarn rope for baichuan	2023-11-04 23:15:41 +01:00
Galunid	e64f4de189	Revert "Remove 'old' conversion scripts" - needed for testing This reverts commit `f4b9a7ea02`.	2023-11-04 23:10:39 +01:00
Galunid	fd30850576	Add big endian support	2023-11-04 23:01:38 +01:00
Galunid	03c9683eb7	Restore support for RWForCausalLM	2023-11-04 20:43:29 +01:00
cebtenzzre	007be85087	model.py : add missing future import	2023-11-02 12:08:44 -04:00
cebtenzzre	e9abcc9c7c	fix linter complaints	2023-11-02 00:06:32 -04:00
cebtenzzre	66ccd62102	sort imports	2023-11-01 23:26:28 -04:00
cebtenzzre	8f31dc54ec	fix mypy errors	2023-11-01 23:24:46 -04:00
Galunid	4fdd7cdf2b	Review fixes, persimmon fixes	2023-11-01 02:32:49 +01:00
Galunid	3ec89dcc69	Use 'IntEnum' instead of 'Enum'	2023-10-31 22:23:26 +01:00
Galunid	f4b9a7ea02	Remove 'old' conversion scripts	2023-10-31 16:27:06 +01:00
Galunid	235acc18cd	Small refactor	2023-10-31 16:23:53 +01:00
Galunid	c94df09732	Rework tokenizer handling	2023-10-31 16:11:08 +01:00
Galunid	b2ba44eab2	Flake8 fixes	2023-10-31 15:38:24 +01:00
Galunid	dc3115f2a3	Add another alias to n_layers	2023-10-31 04:20:51 +01:00
Galunid	0743f7a900	Fix variable	2023-10-31 03:52:52 +01:00
Galunid	b9c664ab2f	Woops	2023-10-31 03:42:55 +01:00
Galunid	6f6856c6ea	[Untested] Initial Persimmon support	2023-10-31 03:27:04 +01:00
Galunid	94ba1db24a	Add Starcoder and Refact	2023-10-31 03:12:25 +01:00
Galunid	0afa75a9a2	Add Falcon support	2023-10-31 02:57:37 +01:00
Galunid	3bb9844de9	Get rid of dumb print	2023-10-31 01:54:24 +01:00
Galunid	08918b700e	MPT conversion fix	2023-10-31 01:52:55 +01:00
Galunid	443f7d586e	Call add_tensor before write_* functions	2023-10-29 20:00:54 +01:00
Galunid	550b925af2	Missing variable	2023-10-29 02:06:41 +01:00
Galunid	989db34149	Missing variable	2023-10-29 02:05:28 +01:00
Galunid	8618b4e74c	Add [UNTESTED] Baichuan support	2023-10-29 01:38:35 +02:00
Galunid	0ff237105d	Make gguf_writer member of Model, rework tokenizer export	2023-10-29 00:33:05 +02:00
Galunid	22201248a0	Remove comments	2023-10-27 02:05:27 +02:00
Galunid	4823b9bdcb	Initial generic convert script	2023-10-26 15:43:19 +02:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00
Georgi Gerganov	cc44877486	log : disable pid in log filenames	2023-10-25 10:09:16 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
John Smith	abd21fc99f	cmake : add missed dependencies (#3763 )	2023-10-24 20:48:45 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Galunid	daab3d7f45	Add more tokenizer tests (#3742 ) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test	2023-10-24 09:17:17 +02:00
Georgi Gerganov	469c9addef	metal : handle ggml_scale for n%4 != 0 (close #3754 ) ggml-ci	2023-10-24 09:47:22 +03:00
Georgi Gerganov	e3932593d4	Revert "make : add optional CUDA_NATIVE_ARCH (#2482 )" This reverts commit `96981f37b1`. See: https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866	2023-10-23 23:46:05 +03:00
M. Yusuf Sarıgöz	9d02956443	issues : separate bug and enhancement template + no default title (#3748 )	2023-10-23 22:57:16 +03:00
Galunid	69a6735087	Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746 ) We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for - a couple of tokenizer tests - some more `special` and `non-special` added tokens handling (as far as I understand it) * Update special token handling * Add mpt	2023-10-23 21:46:00 +02:00
Marcus Dunn	5be6c803fa	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Galunid	6336701c93	Fix baichuan convert script not detecing model (#3739 ) It seems nobody objects.	2023-10-23 17:47:03 +02:00
Alex	96981f37b1	make : add optional CUDA_NATIVE_ARCH (#2482 ) Use the environment variable `CUDA_NATIVE_ARCH` if present to set NVCC arch. Otherwise, use `native`.	2023-10-22 22:56:53 +03:00
Georgi Gerganov	438c2ca830	server : parallel decoding and multimodal (#3677 ) * implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <go778sgt@gmail.com> Co-authored-by: Damian Stewart <d@damianstewart.com> Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com> Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-10-22 22:53:08 +03:00
goerch	9e70cc0322	Add test for MPT tokenization (#3728 ) * Add test for MPT tokenization * Revert code motion * Remove unnecessary restriction in test case * Clarify logic in conversion	2023-10-22 21:21:42 +02:00

1 2 3 4 5 ...

1461 commits