llama.cpp

Author	SHA1	Message	Date
Cuong Trinh Manh	97bbca6e85	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
Justine Tunney	4af4801566	llava-cli : refactor to use sampling library (#4669 ) This change makes it possible to use flags like `--grammar` when using the `llava-cli` program. The rest is just code cleanup deleting a long standing TODO comment. This change also ensures that logging information is emitted to stderr which helps the `llava-cli` command be more friendly to shell scripts. See Mozilla-Ocho/llamafile@1cd334f	2023-12-29 16:38:38 +02:00
Justine Tunney	db49ff8ed7	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	60f55e888c	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	b93edd22f5	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
andrijdavid	82d6eab224	main-cmake-pkg : fix build issue (#4665 ) * Fix main-cmake-pkg compilation * Use glob to load common files * cmake : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 16:18:20 +02:00
Peter Sugihara	afd997ab60	llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674 ) * fix infinite loop * slight UI simplification, clearer UX * clearer UI text, add timings to completion log	2023-12-29 15:58:56 +02:00
Georgi Gerganov	c8255f8a6b	scripts : print list of sync commits	2023-12-29 15:12:35 +02:00
Tamotsu Takahashi	441f51dca0	ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576) * Build with CLBlast * Declare GGML_API After rebasing, examples/talk-llama failed: "D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) -> "D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) -> (Link target) -> llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context ,void (__cdecl)(float,void ),void ,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]	2023-12-29 15:11:53 +02:00
Georgi Gerganov	38b3de4658	sync : ggml	2023-12-29 14:56:41 +02:00
bssrdf	afc8c19291	ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669) * fixed mul-mat error for old GPUs * style fixes * add mul mat src1 f16 test cases, fix more cases ggml-ci --------- Co-authored-by: bssrdf <bssrdf@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-29 14:54:19 +02:00
Georgi Gerganov	ca38b8d334	scripts : do not sync commits from this repo	2023-12-29 14:54:05 +02:00
Justine Tunney	65e5f6dadb	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Concedo	63b65efb78	added tooltips for all items in the GUI launcher	2023-12-28 23:08:57 +08:00
manikbhandari	ea5497df5d	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
Concedo	ec46661a32	wip adding tooltips	2023-12-28 15:54:22 +08:00
Nexesenex	cf360f3e62	Update expose.cpp '#include <cstdint> (#586 )	2023-12-28 15:01:22 +08:00
Concedo	ba77e916ef	added missing parameters for United class.py	2023-12-28 14:07:26 +08:00
Concedo	5e59112de8	prevent other calls when uninitialized	2023-12-28 12:04:53 +08:00
Concedo	2d5d82e915	addlocate gpt_params on heap instead to avoid rare segfault	2023-12-28 11:48:21 +08:00
Nam D. Tran	f6793491b5	llama : add AWQ for llama, llama2, mpt, and mistral models (#4593 ) * update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io> Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-27 17:39:45 +02:00
Daniel Bevenius	879b690a9e	finetune : fix output formatting in print_params (#4653 ) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-27 16:16:55 +02:00
Concedo	69ab1bf2f8	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-12-27 21:43:46 +08:00
Concedo	5b2d93a1f8	updated lite and colab, added logit bias support to lite	2023-12-27 21:32:18 +08:00
Concedo	4d6d967c10	silence autoplay for colab	2023-12-27 19:13:34 +08:00
Georgi Gerganov	b47879b0dd	scripts : add sync-ggml-am.sh	2023-12-27 11:44:22 +02:00
Georgi Gerganov	951010fa53	ggml : fix dot product for ARM (#4630 ) ggml-ci	2023-12-27 11:02:13 +02:00
wonjun Jang	f56d6077d0	Add byte token type when tokenizer.model is not exists (#4641 ) * Add byte token type to hf format * remove unused variable	2023-12-27 17:37:25 +09:00
slaren	dc68f0054c	cuda : fix vmm pool with multi GPU (#4620 ) * cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device	2023-12-26 21:23:59 +01:00
DebuggingLife46	e733a9e425	Add logit_bias to the OpenAI api (#577 ) * Add logit_bias to the OpenAI api * Cleanup and refactor, test in swagger. --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2023-12-27 00:26:19 +08:00
WillCorticesAI	de8e496437	Update comment for AdamW implementation reference. (#4604 ) Co-authored-by: Will Findley <findley@gmail.com>	2023-12-26 11:42:08 +01:00
FantasyGmm	77465dad48	Fix new CUDA10 compilation errors (#4635 )	2023-12-26 11:38:36 +01:00
henk717	5006b23099	CUDA 11.4 for Github CI (#582 ) * Downgrade CUDA to 11.4 This helps the binary be smaller and adds K80 support, the manual compiles we did already had this. * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Restore concedo_experimental	2023-12-26 11:23:43 +08:00
Paul Tsochantaris	a206137f92	Adding Emeltal reference to UI list (#4629 )	2023-12-25 18:09:53 +02:00
Concedo	c2d87b6545	increase multiuser default	2023-12-25 23:49:45 +08:00
Concedo	78a9d206d3	randomize horde genkey	2023-12-25 22:47:21 +08:00
Concedo	cc64f2cad1	Merge branch 'master' into concedo_experimental # Conflicts: # .github/ISSUE_TEMPLATE/bug.md # Makefile # README.md # ggml-cuda.cu # tests/test-grad0.cpp	2023-12-25 18:47:21 +08:00
Concedo	293395e0f5	Merge commit '`708e179e85`' into concedo_experimental # Conflicts: # .github/workflows/docker.yml	2023-12-25 16:48:15 +08:00
slaren	b9f47952ff	simplify bug issue template (#4623 )	2023-12-24 22:01:12 +02:00
Shintarou Okada	753be377b6	llama : add PLaMo model (#3557 ) * add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-24 15:35:49 +02:00
slaren	5bf3953d7e	cuda : improve cuda pool efficiency using virtual memory (#4606 ) * cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-12-24 14:34:22 +01:00
Concedo	bd0d9039ec	better approach to multiuser check	2023-12-24 20:03:33 +08:00
Concedo	bc24c9334c	prevent prompt leakage during usage of check endpoint when genkey is provided in multiuser mode	2023-12-24 17:08:43 +08:00
slaren	708e179e85	fallback to CPU buffer if host buffer alloc fails (#4610 )	2023-12-23 16:10:51 +01:00
Samuel Maynard	925e5584a0	ci(docker): fix tags in "Build and push docker image (tagged)" (#4603 )	2023-12-23 11:35:55 +02:00
Alexey Parfenov	6123979952	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
kalomaze	b9ec82d262	grammar : check the full vocab only if necessary (opt) (#4306 ) * Check the full vocab for grammar only if necessary * Fix missing logit restoration step (?) Does this matter, actually? * Fix whitespace / formatting * Adjust comment * Didn't mean to push test gbnf * Split sampling into the helper function (?) And also revert the changes made to the header * common : fix final newline --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-23 11:27:07 +02:00
Johannes Gäßler	e0a4002273	CUDA: fixed row rounding for 0 tensor splits (#4594 )	2023-12-23 09:16:33 +01:00
Concedo	71a5afaab5	fixed incorrect localflag	2023-12-23 11:00:58 +08:00
Concedo	4a8308b1c8	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-12-23 10:40:29 +08:00

1 2 3 4 5 ...

3028 commits