llama.cpp

Author	SHA1	Message	Date
Justine Tunney	65e5f6dadb	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Concedo	63b65efb78	added tooltips for all items in the GUI launcher	2023-12-28 23:08:57 +08:00
manikbhandari	ea5497df5d	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
Concedo	ec46661a32	wip adding tooltips	2023-12-28 15:54:22 +08:00
Nexesenex	cf360f3e62	Update expose.cpp '#include <cstdint> (#586 )	2023-12-28 15:01:22 +08:00
Concedo	ba77e916ef	added missing parameters for United class.py	2023-12-28 14:07:26 +08:00
Concedo	5e59112de8	prevent other calls when uninitialized	2023-12-28 12:04:53 +08:00
Concedo	2d5d82e915	addlocate gpt_params on heap instead to avoid rare segfault	2023-12-28 11:48:21 +08:00
Nam D. Tran	f6793491b5	llama : add AWQ for llama, llama2, mpt, and mistral models (#4593 ) * update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io> Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-27 17:39:45 +02:00
Daniel Bevenius	879b690a9e	finetune : fix output formatting in print_params (#4653 ) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-27 16:16:55 +02:00
Concedo	69ab1bf2f8	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-12-27 21:43:46 +08:00
Concedo	5b2d93a1f8	updated lite and colab, added logit bias support to lite	2023-12-27 21:32:18 +08:00
Concedo	4d6d967c10	silence autoplay for colab	2023-12-27 19:13:34 +08:00
Georgi Gerganov	b47879b0dd	scripts : add sync-ggml-am.sh	2023-12-27 11:44:22 +02:00
Georgi Gerganov	951010fa53	ggml : fix dot product for ARM (#4630 ) ggml-ci	2023-12-27 11:02:13 +02:00
wonjun Jang	f56d6077d0	Add byte token type when tokenizer.model is not exists (#4641 ) * Add byte token type to hf format * remove unused variable	2023-12-27 17:37:25 +09:00
slaren	dc68f0054c	cuda : fix vmm pool with multi GPU (#4620 ) * cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device	2023-12-26 21:23:59 +01:00
DebuggingLife46	e733a9e425	Add logit_bias to the OpenAI api (#577 ) * Add logit_bias to the OpenAI api * Cleanup and refactor, test in swagger. --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2023-12-27 00:26:19 +08:00
WillCorticesAI	de8e496437	Update comment for AdamW implementation reference. (#4604 ) Co-authored-by: Will Findley <findley@gmail.com>	2023-12-26 11:42:08 +01:00
FantasyGmm	77465dad48	Fix new CUDA10 compilation errors (#4635 )	2023-12-26 11:38:36 +01:00
henk717	5006b23099	CUDA 11.4 for Github CI (#582 ) * Downgrade CUDA to 11.4 This helps the binary be smaller and adds K80 support, the manual compiles we did already had this. * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Update kcpp-build-release-win-cuda.yaml * Restore concedo_experimental	2023-12-26 11:23:43 +08:00
Paul Tsochantaris	a206137f92	Adding Emeltal reference to UI list (#4629 )	2023-12-25 18:09:53 +02:00
Concedo	c2d87b6545	increase multiuser default	2023-12-25 23:49:45 +08:00
Concedo	78a9d206d3	randomize horde genkey	2023-12-25 22:47:21 +08:00
Concedo	cc64f2cad1	Merge branch 'master' into concedo_experimental # Conflicts: # .github/ISSUE_TEMPLATE/bug.md # Makefile # README.md # ggml-cuda.cu # tests/test-grad0.cpp	2023-12-25 18:47:21 +08:00
Concedo	293395e0f5	Merge commit '`708e179e85`' into concedo_experimental # Conflicts: # .github/workflows/docker.yml	2023-12-25 16:48:15 +08:00
slaren	b9f47952ff	simplify bug issue template (#4623 )	2023-12-24 22:01:12 +02:00
Shintarou Okada	753be377b6	llama : add PLaMo model (#3557 ) * add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-24 15:35:49 +02:00
slaren	5bf3953d7e	cuda : improve cuda pool efficiency using virtual memory (#4606 ) * cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-12-24 14:34:22 +01:00
Concedo	bd0d9039ec	better approach to multiuser check	2023-12-24 20:03:33 +08:00
Concedo	bc24c9334c	prevent prompt leakage during usage of check endpoint when genkey is provided in multiuser mode	2023-12-24 17:08:43 +08:00
slaren	708e179e85	fallback to CPU buffer if host buffer alloc fails (#4610 )	2023-12-23 16:10:51 +01:00
Samuel Maynard	925e5584a0	ci(docker): fix tags in "Build and push docker image (tagged)" (#4603 )	2023-12-23 11:35:55 +02:00
Alexey Parfenov	6123979952	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
kalomaze	b9ec82d262	grammar : check the full vocab only if necessary (opt) (#4306 ) * Check the full vocab for grammar only if necessary * Fix missing logit restoration step (?) Does this matter, actually? * Fix whitespace / formatting * Adjust comment * Didn't mean to push test gbnf * Split sampling into the helper function (?) And also revert the changes made to the header * common : fix final newline --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-23 11:27:07 +02:00
Johannes Gäßler	e0a4002273	CUDA: fixed row rounding for 0 tensor splits (#4594 )	2023-12-23 09:16:33 +01:00
Concedo	71a5afaab5	fixed incorrect localflag	2023-12-23 11:00:58 +08:00
Concedo	4a8308b1c8	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-12-23 10:40:29 +08:00
Concedo	8823e8b06d	added presence penalty into lite ui	2023-12-23 10:39:40 +08:00
LeonEricsson	7082d24cec	lookup : add prompt lookup decoding example (#4484 ) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 18:05:56 +02:00
Concedo	b814bb217d	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md	2023-12-23 00:01:21 +08:00
Georgi Gerganov	ba66175132	sync : ggml (fix im2col) (#4591 ) * cuda : fix im2col_f32_f16 (ggml/#658) ggml-ci * ggml-alloc : fix ggml_tallocr_is_own --------- Co-authored-by: leejet <leejet714@gmail.com>	2023-12-22 17:53:43 +02:00
FantasyGmm	a55876955b	cuda : fix jetson compile error (#4560 ) * fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 17:11:12 +02:00
Concedo	3bca03d26b	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/docker.yml # Makefile # README.md # llama.cpp	2023-12-22 21:39:23 +08:00
Henrik Forstén	6724ef1657	Fix CudaMemcpy direction (#4599 )	2023-12-22 14:34:05 +01:00
Concedo	852ca780c9	cherrypicked the Hipblas fixed from PR #571	2023-12-22 21:29:20 +08:00
slaren	48b7ff193e	llama : fix platforms without mmap (#4578 ) * llama : fix platforms without mmap * win32 : limit prefetch size to the file size * fix win32 error clobber, unnecessary std::string in std::runtime_error	2023-12-22 13:12:53 +02:00
Herman Semenov	48b24b170e	ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203 )	2023-12-22 11:26:49 +02:00
Michael Kesper	28cb35a0ec	make : add LLAMA_HIP_UMA option (#4587 ) NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA	2023-12-22 10:03:25 +02:00
Concedo	77463e0e9c	batch size improvements	2023-12-22 15:27:40 +08:00

1 2 3 4 5 ...

3016 commits