Commit graph

3016 commits

Author SHA1 Message Date
Justine Tunney
65e5f6dadb
Fix OpenAI server sampling w.r.t. temp and seed (#4668)
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.

The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.

See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
2023-12-28 15:20:00 -04:00
Concedo
63b65efb78 added tooltips for all items in the GUI launcher 2023-12-28 23:08:57 +08:00
manikbhandari
ea5497df5d
gpt2 : Add gpt2 architecture integration (#4555) 2023-12-28 15:03:57 +01:00
Concedo
ec46661a32 wip adding tooltips 2023-12-28 15:54:22 +08:00
Nexesenex
cf360f3e62
Update expose.cpp '#include <cstdint> (#586) 2023-12-28 15:01:22 +08:00
Concedo
ba77e916ef added missing parameters for United class.py 2023-12-28 14:07:26 +08:00
Concedo
5e59112de8 prevent other calls when uninitialized 2023-12-28 12:04:53 +08:00
Concedo
2d5d82e915 addlocate gpt_params on heap instead to avoid rare segfault 2023-12-28 11:48:21 +08:00
Nam D. Tran
f6793491b5
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-27 17:39:45 +02:00
Daniel Bevenius
879b690a9e
finetune : fix output formatting in print_params (#4653)
This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab:   32000
print_params: n_ctx:     128
print_params: n_embd:    4096
print_params: n_ff:      11008
print_params: n_head:    32
print_params: n_head_kv: 32
print_params: n_layer:   32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 4096
print_params: n_ff                  : 11008
print_params: n_head                : 32
print_params: n_head_kv             : 32
print_params: n_layer               : 32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-27 16:16:55 +02:00
Concedo
69ab1bf2f8 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-12-27 21:43:46 +08:00
Concedo
5b2d93a1f8 updated lite and colab, added logit bias support to lite 2023-12-27 21:32:18 +08:00
Concedo
4d6d967c10 silence autoplay for colab 2023-12-27 19:13:34 +08:00
Georgi Gerganov
b47879b0dd
scripts : add sync-ggml-am.sh 2023-12-27 11:44:22 +02:00
Georgi Gerganov
951010fa53
ggml : fix dot product for ARM (#4630)
ggml-ci
2023-12-27 11:02:13 +02:00
wonjun Jang
f56d6077d0
Add byte token type when tokenizer.model is not exists (#4641)
* Add byte token type to hf format

* remove unused variable
2023-12-27 17:37:25 +09:00
slaren
dc68f0054c
cuda : fix vmm pool with multi GPU (#4620)
* cuda : fix vmm pool with multi GPU

* hip

* use recommended granularity instead of minimum

* better error checking

* fix mixtral

* use cudaMemcpy3DPeerAsync

* use cuda_pool_alloc in ggml_cuda_op_mul_mat

* consolidate error checking in ggml_cuda_set_device

* remove unnecessary inlines

ggml-ci

* style fixes

* only use vmm for the main device

* fix scratch buffer size, re-enable vmm pool for all devices

* remove unnecessary check id != g_main_device
2023-12-26 21:23:59 +01:00
DebuggingLife46
e733a9e425
Add logit_bias to the OpenAI api (#577)
* Add logit_bias to the OpenAI api

* Cleanup and refactor, test in swagger.

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2023-12-27 00:26:19 +08:00
WillCorticesAI
de8e496437
Update comment for AdamW implementation reference. (#4604)
Co-authored-by: Will Findley <findley@gmail.com>
2023-12-26 11:42:08 +01:00
FantasyGmm
77465dad48
Fix new CUDA10 compilation errors (#4635) 2023-12-26 11:38:36 +01:00
henk717
5006b23099
CUDA 11.4 for Github CI (#582)
* Downgrade CUDA to 11.4

This helps the binary be smaller and adds K80 support, the manual compiles we did already had this.

* Update kcpp-build-release-win-cuda.yaml

* Update kcpp-build-release-win-cuda.yaml

* Update kcpp-build-release-win-cuda.yaml

* Update kcpp-build-release-win-cuda.yaml

* Update kcpp-build-release-win-cuda.yaml

* Update kcpp-build-release-win-cuda.yaml

* Restore concedo_experimental
2023-12-26 11:23:43 +08:00
Paul Tsochantaris
a206137f92
Adding Emeltal reference to UI list (#4629) 2023-12-25 18:09:53 +02:00
Concedo
c2d87b6545 increase multiuser default 2023-12-25 23:49:45 +08:00
Concedo
78a9d206d3 randomize horde genkey 2023-12-25 22:47:21 +08:00
Concedo
cc64f2cad1 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/bug.md
#	Makefile
#	README.md
#	ggml-cuda.cu
#	tests/test-grad0.cpp
2023-12-25 18:47:21 +08:00
Concedo
293395e0f5 Merge commit '708e179e85' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
2023-12-25 16:48:15 +08:00
slaren
b9f47952ff
simplify bug issue template (#4623) 2023-12-24 22:01:12 +02:00
Shintarou Okada
753be377b6
llama : add PLaMo model (#3557)
* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-24 15:35:49 +02:00
slaren
5bf3953d7e
cuda : improve cuda pool efficiency using virtual memory (#4606)
* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-24 14:34:22 +01:00
Concedo
bd0d9039ec better approach to multiuser check 2023-12-24 20:03:33 +08:00
Concedo
bc24c9334c prevent prompt leakage during usage of check endpoint when genkey is provided in multiuser mode 2023-12-24 17:08:43 +08:00
slaren
708e179e85
fallback to CPU buffer if host buffer alloc fails (#4610) 2023-12-23 16:10:51 +01:00
Samuel Maynard
925e5584a0
ci(docker): fix tags in "Build and push docker image (tagged)" (#4603) 2023-12-23 11:35:55 +02:00
Alexey Parfenov
6123979952
server : allow to specify custom prompt for penalty calculation (#3727) 2023-12-23 11:31:49 +02:00
kalomaze
b9ec82d262
grammar : check the full vocab only if necessary (opt) (#4306)
* Check the full vocab for grammar only if necessary

* Fix missing logit restoration step (?)

Does this matter, actually?

* Fix whitespace / formatting

* Adjust comment

* Didn't mean to push test gbnf

* Split sampling into the helper function (?)

And also revert the changes made to the header

* common : fix final newline

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-23 11:27:07 +02:00
Johannes Gäßler
e0a4002273
CUDA: fixed row rounding for 0 tensor splits (#4594) 2023-12-23 09:16:33 +01:00
Concedo
71a5afaab5 fixed incorrect localflag 2023-12-23 11:00:58 +08:00
Concedo
4a8308b1c8 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-12-23 10:40:29 +08:00
Concedo
8823e8b06d added presence penalty into lite ui 2023-12-23 10:39:40 +08:00
LeonEricsson
7082d24cec
lookup : add prompt lookup decoding example (#4484)
* initial commit, going through initializations

* main loop finished, starting to debug

* BUG: generates gibberish/repeating tokens after a while

* kv_cache management

* Added colors to distinguish drafted tokens (--color). Updated README

* lookup : fix token positions in the draft batch

* lookup : use n_draft from CLI params

* lookup : final touches

---------

Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 18:05:56 +02:00
Concedo
b814bb217d Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-12-23 00:01:21 +08:00
Georgi Gerganov
ba66175132
sync : ggml (fix im2col) (#4591)
* cuda : fix im2col_f32_f16 (ggml/#658)

ggml-ci

* ggml-alloc : fix ggml_tallocr_is_own

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-12-22 17:53:43 +02:00
FantasyGmm
a55876955b
cuda : fix jetson compile error (#4560)
* fix old jetson compile error

* Update Makefile

* update jetson detect and cuda version detect

* update cuda marco define

* update makefile and cuda,fix some issue

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update Makefile

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 17:11:12 +02:00
Concedo
3bca03d26b Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	README.md
#	llama.cpp
2023-12-22 21:39:23 +08:00
Henrik Forstén
6724ef1657
Fix CudaMemcpy direction (#4599) 2023-12-22 14:34:05 +01:00
Concedo
852ca780c9 cherrypicked the Hipblas fixed from PR #571 2023-12-22 21:29:20 +08:00
slaren
48b7ff193e
llama : fix platforms without mmap (#4578)
* llama : fix platforms without mmap

* win32 : limit prefetch size to the file size

* fix win32 error clobber, unnecessary std::string in std::runtime_error
2023-12-22 13:12:53 +02:00
Herman Semenov
48b24b170e
ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203) 2023-12-22 11:26:49 +02:00
Michael Kesper
28cb35a0ec
make : add LLAMA_HIP_UMA option (#4587)
NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA
2023-12-22 10:03:25 +02:00
Concedo
77463e0e9c batch size improvements 2023-12-22 15:27:40 +08:00