Commit graph

1466 commits

Author SHA1 Message Date
Concedo
523fc3be52 fixed rwkv, standardized new ctx usage 2023-07-10 20:05:53 +08:00
Concedo
2827920044 fix compile errors, rwkv not working 2023-07-10 18:23:25 +08:00
Concedo
15576bc865 Merge branch 'kquant_vocab_fix' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	README.md
#	llama.cpp
#	tests/CMakeLists.txt
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-08 20:43:20 +08:00
Concedo
1854168841 This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for tok_embeddings.weight and output.weight (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions. 2023-07-08 20:38:03 +08:00
Johannes Gäßler
061f5f8d21
CUDA: add __restrict__ to mul mat vec kernels (#2140) 2023-07-08 00:25:15 +02:00
dylan
84525e7962
docker : add support for CUDA in docker (#1461)
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 21:25:25 +03:00
Georgi Gerganov
a7e20edf22
ci : switch threads to 1 (#2138) 2023-07-07 21:23:57 +03:00
Qingyou Meng
1d656d6360
ggml : change ggml_graph_compute() API to not require context (#1999)
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

* rewrite: no longer consider backward compitability; plan and make_plan

* minor: rename ctx as plan; const

* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward

* add static ggml_graph_compute_sugar()

* minor: update comments

* reusable buffers

* ggml : more consistent naming + metal fixes

* ggml : fix docs

* tests : disable grad / opt + minor naming changes

* ggml : add ggml_graph_compute_with_ctx()

- backwards compatible API
- deduplicates a lot of copy-paste

* ci : enable test-grad0

* examples : factor out plan allocation into a helper function

* llama : factor out plan stuff into a helper function

* ci : fix env

* llama : fix duplicate symbols + refactor example benchmark

* ggml : remove obsolete assert + refactor n_tasks section

* ggml : fix indentation in switch

* llama : avoid unnecessary bool

* ggml : remove comments from source file and match order in header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 19:24:01 +03:00
Concedo
8edcb337c6 added ability to select "all devices" 2023-07-07 23:37:55 +08:00
Georgi Gerganov
7242140283 ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134) 2023-07-07 18:37:10 +03:00
Concedo
ddaa4f2a26 fix cuda garbage results and gpu selection issues 2023-07-07 22:14:14 +08:00
Aarni Koskela
3e08ae99ce
convert.py: add mapping for safetensors bf16 (#1598)
Fixes #1473
2023-07-07 09:12:49 -04:00
Concedo
95eca51bef add gpu choice for GUI for cuda 2023-07-07 18:39:47 +08:00
Concedo
a689a66068 make it work with pyinstaller 2023-07-07 17:52:34 +08:00
Concedo
9ee9a77f12 warn outdated GUI (+1 squashed commits)
Squashed commits:

[15aec3d] spelling error
2023-07-07 16:39:17 +08:00
Concedo
32102c2064 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-07-07 14:15:39 +08:00
Howard Su
481f793acc
Fix opencl by wrap #if-else-endif with \n (#2086) 2023-07-07 05:34:18 +02:00
Georgi Gerganov
dfd9fce6d6
ggml : fix restrict usage 2023-07-06 19:41:31 +03:00
Judd
36680f6e40
convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
tslmy
a17a2683d8
alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
Concedo
8424a35c62 added the ability to ban any substring tokens 2023-07-06 23:24:21 +08:00
Concedo
27a0907cfa backport MM256_SET_M128I to ggml_v2, updated lite, added support for selecting the GPU for cublas 2023-07-06 22:33:46 +08:00
Concedo
220aa707e6 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	pocs/vdot/q8dot.cpp
#	pocs/vdot/vdot.cpp
#	scripts/sync-ggml.sh
#	tests/test-grad0.c
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
2023-07-06 15:40:40 +08:00
Concedo
4d1700b172 adjust some ui sizing 2023-07-06 15:17:47 +08:00
Vali-98
1c80002310
New UI using customtkinter (#284)
* Initial conversion to customtkinter.

* Initial conversion to customtkinter.

* Additions to UI, still non-functional

* UI now functional, untested

* UI now functional, untested

* Added saving configs

* Saving and loading now functional

* Fixed sliders not loading

* Cleaned up duplicate arrays

* Cleaned up duplicate arrays

* Fixed loading bugs

* wip fixing all the broken parameters. PLEASE test before you commit

* further cleaning

* bugfix completed for gui. now evaluating save and load

* cleanup prepare to merge

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2023-07-06 15:00:57 +08:00
Tobias Lütke
31cfbb1013
Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson
983b555e9d
Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
Georgi Gerganov
ec326d350c
ggml : fix bug introduced in #1237 2023-07-05 20:44:11 +03:00
Georgi Gerganov
1b6efeab82
tests : fix test-grad0 2023-07-05 20:20:25 +03:00
Stephan Walter
1b107b8550
ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson
8567c76b53
Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
Johannes Gäßler
924dd22fd3
Quantized dot products for CUDA mul mat vec (#2067) 2023-07-05 14:19:42 +02:00
Howard Su
051c70dcd5
llama: Don't double count the sampling time (#2107) 2023-07-05 18:31:23 +08:00
Concedo
ea79e549f0 fixed refusing to quantize some models 2023-07-05 17:29:35 +08:00
Johannes Gäßler
9e4475f5cf
Fixed OpenCL offloading prints (#2082) 2023-07-05 08:58:05 +02:00
Nigel Bosch
7f0e9a775e
embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
Georgi Gerganov
b472f3fca5
readme : add link web chat PR 2023-07-04 22:25:22 +03:00
Georgi Gerganov
ed9a54e512
ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
2023-07-04 21:54:11 +03:00
jwj7140
f257fd2550
Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke
7ee76e45af
Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Henri Vasserman
acc111caf9
Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
2023-07-04 15:38:04 +03:00
ZhouYuChen
23c7c6fc91
Update Makefile: clean simple (#2097) 2023-07-04 14:15:16 +02:00
Concedo
69add28324 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
2023-07-04 18:51:42 +08:00
Concedo
00e35d0bbf Merge branch 'concedo' into concedo_experimental 2023-07-04 18:46:40 +08:00
Michael Moon
f9108ba401
Make koboldcpp.py executable on Linux (#293) 2023-07-04 18:46:08 +08:00
Concedo
fff705d4f6 Merge remote-tracking branch 'ycros/improve-sampler-api-access' into concedo_experimental 2023-07-04 18:42:02 +08:00
Concedo
c6c0afdf18 refactor to avoid code duplication 2023-07-04 18:35:54 +08:00
Concedo
784628a2be Merge remote-tracking branch 'ycros/improve-sampler-api-access' into concedo_experimental 2023-07-04 16:38:32 +08:00
Erik Scholz
698efad5fb
CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04 01:50:12 +02:00
Govlzkoy
14a2cc71f6
[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00