Commit graph

827 commits

Author SHA1 Message Date
Georgi Gerganov
827f5eda91
readme : update hot topics 2023-06-04 23:38:19 +03:00
Georgi Gerganov
ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
0cc4m
dcb2ed4826
OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)
* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation
2023-06-04 08:12:05 +02:00
Henri Vasserman
d8bd0013e8
Add info about CUDA_VISIBLE_DEVICES (#1682) 2023-06-03 16:35:20 +03:00
Jiří Podivín
b5c85468a3
Docker: change to calling convert.py (#1641)
Deprecation disclaimer was added to convert-pth-to-ggml.py
2023-06-03 15:11:53 +03:00
Evan Jones
136476e898
Fix prompt cache saving and chat-persistent rollover (#1678)
* Fix prompt cache saving and chat-persistent rollover (fixes #1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-06-03 07:28:45 -04:00
Randall Fitzgerald
df2ecc942a
Merge pull request #18 from anon998/update-readme
Update readme + parse --mlock and --no-mmap
2023-06-02 17:04:25 -04:00
anon
98ae2de017 parse --mlock and --no-mmap + format 2023-06-02 17:54:46 -03:00
anon
05a5a485b8 make help text load faster 2023-06-02 17:52:04 -03:00
anon
a6ed390cc6 update readme 2023-06-02 17:48:29 -03:00
anon
e1e2be2146 remove --keep from help text 2023-06-02 17:47:42 -03:00
Randall Fitzgerald
5758e9f09b
Removed embedding from flags. 2023-06-02 08:31:12 -07:00
Randall Fitzgerald
310bf61496
Merge pull request #17 from SlyEcho/server_refactor
improve docs and example
2023-06-02 11:25:01 -04:00
Randall Fitzgerald
de6df486e9
Removed embedding from README 2023-06-02 08:24:46 -07:00
Henri Vasserman
bcd616700e
improve docs and example 2023-06-02 18:06:02 +03:00
digiwombat
7cebe2eaf8 Merge branch 'master' of https://github.com/digiwombat/llama.cpp 2023-06-02 10:06:04 -04:00
digiwombat
16e1c9813a Removed the embedding api endpoint and associated code. 2023-06-02 10:05:52 -04:00
Randall Fitzgerald
4dd72fc6e4
Merge pull request #16 from anon998/fix-log-json
Replace invalid characters instead of crashing.
2023-06-02 09:43:29 -04:00
anon
41bb71bde7 replace invalid characters instead of crashing
While logging the requests.
2023-06-02 10:37:13 -03:00
digiwombat
3ff27d30e3 Fixed up a few things in embedding mode. 2023-06-02 09:20:53 -04:00
Randall Fitzgerald
28cc0cdc50
Merge pull request #15 from SlyEcho/server_refactor
Improve long input truncation and add more verbose logging
2023-06-02 08:47:54 -04:00
Henri Vasserman
3df0192804
improve long input truncation
and add more verbose logging
2023-06-02 15:19:05 +03:00
Randall Fitzgerald
1bd52c8627
Merge branch 'ggerganov:master' into master 2023-06-02 07:31:55 -04:00
Randall Fitzgerald
f5d5e7020d
Merge pull request #14 from anon998/do-completion-update
Trim partial stopping strings when not streaming and move multibyte check.
2023-06-02 07:30:53 -04:00
anon
f820740dad move multibyte check to doCompletion 2023-06-02 08:27:23 -03:00
anon
8f9e546b51 trim partial stopping strings when not streaming 2023-06-02 08:25:31 -03:00
Randall Fitzgerald
bebea657cb
Merge pull request #13 from anon998/small-fixes
Small fixes.
2023-06-02 06:53:10 -04:00
anon998
abb7782745
Merge branch 'master' into small-fixes 2023-06-02 10:35:06 +00:00
Henri Vasserman
88cc7bb6f7
Stuff with logits 2023-06-02 13:29:57 +03:00
anon
47efbb5cf3 use std::isinf to check if ignore_eos is active 2023-06-02 07:19:21 -03:00
anon
2932db15a3 avoid creating element in logit_bias accidentally 2023-06-02 06:59:11 -03:00
anon
a8a9f19689 small fixes 2023-06-02 06:01:10 -03:00
anon
49dce94885 make types match gpt_params exactly 2023-06-02 06:01:10 -03:00
anon
1488a0f528 make functions that never return false void 2023-06-02 06:00:48 -03:00
anon
ebfead6e5a remove unused variables 2023-06-02 05:45:57 -03:00
anon
731ecc0d1b fix typo 2023-06-02 05:45:16 -03:00
Henri Vasserman
0bc047730f
Apply suggestions from code review
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-06-02 10:29:09 +03:00
Randall Fitzgerald
d29b6d5f55
Merge pull request #12 from anon998/clear-logit-bias
Clear logit bias between requests.
2023-06-01 08:58:35 -04:00
anon
8cbc4be6c2 clear logit_bias between requests + print 2023-06-01 09:49:50 -03:00
anon
6025476e39 default penalize_nl back to true 2023-06-01 09:49:16 -03:00
anon
49a18bdd14 remove unused parameter warning 2023-06-01 09:41:35 -03:00
Randall Fitzgerald
af711263ae
Merge pull request #11 from SlyEcho/server_refactor
Server refactor
2023-06-01 08:10:55 -04:00
Randall Fitzgerald
797155a0d1
Merge pull request #10 from cirk2/master
Add Options enpoints and Access-Control-Allow-Headers to satisfy CORS
2023-06-01 08:10:26 -04:00
Henri Vasserman
9531ae60db
Add logit bias support 2023-06-01 13:57:47 +03:00
Henri Vasserman
8c6a5fc92b
last tokens fixes 2023-06-01 13:18:12 +03:00
Felix Hellmann
5bbc030338
Add Options enpoints and Access-Control-Allow-Headers to satisfy CORS rules 2023-06-01 10:47:53 +02:00
digiwombat
f7882e2d69 Fixed a crash caused by erasing from empty last_n_tokens 2023-05-31 20:35:28 -04:00
Randall Fitzgerald
5f6e16da36
Merge pull request #9 from anon998/stopping-strings
Fix stopping strings.
2023-05-31 20:05:18 -04:00
anon
e9b1f0bf5c fix stopping strings 2023-05-31 21:00:21 -03:00
digiwombat
342604bb81 Added a super simple CORS header as default for all endpoints. 2023-05-31 19:54:05 -04:00