Concedo
0485fa65a2
wstring convert for mpt
2023-06-24 11:43:42 +08:00
Concedo
6d718525c4
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-23 23:56:31 +08:00
Concedo
f7b096374d
fixed string too long CI issue
2023-06-23 23:56:22 +08:00
Concedo
490cf395f8
better alloc error
2023-06-23 22:51:51 +08:00
Concedo
ece453ed09
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# README.md
2023-06-23 22:46:54 +08:00
Concedo
f39a746089
bug fixes for openblas
2023-06-23 22:45:22 +08:00
Concedo
43c2891afa
option to not use scratch
2023-06-23 19:01:36 +08:00
Concedo
d5e4cf7ffe
handle ctx manip
2023-06-23 19:01:15 +08:00
Concedo
df9135e3a9
fixing memory bugs
2023-06-23 18:41:23 +08:00
eiery
d7b7484f74
Add OpenLLaMA instructions to the README ( #1954 )
...
* add openllama to readme
2023-06-23 10:38:01 +02:00
Erik Scholz
7487137227
rework convert.py to read hyper-parameters from config.json ( #1958 )
...
* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
This allows converting open_llama 3B and other non-standard model designs.
2023-06-22 14:20:47 +02:00
Concedo
0eedccaf06
Merge branch 'master' into optimize_quants_upstream
2023-06-22 17:59:58 +08:00
Concedo
e6ddb15c3a
cleanup
2023-06-22 10:38:27 +08:00
Johannes Gäßler
bbca06e269
cmake: revert CUDA arch default to 52, 61 if f16 ( #1959 )
2023-06-21 23:49:25 +02:00
Rahul Vivek Nair
fb98254f99
Fix typo in README.md ( #1961 )
2023-06-21 23:48:43 +02:00
Concedo
1b71752a9f
Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX
2023-06-22 00:43:25 +08:00
Ycros
b1f00fa9cc
Fix hordeconfig max context setting, and add Makefile flags for cuda F16/KQuants per iter. ( #252 )
...
* Fix hordeconfig maxcontext setting.
* cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama.
2023-06-21 23:01:46 +08:00
Concedo
dfdd20240c
gpt j use scratch buffers
2023-06-21 16:10:31 +08:00
Georgi Gerganov
049aa16b8c
readme : add link to p1
2023-06-20 19:05:54 +03:00
Concedo
266d47a4b9
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-20 22:46:35 +08:00
Concedo
da668e685f
fixing address spaces
2023-06-20 22:46:11 +08:00
Concedo
cce6e67f44
fixing address spaces
2023-06-20 22:45:16 +08:00
Concedo
1f1735f5ad
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-20 21:39:35 +08:00
Concedo
6b75fc48b9
fixed global const struct types
2023-06-20 21:38:48 +08:00
Xiake Sun
2322ec223a
Fix typo ( #1949 )
2023-06-20 15:42:40 +03:00
Concedo
537ff22ec9
fixed a bug with token timings, updated lite
2023-06-20 20:41:42 +08:00
Concedo
c5ae3f50a7
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-20 18:41:13 +08:00
Concedo
a6e8b0216d
remove old dot kernels and template
2023-06-20 18:37:48 +08:00
Concedo
93247a11cd
ported q2k and q5k speedups
2023-06-20 18:37:41 +08:00
Concedo
029bed6446
ported q3k speedup successfully
2023-06-20 18:37:26 +08:00
Concedo
d754915269
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-20 17:26:39 +08:00
Concedo
b4c532e862
Merge branch 'master' into concedo_experimental
2023-06-20 17:26:27 +08:00
0cc4m
8d816d19d1
Add q6_k fast matmul kernel
2023-06-20 08:41:35 +02:00
0cc4m
34a4917984
Use preprocessor for QK_K
2023-06-20 08:04:16 +02:00
0cc4m
069cbe530d
Fix q2_k fast kernel
2023-06-20 08:01:40 +02:00
Ettore Di Giacinto
aacdbd4056
llama : fix params struct slignment ( #1936 )
...
* Workaround struct misalignment during value-copy
Signed-off-by: mudler <mudler@localai.io>
* Move booleans at the bottom of the structure
Signed-off-by: mudler <mudler@localai.io>
* Add comment
Signed-off-by: mudler <mudler@localai.io>
---------
Signed-off-by: mudler <mudler@localai.io>
2023-06-20 04:24:39 +03:00
Henri Vasserman
20568fe60f
[Fix] Reenable server embedding endpoint ( #1937 )
...
* Add back embedding feature
* Update README
2023-06-20 01:12:39 +03:00
Georgi Gerganov
18b35625c3
ggml : fix bug in LBFGS optimizer (found by ggml tests)
2023-06-19 20:43:30 +03:00
Concedo
69fd31d18c
Merge branch 'master' into optimize_quants_upstream
2023-06-19 23:38:59 +08:00
Concedo
5e8e99f206
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
2023-06-19 23:37:53 +08:00
l3utterfly
ba4e85a833
llama : use aligned memory during ggml_init call from loading saved sessions ( #1934 )
...
* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions
* - removed commented out old code from fix
- updated another instance of same issue below original
2023-06-19 18:20:06 +03:00
Georgi Gerganov
23fc5c219a
cmake : fix trailing whitespaces
2023-06-19 18:18:34 +03:00
Kawrakow
cb40dfca69
llama : only use Q6_K for output weights if tensor size is multiple of 256 ( #1932 )
...
* Only use Q6_K for output weights if tensor size is multiple of 256
* Fixed copy/paste mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:17:03 +03:00
Kawrakow
ca7c3f4da5
cuda : faster k-quants on older GPUs ( #1930 )
...
* k_quants: hopefully much faster Q4_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!
* k_quants: hopefully much faster Q3_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!
* k_quants: faster Q2_K on older GPUs
It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.
* k_quants: faster Q5_K on older GPUs
68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.
It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:14:09 +03:00
Georgi Gerganov
b97ca431db
ggml : sync latest ggml repo ( #1924 )
...
* ggml : sync latest ggml repo
* ggml : remove unused comments
* ggml : asserts
2023-06-19 18:12:33 +03:00
Howard Su
1e3abfcef0
cmake : fix build shared ggml when CUDA is enabled ( #1929 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19 18:10:37 +03:00
Concedo
c94a438328
xx + ib0
2023-06-19 23:01:49 +08:00
Concedo
266d436746
Added broken new q4k quant
2023-06-19 22:41:35 +08:00
Concedo
51e834c27b
keep duplicate targets for now
2023-06-19 22:38:23 +08:00
Concedo
cf94340dfc
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
2023-06-19 22:28:38 +08:00