Concedo
a6e8b0216d
remove old dot kernels and template
2023-06-20 18:37:48 +08:00
Concedo
93247a11cd
ported q2k and q5k speedups
2023-06-20 18:37:41 +08:00
Concedo
029bed6446
ported q3k speedup successfully
2023-06-20 18:37:26 +08:00
Concedo
d754915269
Merge branch 'optimize_quants_upstream' into concedo_experimental
2023-06-20 17:26:39 +08:00
Concedo
b4c532e862
Merge branch 'master' into concedo_experimental
2023-06-20 17:26:27 +08:00
0cc4m
8d816d19d1
Add q6_k fast matmul kernel
2023-06-20 08:41:35 +02:00
0cc4m
34a4917984
Use preprocessor for QK_K
2023-06-20 08:04:16 +02:00
0cc4m
069cbe530d
Fix q2_k fast kernel
2023-06-20 08:01:40 +02:00
Ettore Di Giacinto
aacdbd4056
llama : fix params struct slignment ( #1936 )
...
* Workaround struct misalignment during value-copy
Signed-off-by: mudler <mudler@localai.io>
* Move booleans at the bottom of the structure
Signed-off-by: mudler <mudler@localai.io>
* Add comment
Signed-off-by: mudler <mudler@localai.io>
---------
Signed-off-by: mudler <mudler@localai.io>
2023-06-20 04:24:39 +03:00
Henri Vasserman
20568fe60f
[Fix] Reenable server embedding endpoint ( #1937 )
...
* Add back embedding feature
* Update README
2023-06-20 01:12:39 +03:00
Georgi Gerganov
18b35625c3
ggml : fix bug in LBFGS optimizer (found by ggml tests)
2023-06-19 20:43:30 +03:00
Concedo
69fd31d18c
Merge branch 'master' into optimize_quants_upstream
2023-06-19 23:38:59 +08:00
Concedo
5e8e99f206
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
2023-06-19 23:37:53 +08:00
l3utterfly
ba4e85a833
llama : use aligned memory during ggml_init call from loading saved sessions ( #1934 )
...
* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions
* - removed commented out old code from fix
- updated another instance of same issue below original
2023-06-19 18:20:06 +03:00
Georgi Gerganov
23fc5c219a
cmake : fix trailing whitespaces
2023-06-19 18:18:34 +03:00
Kawrakow
cb40dfca69
llama : only use Q6_K for output weights if tensor size is multiple of 256 ( #1932 )
...
* Only use Q6_K for output weights if tensor size is multiple of 256
* Fixed copy/paste mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:17:03 +03:00
Kawrakow
ca7c3f4da5
cuda : faster k-quants on older GPUs ( #1930 )
...
* k_quants: hopefully much faster Q4_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!
* k_quants: hopefully much faster Q3_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!
* k_quants: faster Q2_K on older GPUs
It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.
* k_quants: faster Q5_K on older GPUs
68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.
It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:14:09 +03:00
Georgi Gerganov
b97ca431db
ggml : sync latest ggml repo ( #1924 )
...
* ggml : sync latest ggml repo
* ggml : remove unused comments
* ggml : asserts
2023-06-19 18:12:33 +03:00
Howard Su
1e3abfcef0
cmake : fix build shared ggml when CUDA is enabled ( #1929 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19 18:10:37 +03:00
Concedo
c94a438328
xx + ib0
2023-06-19 23:01:49 +08:00
Concedo
266d436746
Added broken new q4k quant
2023-06-19 22:41:35 +08:00
Concedo
51e834c27b
keep duplicate targets for now
2023-06-19 22:38:23 +08:00
Concedo
cf94340dfc
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
2023-06-19 22:28:38 +08:00
Concedo
8e2dc19dc6
updated tokenizer, added support for scratch buffers for neox and gpt2
2023-06-19 21:29:06 +08:00
Johannes Gäßler
16b9cd1939
Convert vector to f16 for dequantize mul mat vec ( #1913 )
...
* Convert vector to f16 for dmmv
* compile option
* Added compilation option description to README
* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-19 10:23:56 +02:00
Concedo
cb6daa3171
updated lite
2023-06-19 11:51:23 +08:00
Johannes Gäßler
b24c3049d9
Added tokens per second to info prints ( #1928 )
2023-06-18 17:41:26 +02:00
Concedo
d0d3c4f32b
Merge remote-tracking branch 'origin/master' into concedo_experimental
...
# Conflicts:
# README.md
2023-06-18 22:53:10 +08:00
Johannes Gäßler
0ede372a51
Fixed incorrectly applying RMS norm twice ( #1925 )
2023-06-18 16:07:09 +02:00
l3utterfly
8596af4277
ggml : fix bug in ggml_compute_forward_add_q_f32 ( #1918 )
2023-06-18 14:19:16 +03:00
Concedo
b08b371983
allow hordeconfig to set a max ctx length too.
2023-06-18 16:42:32 +08:00
Mike
e1886cf4fe
readme : update Android build instructions ( #1922 )
...
Add steps for using termux on android devices to prevent common errors.
2023-06-18 11:28:26 +03:00
Kawrakow
8ab8ba62eb
llama : prevent usage of k-quants when tensor size is not a multiple of 256 ( #1921 )
...
* Fix examples/metal
* k-quants: prevent usage when tensor size is not divisible by 256
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 11:13:43 +03:00
Kawrakow
90cc59d6ab
examples : fix examples/metal ( #1920 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 10:52:10 +03:00
Concedo
278427d9a4
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
2023-06-18 15:29:44 +08:00
Concedo
8775dd99f4
various debug logging improvements
2023-06-18 15:24:58 +08:00
Georgi Gerganov
ce2c7d72e2
metal : handle buffers larger than device's maxBufferLength ( #1826 )
...
* metal : handle buffers larger than device's maxBufferLength
* metal : print more verbose device info + handle errors
* metal : fix prints for overlapping views
* metal : minimize view overlap to try to utilize device memory better
2023-06-18 09:09:47 +03:00
Howard Su
57cd69460f
cmake : add CUDA_ARCHITECTURES to new target ggml_static ( #1917 )
2023-06-18 07:29:47 +03:00
Georgi Gerganov
b2416493ab
make : do not print help for simple example
2023-06-17 20:55:03 +03:00
Georgi Gerganov
4f9c43e3bd
minor : warning fixes
2023-06-17 20:24:11 +03:00
Johannes Gäßler
2c9380dd2f
Only one CUDA stream per device for async compute ( #1898 )
2023-06-17 19:15:02 +02:00
Georgi Gerganov
051e1b0e6a
llama : fix kv_cache n
init ( close #1903 )
2023-06-17 19:31:20 +03:00
DaniAndTheWeb
86c7571864
make : update for latest Arch ( #1701 )
...
With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.
2023-06-17 19:17:22 +03:00
Howard Su
3d59ec5935
ggml : fix warnings under MSVC ( #1908 )
2023-06-17 18:46:15 +03:00
Concedo
dc3472eb58
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# flake.nix
2023-06-17 23:10:05 +08:00
Concedo
dbd11ddd60
up ver
2023-06-17 23:08:14 +08:00
Aaron Miller
0711a5f6dc
metal : add norm, cpy f16->f16, alibi kernels ( #1823 )
2023-06-17 17:37:49 +03:00
Concedo
8bc4143e14
Merge branch 'concedo' into concedo_experimental
2023-06-17 22:29:38 +08:00
Faez Shakil
fc45a81bc6
exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc ( #1863 )
2023-06-17 14:13:05 +02:00
Concedo
9f8e2f8a18
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
# pocs/vdot/vdot.cpp
# scripts/verify-checksum-models.py
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
# tests/test-sampling.cpp
# tests/test-tokenizer-0.cpp
2023-06-17 20:02:32 +08:00