Commit graph

428 commits

Author SHA1 Message Date
ajasibley
0f4c13fc43 Updated justfile, added script to set shell path, and fixed Cargo.toml version issue. 2023-05-25 01:19:57 -07:00
ajasibley
875b385d79
Update README.md 2023-05-24 17:37:37 -07:00
ajasibley
9580a547d9 Updated README. 2023-05-24 16:56:17 -07:00
ajasibley
5d66c80d99 Cleaner up shell.nix by removing bash commands and replacing them with just recipes. 2023-05-24 15:07:33 -07:00
ajasibley
c0e8da7912
Update README.md 2023-05-24 04:41:50 +00:00
ajasibley
e8ef92a738
Update README.md 2023-05-24 04:30:05 +00:00
ajasibley
988dd73d82
Merge pull request #7 from plurigrid/cosmonic
Removed Vespa project
2023-05-24 04:28:26 +00:00
Aja Sibley
2df76d9064 Removed Vespa project. 2023-05-24 04:26:59 +00:00
barton ⊛
74ac17d3ad
Merge pull request #5 from plurigrid/cosmonic
Cosmonic wasmCloud via nix shell, improving dev experience: tested on aarch64
2023-05-24 04:11:22 +00:00
ajasibley
025f974d68 Updated readme 2023-05-23 20:24:13 -07:00
ajasibley
5f727081bc Fixed env error 2023-05-23 19:15:11 -07:00
ajasibley
3d6d096ab8 Update shell.nix
Added Cosmonic to path
2023-05-23 23:10:25 +00:00
ajasibley
b45efa3239 Updated nix.shell
Added install and setup for Cosmonic and its dependencies for Apple Silicon Macs and and Nvidia Jetsons.
2023-05-23 22:20:17 +00:00
ajasibley
9a565fc9a1
Update README.md 2023-05-18 12:39:44 +00:00
ajasibley
88b988e74b
Update README.md 2023-05-18 12:38:18 +00:00
ajasibley
2dd8d16948
Update README.md 2023-05-18 12:38:06 +00:00
ajasibley
3c0ab1f8b8
Update README.md 2023-05-18 12:37:35 +00:00
ajasibley
954dccfd22
Update README.md 2023-05-18 12:37:05 +00:00
Aja Sibley
4905d355b5 Updated shell nix to install cosmo if not already installed. Added README and cleaned up repo. 2023-05-18 12:36:09 +00:00
Barton Rhodes
119bddaa32 authenticity is all you need 2023-05-18 03:22:13 +00:00
Aja Sibley
26e842a342 Added error log for cosmo. 2023-05-18 01:07:10 +00:00
Aja Sibley
985ab154ec Added openssl to nix for cosmonic. 2023-05-18 00:07:39 +00:00
Aja Sibley
c3ad27c5df Updated comments. 2023-05-17 21:19:50 +00:00
Aja Sibley
933bb643a7 Fixed the vespa overlay. 2023-05-17 21:18:23 +00:00
Aja Sibley
19e8d6e288 Added vesp overlay. 2023-05-17 20:52:11 +00:00
Barton Rhodes
7c36e03dfb too many hooved animals! 2023-04-21 05:07:57 +00:00
Barton Rhodes
6ffe4680ca fiat nexus ▦ 2023-04-20 22:38:56 +00:00
barton ⊛
041627284d
Merge branch 'ggerganov:master' into main 2023-04-20 22:21:31 +00:00
Stephan Walter
2510c1831f
Add ggml-model-*.bin checksums for 7B, 13B, 30B, 65B (#1088)
* Add ggml-model-*.bin checksums for 7B, 13B, 30B
* Add ggml-model-*.bin checksums for 65B

---------

Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-04-20 23:56:44 +02:00
Georgi Gerganov
12b5900dbc
ggml : sync ggml (add GPT-NeoX RoPE implementation) 2023-04-20 23:32:59 +03:00
Georgi Gerganov
9ff334f3c9
ggml : fix bug in ggml_compute_forward_dup_f32() 2023-04-20 21:58:38 +03:00
slaren
2005469ea1
Add Q4_3 support to cuBLAS (#1086) 2023-04-20 20:49:53 +02:00
Georgi Gerganov
8a1756abdf
ggml : do not break cuBLAS build (Q4_3 is not yet implemented) 2023-04-20 21:43:50 +03:00
Georgi Gerganov
66aab46079
ggml : fix Q4_3 quantization
Broke it during conflict resolution in last PR
2023-04-20 20:44:05 +03:00
Kawrakow
38de86a711
llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-20 20:42:27 +03:00
Georgi Gerganov
e0305ead3a
ggml : add Q4_3 quantization (#1082) 2023-04-20 20:35:53 +03:00
Ivan Komarov
6a9661ea5a
ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.

This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
2023-04-20 18:15:18 +03:00
源文雨
5addcb120c
fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080) 2023-04-20 15:28:43 +02:00
Stephan Walter
c8c2c52482
AVX2 optimization for vec_dot_q4_2_q8_0 (#1068) 2023-04-20 08:45:41 +02:00
slaren
02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065) 2023-04-20 03:14:14 +02:00
CRD716
834695fe3a
Minor: Readme fixed grammar, spelling, and misc updates (#1071) 2023-04-19 19:52:14 +00:00
Kawrakow
f7d05095b4
Q4_2 quantization with rmse-optimized scale and quants (#1062)
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 20:20:14 +02:00
Georgi Gerganov
884e7d7a2b
ggml : use 8-bit precision for Q4_1 intermediate results (#1047)
* ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

* ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

56 ms/token with Q4_1 !

* ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

* gitignore : ignore ppl-*.txt files

---------

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-04-19 20:10:08 +03:00
Georgi Gerganov
7cd5c4a3e9
readme : add warning about Q4_2 and Q4_3 2023-04-19 19:07:54 +03:00
Stephan Walter
f3d4edf504
ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
2023-04-19 19:06:37 +03:00
slaren
8944a13296
Add NVIDIA cuBLAS support (#1044) 2023-04-19 11:22:45 +02:00
slaren
6667401238
Multi-threaded ggml_cpy (#1035)
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 00:53:24 +02:00
Georgi Gerganov
77a73403ca
ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
2023-04-18 23:54:57 +03:00
Georgi Gerganov
50a8a2af97
ggml : scratch that - vmlaq_n_f32 is always better
Had a background process that was messing with the timings
2023-04-18 23:11:23 +03:00
Georgi Gerganov
4caebf6d40
gitignore : vdot 2023-04-18 23:00:08 +03:00