llama.cpp

Author	SHA1	Message	Date
ajasibley	0f4c13fc43	Updated justfile, added script to set shell path, and fixed Cargo.toml version issue.	2023-05-25 01:19:57 -07:00
ajasibley	875b385d79	Update README.md	2023-05-24 17:37:37 -07:00
ajasibley	9580a547d9	Updated README.	2023-05-24 16:56:17 -07:00
ajasibley	5d66c80d99	Cleaner up shell.nix by removing bash commands and replacing them with just recipes.	2023-05-24 15:07:33 -07:00
ajasibley	c0e8da7912	Update README.md	2023-05-24 04:41:50 +00:00
ajasibley	e8ef92a738	Update README.md	2023-05-24 04:30:05 +00:00
ajasibley	988dd73d82	Merge pull request #7 from plurigrid/cosmonic Removed Vespa project	2023-05-24 04:28:26 +00:00
Aja Sibley	2df76d9064	Removed Vespa project.	2023-05-24 04:26:59 +00:00
barton ⊛	74ac17d3ad	Merge pull request #5 from plurigrid/cosmonic Cosmonic wasmCloud via nix shell, improving dev experience: tested on aarch64	2023-05-24 04:11:22 +00:00
ajasibley	025f974d68	Updated readme	2023-05-23 20:24:13 -07:00
ajasibley	5f727081bc	Fixed env error	2023-05-23 19:15:11 -07:00
ajasibley	3d6d096ab8	Update shell.nix Added Cosmonic to path	2023-05-23 23:10:25 +00:00
ajasibley	b45efa3239	Updated nix.shell Added install and setup for Cosmonic and its dependencies for Apple Silicon Macs and and Nvidia Jetsons.	2023-05-23 22:20:17 +00:00
ajasibley	9a565fc9a1	Update README.md	2023-05-18 12:39:44 +00:00
ajasibley	88b988e74b	Update README.md	2023-05-18 12:38:18 +00:00
ajasibley	2dd8d16948	Update README.md	2023-05-18 12:38:06 +00:00
ajasibley	3c0ab1f8b8	Update README.md	2023-05-18 12:37:35 +00:00
ajasibley	954dccfd22	Update README.md	2023-05-18 12:37:05 +00:00
Aja Sibley	4905d355b5	Updated shell nix to install cosmo if not already installed. Added README and cleaned up repo.	2023-05-18 12:36:09 +00:00
Barton Rhodes	119bddaa32	authenticity is all you need	2023-05-18 03:22:13 +00:00
Aja Sibley	26e842a342	Added error log for cosmo.	2023-05-18 01:07:10 +00:00
Aja Sibley	985ab154ec	Added openssl to nix for cosmonic.	2023-05-18 00:07:39 +00:00
Aja Sibley	c3ad27c5df	Updated comments.	2023-05-17 21:19:50 +00:00
Aja Sibley	933bb643a7	Fixed the vespa overlay.	2023-05-17 21:18:23 +00:00
Aja Sibley	19e8d6e288	Added vesp overlay.	2023-05-17 20:52:11 +00:00
Barton Rhodes	7c36e03dfb	too many hooved animals!	2023-04-21 05:07:57 +00:00
Barton Rhodes	6ffe4680ca	fiat nexus ▦	2023-04-20 22:38:56 +00:00
barton ⊛	041627284d	Merge branch 'ggerganov:master' into main	2023-04-20 22:21:31 +00:00
Stephan Walter	2510c1831f	Add ggml-model-.bin checksums for 7B, 13B, 30B, 65B (#1088 ) Add ggml-model-.bin checksums for 7B, 13B, 30B Add ggml-model-*.bin checksums for 65B --------- Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	2023-04-20 23:56:44 +02:00
Georgi Gerganov	12b5900dbc	ggml : sync ggml (add GPT-NeoX RoPE implementation)	2023-04-20 23:32:59 +03:00
Georgi Gerganov	9ff334f3c9	ggml : fix bug in ggml_compute_forward_dup_f32()	2023-04-20 21:58:38 +03:00
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	2023-04-20 20:49:53 +02:00
Georgi Gerganov	8a1756abdf	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	2023-04-20 21:43:50 +03:00
Georgi Gerganov	66aab46079	ggml : fix Q4_3 quantization Broke it during conflict resolution in last PR	2023-04-20 20:44:05 +03:00
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Ivan Komarov	6a9661ea5a	ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074 ) [Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed. This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).	2023-04-20 18:15:18 +03:00
源文雨	5addcb120c	fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080 )	2023-04-20 15:28:43 +02:00
Stephan Walter	c8c2c52482	AVX2 optimization for vec_dot_q4_2_q8_0 (#1068 )	2023-04-20 08:45:41 +02:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
CRD716	834695fe3a	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	2023-04-19 19:52:14 +00:00
Kawrakow	f7d05095b4	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 20:20:14 +02:00
Georgi Gerganov	884e7d7a2b	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	2023-04-19 20:10:08 +03:00
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
slaren	6667401238	Multi-threaded ggml_cpy (#1035 ) * Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 00:53:24 +02:00
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
Georgi Gerganov	50a8a2af97	ggml : scratch that - vmlaq_n_f32 is always better Had a background process that was messing with the timings	2023-04-18 23:11:23 +03:00
Georgi Gerganov	4caebf6d40	gitignore : vdot	2023-04-18 23:00:08 +03:00

1 2 3 4 5 ...

428 commits