llama.cpp

Author	SHA1	Message	Date
Concedo	794a38a2e8	Revert "cublas is not feasible at this time. removed for now" This reverts commit `3687db7cf7`.	2023-04-21 21:02:40 +08:00
Concedo	5160053e51	merged llama adapter into the rest of the gpt adapters	2023-04-21 17:47:48 +08:00
Concedo	82d74ca1a6	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml	2023-04-21 16:24:30 +08:00
Concedo	3687db7cf7	cublas is not feasible at this time. removed for now	2023-04-21 16:14:23 +08:00
Georgi Gerganov	d40fded93e	llama : fix comment for "output.weight" tensor	2023-04-21 10:24:02 +03:00
Stephan Walter	2510c1831f	Add ggml-model-.bin checksums for 7B, 13B, 30B, 65B (#1088 ) Add ggml-model-.bin checksums for 7B, 13B, 30B Add ggml-model-*.bin checksums for 65B --------- Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	2023-04-20 23:56:44 +02:00
Georgi Gerganov	12b5900dbc	ggml : sync ggml (add GPT-NeoX RoPE implementation)	2023-04-20 23:32:59 +03:00
Georgi Gerganov	9ff334f3c9	ggml : fix bug in ggml_compute_forward_dup_f32()	2023-04-20 21:58:38 +03:00
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	2023-04-20 20:49:53 +02:00
Georgi Gerganov	8a1756abdf	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	2023-04-20 21:43:50 +03:00
Georgi Gerganov	66aab46079	ggml : fix Q4_3 quantization Broke it during conflict resolution in last PR	2023-04-20 20:44:05 +03:00
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Concedo	07bb31b034	wip dont use	2023-04-21 00:35:54 +08:00
Ivan Komarov	6a9661ea5a	ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074 ) [Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed. This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).	2023-04-20 18:15:18 +03:00
Concedo	7ba36c2c6c	trying to put out penguin based fires. sorry for inconvenience	2023-04-20 23:15:07 +08:00
源文雨	5addcb120c	fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080 )	2023-04-20 15:28:43 +02:00
Concedo	49697d86d8	adjusted down the buf memory allocation now that realloc seems to work	2023-04-20 17:51:13 +08:00
Concedo	4605074245	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # ggml.c	2023-04-20 17:30:54 +08:00
Concedo	3e88616439	fixed WONKY CODE	2023-04-20 16:41:32 +08:00
Concedo	0b08ec7c5d	forgot to remove this	2023-04-20 16:28:47 +08:00
Concedo	346cd68903	make linux and OSX build process equal to windows. Now it will build all applicable libraries, for a full build do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`	2023-04-20 15:53:55 +08:00
Stephan Walter	c8c2c52482	AVX2 optimization for vec_dot_q4_2_q8_0 (#1068 )	2023-04-20 08:45:41 +02:00
Concedo	93761e7baf	slightly clarified the library replacement steps - replacing the dll is necessary in addition to replacing the library imports	2023-04-20 12:23:54 +08:00
Gustavo Rocha Dias	5ca2d774cc	doc - explanation of how to use a custom version of the windows libraries at the lib folder. (#92 ) the dynamic libraries also need to be updated if you replace the import libraries	2023-04-20 12:20:11 +08:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
CRD716	834695fe3a	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	2023-04-19 19:52:14 +00:00
Kawrakow	f7d05095b4	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 20:20:14 +02:00
Georgi Gerganov	884e7d7a2b	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	2023-04-19 20:10:08 +03:00
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
Concedo	be1222c36e	Merged the upstream cublas feature,	2023-04-19 20:45:37 +08:00
Concedo	cc407f283a	messing around with memory allocation to bandaid the random ooms with various gpt2 and gptj models	2023-04-19 20:18:55 +08:00
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
Concedo	f662a9a230	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # .github/workflows/docker.yml # CMakeLists.txt # Makefile # README.md	2023-04-19 16:34:51 +08:00
Concedo	65bfcdb1cc	Merge branch 'concedo_experimental' into concedo	2023-04-19 15:35:48 +08:00
Concedo	45ec09d31b	fast forwarding for rwkv for unmodified contexts	2023-04-19 15:09:35 +08:00
AlpinDale	116488af66	Create make_pyinstaller.sh (#89 )	2023-04-19 10:57:07 +08:00
slaren	6667401238	Multi-threaded ggml_cpy (#1035 ) * Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 00:53:24 +02:00
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
Georgi Gerganov	50a8a2af97	ggml : scratch that - vmlaq_n_f32 is always better Had a background process that was messing with the timings	2023-04-18 23:11:23 +03:00
Georgi Gerganov	4caebf6d40	gitignore : vdot	2023-04-18 23:00:08 +03:00
Georgi Gerganov	dcdd65e296	ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators	2023-04-18 22:59:17 +03:00
Kawrakow	5ecff35151	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-04-18 19:00:14 +00:00
Georgi Gerganov	7faa7460f0	readme : update hot topics about new LoRA functionality	2023-04-18 20:10:26 +03:00
Georgi Gerganov	5af8e32238	ci : do not run on drafts	2023-04-18 19:57:06 +03:00
Concedo	f39def81d4	Update readme with more info	2023-04-18 21:44:26 +08:00
Concedo	3614956bc7	update readme	2023-04-18 21:39:05 +08:00
Concedo	ea01771dd5	rwkv is done	2023-04-18 20:55:01 +08:00
Concedo	a76b15b581	Merge branch 'concedo' into concedo_experimental # Conflicts: # make_pyinstaller.bat	2023-04-18 17:42:43 +08:00

1 2 3 4 5 ...

579 commits