llama.cpp

Author	SHA1	Message	Date
Julia Longtin	0b3f17127f	force to compile.	2024-03-23 14:58:33 +00:00
Julia Longtin	18f353987c	tell ggml-common.h to export what we want.	2024-03-23 14:49:35 +00:00
Julia Longtin	cd20404250	pull in ggml specific types.	2024-03-23 14:38:15 +00:00
Julia Longtin	8f57803f58	import stdio.h for size_t.	2024-03-23 14:29:59 +00:00
Julia Longtin	9bcb8350d5	import stdint.h for sizeSt.	2024-03-23 14:28:29 +00:00
Julia Longtin	a7bd64c130	begin work on targeting dot_q5_K_q8_K.	2024-03-23 14:19:47 +00:00
Julia Longtin	9185e14922	be more specific about the length of our list of run amounts.	2024-03-21 20:38:49 +00:00
Julia Longtin	0979522fbe	spacing changes.	2024-03-21 18:36:25 +00:00
Julia Longtin	ac3637142d	formatting changes.	2024-03-20 21:34:12 +00:00
Julia Longtin	76e66e77c2	use the same header as ggml.c, and remove some warnings.	2024-03-20 21:12:22 +00:00
Julia Longtin	ee27148629	remove intrinsics import, and use upConv to save 12 bytes of memory transit.	2024-03-20 20:15:30 +00:00
Julia Longtin	ab6f3a8a8d	Update ggml-phi-knc.c	2024-03-17 21:36:14 +00:00
Julia Longtin	f882673ba6	add a benchmark / test binary.	2024-03-17 21:20:14 +00:00
Julia Longtin	fe663c1b63	merge from upstream	2024-03-17 21:15:32 +00:00
Julia Longtin	eac00a72d5	Update ggml.c	2024-03-16 14:17:21 +00:00
Julia Longtin	e216a2f133	Update ggml.c	2024-03-16 14:15:51 +00:00
Julia Longtin	257ffd9955	Update ggml.c	2024-03-16 14:13:22 +00:00
Julia Longtin	717e164dd7	implement F32 dot products.	2024-03-16 14:05:03 +00:00
Julia Longtin	7a57feba0c	import intrinsics.	2024-03-13 19:26:54 +00:00
Julia Longtin	a1ae649662	use right type, and define GGML_F32_VEC_ZERO.	2024-03-13 19:23:53 +00:00
Julia Longtin	f346a41deb	try to implement one intrinsic	2024-03-13 19:18:10 +00:00
Julia Longtin	aec982eefd	try to detect the PHI cross compiler in make.	2024-03-12 21:54:38 +00:00
Julia Longtin	a31c936c5a	try to detect the PHI cross compiler in make.	2024-03-12 21:40:46 +00:00
Julia Longtin	5a2973af25	instead of checking on glibc, check on SYS_getcpu	2024-03-12 21:07:10 +00:00
Julia Longtin	7f3722beb6	handle the case that we have no glibc on the PHI.	2024-03-12 21:02:14 +00:00
Julia Longtin	868a2016ac	add detection of Xeon PHI: Knights Corner.	2024-03-12 20:57:43 +00:00
slaren	306d34be7a	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00
Georgi Gerganov	8030da7afe	ggml : reuse quantum structs across backends (#5943 ) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-12 14:27:20 +02:00
Georgi Gerganov	184215e783	ggml : fix UB in IQ2_S and IQ3_S (#6012 )	2024-03-12 13:49:55 +02:00
Georgi Gerganov	48358b2e5b	sycl : update IQ1_S kernels (WIP - not working!) (#5995 ) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-12 11:15:05 +02:00
gliptic	5cdb371731	grammar : fix unnecessarily retained pointer to rules (#6003 )	2024-03-11 21:59:03 +02:00
Kawrakow	44ca159faf	1.5 bit: we can do even better (#5999 ) * iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 17:53:15 +02:00
Georgi Gerganov	05b06210c9	llama : more consistent names of count variables (#5994 ) * llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name	2024-03-11 17:49:47 +02:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00
Xuan Son Nguyen	caa106d4e0	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Michael Podvitskiy	3202361c5b	ggml, ci : Windows ARM runner and build fixes (#5979 ) * windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`	2024-03-11 11:28:51 +02:00
Minsoo Cheong	332bdfd798	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
Georgi Gerganov	ee35600b90	llama : fix F16/F32 downcast + improve names (#5980 )	2024-03-11 09:56:47 +02:00
Kawrakow	be858f6205	Better 1.5 bit quantization (#5971 ) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 07:51:49 +01:00
Abhilash Majumder	ef3ced26a3	[SYCL] Add q3_s and q1_s (#5886 ) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-11 10:27:56 +05:30
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
Georgi Gerganov	bb6d00bbf9	metal : move mm_id indices to shared mem (#5982 )	2024-03-10 23:12:48 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00
Georgi Gerganov	d9f65c97c3	readme : update hot topics	2024-03-10 20:58:26 +02:00
Georgi Gerganov	b838b53ad6	sync : ggml	2024-03-10 20:10:46 +02:00
Georgi Gerganov	df4dc3e7cb	ggml : try fix 32-bit arm compat (whisper/1938) * ggml : try fix 32-bit arm compat * ggml : fix cont	2024-03-10 20:10:39 +02:00
Georgi Gerganov	bf47a5eefc	ggml : remove __constant__ specifier for CUDA tables (#5940 )	2024-03-10 20:09:24 +02:00
Pierrick Hymbert	fa8a809a91	server: ci: windows build and tests (#5968 ) * server: ci: windows build and tests * server: ci: remove tmp push branch * server: ci: EOF EOL * Use builti Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * server: tests: server graceful shutdown, then kill, then hard kill * server: tests: remove python2 unicode string * server: tests: remove wrong comment on server starting, close_fds is always true * server: tests: server kill, if pid exists * server: tests: remove dependency to killall * server: tests: ci windows: pid exists better handling --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-10 18:17:47 +01:00

1 2 3 4 5 ...

2485 commits