llama.cpp

Author	SHA1	Message	Date
Johannes Gäßler	924dd22fd3	Quantized dot products for CUDA mul mat vec (#2067 )	2023-07-05 14:19:42 +02:00
Howard Su	051c70dcd5	llama: Don't double count the sampling time (#2107 )	2023-07-05 18:31:23 +08:00
Johannes Gäßler	9e4475f5cf	Fixed OpenCL offloading prints (#2082 )	2023-07-05 08:58:05 +02:00
Nigel Bosch	7f0e9a775e	embd-input: Fix input embedding example unsigned int seed (#2105 )	2023-07-05 07:33:33 +08:00
0cc4m	80b17e2f66	Fix trailing whitespace in vk_mem_alloc.h	2023-07-04 23:01:32 +02:00
0cc4m	e35d28fec3	Fix queue selection for AMD RADV	2023-07-04 22:57:08 +02:00
0cc4m	ae7325fdff	Fix 2d write	2023-07-04 22:42:07 +02:00
0cc4m	ade9555c48	Add 2d write operation, profiling code	2023-07-04 22:31:47 +02:00
Georgi Gerganov	b472f3fca5	readme : add link web chat PR	2023-07-04 22:25:22 +03:00
Georgi Gerganov	ed9a54e512	ggml : sync latest (new ops, macros, refactoring) (#2106 ) - add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c	2023-07-04 21:54:11 +03:00
jwj7140	f257fd2550	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding	2023-07-04 21:06:12 +03:00
Tobias Lütke	7ee76e45af	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-04 16:05:27 +02:00
Henri Vasserman	3d7d8d00a4	add cmake commands	2023-07-04 17:02:22 +03:00
Henri Vasserman	acc111caf9	Allow old Make to build server. (#2098 ) Also make server build by default. Tested with Make 3.82	2023-07-04 15:38:04 +03:00
ZhouYuChen	23c7c6fc91	Update Makefile: clean simple (#2097 )	2023-07-04 14:15:16 +02:00
Erik Scholz	698efad5fb	CI: make the brew update temporarily optional. (#2092 ) until they decide to fix the brew installation in the macos runners. see the open issues. eg https://github.com/actions/runner-images/pull/7710	2023-07-04 01:50:12 +02:00
Govlzkoy	14a2cc71f6	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088 )	2023-07-04 07:50:00 +08:00
Henri Vasserman	1cf14ccef1	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
Howard Su	cc45a7feb8	Fix crash of test-tokenizer-0 under Debug build (#2064 ) * Fix crash of test-tokenizer-0 under Debug build * Change per comment	2023-07-03 20:43:55 +02:00
Howard Su	55dbb915cc	[llama] No need to check file version when loading vocab score (#2079 )	2023-07-03 19:58:58 +08:00
WangHaoranRobin	d7d2e6a0f0	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together	2023-07-03 00:38:44 +03:00
0cc4m	24eeb97d13	Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly	2023-07-02 22:11:58 +02:00
Georgi Gerganov	46088f7231	ggml : fix build with OpenBLAS (close #2066 )	2023-07-02 09:46:46 +03:00
Johannes Gäßler	0bc2cdfc87	Better CUDA synchronization logic (#2057 )	2023-07-01 21:49:44 +02:00
Johannes Gäßler	befb3a3562	Test-based VRAM scratch size + context adjustment (#2056 )	2023-07-01 21:47:26 +02:00
Daniel Drake	b213227067	cmake : don't force -mcpu=native on aarch64 (#2063 ) It's currently not possible to cross-compile llama.cpp for aarch64 because CMakeLists.txt forces -mcpu=native for that target. -mcpu=native doesn't make sense if your build host is not the target architecture, and clang rejects it for that reason, aborting the build. This can be easily reproduced using the current Android NDK to build for aarch64 on an x86_64 host. If there is not a specific CPU-tuning target for aarch64 then -mcpu should be omitted completely. I think that makes sense, there is not enough variance in the aarch64 instruction set to warrant a fixed -mcpu optimization at this point. And if someone is building natively and wishes to enable any possible optimizations for the host device, then there is already the LLAMA_NATIVE option available. Fixes #495.	2023-07-01 21:31:44 +03:00
Aaron Miller	2f8cd979ec	metal : release buffers when freeing metal context (#2062 )	2023-07-01 21:14:59 +03:00
Judd	471aab6e4c	convert : add support of baichuan-7b (#2055 ) Co-authored-by: Judd <foldl@boxvest.com>	2023-07-01 20:00:25 +03:00
Georgi Gerganov	463f2f4c4f	llama : fix return value of llama_load_session_file_internal (#2022 )	2023-07-01 19:05:09 +03:00
Rand Xie	cb44dbc7de	llama : catch llama_load_session_file_internal exceptions (#2022 ) * convert checks in llama_load_session_file to throw and handle them * make llama_load_session_file_internal static * address feedbacks to avoid using exceptions	2023-07-01 19:02:58 +03:00
Georgi Gerganov	79f634a19d	embd-input : fix returning ptr to temporary	2023-07-01 18:46:00 +03:00
Georgi Gerganov	04606a1599	train : fix compile warning	2023-07-01 18:45:44 +03:00
Qingyou Meng	b1ca8f36a9	ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995 ) Will not be scheduled unless explicitly enabled.	2023-07-01 18:42:43 +03:00
0cc4m	36cd5d85e9	Avoid requesting dedicated memory, VMA can decide that by itself	2023-06-30 21:20:19 +02:00
0cc4m	4ea9b2fd4b	Add VMA library	2023-06-30 21:15:06 +02:00
0cc4m	c8ff09bdc7	dequant_q4_0 kernel	2023-06-30 20:48:42 +02:00
0cc4m	cb5cb4d6e2	Fix f16_to_f32 kernel	2023-06-30 20:48:03 +02:00
0cc4m	df3cdbdac7	Output FP32 in fp16 matmul shader	2023-06-30 18:37:10 +02:00
0cc4m	40c8f843f2	Fix mulmat_f16	2023-06-30 18:37:10 +02:00
0cc4m	c31e14b2fd	Enable device extensions properly, restore fp16 matmul op	2023-06-30 18:37:10 +02:00
0cc4m	fc5bb53b32	Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel	2023-06-30 18:37:10 +02:00
0cc4m	3adc7b1d60	First FP16 attempt, disabled for now	2023-06-30 18:37:10 +02:00
0cc4m	2c70df985a	Continue vulkan implementation and optimization	2023-06-30 18:36:42 +02:00
0cc4m	0c9cca00bd	Write coalescing	2023-06-30 18:36:42 +02:00
0cc4m	7c6860b483	2D Blocktiling	2023-06-30 18:36:42 +02:00
0cc4m	1b4863c2b9	1D Blocktiling	2023-06-30 18:36:42 +02:00
0cc4m	baf9ff536b	GEMM Kernel optimization	2023-06-30 18:36:42 +02:00
0cc4m	a42376e7ec	First matmul success	2023-06-30 18:36:42 +02:00
0cc4m	8ce84c2747	Continue implementation	2023-06-30 18:36:42 +02:00
0cc4m	2471728a9d	Add aligned malloc and free for VMA	2023-06-30 18:36:42 +02:00

1 2 3 4 5 ...

919 commits