llama.cpp

Author	SHA1	Message	Date
Yifan Gu	fa5b31a5ca	vulkan : add GGML_VK_FORCE_HEAP_INDEX env var Some vulkan devices (namely integrated graphics cards) have multiple memory heaps: a smaller dedicated memory and a larger shared memory. ggml uses the first usable memory type, which usually resides on the smaller dedicated memory heap. This can likely cause allocation failures. This patch adds an environment variable that forces allocation on a specific memory heap.	2024-10-03 23:14:31 +00:00
Georgi Gerganov	d5ed2b929d	metal : remove abort (skip) (ggml/0)	2024-10-03 21:18:19 +03:00
Georgi Gerganov	1bb8a64ebf	sync : ggml	2024-10-03 21:17:49 +03:00
Johannes Gäßler	fabdc3bda3	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
Johannes Gäßler	eee39bdc96	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-03 21:17:26 +03:00
Jack Mousseau	5d5ab1e5cc	metal : fix compute pass descriptor autorelease crash (#9718 )	2024-10-03 21:01:46 +03:00
Diego Devesa	a7ad553513	ggml-backend : add device description to CPU backend (#9720 )	2024-10-03 17:39:18 +02:00
bandoti	d6fe7abf04	ggml: unify backend logging mechanism (#9709 ) * Add scaffolding for ggml logging macros * Metal backend now uses GGML logging * Cuda backend now uses GGML logging * Cann backend now uses GGML logging * Add enum tag to parameters * Use C memory allocation funcs * Fix compile error * Use GGML_LOG instead of GGML_PRINT * Rename llama_state to llama_logger_state * Prevent null format string * Fix whitespace * Remove log callbacks from ggml backends * Remove cuda log statement	2024-10-03 17:39:03 +02:00
compilade	e3c355ba65	convert : handle tokenizer merges format from transformers 4.45 (#9696 )	2024-10-03 17:22:15 +03:00
Radoslav Gerganov	841713e1e4	rpc : enable vulkan (#9714 ) closes #8536	2024-10-03 13:00:52 +03:00
Ouadie EL FAROUKI	5639971466	Fixed dequant precision issues in Q4_1 and Q5_1 (#9711 )	2024-10-03 07:50:44 +01:00
Diego Devesa	c83ad6d01e	ggml-backend : add device and backend reg interfaces (#9707 ) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-03 01:49:47 +02:00
Xuan Son Nguyen	a39ab216aa	llama : reduce compile time and binary size (#9712 ) * llama : speed up compile time * fix build * fix build (2)	2024-10-02 15:49:55 +02:00
Alberto Cabrera Pérez	f536f4c439	[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658 ) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-02 13:57:18 +01:00
Radoslav Gerganov	00b7317e63	vulkan : do not use tensor->extra (#9407 ) * vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: #8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2) --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-10-02 13:49:16 +03:00
Zhenwei Jin	76b37d1541	gguf-split : improve --split and --merge logic (#9619 ) * make sure params --split and --merge are not specified at same time * update gguf-split params parse logic * Update examples/gguf-split/gguf-split.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-10-02 10:21:57 +03:00
Georgi Gerganov	148844fe97	examples : remove benchmark (#9704 ) ggml-ci	2024-10-02 10:14:44 +03:00
Paweł Wodnicki	3f1ae2e32c	Update README.md (#9591 ) Add Bielik model.	2024-10-01 19:18:46 +02:00
Georgi Gerganov	f1b8c42711	sync : ggml	2024-10-01 16:09:42 +03:00
Johannes Gäßler	e98c1c188e	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-01 16:07:40 +03:00
Salvatore Mesoraca	cb00020504	vulkan : mul_mat: fix UB with small warps (ggml/952) When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-01 16:07:39 +03:00
Borislav Stanimirov	6c5322481a	ggml : fix ggml_cast (ggml/973)	2024-10-01 16:07:39 +03:00
Johannes Gäßler	7254cdf7e8	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-01 16:07:38 +03:00
Georgi Gerganov	cad341d889	metal : reduce command encoding overhead (#9698 ) * metal : reduce command encoding overhead ggml-ci * metal : add comments	2024-10-01 16:00:25 +03:00
Georgi Gerganov	a90484c6d9	llama : print correct model type for Llama 3.2 1B and 3B	2024-10-01 11:42:01 +03:00
compilade	1927378bcc	convert : refactor rope_freqs generation (#9396 ) * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings.	2024-10-01 09:31:36 +03:00
serhii-nakon	6f1d9d71f4	Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641 ) * Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS * Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code	2024-09-30 20:57:12 +02:00
compilade	511636df0c	ci : reduce severity of unused Pyright ignore comments (#9697 )	2024-09-30 14:13:16 -04:00
vb	08a43d05b6	py : update transfomers version (#9694 ) * update transfomers version. * update hfh version.	2024-09-30 18:03:47 +03:00
Georgi Gerganov	ace4f4be37	flake.lock: Update (#9680 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19) → 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-09-30 07:48:49 -07:00
Ruchira Hasaranga	8277a817f1	console : utf-8 fix for windows stdin (#9690 ) * utf-8 fix for windows stdin * Update common/console.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-30 11:23:42 +03:00
Georgi Gerganov	c919d5db39	ggml : define missing HWCAP flags (#9684 ) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-09-29 21:18:23 +03:00
Georgi Gerganov	d0b1d663e4	sync : ggml	2024-09-29 21:16:07 +03:00
Johannes Gäßler	aaa4099925	CUDA: remove bad assert (ggml/972)	2024-09-29 21:15:37 +03:00
Jeff Bolz	641002fba8	vulkan : multithread pipeline creation (ggml/963)	2024-09-29 21:15:37 +03:00
Jeff Bolz	0de8b203f1	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-09-29 21:15:37 +03:00
Salvatore Mesoraca	544f409b4b	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-09-29 21:15:37 +03:00
Georgi Gerganov	6084bfb261	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-09-29 21:15:35 +03:00
matiaslin	faac0bae26	common : ensure llama_batch size does not exceed max size (#9668 ) A crash was observed when the number of tokens added to a batch exceeds llama_batch size. An assertion in llama_batch_add was added to protect against llama_batch size overflow.	2024-09-29 15:25:00 +03:00
nopperl	f99d3f8367	py : add model class for Chameleon conversion (#9683 )	2024-09-29 15:02:06 +03:00
Georgi Gerganov	589b48d41e	contrib : add Resources section (#9675 )	2024-09-29 14:38:18 +03:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
slaren	1b2f992cd2	test-backend-ops : use flops for some performance tests (#9657 ) * test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second	2024-09-28 14:32:46 +02:00
Georgi Gerganov	739842703e	llama : add comment about thread-safety [no ci] (#9449 )	2024-09-28 15:13:42 +03:00
Zhenwei Jin	6102037bbb	vocab : refactor tokenizer to reduce init overhead (#9449 ) * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-28 15:10:58 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Aarni Koskela	43bcdd9703	readme : add tool (#9655 )	2024-09-28 15:07:14 +03:00
Dan Johansson	6a0f779484	ggml : add run-time detection of neon, i8mm and sve (#9331 ) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-09-28 15:06:16 +03:00
Markus Tavenrath	89f9944981	Enable use to the rebar feature to upload buffers to the device. (#9251 )	2024-09-28 12:05:05 +02:00
Georgi Gerganov	b5de3b74a5	readme : update hot topics	2024-09-27 20:57:51 +03:00

1 2 3 4 5 ...

3879 commits