llama.cpp

Author	SHA1	Message	Date
hongruichen	3c491a3263	remove reference of g_qnn_mgr in qnn_instance	2024-06-19 14:45:43 +08:00
hongruichen	99320620b0	split logger function, tensors and backend from main qnn source	2024-06-19 14:39:16 +08:00
hongruichen	dfe159ffff	remove TODO	2024-06-19 11:16:12 +08:00
hongruichen	aeef0c68f4	make the constant condition first	2024-06-19 10:29:53 +08:00
hongruichen	65a14d9e9a	fix todo	2024-06-18 23:09:04 +08:00
hongruichen	9456bba121	rename	2024-06-17 18:44:19 +08:00
hongruichen	5fe7b87ba1	use ggml_qnn_tensor_writer for all parameters	2024-06-17 11:17:46 +08:00
hongruichen	a5679ddd8e	use ggml_qnn_tensor_reader for output tensor	2024-06-16 22:28:11 +08:00
hongruichen	36e41a1055	use tensor wrapper in matmul	2024-06-16 22:28:11 +08:00
hongruichen	37bb9263dd	use tensor wrapper in add	2024-06-16 22:28:11 +08:00
hongruichen	6c68adc1d9	add ggml_qnn_tensor_binder	2024-06-16 22:28:10 +08:00
hongruichen	5e18cdc268	init the test array with const values	2024-06-16 22:28:10 +08:00
zhou.weiguo	5598fbd15d	review: make a MVP(Minimum Viable PR) style PR in upstream	2024-06-13 15:41:53 +08:00
zhou.weiguo	faaa86b7e4	ggml-qnn: refine ggml inference using QNN NPU	2024-06-12 16:30:50 +08:00
zhou.weiguo	5269e082aa	ggml-qnn: refine ggml inference using QNN NPU	2024-06-11 23:05:00 +08:00
zhou.weiguo	5f8cfe4a1e	ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy	2024-06-10 20:07:26 +08:00
zhou.weiguo	d38d4a67d1	npu: probe htp info and capacity of rpc ion memory	2024-06-09 23:49:54 +08:00
zhou.weiguo	3e8b61f970	review: fix a memory leak introduced by review modification which explained in https://github.com/zhouwg/llama.cpp/pull/1	2024-06-09 09:06:44 +08:00
zhou.weiguo	fdf0272dfb	review: code format using clang-format + manually modification according to review comments	2024-06-08 17:56:32 +08:00
zhou.weiguo	5d691c6cd0	review: put qnn's internal log inside preprocessor diretive	2024-06-08 09:22:39 +08:00
zhou.weiguo	94ee775058	review: remove static global vars to support multi-instance simultaneously and thread safe	2024-06-07 14:56:07 +08:00
zhou.weiguo	2fab33d825	ggml-qnn: remove static global vars to support multi-instance simultaneously	2024-06-07 12:51:04 +08:00
zhou.weiguo	f4c53037ab	review: remove unused QNN helper functions	2024-06-06 20:24:03 +08:00
zhou.weiguo	dd29834c11	add supportive of quantize data type Q8_0	2024-06-06 17:12:28 +08:00
zhou.weiguo	926a8661f3	review: replace external declaration with NDK header file	2024-06-05 21:10:59 +08:00
zhou.weiguo	9c872cbbce	refine ggml-qnn-ut program and script to make reviewers happy	2024-06-05 12:06:17 +08:00
zhou.weiguo	c75817b881	rebase	2024-06-05 10:57:08 +08:00
zhou.weiguo	d325088dbf	ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend	2024-06-05 10:55:45 +08:00
jaime-m-p	c90dbe026b	Fix per token atrributes bits (#7749 )	2024-06-05 01:26:14 +02:00
agray3	b90dc566c1	Allow number of nodes in CUDA graph to change (#7738 ) Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.	2024-06-04 22:06:49 +02:00
Georgi Gerganov	1442677f92	common : refactor cli arg parsing (#7675 ) * common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params	2024-06-04 21:23:39 +03:00
Georgi Gerganov	554c247caf	ggml : remove OpenCL (#7735 ) ggml-ci	2024-06-04 21:23:20 +03:00
Georgi Gerganov	0cd6bd3483	llama : remove beam search (#7736 )	2024-06-04 21:23:05 +03:00
Georgi Gerganov	5ca0944a15	readme : remove obsolete Zig instructions (#7471 )	2024-06-04 19:43:01 +03:00
slaren	adc9ff3841	llama-bench : allow using a different printer for stderr with -oe (#7722 ) compare-commits.sh : hide stdout, use -oe to print markdown	2024-06-04 14:32:42 +02:00
Daniele	987d743d6b	Improve hipBLAS support in CMake (#7696 ) * Improve hipBLAS support in CMake This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK. * Set ROCM_PATH correctly	2024-06-04 14:09:15 +02:00
zhouwg	b226c1227b	refine .gitignore (#7688 ) This adds tags and android ndk into the git ignore list	2024-06-04 21:21:26 +10:00
jaime-m-p	3b38d48609	Per token attributes (#7685 ) * Add per token attributes enum * Using phi-3 for testing 'rstrip' * Using jina-v2 for testing 'lstrip' * Brute force test for 'lstrip' and 'rstrip' * Implement 'rstrip' and 'lstrip' * Update phi-3 GGUF file (obsolete since `917dc8c`) * Replace llama_token_type with llama_token_attribs	2024-06-04 09:17:17 +02:00
Georgi Gerganov	6d1616944d	ggml : prevent builds with -ffinite-math-only (#7726 ) This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825	2024-06-04 17:01:09 +10:00
Radoslav Gerganov	bde7cd3cd9	llama : offload to RPC in addition to other backends (#7640 ) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-03 20:03:26 +03:00
Masaya, Kato	a5735e4426	ggml : use OpenMP as a thread pool (#7606 ) * ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-03 17:14:15 +02:00
Johannes Gäßler	0b832d53ba	make: fix debug options not being applied to NVCC (#7714 )	2024-06-03 16:28:58 +02:00
0cc4m	3d7ebf6312	Vulkan Mixture of Experts (MoE) support (#7628 ) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU	2024-06-03 10:59:14 +02:00
Andy Tai	a10cda58d3	cmake : add pkg-config spec file for llama.cpp (#7702 )	2024-06-03 11:06:24 +03:00
zhangkaihuo	6f28a333c1	llama : MiniCPM support tied embeddings (#7664 ) * support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>	2024-06-03 10:49:30 +03:00
Georgi Gerganov	549279d804	llama : avoid double token-to-piece cache (#7654 ) ggml-ci	2024-06-03 08:34:43 +03:00
woachk	9e405b6e2e	kompute : implement op_getrows_f32 (#6403 ) op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.	2024-06-03 08:32:16 +03:00
Dave Airlie	3413ae2193	fix bug introduced in using calloc (#7701 ) compilade pointed this out on the previous MR	2024-06-02 17:59:54 -04:00
Georgi Gerganov	1669810d7c	flake.lock: Update (#7686 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/8dc45382d5206bd292f9c2768b8058a8fd8311d9?narHash=sha256-/GJvTdTpuDjNn84j82cU6bXztE0MSkdnTWClUCRub78%3D' (2024-05-16) → 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01) • Updated input 'flake-parts/nixpkgs-lib': 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02) → 'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/bfb7a882678e518398ce9a31a881538679f6f092?narHash=sha256-4zSIhSRRIoEBwjbPm3YiGtbd8HDWzFxJjw5DYSDy1n8%3D' (2024-05-24) → 'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-06-02 14:13:12 -07:00
Austin	7c4e5b7eae	chore : add ignore rule for generated server themes (#7689 )	2024-06-02 20:39:08 +03:00

1 2 3 4 5 ...

3118 commits