Commit graph

3118 commits

Author SHA1 Message Date
hongruichen
3c491a3263 remove reference of g_qnn_mgr in qnn_instance 2024-06-19 14:45:43 +08:00
hongruichen
99320620b0 split logger function, tensors and backend from main qnn source 2024-06-19 14:39:16 +08:00
hongruichen
dfe159ffff remove TODO 2024-06-19 11:16:12 +08:00
hongruichen
aeef0c68f4 make the constant condition first 2024-06-19 10:29:53 +08:00
hongruichen
65a14d9e9a fix todo 2024-06-18 23:09:04 +08:00
hongruichen
9456bba121 rename 2024-06-17 18:44:19 +08:00
hongruichen
5fe7b87ba1 use ggml_qnn_tensor_writer for all parameters 2024-06-17 11:17:46 +08:00
hongruichen
a5679ddd8e use ggml_qnn_tensor_reader for output tensor 2024-06-16 22:28:11 +08:00
hongruichen
36e41a1055 use tensor wrapper in matmul 2024-06-16 22:28:11 +08:00
hongruichen
37bb9263dd use tensor wrapper in add 2024-06-16 22:28:11 +08:00
hongruichen
6c68adc1d9 add ggml_qnn_tensor_binder 2024-06-16 22:28:10 +08:00
hongruichen
5e18cdc268 init the test array with const values 2024-06-16 22:28:10 +08:00
zhou.weiguo
5598fbd15d
review: make a MVP(Minimum Viable PR) style PR in upstream 2024-06-13 15:41:53 +08:00
zhou.weiguo
faaa86b7e4
ggml-qnn: refine ggml inference using QNN NPU 2024-06-12 16:30:50 +08:00
zhou.weiguo
5269e082aa
ggml-qnn: refine ggml inference using QNN NPU 2024-06-11 23:05:00 +08:00
zhou.weiguo
5f8cfe4a1e
ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy 2024-06-10 20:07:26 +08:00
zhou.weiguo
d38d4a67d1
npu: probe htp info and capacity of rpc ion memory 2024-06-09 23:49:54 +08:00
zhou.weiguo
3e8b61f970
review: fix a memory leak introduced by review modification which explained in https://github.com/zhouwg/llama.cpp/pull/1 2024-06-09 09:06:44 +08:00
zhou.weiguo
fdf0272dfb
review: code format using clang-format + manually modification according to review comments 2024-06-08 17:56:32 +08:00
zhou.weiguo
5d691c6cd0
review: put qnn's internal log inside preprocessor diretive 2024-06-08 09:22:39 +08:00
zhou.weiguo
94ee775058
review: remove static global vars to support multi-instance simultaneously and thread safe 2024-06-07 14:56:07 +08:00
zhou.weiguo
2fab33d825
ggml-qnn: remove static global vars to support multi-instance simultaneously 2024-06-07 12:51:04 +08:00
zhou.weiguo
f4c53037ab
review: remove unused QNN helper functions 2024-06-06 20:24:03 +08:00
zhou.weiguo
dd29834c11
add supportive of quantize data type Q8_0 2024-06-06 17:12:28 +08:00
zhou.weiguo
926a8661f3
review: replace external declaration with NDK header file 2024-06-05 21:10:59 +08:00
zhou.weiguo
9c872cbbce
refine ggml-qnn-ut program and script to make reviewers happy 2024-06-05 12:06:17 +08:00
zhou.weiguo
c75817b881
rebase 2024-06-05 10:57:08 +08:00
zhou.weiguo
d325088dbf
ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend 2024-06-05 10:55:45 +08:00
jaime-m-p
c90dbe026b
Fix per token atrributes bits (#7749) 2024-06-05 01:26:14 +02:00
agray3
b90dc566c1
Allow number of nodes in CUDA graph to change (#7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
2024-06-04 22:06:49 +02:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Georgi Gerganov
554c247caf
ggml : remove OpenCL (#7735)
ggml-ci
2024-06-04 21:23:20 +03:00
Georgi Gerganov
0cd6bd3483
llama : remove beam search (#7736) 2024-06-04 21:23:05 +03:00
Georgi Gerganov
5ca0944a15
readme : remove obsolete Zig instructions (#7471) 2024-06-04 19:43:01 +03:00
slaren
adc9ff3841
llama-bench : allow using a different printer for stderr with -oe (#7722)
compare-commits.sh : hide stdout, use -oe to print markdown
2024-06-04 14:32:42 +02:00
Daniele
987d743d6b
Improve hipBLAS support in CMake (#7696)
* Improve hipBLAS support in CMake

This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.

* Set ROCM_PATH correctly
2024-06-04 14:09:15 +02:00
zhouwg
b226c1227b
refine .gitignore (#7688)
This adds tags and android ndk into the git ignore list
2024-06-04 21:21:26 +10:00
jaime-m-p
3b38d48609
Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
2024-06-04 09:17:17 +02:00
Georgi Gerganov
6d1616944d
ggml : prevent builds with -ffinite-math-only (#7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
2024-06-04 17:01:09 +10:00
Radoslav Gerganov
bde7cd3cd9
llama : offload to RPC in addition to other backends (#7640)
* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-03 20:03:26 +03:00
Masaya, Kato
a5735e4426
ggml : use OpenMP as a thread pool (#7606)
* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-03 17:14:15 +02:00
Johannes Gäßler
0b832d53ba
make: fix debug options not being applied to NVCC (#7714) 2024-06-03 16:28:58 +02:00
0cc4m
3d7ebf6312
Vulkan Mixture of Experts (MoE) support (#7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
2024-06-03 10:59:14 +02:00
Andy Tai
a10cda58d3
cmake : add pkg-config spec file for llama.cpp (#7702) 2024-06-03 11:06:24 +03:00
zhangkaihuo
6f28a333c1
llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
2024-06-03 10:49:30 +03:00
Georgi Gerganov
549279d804
llama : avoid double token-to-piece cache (#7654)
ggml-ci
2024-06-03 08:34:43 +03:00
woachk
9e405b6e2e
kompute : implement op_getrows_f32 (#6403)
op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122
for the Vulkan w/ Kompute backend to be functional.

As such, implement this op to make this backend functional again.
2024-06-03 08:32:16 +03:00
Dave Airlie
3413ae2193
fix bug introduced in using calloc (#7701)
compilade pointed this out on the previous MR
2024-06-02 17:59:54 -04:00
Georgi Gerganov
1669810d7c
flake.lock: Update (#7686)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/8dc45382d5206bd292f9c2768b8058a8fd8311d9?narHash=sha256-/GJvTdTpuDjNn84j82cU6bXztE0MSkdnTWClUCRub78%3D' (2024-05-16)
  → 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02)
  → 'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bfb7a882678e518398ce9a31a881538679f6f092?narHash=sha256-4zSIhSRRIoEBwjbPm3YiGtbd8HDWzFxJjw5DYSDy1n8%3D' (2024-05-24)
  → 'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-02 14:13:12 -07:00
Austin
7c4e5b7eae
chore : add ignore rule for generated server themes (#7689) 2024-06-02 20:39:08 +03:00