slaren
bc5e64b1bf
propagate buffer usage in multi buffers
2024-01-23 21:01:57 +01:00
0cc4m
3742b6c706
Fix single queue logic
2024-01-23 17:45:50 +01:00
0cc4m
566a178c8f
Handle devices with only a single queue
2024-01-23 17:34:45 +01:00
0cc4m
1c953c10a0
Check for maintenance4 support before using it
2024-01-23 08:00:39 +01:00
0cc4m
f2c364a574
Disable unsupported ops to fix tests
2024-01-22 23:52:13 +01:00
slaren
6b97c71834
refactor multi buf
2024-01-22 23:16:02 +01:00
0cc4m
bcf2a4488c
Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
2024-01-22 21:57:17 +01:00
0cc4m
f652ebfd54
Implement max_size for backend buffer types to limit the size of a single allocation
2024-01-22 18:39:04 +01:00
0cc4m
7fa5ca9e62
Fix gcc warnings
2024-01-21 13:31:15 +01:00
0cc4m
1f55cd20a0
Simplify barrier synchronization calls
2024-01-21 13:27:09 +01:00
0cc4m
00f214c335
Fix oversized host staging buffers
2024-01-21 12:52:52 +01:00
0cc4m
6e6174206f
Properly implement Vulkan backend buffer handling
2024-01-21 10:28:46 +01:00
0cc4m
c0f3474ed5
Fix compiler warnings
2024-01-18 20:44:00 +01:00
0cc4m
f84c54fe23
Fix warning about empty C function parameters
2024-01-18 19:34:56 +01:00
0cc4m
1811c4ec9b
Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
2024-01-18 19:34:21 +01:00
0cc4m
2d14b22a99
Merge upstream changes, implement basic vulkan backend
2024-01-18 16:54:12 +01:00
0cc4m
02d2e38949
Fix missing event cast
2024-01-17 06:27:40 +01:00
0cc4m
c3290d29e0
Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
2024-01-16 21:30:14 +01:00
Georgi Gerganov
5c99960901
py : remove unnecessary hasattr ( #4903 )
2024-01-16 20:59:31 +02:00
Philip Taron
bee938da74
nix: remove nixConfig from flake.nix ( #4984 )
2024-01-16 09:56:21 -08:00
Daniel Bevenius
cec8a48470
finetune : add training data file to log message ( #4979 )
...
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 19:54:24 +02:00
Kawrakow
334a835a1c
ggml : importance matrix support for legacy quants ( #4969 )
...
* imatrix: adding support for legacy quants
* imatrix: guard Q4_0/Q5_0 against ffn_down craziness
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-16 19:51:26 +02:00
Maximilian Winter
4feb4b33ee
examples : add complete parallel function calling example ( #4974 )
2024-01-16 19:41:42 +02:00
Georgi Gerganov
959ef0c0df
perplexity : fix kv cache handling for hellaswag ( #4981 )
...
ggml-ci
2024-01-16 19:34:54 +02:00
Georgi Gerganov
c37b3474e6
flake.lock: update flake-parts, flake-parts/nixpkgs-lib, and nixpkgs ( #4920 )
...
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/34fed993f1674c8d06d58b37ce1e0fe5eebcb9f5' (2023-12-01)
→ 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/e92039b55bcd58469325ded85d4f58dd5a4eaf58?dir=lib' (2023-11-29)
→ 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/cfc3698c31b1fb9cdcf10f36c9643460264d0ca8' (2023-12-27)
→ 'github:NixOS/nixpkgs/317484b1ead87b9c1b8ac5261a8d2dd748a0492d' (2024-01-08)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-01-16 09:13:54 -08:00
Paul Tsochantaris
158f8c9e21
metal : localized logic in ggml_metal_graph_compute
( #4924 )
...
* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement
* Whitespace
* Collecting command buffer completions on single thread
* Whitespace
* Reduce diff noise
2024-01-16 19:05:19 +02:00
Neuman Vong
862f5e41ab
android : introduce starter project example ( #4926 )
...
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
2024-01-16 15:47:34 +02:00
Alex Azarov
3a48d558a6
metal : replace loop of dispatch_async with dispatch_apply ( #4934 )
...
* Replace loop of dispatch_async with dispatch_apply
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-16 15:41:27 +02:00
Alex Azarov
7c8d3abd1a
metal : log recommendedMaxWorkingSetSize
on iOS 16+ ( #4936 )
...
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-16 15:33:02 +02:00
Maximilian Winter
122ed4840c
examples : fix and improv docs for the grammar generator ( #4909 )
...
* Create pydantic-models-to-grammar.py
* Added some comments for usage
* Refactored Grammar Generator
Added example and usage instruction.
* Update pydantic_models_to_grammar.py
* Update pydantic-models-to-grammar-examples.py
* Renamed module and imported it.
* Update pydantic-models-to-grammar.py
* Renamed file and fixed grammar generator issue.
* Fixed some issues and bugs of the grammar generator. Imporved Documentation
* Update pydantic_models_to_grammar.py
2024-01-16 14:10:48 +02:00
Justine Tunney
a0b3ac8c48
ggml : introduce GGML_CALL function annotation ( #4850 )
...
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
2024-01-16 13:16:33 +02:00
Daniel Bevenius
d75c232e1d
finetune : use LLAMA_FILE_MAGIC_GGLA ( #4961 )
...
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16 13:14:19 +02:00
stduhpf
e0324285a5
speculative : threading options ( #4959 )
...
* speculative: expose draft threading
* fix usage format
* accept -td and -tbd args
* speculative: revert default behavior when -td is unspecified
* fix trailing whitespace
2024-01-16 13:04:32 +02:00
ngc92
3e5ca7931c
pass cpu-architecture arguments only to host code (C;C++) ( #4943 )
2024-01-15 19:40:48 +01:00
David Friehs
4483396751
llama : apply classifier-free guidance to logits directly ( #4951 )
2024-01-15 15:06:52 +02:00
Victor Z. Peng
d9aa4ffa6e
awq-py : fix typo in awq-py/README.md ( #4947 )
2024-01-15 14:41:46 +02:00
Georgi Gerganov
ddb008d845
cuda : fix dequantize kernel names ( #4938 )
2024-01-15 13:27:00 +02:00
Kawrakow
2faaef3979
llama : check for 256 divisibility for IQ2_XS, IQ2_XXS ( #4950 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15 10:09:38 +02:00
Kawrakow
4a3156de2f
CUDA: faster dequantize kernels for Q4_0 and Q4_1 ( #4938 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15 07:48:06 +02:00
David Pflug
a836c8f534
llama : fix missing quotes ( #4937 )
2024-01-14 17:46:00 +02:00
Kawrakow
467a882fd2
Add ability to use importance matrix for all k-quants ( #4930 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 16:21:12 +02:00
Georgi Gerganov
bb0c139247
llama : check LLAMA_TRACE env for extra logging ( #4929 )
...
* llama : minor fix indent
* llama : check LLAMA_TRACE env for extra logging
ggml-ci
2024-01-14 13:26:53 +02:00
Georgi Gerganov
9408cfdad6
scripts : sync-ggml-am.sh option to skip commits
2024-01-14 11:08:41 +02:00
Georgi Gerganov
03c5267490
llama : use LLAMA_LOG_ macros for logging
2024-01-14 11:03:19 +02:00
Kawrakow
a128c38de8
Fix ffn_down quantization mix for MoE models ( #4927 )
...
* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:53:39 +02:00
Alex Azarov
5f5fe1bd60
metal : correctly set SIMD support flags on iOS ( #4923 )
...
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad
* log a little bit more info on iOS
2024-01-14 10:44:39 +02:00
Karthik Kumar Viswanathan
ac32902a87
llama : support WinXP build with MinGW 8.1.0 ( #3419 )
2024-01-14 10:41:44 +02:00
Kawrakow
147b17ac94
2-bit quantizations ( #4897 )
...
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
Kawrakow
807179ec58
Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B ( #4906 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:44:30 +02:00
Georgi Gerganov
76484fbfd3
sync : ggml
2024-01-14 00:14:46 +02:00