* Initial Vulkan multi-gpu implementation
Move most global variables into backend context
* Add names to backend device functions
* Add further missing cleanup code
* Reduce code duplication in tensor split layer assignment
* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h
* Only do device info print in the beginning and initialize one backend for cpu assist
Add missing cleanup code
* Rework backend memory management to make sure devices and buffers get properly allocated and freed
* Rename cpu assist free function
---------
Co-authored-by: slaren <slarengh@gmail.com>
* support minicpm arch.
* fix tab/space typo.
* convert minicpm model via convert-hf-gguf.py
* try to make tokenizer work
* fix bug for quantize minicpm
* fix for flake8 lint
* remove convert-minicpm.py
* fix for editorconfig
* correct minicpm model type (size)
* constants expanded for minicpm
* Minor change of the constant names for minicpm
* include total "num_slots" in default_generation_settings_for_props
* cleanup total_slots return value in /props endpoint
* update /props endpoint docs with total_slots
* remove num_slots from default_generation_settings_for_props
* update /props endpoint section
server : fix deadlock when prompt array contains strings and numbers
server : removed an unnecessary generation when generating multi-prompts
server : removed an unnecessary assert
The content of the OBJ type is actually a list of all key names of the object.
* Python
* `gguf_writer.py`:
* Added `def add_kv(self, key: str, val: Any) -> None`: Automatically determines the appropriate value type based on `val`.
* Added `def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None`: Adds object (dict) values, It will recursively add all subkeys.
* Added `add_array_ex` to support the nested and mixed-type array.
* `constants.py`:
* Added `GGUFValueType.get_type_ex(val)`: Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer.
* `gguf_reader.py`:
* Added functionality to retrieve values from specific fields using `ReaderField.get()` method.
* Unit test added
* CPP
* `ggml`:
* Added `GGUF_TYPE_OBJ` to the `gguf_type` enum type.
* Use `gguf_get_arr_n` and `gguf_get_arr_str` to get the subKey names of `GGUF_TYPE_OBJ`.
* Added `gguf_set_obj_str` function to set object subkey names
* Added `gguf_set_arr_obj` function to set object array count
* Added `gguf_set_arr_arr` function to set nested array count
* `llama`:
* Modified `gguf_kv_to_str`
* Added `LLAMA_API char * gguf_kv_to_c_str` function to get the c_str value as JSON format.
* Maybe this API should be moved into `ggml` as `gguf_get_val_json`. (问题是 ggml.c 用的是C语言,而这里大量用了C++的功能)
* Added basic support to `GGUF_TYPE_OBJ` and nested array
* Unit test added
feat: add basic support to GGUF_TYPE_OBJ on cpp
feat(gguf.py): add OBJ and mixed-type array supports to GGUF ARRAY
feat: add OBJ and mixed-type array supports to GGUF ARRAY(CPP)
feat: add nested array supported
feat: * Subkey name convention in OBJ types:
* If the first letter of the subkey name is "/", it means referencing the full name of other keys.
* If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used.
feat: add LLAMA_API gguf_kv_to_c_str to llama.h
test: write test gguf file to tests folder directly(py)
test: add test-gguf-meta.cpp
feat: Key convention: "." indicates that the key is a subkey, not an independent key.
feat: add excludes argument to add_dict(gguf_write.py)
feat: add_array_ex to supports nested and mix-typed array, and keep the add_array to the same
fix(constant.py): rollback the get_type function and add the new get_type_ex
test: add test compatibility
fix: use GGML_MALLOC instead of malloc
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt