* cuda : fix vmm pool with multi GPU
* hip
* use recommended granularity instead of minimum
* better error checking
* fix mixtral
* use cudaMemcpy3DPeerAsync
* use cuda_pool_alloc in ggml_cuda_op_mul_mat
* consolidate error checking in ggml_cuda_set_device
* remove unnecessary inlines
ggml-ci
* style fixes
* only use vmm for the main device
* fix scratch buffer size, re-enable vmm pool for all devices
* remove unnecessary check id != g_main_device
* Add logit_bias to the OpenAI api
* Cleanup and refactor, test in swagger.
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* Downgrade CUDA to 11.4
This helps the binary be smaller and adds K80 support, the manual compiles we did already had this.
* Update kcpp-build-release-win-cuda.yaml
* Update kcpp-build-release-win-cuda.yaml
* Update kcpp-build-release-win-cuda.yaml
* Update kcpp-build-release-win-cuda.yaml
* Update kcpp-build-release-win-cuda.yaml
* Update kcpp-build-release-win-cuda.yaml
* Restore concedo_experimental
* cuda : improve cuda pool efficiency using virtual memory
* fix mixtral
* fix cmake build
* check for vmm support, disable for hip
ggml-ci
* fix hip build
* clarify granularity
* move all caps to g_device_caps
* refactor error checking
* add cuda_pool_alloc, refactor most pool allocations
ggml-ci
* fix hip build
* CUBLAS_TF32_TENSOR_OP_MATH is not a macro
* more hip crap
* llama : fix msvc warnings
* ggml : fix msvc warnings
* minor
* minor
* cuda : fallback to CPU on host buffer alloc fail
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ensure allocations are always aligned
* act_size -> actual_size
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Check the full vocab for grammar only if necessary
* Fix missing logit restoration step (?)
Does this matter, actually?
* Fix whitespace / formatting
* Adjust comment
* Didn't mean to push test gbnf
* Split sampling into the helper function (?)
And also revert the changes made to the header
* common : fix final newline
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* initial commit, going through initializations
* main loop finished, starting to debug
* BUG: generates gibberish/repeating tokens after a while
* kv_cache management
* Added colors to distinguish drafted tokens (--color). Updated README
* lookup : fix token positions in the draft batch
* lookup : use n_draft from CLI params
* lookup : final touches
---------
Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix old jetson compile error
* Update Makefile
* update jetson detect and cuda version detect
* update cuda marco define
* update makefile and cuda,fix some issue
* Update README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update Makefile
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>