Commit graph

2517 commits

Author SHA1 Message Date
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence

* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear

Use llama_kv_cache_clear for cache clearing

Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
cebtenzzre
2046eb4345
make : remove unnecessary dependency on build-info.h (#3842) 2023-10-29 18:33:47 +02:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug (#3835)
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants

ggml-ci

* ggml-quants : fix Zig and Swift builds + quantize tool

ggml-ci

* quantize : --pure option for disabling k-quant mixtures

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Concedo
7f5d1b2fc6 slider error 2023-10-30 00:02:38 +08:00
Concedo
7f050b5d16 tweak numbers 2023-10-29 22:46:19 +08:00
Concedo
7924592a83 context shift feature done 2023-10-29 18:21:39 +08:00
Concedo
338d6c265d fixes to smartcontextpro 2023-10-29 10:42:37 +08:00
Erik Scholz
ff3bad83e2
flake : update flake.lock for newer transformers version + provide extra dev shell (#3797)
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02
metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793)
* Try cwd for ggml-metal if bundle lookup fails

When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`.  In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.

Follows up on #1782

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d
issues : change label from bug to bug-unconfirmed (#3748) 2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29
convert : ignore tokens if their IDs are within [0, vocab_size) (#3831) 2023-10-28 06:25:15 -06:00
Kerfuffle
bd6d9e2059
llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
ee1a0ec9cb
llama : add option for greedy sampling with probs (#3813)
* llama : add option for greedy sampling with probs

* llama : add comment about llama_sample_token_greedy() missing probs

* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
2023-10-28 14:23:11 +03:00
Concedo
20ef442c2a fixed for smartcontext 2023-10-28 19:09:22 +08:00
Henk Poley
177461104b
common : print that one line of the syntax help *also* to standard output (#3823) 2023-10-28 13:16:33 +03:00
Concedo
6cf2b4c73b MMQ optimizations (+1 squashed commits)
Squashed commits:

[d87de001] mmq optimization (+1 squashed commits)

Squashed commits:

[f1f67af8] still allow mmq
2023-10-28 17:57:46 +08:00
Georgi Gerganov
fdee152e4e
starcoder : add GPU offloading (#3827)
* starcoder : do not GPU split 1D bias tensors

* starcoder : offload layers to GPU

ggml-ci
2023-10-28 12:06:08 +03:00
Concedo
2ea3b567cf Merge: Testing speed of tensor cores vs MMQ 2023-10-28 16:41:42 +08:00
Concedo
2fa1137890 updated lite 2023-10-28 14:43:15 +08:00
Concedo
09c74ea046 include content-length 2023-10-28 14:24:37 +08:00
Concedo
64f3bc5168 update model string (+1 squashed commits)
Squashed commits:

[a7c568ea] simplify colab
2023-10-28 14:07:52 +08:00
Concedo
879d1ba268 simplify colab dropdowns (+1 squashed commits)
Squashed commits:

[72aab0e8] simplify colab dropdown
2023-10-28 13:57:01 +08:00
Pyroserenus
eb9a93097b
Colab Improvements (#498)
* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb
2023-10-28 13:26:59 +08:00
Concedo
15f525c580 revamped smart context for llama models 2023-10-28 12:59:08 +08:00
Kerfuffle
41aee4df82
speculative : ensure draft and target model vocab matches (#3812)
* speculative: Ensure draft and target model vocab matches

* Tolerate small differences when checking dft vs tgt vocab
2023-10-28 00:40:07 +03:00
cebtenzzre
6d459cbfbe
llama : correctly report GGUFv3 format (#3818) 2023-10-27 17:33:53 -04:00
Thibault Terrasson
c8d6a1f34a
simple : fix batch handling (#3803) 2023-10-27 08:37:41 -06:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Concedo
c2f675133d support for abort without crash on disconnect 2023-10-27 15:27:17 +08:00
Georgi Gerganov
34b2a5e1ee
server : do not release slot on image input (#3798) 2023-10-26 22:54:17 +03:00
Concedo
aed05e5565 todo: troubleshoot sse with multiuser 2023-10-27 00:21:52 +08:00
Concedo
f344a99425 causallm is not working well on clblast, running out of mem wth blas. this helps a bit but doesnt fix the problem. 2023-10-26 23:36:35 +08:00
Concedo
0f46534866 wip 2023-10-26 21:58:51 +08:00
Concedo
5db89b90b7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	build.zig
#	ggml-opencl.cpp
#	tests/CMakeLists.txt
#	tests/test-double-float.cpp
#	tests/test-sampling.cpp
2023-10-25 23:58:15 +08:00
Concedo
98d1dba256 tighten timings 2023-10-25 20:44:20 +08:00
Georgi Gerganov
6961c4bd0b
batched-bench : print params at start 2023-10-25 10:26:27 +03:00
Concedo
c9983a72d6 prevent lora with clblast 2023-10-25 15:18:03 +08:00
Georgi Gerganov
cc44877486
log : disable pid in log filenames 2023-10-25 10:09:16 +03:00
Concedo
30d1017021 update readme and colab (+1 squashed commits)
Squashed commits:

[ec2a7c2a] improve colab (+1 squashed commits)

Squashed commits:

[404f81b2] shorter 302 redirect url for prebuilt binaries
2023-10-25 15:01:22 +08:00
cebtenzzre
ad93962657
server : add parameter -tb N, --threads-batch N (#3584) (#3768)
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2023-10-24 23:10:43 +03:00
Georgi Gerganov
1717521cdb
server : do not block system prompt update (#3767)
* server : do not block system prompt update

* server : update state machine logic to process system prompts

* server : minor
2023-10-24 23:08:20 +03:00
Georgi Gerganov
b2f7e04bd3
sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
2023-10-24 21:51:20 +03:00
John Smith
abd21fc99f
cmake : add missed dependencies (#3763) 2023-10-24 20:48:45 +03:00
Concedo
839fc6dac8 handle freq_base_train 2023-10-24 23:44:22 +08:00
Georgi Gerganov
2b4ea35e56
cuda : add batched cuBLAS GEMM for faster attention (#3749)
* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-10-24 16:48:37 +03:00
AlpinDale
6a4d9c26e1
readme: add AUR instructions and clean up preview (#494)
* readme: add AUR instructions and clean up preview

This PR adds the following:

- Instructions for the koboldcpp Arch User Repository (AUR) packages
- Clean up the preview images by placing them in a 2x2 table

* remove table
2023-10-24 17:13:56 +08:00
teddybear082
7d120f2794
Add context size parameter to google colab notebook (#489)
-add configurable context size to parameters along with models and layers for ease of use

-this can already be done with a simple edit by experienced llm users but new users may not know this is a parameter they should set.

Co-authored-by: LostRuins <39025047+LostRuins@users.noreply.github.com>
2023-10-24 17:13:01 +08:00
Concedo
7744aa6a9c updated colab 2023-10-24 15:37:47 +08:00
Galunid
daab3d7f45
Add more tokenizer tests (#3742)
* Add more tokenizer tests

* Add starcoder

* Update test vocab files

* Restrict bpe tokenizer tests to unicode planes

* Update comment

* Comment cosmetics

* Remove bloom vocab/test
2023-10-24 09:17:17 +02:00