Commit graph

2537 commits

Author SHA1 Message Date
Georgi Gerganov
9a3b4f6c86
ggml : fix UNUSED macro (#3762) 2023-11-01 13:50:45 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter (#3762)
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Concedo
ae2cd56de8 kobold integration of min_p sampler (+1 squashed commits)
Squashed commits:

[8ad2e349] kobold integration for min_p sampler
2023-11-01 19:08:45 +08:00
Concedo
bcb397953f Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental 2023-11-01 18:29:08 +08:00
Concedo
92d80b94b3 bundle simpleclinfo into pyinstaller except for linux 2023-11-01 18:26:15 +08:00
Concedo
9342636408 Merge branch 'master' into concedo_experimental
# Conflicts:
#	flake.lock
#	flake.nix
2023-11-01 18:24:36 +08:00
Concedo
df7e757d40 windows: added simpleclinfo, which helps determine clblast platform and device on windows 2023-11-01 18:10:35 +08:00
Georgi Gerganov
f0e209324a
scripts : add server-llm.sh (#3868)
* scripts : add deploy-server.sh

* scripts : rename to server-llm.sh

* scripts : working curl pipe
2023-11-01 11:29:07 +02:00
Adrian Hesketh
ca190bca8e
server : re-enable completion and embedded at the same time (#3876) 2023-11-01 11:28:28 +02:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code (#3837)
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)
* Introduce the new Min-P sampler by @kalomaze
   The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.

* Min-P enabled and set to 0.05 default

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Georgi Gerganov
22cc9bef09
cuda : check if this fixes Pascal card regression 2023-10-31 20:01:47 +02:00
Tungsten842
07178c98e1
flake.nix: fix for rocm 5.7 (#3853) 2023-10-31 19:24:03 +02:00
Concedo
43a5143450 added clinfo binary, cleanup unused stuff 2023-10-31 22:25:25 +08:00
Concedo
f3690ba6d2 shifting enabled by default 2023-10-31 21:41:57 +08:00
Concedo
e62f38abd1 Merge branch 'master' into concedo_experimental
# Conflicts:
#	tests/test-double-float.cpp
#	tests/test-quantize-fns.cpp
2023-10-31 21:09:49 +08:00
Concedo
cc5b282350 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	build.zig
#	flake.lock
#	flake.nix
#	ggml.c
2023-10-31 20:44:04 +08:00
Georgi Gerganov
207b51900e
ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h

ggml-ci

* tests : fix ARM build

* ggml : explicitly initialize deprecated type traits

* ggml : add math.h to ggml-impl.h

* ggml : remove duplicate static assert macros

* ggml : prefix lookup tables with ggml_

ggml-ci

* ggml-impl : move extern "C" to start of file
2023-10-30 19:19:15 +02:00
Concedo
9eba77c6a0 finally got something workable 2023-10-30 23:30:21 +08:00
Concedo
61c395833d context shifting is still buggy 2023-10-30 16:25:01 +08:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence

* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear

Use llama_kv_cache_clear for cache clearing

Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
cebtenzzre
2046eb4345
make : remove unnecessary dependency on build-info.h (#3842) 2023-10-29 18:33:47 +02:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug (#3835)
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants

ggml-ci

* ggml-quants : fix Zig and Swift builds + quantize tool

ggml-ci

* quantize : --pure option for disabling k-quant mixtures

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Concedo
7f5d1b2fc6 slider error 2023-10-30 00:02:38 +08:00
Concedo
7f050b5d16 tweak numbers 2023-10-29 22:46:19 +08:00
Concedo
7924592a83 context shift feature done 2023-10-29 18:21:39 +08:00
Concedo
338d6c265d fixes to smartcontextpro 2023-10-29 10:42:37 +08:00
Erik Scholz
ff3bad83e2
flake : update flake.lock for newer transformers version + provide extra dev shell (#3797)
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02
metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793)
* Try cwd for ggml-metal if bundle lookup fails

When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`.  In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.

Follows up on #1782

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d
issues : change label from bug to bug-unconfirmed (#3748) 2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29
convert : ignore tokens if their IDs are within [0, vocab_size) (#3831) 2023-10-28 06:25:15 -06:00
Kerfuffle
bd6d9e2059
llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
ee1a0ec9cb
llama : add option for greedy sampling with probs (#3813)
* llama : add option for greedy sampling with probs

* llama : add comment about llama_sample_token_greedy() missing probs

* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
2023-10-28 14:23:11 +03:00
Concedo
20ef442c2a fixed for smartcontext 2023-10-28 19:09:22 +08:00
Henk Poley
177461104b
common : print that one line of the syntax help *also* to standard output (#3823) 2023-10-28 13:16:33 +03:00
Concedo
6cf2b4c73b MMQ optimizations (+1 squashed commits)
Squashed commits:

[d87de001] mmq optimization (+1 squashed commits)

Squashed commits:

[f1f67af8] still allow mmq
2023-10-28 17:57:46 +08:00
Georgi Gerganov
fdee152e4e
starcoder : add GPU offloading (#3827)
* starcoder : do not GPU split 1D bias tensors

* starcoder : offload layers to GPU

ggml-ci
2023-10-28 12:06:08 +03:00
Concedo
2ea3b567cf Merge: Testing speed of tensor cores vs MMQ 2023-10-28 16:41:42 +08:00
Concedo
2fa1137890 updated lite 2023-10-28 14:43:15 +08:00
Concedo
09c74ea046 include content-length 2023-10-28 14:24:37 +08:00
Concedo
64f3bc5168 update model string (+1 squashed commits)
Squashed commits:

[a7c568ea] simplify colab
2023-10-28 14:07:52 +08:00
Concedo
879d1ba268 simplify colab dropdowns (+1 squashed commits)
Squashed commits:

[72aab0e8] simplify colab dropdown
2023-10-28 13:57:01 +08:00
Pyroserenus
eb9a93097b
Colab Improvements (#498)
* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb

* Update colab.ipynb
2023-10-28 13:26:59 +08:00
Concedo
15f525c580 revamped smart context for llama models 2023-10-28 12:59:08 +08:00
Kerfuffle
41aee4df82
speculative : ensure draft and target model vocab matches (#3812)
* speculative: Ensure draft and target model vocab matches

* Tolerate small differences when checking dft vs tgt vocab
2023-10-28 00:40:07 +03:00
cebtenzzre
6d459cbfbe
llama : correctly report GGUFv3 format (#3818) 2023-10-27 17:33:53 -04:00
Thibault Terrasson
c8d6a1f34a
simple : fix batch handling (#3803) 2023-10-27 08:37:41 -06:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Concedo
c2f675133d support for abort without crash on disconnect 2023-10-27 15:27:17 +08:00