cebtenzzre
2fffa0d61f
cuda : fix RoPE after #2268 ( #3897 )
2023-11-02 07:49:44 +02:00
cebtenzzre
e9abcc9c7c
fix linter complaints
2023-11-02 00:06:32 -04:00
cebtenzzre
66ccd62102
sort imports
2023-11-01 23:26:28 -04:00
cebtenzzre
8f31dc54ec
fix mypy errors
2023-11-01 23:24:46 -04:00
cebtenzzre
0eb332a10f
llama : fix llama_context_default_params after #2268 ( #3893 )
2023-11-01 19:29:14 -04:00
slaren
d02e98cde0
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel ( #3891 )
...
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
2023-11-01 23:10:09 +01:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling ( #2268 )
...
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
c43c2da8af
llm : fix llm_build_kqv taking unused tensor (benign, #3837 )
2023-11-01 23:08:30 +02:00
Georgi Gerganov
523e49b111
llm : fix falcon norm after refactoring ( #3837 )
2023-11-01 23:00:50 +02:00
Georgi Gerganov
e16b9fa4ba
metal : multi-simd softmax ( #3710 )
...
ggml-ci
2023-11-01 21:25:00 +02:00
Georgi Gerganov
ff8f9a88da
common : minor ( #3715 )
2023-11-01 21:15:55 +02:00
Georgi Gerganov
50337961a6
llm : add llm_build_context ( #3881 )
...
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
2023-11-01 20:11:02 +02:00
bandoti
0e40806c1c
common : allow caller to handle help/argument exceptions ( #3715 )
...
* Allow caller to handle help/argument exceptions
* Prepend newline to usage output
* Add new gpt_params_parse_ex function to hide arg-parse impl
* Fix issue blocking success case
* exit instead of returning false
* Update common/common.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01 19:42:01 +02:00
staviq
a2758d08e4
log : make generating separate log files optional ( #3787 )
...
* impl --log-new, --log-append
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Apply suggestions from code review
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 16:18:27 +02:00
l3utterfly
e75dfdd31b
sampling : null grammar field after reset ( #3885 )
2023-11-01 15:40:43 +02:00
Georgi Gerganov
9a3b4f6c86
ggml : fix UNUSED macro ( #3762 )
2023-11-01 13:50:45 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter ( #3762 )
...
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Georgi Gerganov
f0e209324a
scripts : add server-llm.sh ( #3868 )
...
* scripts : add deploy-server.sh
* scripts : rename to server-llm.sh
* scripts : working curl pipe
2023-11-01 11:29:07 +02:00
Adrian Hesketh
ca190bca8e
server : re-enable completion and embedded at the same time ( #3876 )
2023-11-01 11:28:28 +02:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code ( #3837 )
...
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848 )
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849 )
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
Galunid
4fdd7cdf2b
Review fixes, persimmon fixes
2023-11-01 02:32:49 +01:00
Galunid
3ec89dcc69
Use 'IntEnum' instead of 'Enum'
2023-10-31 22:23:26 +01:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] ( #3841 )
...
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Tungsten842
07178c98e1
flake.nix: fix for rocm 5.7 ( #3853 )
2023-10-31 19:24:03 +02:00
Galunid
f4b9a7ea02
Remove 'old' conversion scripts
2023-10-31 16:27:06 +01:00
Galunid
235acc18cd
Small refactor
2023-10-31 16:23:53 +01:00
Galunid
c94df09732
Rework tokenizer handling
2023-10-31 16:11:08 +01:00
Galunid
b2ba44eab2
Flake8 fixes
2023-10-31 15:38:24 +01:00
Galunid
dc3115f2a3
Add another alias to n_layers
2023-10-31 04:20:51 +01:00
Galunid
0743f7a900
Fix variable
2023-10-31 03:52:52 +01:00
Galunid
b9c664ab2f
Woops
2023-10-31 03:42:55 +01:00
Galunid
6f6856c6ea
[Untested] Initial Persimmon support
2023-10-31 03:27:04 +01:00
Galunid
94ba1db24a
Add Starcoder and Refact
2023-10-31 03:12:25 +01:00
Galunid
0afa75a9a2
Add Falcon support
2023-10-31 02:57:37 +01:00
Galunid
3bb9844de9
Get rid of dumb print
2023-10-31 01:54:24 +01:00
Galunid
08918b700e
MPT conversion fix
2023-10-31 01:52:55 +01:00
Georgi Gerganov
207b51900e
ggml : move FP16 <-> FP32 code to ggml-impl.h ( #3861 )
...
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
2023-10-30 19:19:15 +02:00
Galunid
443f7d586e
Call add_tensor before write_* functions
2023-10-29 20:00:54 +01:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence ( #3843 )
...
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
cebtenzzre
2046eb4345
make : remove unnecessary dependency on build-info.h ( #3842 )
2023-10-29 18:33:47 +02:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug ( #3835 )
...
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring ( #3833 )
...
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Galunid
550b925af2
Missing variable
2023-10-29 02:06:41 +01:00
Galunid
989db34149
Missing variable
2023-10-29 02:05:28 +01:00
Galunid
8618b4e74c
Add [UNTESTED] Baichuan support
2023-10-29 01:38:35 +02:00
Galunid
0ff237105d
Make gguf_writer member of Model, rework tokenizer export
2023-10-29 00:33:05 +02:00
Erik Scholz
ff3bad83e2
flake : update flake.lock for newer transformers version + provide extra dev shell ( #3797 )
...
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02
metal : try cwd for ggml-metal.metal if bundle lookup fails ( #3793 )
...
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d
issues : change label from bug to bug-unconfirmed ( #3748 )
2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29
convert : ignore tokens if their IDs are within [0, vocab_size) ( #3831 )
2023-10-28 06:25:15 -06:00