Commit graph

1156 commits

Author SHA1 Message Date
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint

* handle lora_no_apply

* fix build

* updae docs

* clean up struct def

* fix build

* add LoRA test

* fix style
2024-08-06 17:33:39 +02:00
Daniel Bevenius
5f4dcb1e60
simple : update name of executable to llama-simple (#8885)
This commit updates the name of the executable in README.md from
`simple` to `llama-simple`.
2024-08-06 16:44:35 +02:00
Neo Zhang
d4ff847153
[SYCL] correct cmd name (#8877) 2024-08-06 09:09:12 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)

Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>

* delete llama_init_default_params()

* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params

* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Brian Cunnie
ecf6b7f23e
batched-bench : handle empty -npl (#8839)
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-04 13:55:03 +03:00
Daniel Bevenius
01aae2b497 baby-llama : remove duplicate vector include 2024-08-04 13:24:59 +03:00
Igor Okulist
afbbcf3c04
server : update llama-server embedding flag documentation (#8779)
Fixes #8763
2024-07-31 19:59:09 -04:00
compilade
4c676c85e5
llama : refactor session file management (#8699)
* llama : refactor session file management

* llama : saving and restoring state checks for overflow

The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.

* llama : stream from session file instead of copying into a big buffer

Loading session files should no longer cause a memory usage spike.

* llama : llama_state_get_size returns the actual size instead of max

This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.

* llama : share code between whole and seq_id-specific state saving

Both session file types now use a more similar format.

* llama : no longer store all hparams in session files

Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.

* llama : fix uint64_t format type

* llama : various integer type cast and format string fixes

Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.

* llama : remove _context suffix for llama_data_context

* llama : fix session file loading

llama_state_get_size cannot be used to get the max size anymore.

* llama : more graceful error handling of invalid session files

* llama : remove LLAMA_MAX_RNG_STATE

It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.

* llama : cast seq_id in comparison with unsigned n_seq_max
2024-07-28 00:42:05 -04:00
slaren
2b1f616b20
ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Yaiko
01aec4a631
server : add Speech Recognition & Synthesis to UI (#8679)
* server : add Speech Recognition & Synthesis to UI

* server : add Speech Recognition & Synthesis to UI (fixes)
2024-07-26 00:10:16 +02:00
Xuan Son Nguyen
41cd47caab
examples : export-lora : fix issue with quantized base models (#8687) 2024-07-25 23:49:39 +02:00
Xuan Son Nguyen
be6d7c0791
examples : remove finetune and train-text-from-scratch (#8669)
* examples : remove finetune and train-text-from-scratch

* fix build

* update help message

* fix small typo for export-lora
2024-07-25 10:39:04 +02:00
Ujjawal Panchal
4b0eff3df5
docs : Quantum -> Quantized (#8666)
Some checks failed
flake8 Lint / Lint (push) Has been cancelled
* docfix: imatrix readme, quantum models -> quantized models.

* docfix: server readme: quantum models -> quantized models.
2024-07-25 11:13:27 +03:00
Xuan Son Nguyen
96952e7181
llama : fix llama_chat_format_single for mistral (#8657)
* fix `llama_chat_format_single` for mistral

* fix typo

* use printf
2024-07-24 13:48:46 +02:00
Xuan Son Nguyen
de280085e7
examples : Fix llama-export-lora example (#8607)
* fix export-lora example

* add more logging

* reject merging subset

* better check

* typo
2024-07-23 23:48:37 +02:00
Vali Malinoiu
b841d07408
server : fix URL.parse in the UI (#8646) 2024-07-23 17:37:42 +03:00
Georgi Gerganov
938943cdbf
llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling

ggml-ci

* llama : move grammar code into llama-grammar

ggml-ci

* cont

ggml-ci

* cont : pre-fetch rules

* cont

ggml-ci

* llama : deprecate llama_sample_grammar

* llama : move tokenizers into llama-vocab

ggml-ci

* make : update llama.cpp deps [no ci]

* llama : redirect external API to internal APIs

ggml-ci

* llama : suffix the internal APIs with "_impl"

ggml-ci

* llama : clean-up
2024-07-23 13:10:17 +03:00
Jan Boon
628154492a
server : update doc to clarify n_keep when there is bos token (#8619) 2024-07-22 11:02:09 +03:00
devojony
b7c11d36e6
examples: fix android example cannot be generated continuously (#8621)
When generation ends `completion_loop()` should return a NULL, not the empty string
2024-07-22 09:54:42 +03:00
M-A
22f281aa16
examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:

- Move each example into its own function. This makes the code much
  easier to read and understand.
- Make the program easy to only run one test by commenting out function
  calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
  instead of hardcoding the returned values. This makes the code more
  copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
  returned expected values. It's super useful to check for correctness.

Testing:

- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
  Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
  - I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
  - Llama-3 failed about a third of the time in example_concurrent: it
    only returned one call instead of 3. Even for F16.

Potential follow ups:

- Do not fix the prompt encoding yet. Surprisingly it mostly works even
  if the prompt encoding is not model optimized.
- Add chained answer and response.

Test only change.
2024-07-20 22:09:17 -04:00
Georgi Gerganov
07283b1a90
gguf : handle null name during init (#8587) 2024-07-20 17:15:42 +03:00
Huifeng Ou
69b9945b44
llama.swiftui: fix end of generation bug (#8268)
* fix continuing generating blank lines after getting EOT token or EOS token from LLM

* change variable name to is_done (variable name suggested by ggerganov)

* minor : fix trailing whitespace

* minor : add space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-20 16:09:37 +03:00
Eric Zhang
0d2c7321e9
server: use relative routes for static files in new UI (#8552)
* server: public: fix api_url on non-index pages

* server: public: use relative routes for static files in new UI
2024-07-18 12:43:49 +02:00
Brian
672a6f1018
convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499)
Main thing is that the default output filename will take this form

{name}{parameters}{finetune}{version}{encoding}{kind}

In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content

* No Change:
  - Internal GGUF Spec
    - `general.architecture`
    - `general.quantization_version`
    - `general.alignment`
    - `general.file_type`
  - General Model Details
    - `general.name`
    - `general.author`
    - `general.version`
    - `general.description`
  - Licensing details
    - `general.license`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.url`
  - Model Source during conversion
    - `general.source.url`

* Removed:
  - Model Source during conversion
    - `general.source.huggingface.repository`

* Added:
  - General Model Details
    - `general.organization`
    - `general.finetune`
    - `general.basename`
    - `general.quantized_by`
    - `general.size_label`
  - Licensing details
    - `general.license.name`
    - `general.license.link`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.doi`
    - `general.uuid`
    - `general.repo_url`
  - Model Source during conversion
    - `general.source.doi`
    - `general.source.uuid`
    - `general.source.repo_url`
  - Base Model Source
    - `general.base_model.count`
    - `general.base_model.{id}.name`
    - `general.base_model.{id}.author`
    - `general.base_model.{id}.version`
    - `general.base_model.{id}.organization`
    - `general.base_model.{id}.url` (Model Website/Paper)
    - `general.base_model.{id}.doi`
    - `general.base_model.{id}.uuid`
    - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
  - Array based KV stores
    - `general.tags`
    - `general.languages`
    - `general.datasets`

---------

Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-18 20:40:15 +10:00
RunningLeon
3807c3de04
server : respect --special cli arg (#8553) 2024-07-18 11:06:22 +03:00
Johannes Gäßler
e02b597be3
lookup: fibonacci hashing, fix crashes (#8548) 2024-07-17 23:35:44 +02:00
hipudding
1bdd8ae19f
[CANN] Add Ascend NPU backend (#6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-07-17 14:23:50 +03:00
Masaya, Kato
da3913d8f9
batched: fix n_predict parameter (#8527) 2024-07-17 10:34:28 +03:00
Brian
1666f92dcd
gguf-hash : update clib.json to point to original xxhash repo (#8491)
* Update clib.json to point to Cyan4973 original xxhash

Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata

https://github.com/Cyan4973/xxHash/pull/954

* gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]
2024-07-16 10:14:16 +03:00
Steve Bonds
37b12f92ab
export-lora : handle help argument (#8497)
The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.
2024-07-16 10:04:45 +03:00
Georgi Gerganov
0efec57787
llama : valign + remove unused ftype (#8502) 2024-07-16 10:00:30 +03:00
Xuan Son Nguyen
4db8f60fe7
fix ci (#8494) 2024-07-15 19:23:10 +02:00
M-A
f17f39ff9c
server: update README.md with llama-server --help output [no ci] (#8472)
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.

Did:

    make llama-server
    ./llama-server --help > t.txt
    vimdiff t.txt examples/server/README.md

I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.

No functional change.
2024-07-15 15:04:56 +03:00
compilade
090fca7a07
pydantic : replace uses of __annotations__ with get_type_hints (#8474)
* pydantic : replace uses of __annotations__ with get_type_hints

* pydantic : fix Python 3.9 and 3.10 support
2024-07-14 19:51:21 -04:00
Georgi Gerganov
4e24cffd8c
server : handle content array in chat API (#8449)
* server : handle content array in chat API

* Update examples/server/utils.hpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-12 14:48:15 +03:00
Georgi Gerganov
6af51c0d96
main : print error on empty input (#8456) 2024-07-12 14:48:04 +03:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Georgi Gerganov
71c1121d11
examples : sprintf -> snprintf (#8434)
* examples : sprintf -> snprintf

ggml-ci

* examples : use sizeof() instead of hardcoded constants
2024-07-12 10:46:14 +03:00
Georgi Gerganov
370b1f7e7a
ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-12 10:46:02 +03:00
compilade
9a55ffe6fb
tokenize : add --no-parse-special option (#8423)
This should allow more easily explaining
how parse_special affects tokenization.
2024-07-11 10:41:48 +03:00
Clint Herron
278d0e1846
Initialize default slot sampling parameters from the global context. (#8418) 2024-07-10 20:08:17 -04:00
Dibakar Gope
0f1a39f343
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-07-10 15:14:51 +03:00
Clint Herron
a59f8fdc85
Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-09 18:26:40 -04:00
Clint Herron
e500d6135a
Deprecation warning to assist with migration to new binary names (#8283)
* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.

* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.
2024-07-09 11:54:43 -04:00
Georgi Gerganov
6f0dbf6ab0
infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-08 09:34:35 +03:00
compilade
3fd62a6b1c
py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-07 15:04:39 -04:00
Denis Spasyuk
a8db2a9ce6
Update llama-cli documentation (#8315)
* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md
2024-07-07 17:08:28 +02:00
Brian
f7cab35ef9
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
CLI to hash GGUF files to detect difference on a per model and per tensor level

The hash type we support is:

- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256

While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.

This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 22:58:43 +10:00
compilade
d39130a398
py : use cpu-only torch in requirements.txt (#8335) 2024-07-07 14:23:38 +03:00