Commit graph

2351 commits

Author SHA1 Message Date
DAN™
50973c77ae common : use LLAMA_DEFAULT_SEED (#5855) 2024-03-10 15:38:20 +08:00
DAN™
d48c24273b main : support special tokens as reverse/anti prompt (#5847)
* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 15:38:20 +08:00
slaren
a0a1ca04d8 cuda : fix data race in soft max (#5853) 2024-03-10 15:38:20 +08:00
Georgi Gerganov
5b2daffcde readme : add API changes section 2024-03-10 15:38:20 +08:00
Douglas Hanley
74a0202ff3 llama : allow for user specified embedding pooling type (#5849)
* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 15:38:20 +08:00
Nindaleth
c0a1a5de91 gguf-dump : support i-quants (#5841)
Co-authored-by: Black_Fox <radekliska@gmail.com>
2024-03-10 15:38:19 +08:00
compilade
abb8e001e5 llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
2024-03-10 15:38:19 +08:00
Pierrick Hymbert
7a5c8bd3b6 ci : schedule slow server tests only on Release or on demand (#5839) 2024-03-10 15:38:19 +08:00
Pierrick Hymbert
909f62ef71 server : init http requests thread pool with --parallel if set (#5836) 2024-03-10 15:38:19 +08:00
Georgi Gerganov
49dab82b48 flake.lock: Update (#5842)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-10 15:38:19 +08:00
Pierrick Hymbert
23f60349e3 server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test
2024-03-10 15:38:19 +08:00
Michael Podvitskiy
352c2f375f llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 15:38:19 +08:00
Georgi Gerganov
afe9525a70 ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
2024-03-10 15:38:19 +08:00
Jared Van Bortel
dcb8d4439a convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) 2024-03-10 15:38:19 +08:00
Jared Van Bortel
5fd9d9e1ad convert-hf : make model class definitions self-contained (#5825) 2024-03-10 15:38:19 +08:00
Kawrakow
2ee066ad9e ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-10 15:38:19 +08:00
Georgi Gerganov
28af53f508 scripts : add pod-llama.sh 2024-03-10 15:38:19 +08:00
Xuan Son Nguyen
a6ebb7be75 llama : refactor internal quantization functions (#5830) 2024-03-10 15:38:19 +08:00
compilade
316f837abd llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-03-10 15:38:19 +08:00
Neo Zhang Jianyu
e789d25713 Support multiple GPUs (split mode) on SYCL backend (#5806)
* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments
2024-03-10 15:38:19 +08:00
crasm
439467b4f6 workflows : remove nocleanup arg for check-requirements.sh (#5826)
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.

Fixes the 'No space left on device' issue mentioned in #5703.
2024-03-10 15:38:19 +08:00
Tushar
337b5df18c build(nix): Introduce flake.formatter for nix fmt (#5687)
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
2024-03-10 15:38:19 +08:00
nold
2d16323c7e convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792) 2024-03-10 15:38:19 +08:00
Sourab Mangrulkar
81ee97614b llama : add StarCoder2 support (#5795)
* Add support for starcoder2

* handle rope type

* skip rope freq and rotary embeddings from being serialized

* resolve comments

* Update llama.cpp

* remove redundant changes

* handle `rope-theta`

* llama : change starcoder2 rope type

* address comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 15:38:19 +08:00
Georgi Gerganov
3851d6bebe server : remove api_like_OAI.py proxy script (#5808) 2024-03-10 15:38:19 +08:00
ddpasa
dba99b0778 ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813) 2024-03-10 15:38:19 +08:00
kunal-vaishnavi
e54be67b1b gemma : fix bfloat16 -> float16 conversion issue (#5810) 2024-03-10 15:38:19 +08:00
Miwa / Ensan
d134b79c6a common : fix flag --logits-all to --all-logits (#5805) 2024-03-10 15:38:19 +08:00
Pierrick Hymbert
cd3dca791b llama : cleanup unused mmq flags (#5772)
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q

* remove: mul_mat_q in compare llama bench and usage

* update llama-bench

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-10 15:38:19 +08:00
Douglas Hanley
51a38fc0c8 unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
2024-03-10 15:38:19 +08:00
Pierrick Hymbert
996a2ff344 server: allow to override threads server pool with --threads-http (#5794) 2024-03-10 15:38:19 +08:00
Eve
a4a260e57d ci : add Ubuntu 22 Vulkan CI run (#5789) 2024-03-10 15:38:19 +08:00
Georgi Gerganov
644d40a95f server : fix newlines in help (#5785) 2024-03-10 15:38:19 +08:00
AidanBeltonS
058951722f [SYCL] Use batched mul_mat pathway (#5591)
* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-03-10 15:38:19 +08:00
Xuan Son Nguyen
3e0058dbfd Server: normalize naming (#5779)
* server: normalize naming

* fix spacing
2024-03-10 15:38:19 +08:00
hazelnutcloud
4b08df1f8b fix broken merge 2024-02-29 18:13:15 +08:00
hsnmkls
e063bce2b6
Merge branch 'ggerganov:master' into master 2024-02-29 16:21:59 +08:00
Marcus Dunn
d5ab29757e
llama : constified llama_set_state_data's src (#5774) 2024-02-29 10:17:23 +02:00
Georgi Gerganov
87c91c0766
ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)
ggml-ci
2024-02-28 21:44:21 +02:00
Eve
317709b2a8
make portability_enumeration_ext apple only (#5757) 2024-02-28 20:33:37 +01:00
Georgi Gerganov
08c5ee87e4
llama : remove deprecated API (#5770)
ggml-ci
2024-02-28 18:43:38 +02:00
Georgi Gerganov
78aacf3634
awq-py : remove (#5768) 2024-02-28 17:36:53 +02:00
Georgi Gerganov
8c0e8f4e73
sync : ggml 2024-02-28 11:17:32 +02:00
slaren
2774b0c974
add google magika inference example (ggml/748)
* add magika inference example

* ggml : fix unaligned accesses in custom ops

* ggml : fix FP32 GELU for values that exceed the FP16 range

* use ggml_pool_1d

* add README

* Update README.md

* pad inputs if the files are too small

* cleanup

ggml-ci
2024-02-28 11:17:06 +02:00
UEXTM.com
5f70671856
Introduce backend GUIDs (ggml/743)
* Introduce backend GUIDs

Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)

Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID

* Remove redundant functions

Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion

* Add spaces to match style

Co-authored-by: slaren <slarengh@gmail.com>

* Fix brace style to match

Co-authored-by: slaren <slarengh@gmail.com>

* Add void to () in function signature

Co-authored-by: slaren <slarengh@gmail.com>

* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid

* add guids to all backends

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-28 11:17:05 +02:00
Xuan Son Nguyen
a693bea1e6
server : hit Ctrl+C twice to exit (#5734)
* server: twice ctrl+C to exit

* std::atomic_flag

* sigint: message

* sigint: stderr

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-28 10:55:37 +02:00
compilade
adcb12a9ba
llama : fix non-quantization of expert gating tensors (#5754)
This reverts a single line from #5475
2024-02-28 10:52:56 +02:00
Douglas Hanley
177628bfd8
llama : improve BERT tokenization (#5740)
* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-28 10:51:11 +02:00
Daniel Bevenius
6c4416868d
readme : add link to LLaVA 1.6 models (#5758)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-28 10:39:39 +02:00
Jorge A
efc72253f7
server : add "/chat/completions" alias for "/v1/...` (#5722)
* Add "/chat/completions" as alias for "/v1/chat/completions"

* merge to upstream master

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-28 10:39:15 +02:00