Commit graph

2399 commits

Author SHA1 Message Date
pudepiedj
f2f002d9af Correct whitespace/nl editor config 2024-03-04 08:02:28 +00:00
pudepiedj
4089657815 Remove extraneous files 2024-03-04 07:54:00 +00:00
pudepiedj
d532d5b1f7 Remove rtf files 2024-03-04 07:44:37 +00:00
pudepiedj
eb3da36e89 Delete rb and vca modules 2024-03-04 07:43:25 +00:00
pudepiedj
f44e9456a2 server update 2024-03-04 07:18:18 +00:00
pudepiedj
96ddeac1c6 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-03-03 11:20:12 +00:00
pudepiedj
480089d00d improve Llamaserver.py 2024-03-03 11:20:10 +00:00
pudepiedj
54bea4428f
Merge branch 'ggerganov:master' into server_branch 2024-03-03 11:19:25 +00:00
Georgi Gerganov
231ae28f07
readme : add API changes section 2024-03-03 12:44:03 +02:00
Douglas Hanley
475df1d6cf
llama : allow for user specified embedding pooling type (#5849)
* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-03 12:40:27 +02:00
Nindaleth
87c2e8b279
gguf-dump : support i-quants (#5841)
Co-authored-by: Black_Fox <radekliska@gmail.com>
2024-03-03 10:43:42 +02:00
compilade
de9692a7d2
llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
2024-03-03 10:41:55 +02:00
Pierrick Hymbert
e6029348e8
ci : schedule slow server tests only on Release or on demand (#5839) 2024-03-03 10:35:23 +02:00
Pierrick Hymbert
8ef969afce
server : init http requests thread pool with --parallel if set (#5836) 2024-03-03 09:48:36 +02:00
pudepiedj
265741aa0f Merge remote-tracking branch 'origin/master' into server_branch 2024-03-03 06:56:31 +00:00
Georgi Gerganov
fa974646e1
flake.lock: Update (#5842)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-02 20:11:31 -08:00
pudepiedj
f3bb1e55c6 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-03-02 22:10:30 +00:00
pudepiedj
bf366d2d9a add api key 2024-03-02 22:10:28 +00:00
Pierrick Hymbert
9731134296
server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test
2024-03-02 22:00:14 +01:00
Michael Podvitskiy
4a6e2d6142
llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-02 21:52:25 +02:00
Georgi Gerganov
494c870326
ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
2024-03-02 20:00:49 +02:00
Jared Van Bortel
4d4d2366fc
convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) 2024-03-02 12:27:26 -05:00
Jared Van Bortel
c7a0ad8ec9
convert-hf : make model class definitions self-contained (#5825) 2024-03-02 12:21:47 -05:00
Kawrakow
bbde6eb256
ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-02 17:00:51 +02:00
Georgi Gerganov
ef2cd694c4
scripts : add pod-llama.sh 2024-03-02 16:54:20 +02:00
Xuan Son Nguyen
6c32d8c7ad
llama : refactor internal quantization functions (#5830) 2024-03-02 16:19:09 +02:00
compilade
802da0091b
llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-03-02 15:42:56 +02:00
pudepiedj
8bda1c1041
Merge branch 'ggerganov:master' into server_branch 2024-03-02 12:09:07 +00:00
Neo Zhang Jianyu
715641391d
Support multiple GPUs (split mode) on SYCL backend (#5806)
* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments
2024-03-02 19:49:30 +08:00
pudepiedj
68814783c5 Merge remote-tracking branch 'origin/master' into server_branch 2024-03-02 10:28:37 +00:00
pudepiedj
5d61ae8d2a Renaming some vars 2024-03-02 10:24:07 +00:00
crasm
9bf297a02b
workflows : remove nocleanup arg for check-requirements.sh (#5826)
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.

Fixes the 'No space left on device' issue mentioned in #5703.
2024-03-02 00:11:06 -05:00
Tushar
cb5e8f7fc4
build(nix): Introduce flake.formatter for nix fmt (#5687)
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
2024-03-01 15:18:26 -08:00
nold
da3b9ba2b7
convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792) 2024-03-01 16:51:12 -05:00
Sourab Mangrulkar
c29af7e225
llama : add StarCoder2 support (#5795)
* Add support for starcoder2

* handle rope type

* skip rope freq and rotary embeddings from being serialized

* resolve comments

* Update llama.cpp

* remove redundant changes

* handle `rope-theta`

* llama : change starcoder2 rope type

* address comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-01 21:30:46 +02:00
Georgi Gerganov
38d16b1426
server : remove api_like_OAI.py proxy script (#5808) 2024-03-01 20:00:58 +02:00
pudepiedj
f51554180a Merge remote-tracking branch 'origin/master' into server_branch 2024-03-01 17:26:01 +00:00
ddpasa
c2224f003b
ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813) 2024-03-01 18:00:00 +01:00
pudepiedj
b47525df0a server tweak 2024-03-01 15:53:56 +00:00
kunal-vaishnavi
e743386728
gemma : fix bfloat16 -> float16 conversion issue (#5810) 2024-03-01 16:08:08 +02:00
Miwa / Ensan
f49a535686
common : fix flag --logits-all to --all-logits (#5805) 2024-03-01 15:48:56 +02:00
Pierrick Hymbert
3ab8b3a92e
llama : cleanup unused mmq flags (#5772)
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q

* remove: mul_mat_q in compare llama bench and usage

* update llama-bench

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-01 13:39:06 +02:00
Douglas Hanley
9600d59e01
unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
2024-03-01 11:15:36 +02:00
Pierrick Hymbert
5cb02b4a01
server: allow to override threads server pool with --threads-http (#5794) 2024-03-01 10:08:08 +01:00
Eve
6ea0f010ff
ci : add Ubuntu 22 Vulkan CI run (#5789) 2024-03-01 10:54:53 +02:00
Georgi Gerganov
f105471ef6
server : fix newlines in help (#5785) 2024-03-01 09:59:43 +02:00
AidanBeltonS
38d1521608
[SYCL] Use batched mul_mat pathway (#5591)
* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-03-01 13:06:47 +05:30
Xuan Son Nguyen
052051d8ae
Server: normalize naming (#5779)
* server: normalize naming

* fix spacing
2024-02-29 21:42:11 +01:00
pudepiedj
13d0948fdc server tweak 2024-02-29 18:14:08 +00:00
pudepiedj
71f885f2d0 Llamaserver.py changes 2024-02-29 16:56:51 +00:00