Commit graph

3230 commits

Author SHA1 Message Date
AidanBeltonS
9a17ab914b
Add missing " (#7303) 2024-05-15 17:56:30 +05:30
dm4
ea3b0590ee
embedding : free the batch after execution (#7297) 2024-05-15 15:01:12 +03:00
Georgi Gerganov
29499bb593
sync : ggml 2024-05-15 13:23:41 +03:00
John Balis
48aa8fd1f2
ggml : add ggml_upscale_ext (ggml/814)
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next

* experimental commit to see if dst shape is correct

* test version

* test

* removed unnecessary params

* refactor

* fixed tests

* ggml : metal impl + cleanup + sycl dev warnings

* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior

* metal : fix upsacle op to support nb00 + style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-15 13:23:33 +03:00
Johannes Gäßler
583fd6b000
server bench: fix bench not waiting for model load (#7284) 2024-05-15 08:44:16 +02:00
Georgi Gerganov
9f773486ab
script : sync ggml-rpc 2024-05-14 19:14:38 +03:00
Georgi Gerganov
e8a7fd4fb0
metal : support FA without mask + add asserts (#7278)
* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci
2024-05-14 19:09:30 +03:00
Georgi Gerganov
a5e3fde857 sync : ggml
ggml-ci
2024-05-14 19:08:09 +03:00
Georgi Gerganov
f308ea7059 metal : tune soft_max number of threads (whisper/0) 2024-05-14 19:08:09 +03:00
Georgi Gerganov
c3c88f296a ggml : try fix ppc64 (whisper/0) 2024-05-14 19:08:09 +03:00
Przemysław Pawełczyk
182adefcf3 ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (whisper/2128) 2024-05-14 19:08:09 +03:00
Hong Bo PENG
0d26d8ccd8 ggml : optimize for ppc64le using VSX intrinsics (ggml/784)
* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 19:08:09 +03:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Brian
1265c670fd
Revert "move ndk code to a new library (#6951)" (#7282)
This reverts commit efc8f767c8.
2024-05-14 16:10:39 +03:00
Radoslav Gerganov
5e31828d3e
ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
slaren
541600201e
llama : disable pipeline parallelism with nkvo (#7265) 2024-05-14 17:33:42 +10:00
Elton Kola
efc8f767c8
move ndk code to a new library (#6951) 2024-05-14 17:30:30 +10:00
Haggai Nuchi
e0f556186b
Add left recursion check: quit early instead of going into an infinite loop (#7083)
* Add left recursion check: quit early instead of going into an infinite loop

* Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty

* Remove unnecessary declaration
2024-05-14 15:25:56 +10:00
Ryuei
27f65d6267
docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
compilade
ee52225067
convert-hf : support direct Q8_0 conversion (#7234)
* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.
2024-05-13 14:10:51 -04:00
Georgi Gerganov
614d3b914e
llama : less KV padding when FA is off (#7257)
ggml-ci
2024-05-13 17:15:15 +03:00
k.h.lai
30e70334f7
llava-cli: fix base64 prompt (#7248) 2024-05-14 00:02:36 +10:00
Johannes Gäßler
1c570d8bee
perplexity: add BF16 vs. FP16 results (#7150) 2024-05-13 13:03:27 +02:00
Neo Zhang
948f4ec7c5
[SYCL] rm wait() (#7233) 2024-05-13 18:11:26 +08:00
Joan Martinez
22b5f6b71f Merge branch 'master' of https://github.com/JoanFM/llama.cpp into feat-jina-embeddings-v2-zh 2024-05-13 10:41:48 +02:00
Joan Fontanals
9aa672490c
llama : rename jina tokenizers to v2 (#7249)
* refactor: rename jina tokenizers to v2

* refactor: keep refactoring non-breaking
2024-05-13 11:35:14 +03:00
Joan Martinez
ea0f7df2fb Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llama.cpp into feat-jina-embeddings-v2-zh 2024-05-13 10:31:28 +02:00
Joan Martinez
fb83012096 refactor: keep refactoring non-breaking 2024-05-13 10:28:26 +02:00
Joan Martinez
22a0113299 fix: fix alignment 2024-05-13 10:27:23 +02:00
Joan Martinez
0771b175aa Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llama.cpp into feat-jina-embeddings-v2-zh 2024-05-13 09:46:23 +02:00
Joan Martinez
8957cacd98 refactor: rename jina tokenizers to v2 2024-05-13 09:40:46 +02:00
Joan Martinez
d0a99aa424 Merge branch 'master' of https://github.com/JoanFM/llama.cpp into feat-jina-embeddings-v2-zh 2024-05-13 09:38:04 +02:00
Brian
b1f8af1886
convert.py: Outfile default name change and additional metadata support (#4858)
* convert.py: Outfile default name change and additional metadata support

* convert.py: don't stringify Metadata load method output

* convert.py: typo fix

* convert.py: fix metadata format to sync with LLM_KV_NAMES in llama.cpp
2024-05-13 12:56:47 +10:00
Benjamin Findley
e586ee4259
change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Neo Zhang
cbf75894d2
[SYCL] Add oneapi runtime dll files to win release package (#7241)
* add oneapi running time dlls to release package

* fix path

* fix path

* fix path

* fix path

* fix path

---------

Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-13 08:04:29 +08:00
Neo Zhang
0d5cef78ae
[SYCL] update CI with oneapi 2024.1 (#7235)
Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-13 08:02:55 +08:00
Johannes Gäßler
dc685be466
CUDA: add FP32 FlashAttention vector kernel (#7188)
* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-12 19:40:45 +02:00
Georgi Gerganov
6f1b63606f
cmake : fix version cmp (#7227) 2024-05-12 18:30:23 +03:00
slaren
b228aba91a
remove convert-lora-to-ggml.py (#7204) 2024-05-12 02:29:33 +02:00
Georgi Gerganov
7bd4ffb780
metal : fix warnings (skipme) (#0) 2024-05-11 21:38:13 +03:00
Georgi Gerganov
1622ac023f
sync : ggml 2024-05-11 21:35:05 +03:00
Georgi Gerganov
6aeff24f8b
metal : fix indent (ggml/0) 2024-05-11 21:34:21 +03:00
Georgi Gerganov
325756d28d
ggml : resolve merge (ggml/0)
ggml-ci
2024-05-11 21:33:08 +03:00
Josh Ramer
fed0108491
Scripting & documenting debugging one test without anything else in the loop. (#7096)
* A little documentation that shares my quick tips for working in the repository.

* Update startup-testing-debugging.md

* script that shows a menu of tests to pick from & run the debugger on

* debug-test.sh: Refactor CLI help message

* debug-test.sh: documentation update

* debug-test.sh: CLI Help output corrections

* debug-test.sh: minor doc fix

---------

authored-by: Josh Ramer <ubuntu@ip-172-31-32-53.ec2.internal>
Assisted-by: brian khuu <mofosyne@gmail.com>
2024-05-12 03:26:35 +10:00
Xuan Son Nguyen
72c177c1f6
fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00
compilade
5a419926b0
convert-hf : support bfloat16 conversion (#7158)
* convert-hf : support bfloat16 conversion

* gguf-py : flake8 fixes

* convert-hf : add missing space after comma

* convert-hf : get bit-exact same output as ./quantize

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32

* convert-hf : add --outtype auto-f16

A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.

* convert-hf : remove a semicolon because flake8 doesn't like it

It's a reflex from when programming in C/C++, I guess.

* convert-hf : support outtype templating in outfile name

* convert-hf : rename --outtype auto-f16 to --outtype auto
2024-05-11 11:06:26 -04:00
Georgi Gerganov
fae9d234b6 sync : ggml
ggml-ci
2024-05-11 15:38:34 +03:00
Justina Cho
f5ef34e428 feat: implemented sigmoid function (ggml/806)
* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count
2024-05-11 15:38:34 +03:00
Borislav Stanimirov
ef0d5e3ec9 build: fix and ignore msvc warnings (ggml/805) 2024-05-11 15:38:34 +03:00
Joan Martinez
3269efe70d Merge branch 'master' of https://github.com/JoanFM/llama.cpp into feat-jina-embeddings-v2-zh 2024-05-11 11:52:58 +02:00