Commit graph

3450 commits

Author SHA1 Message Date
John Balis
48aa8fd1f2
ggml : add ggml_upscale_ext (ggml/814)
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next

* experimental commit to see if dst shape is correct

* test version

* test

* removed unnecessary params

* refactor

* fixed tests

* ggml : metal impl + cleanup + sycl dev warnings

* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior

* metal : fix upsacle op to support nb00 + style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-15 13:23:33 +03:00
Johannes Gäßler
583fd6b000
server bench: fix bench not waiting for model load (#7284) 2024-05-15 08:44:16 +02:00
Georgi Gerganov
9f773486ab
script : sync ggml-rpc 2024-05-14 19:14:38 +03:00
Georgi Gerganov
e8a7fd4fb0
metal : support FA without mask + add asserts (#7278)
* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci
2024-05-14 19:09:30 +03:00
Georgi Gerganov
a5e3fde857 sync : ggml
ggml-ci
2024-05-14 19:08:09 +03:00
Georgi Gerganov
f308ea7059 metal : tune soft_max number of threads (whisper/0) 2024-05-14 19:08:09 +03:00
Georgi Gerganov
c3c88f296a ggml : try fix ppc64 (whisper/0) 2024-05-14 19:08:09 +03:00
Przemysław Pawełczyk
182adefcf3 ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (whisper/2128) 2024-05-14 19:08:09 +03:00
Hong Bo PENG
0d26d8ccd8 ggml : optimize for ppc64le using VSX intrinsics (ggml/784)
* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 19:08:09 +03:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Brian
1265c670fd
Revert "move ndk code to a new library (#6951)" (#7282)
This reverts commit efc8f767c8.
2024-05-14 16:10:39 +03:00
Radoslav Gerganov
5e31828d3e
ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
slaren
541600201e
llama : disable pipeline parallelism with nkvo (#7265) 2024-05-14 17:33:42 +10:00
Elton Kola
efc8f767c8
move ndk code to a new library (#6951) 2024-05-14 17:30:30 +10:00
Haggai Nuchi
e0f556186b
Add left recursion check: quit early instead of going into an infinite loop (#7083)
* Add left recursion check: quit early instead of going into an infinite loop

* Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty

* Remove unnecessary declaration
2024-05-14 15:25:56 +10:00
Ryuei
27f65d6267
docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
compilade
ee52225067
convert-hf : support direct Q8_0 conversion (#7234)
* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.
2024-05-13 14:10:51 -04:00
teleprint-me
cfe659d90a
feat: Add option for adding generation prompt
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
2024-05-13 13:12:38 -04:00
teleprint-me
da96fdd15f
patch: Apply patch for advisories/GHSA-56xg-wfcc-g829
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
2024-05-13 13:10:44 -04:00
Georgi Gerganov
614d3b914e
llama : less KV padding when FA is off (#7257)
ggml-ci
2024-05-13 17:15:15 +03:00
k.h.lai
30e70334f7
llava-cli: fix base64 prompt (#7248) 2024-05-14 00:02:36 +10:00
Johannes Gäßler
1c570d8bee
perplexity: add BF16 vs. FP16 results (#7150) 2024-05-13 13:03:27 +02:00
Neo Zhang
948f4ec7c5
[SYCL] rm wait() (#7233) 2024-05-13 18:11:26 +08:00
Joan Fontanals
9aa672490c
llama : rename jina tokenizers to v2 (#7249)
* refactor: rename jina tokenizers to v2

* refactor: keep refactoring non-breaking
2024-05-13 11:35:14 +03:00
Brian
b1f8af1886
convert.py: Outfile default name change and additional metadata support (#4858)
* convert.py: Outfile default name change and additional metadata support

* convert.py: don't stringify Metadata load method output

* convert.py: typo fix

* convert.py: fix metadata format to sync with LLM_KV_NAMES in llama.cpp
2024-05-13 12:56:47 +10:00
Benjamin Findley
e586ee4259
change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
teleprint-me
f8bb223924
refactor: Remove rename from display to render and return result instead of printing 2024-05-12 21:17:04 -04:00
teleprint-me
214e9e6f0b
refactor: Add logging debug and clean up logger implementation 2024-05-12 21:07:05 -04:00
teleprint-me
6be3576e01
feat: Add sane defaults and options for setting special tokens 2024-05-12 20:48:29 -04:00
teleprint-me
fa0b0b10cc
feat: Allow toggling verbosity 2024-05-12 20:39:55 -04:00
teleprint-me
668c7ee6c5
refactor: Use render template instead of format 2024-05-12 20:37:27 -04:00
teleprint-me
8b9ed888bc
patch: Handle how templates are rendered if no system prompt is allowed 2024-05-12 20:17:35 -04:00
teleprint-me
4a018e706f
feat: Add assistant turn 2024-05-12 20:08:37 -04:00
teleprint-me
bf5154f9bc
docs: Fix filename in docstring and remove return type from main 2024-05-12 20:07:30 -04:00
Neo Zhang
cbf75894d2
[SYCL] Add oneapi runtime dll files to win release package (#7241)
* add oneapi running time dlls to release package

* fix path

* fix path

* fix path

* fix path

* fix path

---------

Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-13 08:04:29 +08:00
Neo Zhang
0d5cef78ae
[SYCL] update CI with oneapi 2024.1 (#7235)
Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-13 08:02:55 +08:00
Austin
f572213c34
Merge branch 'ggerganov:master' into gguf-model-template 2024-05-12 19:54:44 -04:00
teleprint-me
eac2e83f9f
gguf: Add example script for extracting chat template 2024-05-12 19:51:55 -04:00
Johannes Gäßler
dc685be466
CUDA: add FP32 FlashAttention vector kernel (#7188)
* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-12 19:40:45 +02:00
Georgi Gerganov
6f1b63606f
cmake : fix version cmp (#7227) 2024-05-12 18:30:23 +03:00
slaren
b228aba91a
remove convert-lora-to-ggml.py (#7204) 2024-05-12 02:29:33 +02:00
Georgi Gerganov
7bd4ffb780
metal : fix warnings (skipme) (#0) 2024-05-11 21:38:13 +03:00
Georgi Gerganov
1622ac023f
sync : ggml 2024-05-11 21:35:05 +03:00
Georgi Gerganov
6aeff24f8b
metal : fix indent (ggml/0) 2024-05-11 21:34:21 +03:00
Georgi Gerganov
325756d28d
ggml : resolve merge (ggml/0)
ggml-ci
2024-05-11 21:33:08 +03:00
Josh Ramer
fed0108491
Scripting & documenting debugging one test without anything else in the loop. (#7096)
* A little documentation that shares my quick tips for working in the repository.

* Update startup-testing-debugging.md

* script that shows a menu of tests to pick from & run the debugger on

* debug-test.sh: Refactor CLI help message

* debug-test.sh: documentation update

* debug-test.sh: CLI Help output corrections

* debug-test.sh: minor doc fix

---------

authored-by: Josh Ramer <ubuntu@ip-172-31-32-53.ec2.internal>
Assisted-by: brian khuu <mofosyne@gmail.com>
2024-05-12 03:26:35 +10:00
Xuan Son Nguyen
72c177c1f6
fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00
compilade
5a419926b0
convert-hf : support bfloat16 conversion (#7158)
* convert-hf : support bfloat16 conversion

* gguf-py : flake8 fixes

* convert-hf : add missing space after comma

* convert-hf : get bit-exact same output as ./quantize

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32

* convert-hf : add --outtype auto-f16

A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.

* convert-hf : remove a semicolon because flake8 doesn't like it

It's a reflex from when programming in C/C++, I guess.

* convert-hf : support outtype templating in outfile name

* convert-hf : rename --outtype auto-f16 to --outtype auto
2024-05-11 11:06:26 -04:00
Georgi Gerganov
fae9d234b6 sync : ggml
ggml-ci
2024-05-11 15:38:34 +03:00
Justina Cho
f5ef34e428 feat: implemented sigmoid function (ggml/806)
* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count
2024-05-11 15:38:34 +03:00