Commit graph

2488 commits

Author SHA1 Message Date
Julia Longtin
6face8a0be first fixes. 2024-03-23 15:56:47 +00:00
Julia Longtin
0a2051aa88 attempt to speed up float clearing. 2024-03-23 15:55:00 +00:00
Julia Longtin
0b012c03ef allow using code from ggml-phi-knc-dot_q5_K_q8_K.c 2024-03-23 15:02:56 +00:00
Julia Longtin
0b3f17127f force to compile. 2024-03-23 14:58:33 +00:00
Julia Longtin
18f353987c tell ggml-common.h to export what we want. 2024-03-23 14:49:35 +00:00
Julia Longtin
cd20404250 pull in ggml specific types. 2024-03-23 14:38:15 +00:00
Julia Longtin
8f57803f58 import stdio.h for size_t. 2024-03-23 14:29:59 +00:00
Julia Longtin
9bcb8350d5 import stdint.h for sizeSt. 2024-03-23 14:28:29 +00:00
Julia Longtin
a7bd64c130 begin work on targeting dot_q5_K_q8_K. 2024-03-23 14:19:47 +00:00
Julia Longtin
9185e14922 be more specific about the length of our list of run amounts. 2024-03-21 20:38:49 +00:00
Julia Longtin
0979522fbe spacing changes. 2024-03-21 18:36:25 +00:00
Julia Longtin
ac3637142d formatting changes. 2024-03-20 21:34:12 +00:00
Julia Longtin
76e66e77c2 use the same header as ggml.c, and remove some warnings. 2024-03-20 21:12:22 +00:00
Julia Longtin
ee27148629 remove intrinsics import, and use upConv to save 12 bytes of memory transit. 2024-03-20 20:15:30 +00:00
Julia Longtin
ab6f3a8a8d
Update ggml-phi-knc.c 2024-03-17 21:36:14 +00:00
Julia Longtin
f882673ba6 add a benchmark / test binary. 2024-03-17 21:20:14 +00:00
Julia Longtin
fe663c1b63 merge from upstream 2024-03-17 21:15:32 +00:00
Julia Longtin
eac00a72d5
Update ggml.c 2024-03-16 14:17:21 +00:00
Julia Longtin
e216a2f133
Update ggml.c 2024-03-16 14:15:51 +00:00
Julia Longtin
257ffd9955
Update ggml.c 2024-03-16 14:13:22 +00:00
Julia Longtin
717e164dd7 implement F32 dot products. 2024-03-16 14:05:03 +00:00
Julia Longtin
7a57feba0c import intrinsics. 2024-03-13 19:26:54 +00:00
Julia Longtin
a1ae649662 use right type, and define GGML_F32_VEC_ZERO. 2024-03-13 19:23:53 +00:00
Julia Longtin
f346a41deb try to implement one intrinsic 2024-03-13 19:18:10 +00:00
Julia Longtin
aec982eefd try to detect the PHI cross compiler in make. 2024-03-12 21:54:38 +00:00
Julia Longtin
a31c936c5a try to detect the PHI cross compiler in make. 2024-03-12 21:40:46 +00:00
Julia Longtin
5a2973af25 instead of checking on glibc, check on SYS_getcpu 2024-03-12 21:07:10 +00:00
Julia Longtin
7f3722beb6 handle the case that we have no glibc on the PHI. 2024-03-12 21:02:14 +00:00
Julia Longtin
868a2016ac add detection of Xeon PHI: Knights Corner. 2024-03-12 20:57:43 +00:00
slaren
306d34be7a
ci : remove tidy-review (#6021) 2024-03-12 17:55:19 +02:00
Georgi Gerganov
8030da7afe
ggml : reuse quantum structs across backends (#5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-12 14:27:20 +02:00
Georgi Gerganov
184215e783
ggml : fix UB in IQ2_S and IQ3_S (#6012) 2024-03-12 13:49:55 +02:00
Georgi Gerganov
48358b2e5b
sycl : update IQ1_S kernels (WIP - not working!) (#5995)
* sycl : try to fix after IQ1_S changes

* sycl : iq1s_grid -> iq1s_grid_gpu

* sycl : fix grid type
2024-03-12 11:15:05 +02:00
gliptic
5cdb371731
grammar : fix unnecessarily retained pointer to rules (#6003) 2024-03-11 21:59:03 +02:00
Kawrakow
44ca159faf
1.5 bit: we can do even better (#5999)
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 17:53:15 +02:00
Georgi Gerganov
05b06210c9
llama : more consistent names of count variables (#5994)
* llama : more consistent names of count variables

ggml-ci

* llama : n_parallel -> n_seq_max

* common : fix param name

* examples : fix param name
2024-03-11 17:49:47 +02:00
Georgi Gerganov
83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
Jakub N
828defefb6
Update server docker image URLs (#5997) 2024-03-11 14:40:42 +01:00
Xuan Son Nguyen
caa106d4e0
Server: format error to json (#5961)
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
2024-03-11 10:56:41 +01:00
Michael Podvitskiy
3202361c5b
ggml, ci : Windows ARM runner and build fixes (#5979)
* windows arm ci

* fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64

* fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned`

* fix `error C2065: '__fp16': undeclared identifier`
2024-03-11 11:28:51 +02:00
Minsoo Cheong
332bdfd798
server : maintain chat completion id for streaming responses (#5988)
* server: maintain chat completion id for streaming responses

* Update examples/server/utils.hpp

* Update examples/server/utils.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-11 10:09:32 +02:00
Gilad S
ecab1c75de
cmake : fix subdir for LLAMA_METAL_EMBED_LIBRARY (#5985) 2024-03-11 10:00:08 +02:00
Georgi Gerganov
ee35600b90
llama : fix F16/F32 downcast + improve names (#5980) 2024-03-11 09:56:47 +02:00
Kawrakow
be858f6205
Better 1.5 bit quantization (#5971)
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 07:51:49 +01:00
Abhilash Majumder
ef3ced26a3
[SYCL] Add q3_s and q1_s (#5886)
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
2024-03-11 10:27:56 +05:30
AidanBeltonS
3814a07392
[SYCL] Add support for SYCL Nvidia target (#5738)
* Add support for nvidia target in CMake

* Update sycl read-me for Nvidia target

* Fix errors
2024-03-11 09:13:57 +08:00
Georgi Gerganov
bb6d00bbf9
metal : move mm_id indices to shared mem (#5982) 2024-03-10 23:12:48 +02:00
Dean
7ab7b733bb
android : fix utf8 decoding error (#5935)
* examples: fix utf8 decoding error

some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token
one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137

* android : minor

---------

Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 22:03:17 +02:00
Georgi Gerganov
d9f65c97c3
readme : update hot topics 2024-03-10 20:58:26 +02:00
Georgi Gerganov
b838b53ad6
sync : ggml 2024-03-10 20:10:46 +02:00