Commit graph

919 commits

Author SHA1 Message Date
Johannes Gäßler
924dd22fd3
Quantized dot products for CUDA mul mat vec (#2067) 2023-07-05 14:19:42 +02:00
Howard Su
051c70dcd5
llama: Don't double count the sampling time (#2107) 2023-07-05 18:31:23 +08:00
Johannes Gäßler
9e4475f5cf
Fixed OpenCL offloading prints (#2082) 2023-07-05 08:58:05 +02:00
Nigel Bosch
7f0e9a775e
embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
0cc4m
80b17e2f66 Fix trailing whitespace in vk_mem_alloc.h 2023-07-04 23:01:32 +02:00
0cc4m
e35d28fec3 Fix queue selection for AMD RADV 2023-07-04 22:57:08 +02:00
0cc4m
ae7325fdff Fix 2d write 2023-07-04 22:42:07 +02:00
0cc4m
ade9555c48 Add 2d write operation, profiling code 2023-07-04 22:31:47 +02:00
Georgi Gerganov
b472f3fca5
readme : add link web chat PR 2023-07-04 22:25:22 +03:00
Georgi Gerganov
ed9a54e512
ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
2023-07-04 21:54:11 +03:00
jwj7140
f257fd2550
Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke
7ee76e45af
Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Henri Vasserman
3d7d8d00a4
add cmake commands 2023-07-04 17:02:22 +03:00
Henri Vasserman
acc111caf9
Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
2023-07-04 15:38:04 +03:00
ZhouYuChen
23c7c6fc91
Update Makefile: clean simple (#2097) 2023-07-04 14:15:16 +02:00
Erik Scholz
698efad5fb
CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04 01:50:12 +02:00
Govlzkoy
14a2cc71f6
[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00
Henri Vasserman
1cf14ccef1
fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
Howard Su
cc45a7feb8
Fix crash of test-tokenizer-0 under Debug build (#2064)
* Fix crash of test-tokenizer-0 under Debug build

* Change per comment
2023-07-03 20:43:55 +02:00
Howard Su
55dbb915cc
[llama] No need to check file version when loading vocab score (#2079) 2023-07-03 19:58:58 +08:00
WangHaoranRobin
d7d2e6a0f0
server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
2023-07-03 00:38:44 +03:00
0cc4m
24eeb97d13 Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly 2023-07-02 22:11:58 +02:00
Georgi Gerganov
46088f7231 ggml : fix build with OpenBLAS (close #2066) 2023-07-02 09:46:46 +03:00
Johannes Gäßler
0bc2cdfc87
Better CUDA synchronization logic (#2057) 2023-07-01 21:49:44 +02:00
Johannes Gäßler
befb3a3562
Test-based VRAM scratch size + context adjustment (#2056) 2023-07-01 21:47:26 +02:00
Daniel Drake
b213227067
cmake : don't force -mcpu=native on aarch64 (#2063)
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.

-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.

If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.

Fixes #495.
2023-07-01 21:31:44 +03:00
Aaron Miller
2f8cd979ec
metal : release buffers when freeing metal context (#2062) 2023-07-01 21:14:59 +03:00
Judd
471aab6e4c
convert : add support of baichuan-7b (#2055)
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
Georgi Gerganov
463f2f4c4f
llama : fix return value of llama_load_session_file_internal (#2022) 2023-07-01 19:05:09 +03:00
Rand Xie
cb44dbc7de
llama : catch llama_load_session_file_internal exceptions (#2022)
* convert checks in llama_load_session_file to throw and handle them

* make llama_load_session_file_internal static

* address feedbacks to avoid using exceptions
2023-07-01 19:02:58 +03:00
Georgi Gerganov
79f634a19d
embd-input : fix returning ptr to temporary 2023-07-01 18:46:00 +03:00
Georgi Gerganov
04606a1599
train : fix compile warning 2023-07-01 18:45:44 +03:00
Qingyou Meng
b1ca8f36a9
ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)
Will not be scheduled unless explicitly enabled.
2023-07-01 18:42:43 +03:00
0cc4m
36cd5d85e9 Avoid requesting dedicated memory, VMA can decide that by itself 2023-06-30 21:20:19 +02:00
0cc4m
4ea9b2fd4b Add VMA library 2023-06-30 21:15:06 +02:00
0cc4m
c8ff09bdc7 dequant_q4_0 kernel 2023-06-30 20:48:42 +02:00
0cc4m
cb5cb4d6e2 Fix f16_to_f32 kernel 2023-06-30 20:48:03 +02:00
0cc4m
df3cdbdac7 Output FP32 in fp16 matmul shader 2023-06-30 18:37:10 +02:00
0cc4m
40c8f843f2 Fix mulmat_f16 2023-06-30 18:37:10 +02:00
0cc4m
c31e14b2fd Enable device extensions properly, restore fp16 matmul op 2023-06-30 18:37:10 +02:00
0cc4m
fc5bb53b32 Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel 2023-06-30 18:37:10 +02:00
0cc4m
3adc7b1d60 First FP16 attempt, disabled for now 2023-06-30 18:37:10 +02:00
0cc4m
2c70df985a Continue vulkan implementation and optimization 2023-06-30 18:36:42 +02:00
0cc4m
0c9cca00bd Write coalescing 2023-06-30 18:36:42 +02:00
0cc4m
7c6860b483 2D Blocktiling 2023-06-30 18:36:42 +02:00
0cc4m
1b4863c2b9 1D Blocktiling 2023-06-30 18:36:42 +02:00
0cc4m
baf9ff536b GEMM Kernel optimization 2023-06-30 18:36:42 +02:00
0cc4m
a42376e7ec First matmul success 2023-06-30 18:36:42 +02:00
0cc4m
8ce84c2747 Continue implementation 2023-06-30 18:36:42 +02:00
0cc4m
2471728a9d Add aligned malloc and free for VMA 2023-06-30 18:36:42 +02:00