Commit graph

746 commits

Author SHA1 Message Date
Tom Seneviratne
aaf3f2476d
Update Makefile - minor spelling error 2023-06-19 14:46:22 +10:00
John
f0165a5f18 Merge branch 'master' into cuda-integration 2023-06-19 05:31:46 +02:00
John
932f7f663a Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp 2023-06-19 05:30:41 +02:00
John
7c8249ff6b cuda malloc:
- added functionality to find the smallest fitting buffer instead of the first found buffer that >= than requested
-- this prevents that two buffer allocations in sequence can take a huge buffer for a small tensor and then require a new buffer for the 2nd tensor
-- in my test it saved 1GB VRAM that are now free for more offloading

cuda free buffers:
- added a helper function that frees all unused buffers from a device to prevent huge F32 buffers from cuBLAS occupying VRAM needlessly after token ingestion

libfalcon:
- corrected vram_overhead calculation to account for the actual non-weight buffers needed during inference
- added vram_overhead for n_batch > 1 as this switches the ingestion into a 32 bit dequantization mode for cu_blas which needs almost 2 GB VRAM buffers
- corrected the automated layer distribution to fill VRAM as much as possible with layers

From here on it's recommended to use --ngl 100 and -b 1  for CUDA processing.
In addition -t is recommended using 1 or 1 less threads than CPU cores (depends on CPU, GPU used)
2023-06-19 05:03:28 +02:00
John
ec253a67bc
Merge pull request #8 from alepar/patch-1
Fixes typo
2023-06-18 22:46:26 +02:00
Alexey Parfenov
3984d36542
Fixes typo 2023-06-18 11:10:24 -07:00
John
72f358150c
Update README.md 2023-06-18 16:37:21 +02:00
John
b4028edb9a a debug line slipped in 2023-06-18 14:26:35 +02:00
John
80f654631e
Update README.md 2023-06-18 05:57:19 +02:00
John
76b41830b9 Added cuda-integration of JohannesGaessler git
- disabled offload of non layer tensors for now (not working yet)
- corrected tensor size calculation for vram
- added some more detailed vram reporting

- added a automated skip of tensors that would not fit in vram (significant slowdown if --ngl is too high, probably from temporary cuda buffer copies)
- added vram_overhead and vram_reserved - those are not pretty but currently needed to get the vram usage right
- moved vram scratch buffer allocation a bit up so the usage is available for the skip
2023-06-18 05:46:12 +02:00
John
5ecd645bce minor verbose messages 2023-06-18 02:10:26 +02:00
JohannesGaessler
a8bb0fe358 WIP full GPU acceleration 2023-06-18 00:11:20 +02:00
JohannesGaessler
8f81cab1bc WIP full GPU acceleration 2023-06-17 23:08:49 +02:00
JohannesGaessler
0c916d2357 Offload weights 2023-06-17 23:08:49 +02:00
John
f75125615a Update README.md 2023-06-17 23:08:49 +02:00
John
2797754843 Update README.md 2023-06-17 23:08:49 +02:00
John
f9118b0ca5 Update README.md 2023-06-17 23:08:49 +02:00
John
6ae8567a30 Update README.md 2023-06-17 23:08:49 +02:00
John
9d4d26554a Update README.md 2023-06-17 23:08:49 +02:00
John
d0c460629d Update README.md 2023-06-17 23:08:47 +02:00
John
ab509ad9e2 added the tensor size calculation routines 2023-06-17 23:06:21 +02:00
Jan Ploski
ea70881941 Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp). 2023-06-17 23:06:21 +02:00
Jan Ploski
c3e9c88d71 Fixed segfault during context swap introduced by commit 3d6ed185 2023-06-17 23:06:21 +02:00
Jan Ploski
5ec0d12652 Correction to 4a37251a - since we did not insert the bos token, do not need attempt to rescue it during context swap 2023-06-17 23:06:21 +02:00
Jan Ploski
db0083f7b7 Fixed bos/eos token (which is both 11 according to config.json of Falcon-7B/40B). Also: do not auto-insert a space or (b|e)os at the beginning of prompt (seems to be LLaMA-specific). 2023-06-17 23:06:21 +02:00
John
ed4ad057b2 Went back to the original size calculation for now.
Though it appears not to matter.
2023-06-17 23:06:21 +02:00
John
fee7da163b Work in progress.
Added falcon main and library based on llama.cpp
CPU inference works (getting ~260ms/token on 7B 16 bit falcon)
Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly)

TODO/WIP:
1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary
- even quantization from 16 -> 16 also fails, something is wrong in the tensors produced
2) mmap should work with quantized binaries once 1) is solved
3) CUDA support is mostly there, it's currently disabled (all CPU backend)
4) memory/context caluculations are off, GPU memory calculations are wrong either
5) the python conversion script is pre GGML 1 version (tokens without scores)
6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both
7) the GGML produced by the current python uses an old ftype method

Makfiles:
cmake on windows with build tools works
the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something

Changes to the codebase:
* repeat2 has been added to ggml (jploski - https://github.com/ggerganov/ggml/pull/231) including the backward variant (untested, probably fails)
* minor changes to work with falcon (name length)
* libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp
2023-06-17 23:06:13 +02:00
John
cbb31807a3
Update README.md 2023-06-17 21:34:24 +02:00
Georgi Gerganov
b2416493ab
make : do not print help for simple example 2023-06-17 20:55:03 +03:00
Georgi Gerganov
4f9c43e3bd
minor : warning fixes 2023-06-17 20:24:11 +03:00
Johannes Gäßler
2c9380dd2f
Only one CUDA stream per device for async compute (#1898) 2023-06-17 19:15:02 +02:00
John
f89c7592eb
Update README.md 2023-06-17 18:57:40 +02:00
Georgi Gerganov
051e1b0e6a
llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00
DaniAndTheWeb
86c7571864
make : update for latest Arch (#1701)
With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.
2023-06-17 19:17:22 +03:00
Howard Su
3d59ec5935
ggml : fix warnings under MSVC (#1908) 2023-06-17 18:46:15 +03:00
John
c72bc02695
Update README.md 2023-06-17 16:51:34 +02:00
John
6e137abe56
Update README.md 2023-06-17 16:42:23 +02:00
John
abc77a7496 Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp 2023-06-17 16:41:08 +02:00
John
588ca709fb added the tensor size calculation routines 2023-06-17 16:40:57 +02:00
Aaron Miller
0711a5f6dc
metal : add norm, cpy f16->f16, alibi kernels (#1823) 2023-06-17 17:37:49 +03:00
John
7c5f607287
Update README.md 2023-06-17 16:23:40 +02:00
John
d4b9423560
Update README.md 2023-06-17 16:23:01 +02:00
John
0ed97e529f
Update README.md 2023-06-17 16:20:02 +02:00
John
dd3d346f7a Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp 2023-06-17 14:39:28 +02:00
Faez Shakil
fc45a81bc6
exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) 2023-06-17 14:13:05 +02:00
Randall Fitzgerald
794db3e7b9
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.

Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.

This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

Summary of the changes:

- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict 
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables

---------

Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 14:53:04 +03:00
Jiří Podivín
5ddf7ea1fb
hooks : setting up flake8 and pre-commit hooks (#1681)
Small, non-functional changes were made to non-compliant files.
These include breaking up long lines, whitespace sanitation and
unused import removal.

Maximum line length in python files was set to a generous 125 chars,
in order to minimize number of changes needed in scripts and general
annoyance. The "txt" prompts directory is excluded from the checks
as it may contain oddly formatted files and strings for a good reason.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-06-17 13:32:48 +03:00
Gustavo Rocha Dias
bac19927c3
readme : alternative way to build for Android with CLBlast. (#1828) 2023-06-17 12:01:06 +03:00
Kerfuffle
b4c6f46f17
Allow cmake to build ggml as a library (#1896)
* Allow cmake to build ggml as a library

* A ggml_static library will be created

* When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built
2023-06-17 01:49:42 -06:00
David Yang
92f20d9942
train : get raw text instead of page with html (#1905)
We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.
2023-06-17 09:51:54 +03:00