llama.cpp

Author	SHA1	Message	Date
John	b4028edb9a	a debug line slipped in	2023-06-18 14:26:35 +02:00
John	76b41830b9	Added cuda-integration of JohannesGaessler git - disabled offload of non layer tensors for now (not working yet) - corrected tensor size calculation for vram - added some more detailed vram reporting - added a automated skip of tensors that would not fit in vram (significant slowdown if --ngl is too high, probably from temporary cuda buffer copies) - added vram_overhead and vram_reserved - those are not pretty but currently needed to get the vram usage right - moved vram scratch buffer allocation a bit up so the usage is available for the skip	2023-06-18 05:46:12 +02:00
JohannesGaessler	a8bb0fe358	WIP full GPU acceleration	2023-06-18 00:11:20 +02:00
JohannesGaessler	8f81cab1bc	WIP full GPU acceleration	2023-06-17 23:08:49 +02:00
JohannesGaessler	0c916d2357	Offload weights	2023-06-17 23:08:49 +02:00
John	f75125615a	Update README.md	2023-06-17 23:08:49 +02:00
John	2797754843	Update README.md	2023-06-17 23:08:49 +02:00
John	f9118b0ca5	Update README.md	2023-06-17 23:08:49 +02:00
John	6ae8567a30	Update README.md	2023-06-17 23:08:49 +02:00
John	9d4d26554a	Update README.md	2023-06-17 23:08:49 +02:00
John	d0c460629d	Update README.md	2023-06-17 23:08:47 +02:00
John	ab509ad9e2	added the tensor size calculation routines	2023-06-17 23:06:21 +02:00
Jan Ploski	ea70881941	Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp).	2023-06-17 23:06:21 +02:00
Jan Ploski	c3e9c88d71	Fixed segfault during context swap introduced by commit `3d6ed185`	2023-06-17 23:06:21 +02:00
Jan Ploski	5ec0d12652	Correction to `4a37251a` - since we did not insert the bos token, do not need attempt to rescue it during context swap	2023-06-17 23:06:21 +02:00
Jan Ploski	db0083f7b7	Fixed bos/eos token (which is both 11 according to config.json of Falcon-7B/40B). Also: do not auto-insert a space or (b\|e)os at the beginning of prompt (seems to be LLaMA-specific).	2023-06-17 23:06:21 +02:00
John	ed4ad057b2	Went back to the original size calculation for now. Though it appears not to matter.	2023-06-17 23:06:21 +02:00
John	fee7da163b	Work in progress. Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - https://github.com/ggerganov/ggml/pull/231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp	2023-06-17 23:06:13 +02:00
Georgi Gerganov	b2416493ab	make : do not print help for simple example	2023-06-17 20:55:03 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
DaniAndTheWeb	86c7571864	make : update for latest Arch (#1701 ) With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.	2023-06-17 19:17:22 +03:00
Howard Su	3d59ec5935	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
Aaron Miller	0711a5f6dc	metal : add norm, cpy f16->f16, alibi kernels (#1823 )	2023-06-17 17:37:49 +03:00
Faez Shakil	fc45a81bc6	exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863 )	2023-06-17 14:13:05 +02:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00
Jiří Podivín	5ddf7ea1fb	hooks : setting up flake8 and pre-commit hooks (#1681 ) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-06-17 13:32:48 +03:00
Gustavo Rocha Dias	bac19927c3	readme : alternative way to build for Android with CLBlast. (#1828 )	2023-06-17 12:01:06 +03:00
Kerfuffle	b4c6f46f17	Allow cmake to build ggml as a library (#1896 ) * Allow cmake to build ggml as a library * A ggml_static library will be created * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built	2023-06-17 01:49:42 -06:00
David Yang	92f20d9942	train : get raw text instead of page with html (#1905 ) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.	2023-06-17 09:51:54 +03:00
0cc4m	d411968e99	opencl : support k-quants (#1836 ) * Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2023-06-16 21:59:49 +03:00
SuperUserNameMan	b41b4cad6f	examples : add "simple" (#1840 ) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-16 21:58:09 +03:00
Zenix	13fe9d2d84	cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886 )	2023-06-16 21:53:04 +03:00
Johannes Gäßler	ac3b886953	llama : fix embd when offloading non-repeating layers (#1891 )	2023-06-16 21:25:51 +03:00
FrankHB	5b9ccaf104	Fixed possible macro redefinition (#1892 ) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.	2023-06-16 21:25:01 +03:00
Borislav Stanimirov	9cbf50c041	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
Kawrakow	3d01122610	CUDA : faster k-quant dot kernels (#1862 ) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-16 20:08:44 +03:00
Borislav Stanimirov	602c748863	gitignore : add several entries specific to Visual Studio (#1888 )	2023-06-16 09:58:11 +03:00
Johannes Gäßler	a09f9195be	Fixed CUDA runtime version check (#1879 )	2023-06-15 21:49:08 +02:00
Georgi Gerganov	bed9275617	cmake : remove whitespaces	2023-06-15 21:56:50 +03:00
yangli2	c36e81da62	examples : add chat-vicuna.sh (#1854 ) Co-authored-by: Yang Li <yangliyl@google.com>	2023-06-15 21:05:53 +03:00
Igor Okulist	3559433fec	cmake : set include path for OpenBlas (#1830 )	2023-06-15 20:51:26 +03:00
Frederik Vogel	69b34a0e80	swift : Package compile breaks due to ggml-metal.metal (#1831 ) * Ignore metal file in spm * Add ggml.h to spm public Headers --------- Co-authored-by: Vogel Frederik <vogel.frederik@linecorp.com>	2023-06-15 20:47:04 +03:00
daboe01	cf267d1c71	make : add train-text-from-scratch (#1850 ) * make finetuning example accessible * fixed: targed was in wrong line * fixed: name of executable was wrong * fixed: naming of binary * fixed: model path was wrong * fixed clean target * Update examples/train-text-from-scratch/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-15 20:42:48 +03:00
Srinivas Billa	9dda13e5e1	readme : server compile flag (#1874 ) Explicitly include the server make instructions for C++ noobsl like me ;)	2023-06-15 20:36:38 +03:00
sandyiscool	37e257c48e	make : clean *.so files (#1857 )	2023-06-15 20:36:06 +03:00
Howard Su	64cc19b4fe	Fix the validation of main device (#1872 )	2023-06-15 19:29:59 +02:00
Georgi Gerganov	4bfcc855ab	metal : parallel command buffer encoding (#1860 ) * metal : parallel command buffer encoding * metal : determine number of command buffers based on gf->n_threads	2023-06-15 20:29:48 +03:00
Johannes Gäßler	6b8312e797	Better error when using both LoRA + GPU layers (#1861 )	2023-06-15 19:06:46 +02:00

1 2 3 4 5 ...

720 commits