llama.cpp

Author	SHA1	Message	Date
John	32141a3a75	added the QK reject message into the quantizer	2023-06-19 23:05:16 +02:00
John	1f421dddde	added missing #if defined(GGML_USE_CUBLAS)	2023-06-19 14:13:34 +02:00
John	eb22d7e504	was reverted on cuda merge	2023-06-19 13:43:12 +02:00
John	c5399d1cf7	Merge pull request #9 from tomBlueOrange/patch-1 Update Makefile - minor spelling error	2023-06-19 13:40:00 +02:00
Tom Seneviratne	aaf3f2476d	Update Makefile - minor spelling error	2023-06-19 14:46:22 +10:00
John	f0165a5f18	Merge branch 'master' into cuda-integration	2023-06-19 05:31:46 +02:00
John	932f7f663a	Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp	2023-06-19 05:30:41 +02:00
John	7c8249ff6b	cuda malloc: - added functionality to find the smallest fitting buffer instead of the first found buffer that >= than requested -- this prevents that two buffer allocations in sequence can take a huge buffer for a small tensor and then require a new buffer for the 2nd tensor -- in my test it saved 1GB VRAM that are now free for more offloading cuda free buffers: - added a helper function that frees all unused buffers from a device to prevent huge F32 buffers from cuBLAS occupying VRAM needlessly after token ingestion libfalcon: - corrected vram_overhead calculation to account for the actual non-weight buffers needed during inference - added vram_overhead for n_batch > 1 as this switches the ingestion into a 32 bit dequantization mode for cu_blas which needs almost 2 GB VRAM buffers - corrected the automated layer distribution to fill VRAM as much as possible with layers From here on it's recommended to use --ngl 100 and -b 1 for CUDA processing. In addition -t is recommended using 1 or 1 less threads than CPU cores (depends on CPU, GPU used)	2023-06-19 05:03:28 +02:00
John	ec253a67bc	Merge pull request #8 from alepar/patch-1 Fixes typo	2023-06-18 22:46:26 +02:00
Alexey Parfenov	3984d36542	Fixes typo	2023-06-18 11:10:24 -07:00
John	72f358150c	Update README.md	2023-06-18 16:37:21 +02:00
John	b4028edb9a	a debug line slipped in	2023-06-18 14:26:35 +02:00
John	80f654631e	Update README.md	2023-06-18 05:57:19 +02:00
John	76b41830b9	Added cuda-integration of JohannesGaessler git - disabled offload of non layer tensors for now (not working yet) - corrected tensor size calculation for vram - added some more detailed vram reporting - added a automated skip of tensors that would not fit in vram (significant slowdown if --ngl is too high, probably from temporary cuda buffer copies) - added vram_overhead and vram_reserved - those are not pretty but currently needed to get the vram usage right - moved vram scratch buffer allocation a bit up so the usage is available for the skip	2023-06-18 05:46:12 +02:00
John	5ecd645bce	minor verbose messages	2023-06-18 02:10:26 +02:00
JohannesGaessler	a8bb0fe358	WIP full GPU acceleration	2023-06-18 00:11:20 +02:00
JohannesGaessler	8f81cab1bc	WIP full GPU acceleration	2023-06-17 23:08:49 +02:00
JohannesGaessler	0c916d2357	Offload weights	2023-06-17 23:08:49 +02:00
John	f75125615a	Update README.md	2023-06-17 23:08:49 +02:00
John	2797754843	Update README.md	2023-06-17 23:08:49 +02:00
John	f9118b0ca5	Update README.md	2023-06-17 23:08:49 +02:00
John	6ae8567a30	Update README.md	2023-06-17 23:08:49 +02:00
John	9d4d26554a	Update README.md	2023-06-17 23:08:49 +02:00
John	d0c460629d	Update README.md	2023-06-17 23:08:47 +02:00
John	ab509ad9e2	added the tensor size calculation routines	2023-06-17 23:06:21 +02:00
Jan Ploski	ea70881941	Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp).	2023-06-17 23:06:21 +02:00
Jan Ploski	c3e9c88d71	Fixed segfault during context swap introduced by commit `3d6ed185`	2023-06-17 23:06:21 +02:00
Jan Ploski	5ec0d12652	Correction to `4a37251a` - since we did not insert the bos token, do not need attempt to rescue it during context swap	2023-06-17 23:06:21 +02:00
Jan Ploski	db0083f7b7	Fixed bos/eos token (which is both 11 according to config.json of Falcon-7B/40B). Also: do not auto-insert a space or (b\|e)os at the beginning of prompt (seems to be LLaMA-specific).	2023-06-17 23:06:21 +02:00
John	ed4ad057b2	Went back to the original size calculation for now. Though it appears not to matter.	2023-06-17 23:06:21 +02:00
John	fee7da163b	Work in progress. Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - https://github.com/ggerganov/ggml/pull/231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp	2023-06-17 23:06:13 +02:00
John	cbb31807a3	Update README.md	2023-06-17 21:34:24 +02:00
Georgi Gerganov	b2416493ab	make : do not print help for simple example	2023-06-17 20:55:03 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
John	f89c7592eb	Update README.md	2023-06-17 18:57:40 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
DaniAndTheWeb	86c7571864	make : update for latest Arch (#1701 ) With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.	2023-06-17 19:17:22 +03:00
Howard Su	3d59ec5935	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
John	c72bc02695	Update README.md	2023-06-17 16:51:34 +02:00
John	6e137abe56	Update README.md	2023-06-17 16:42:23 +02:00
John	abc77a7496	Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp	2023-06-17 16:41:08 +02:00
John	588ca709fb	added the tensor size calculation routines	2023-06-17 16:40:57 +02:00
Aaron Miller	0711a5f6dc	metal : add norm, cpy f16->f16, alibi kernels (#1823 )	2023-06-17 17:37:49 +03:00
John	7c5f607287	Update README.md	2023-06-17 16:23:40 +02:00
John	d4b9423560	Update README.md	2023-06-17 16:23:01 +02:00
John	0ed97e529f	Update README.md	2023-06-17 16:20:02 +02:00
John	dd3d346f7a	Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp	2023-06-17 14:39:28 +02:00
Faez Shakil	fc45a81bc6	exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863 )	2023-06-17 14:13:05 +02:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00

1 2 3 4 5 ...

750 commits