Michael Klimenko 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								52bb63c708 
								
							 
						 
						
							
							
								
								refactor : switch to emplace_back to avoid extra object ( #5291 )  
							
							
							
						 
						
							2024-02-03 13:23:37 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									kalomaze 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								191221178f 
								
							 
						 
						
							
							
								
								perplexity : fix KL divergence calculations on Windows ( #5273 )  
							
							
							
						 
						
							2024-02-02 16:15:30 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								44879ee885 
								
							 
						 
						
							
							
								
								Additional KL-divergence statistics ( #5081 )  
							
							... 
							
							
							
							* perplexity: add top-token probability
* perplexity: add additional KL-divergence statistics
* perplexity: a better organized KL-divergence statistics output
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-23 15:17:20 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								89758723c7 
								
							 
						 
						
							
							
								
								minor : clean-up some warnings and style ( #5094 )  
							
							... 
							
							
							
							* minor : clean-up some warnings and style
ggml-ci
* ggml : add comment 
							
						 
						
							2024-01-23 14:12:57 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6f9939d119 
								
							 
						 
						
							
							
								
								KL-divergence ( #5076 )  
							
							... 
							
							
							
							* kl-divergence: be able to save all logits to a file
* Add ability to compute KL-divergence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-22 16:10:14 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								7dcbe39d36 
								
							 
						 
						
							
							
								
								Add ability to evauate multiple choice tasks  ( #5047 )  
							
							... 
							
							
							
							* TruthfulQA: 1st attempt, does not look like it is working
The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.
* TruthfulQA: works but the result is bad
I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.
* TruthfulQA: fix random sample
* TruthfulQA: prepare tasks in parallel for large test datasets
* Rename truthful_qa to multiple_choice
* Make MSVC happy
I had forgotten that MSVC does not make constexpr's available
inside a lambda.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-21 14:42:44 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Jared Van Bortel 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								97c1549808 
								
							 
						 
						
							
							
								
								perplexity : fix MSVC build after  #5020  ( #5043 )  
							
							... 
							
							
							
							* perplexity : fix MSVC build after #5020 
* try a differerent fix 
							
						 
						
							2024-01-20 17:08:08 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								7051aacfac 
								
							 
						 
						
							
							
								
								winogrande: evaluate log-probs in parallel ( #5036 )  
							
							... 
							
							
							
							This is a relatively minor performance tweak resulting in
~10% speedup on my system.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-19 11:39:11 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								993fba8180 
								
							 
						 
						
							
							
								
								perplexity: avoid unnecessary alloocations and logit copies ( #5035 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-19 11:02:39 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8b20858e5e 
								
							 
						 
						
							
							
								
								perplexity : faster Winogrande via batching ( #5024 )  
							
							... 
							
							
							
							* perplexity : faster Winogrande via batching
ggml-ci
* perplexity : remove unused function
* perplexity : only tokenize selected tasks for Winogrande 
							
						 
						
							2024-01-19 10:45:06 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d391ae9b49 
								
							 
						 
						
							
							
								
								perplexity : fix winogrande N tasks option  
							
							
							
						 
						
							2024-01-18 20:49:00 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3e945cc1e9 
								
							 
						 
						
							
							
								
								HellaSwag: speed up by parallelizing log-prob evaluation ( #5020 )  
							
							... 
							
							
							
							For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-18 19:18:21 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ad19812cda 
								
							 
						 
						
							
							
								
								perplexity : faster HellaSwag via batching ( #5017 )  
							
							... 
							
							
							
							* perplexity : faster HellaSwag
ggml-ci
* perplexity : clean-up
ggml-ci
* perplexity : no need for decode_helper
ggml-ci
* perplexity : add comments
* perplexity : option to specify max batched tasks via `n_parallel`
* perplexity : remove HellaSwag restruction for n_batch 
							
						 
						
							2024-01-18 15:33:01 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								682986a08e 
								
							 
						 
						
							
							
								
								Add Winogrande evaluation ( #5015 )  
							
							... 
							
							
							
							* winogrande: simple implementation
It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
* winogrande: somewhat better
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
* winogrande: improving
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.
It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.
It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
* winogrande: add dataset instructions
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2024-01-18 13:46:27 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								959ef0c0df 
								
							 
						 
						
							
							
								
								perplexity : fix kv cache handling for hellaswag ( #4981 )  
							
							... 
							
							
							
							ggml-ci 
							
						 
						
							2024-01-16 19:34:54 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kerfuffle 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								91f6499393 
								
							 
						 
						
							
							
								
								Respect tokenizer.ggml.add_bos_token value when tokenizing ( #4040 )  
							
							... 
							
							
							
							* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time 
							
						 
						
							2023-11-16 19:14:37 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b12fa0d1c1 
								
							 
						 
						
							
							
								
								build : link against build info instead of compiling against it ( #3879 )  
							
							... 
							
							
							
							* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com> 
							
						 
						
							2023-11-02 08:50:16 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kerfuffle 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6e08281e58 
								
							 
						 
						
							
							
								
								Extend llama_kv_cache_seq_rm to allow matching any sequence ( #3843 )  
							
							... 
							
							
							
							* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality 
							
						 
						
							2023-10-29 11:31:40 -06:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Marcus Dunn 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								5be6c803fa 
								
							 
						 
						
							
							
								
								llama : remove token functions with context args in favor of model ( #3720 )  
							
							... 
							
							
							
							* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-10-23 22:40:03 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									slaren 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								16bc66d947 
								
							 
						 
						
							
							
								
								llama.cpp : split llama_context_params into model and context params ( #3301 )  
							
							... 
							
							
							
							* llama.cpp : split llama_context_params into model and context params
ggml-ci
* fix metal build
* fix freq_base/scale default to model value
* llama-bench : keep the same model between tests when possible
* move n_threads to llama_context_params, add n_threads_batch
* fix mpi build
* remove kv_size(), cuda scratch fixes
* remove low-vram option
* add n_threads_batch to system info, refactor to get_system_info()
* add documentation about --threads-batch to the READMEs
* llama-bench fix
* main : fix rope freq/scale warning
* llama.cpp : add llama_get_model
common : add llama_tokenize from model
* remove duplicated ctx/model functions
ggml-ci
* cuda : print total VRAM used 
							
						 
						
							2023-09-28 22:42:38 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ec893798b7 
								
							 
						 
						
							
							
								
								llama : custom attention mask + parallel decoding + no context swaps ( #3228 )  
							
							... 
							
							
							
							* tests : verify that RoPE is "additive"
* llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)
* ggml : ggml_rope now takes a vector with positions instead of n_past
* metal : add rope_f16 kernel + optimize cpy kernels
* llama : unified KV cache + batch inference API
* llama : add new llama_decode() API that works with llama_batch
* llama : add cell_max heuristic for more efficient kv_cache
* llama : extend llama_kv_cache API
* llama : more robust cell_max heuristic + wip shift
* metal : disable concurrency optimization
* llama : add llama_kv_cache_shift_seq + no more context swaps
* llama : apply K-cache roping for Falcon and Baichuan
* speculative : fix KV cache management
* parallel : example for serving multiple users in parallel
* parallel : disable hot-plug to avoid cache fragmentation
* fixes : speculative KV cache + llama worst-case graph
* llama : extend batch API to select which logits to output
* llama : fix worst case graph build
* ggml-cuda : update rope implementation for parallel decoding (#3254 )
* ggml-cuda : update rope implementation for parallel decoding
* better solution for p0 computation
* fix rope
* simpler rope implementation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make : add parallel to build + fix static functions in llama.cpp
* simple : fix token counting
* parallel : various improvements
* llama : fix cell_max logic + rename functions
* parallel : try smaller batches when the KV cache is fragmented
* parallel : fix sequence termination criteria
* llama : silence errors KV cache errors
* parallel : remove new line from prompt
* parallel : process system prompt once + configurable paramters + llama API
* parallel : remove question with short answers
* parallel : count cache misses
* parallel : print misses on each request
* parallel : minor
* llama : fix n_kv to never become 0
* parallel : rename hot-plug to continuous-batching
* llama : improve llama_batch API + simplify parallel example
* simple : add parallel decoding support
* simple : improve comments + free batch
* ggml-cuda : add rope f16, restore performance with parallel decoding (#3272 )
* ggml-cuda : add rope f16, restore performance
* offload KQ_mask with all models
* fix rope shift
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : disable MPI for now
ggml-ci
* train : make KQ_pos memory buffer permanent via dummy scale op
* ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275 )
ggml-ci
* parallel : fix bug (extra BOS) + smaller token_prev array
* parallel : fix cases where the input prompts can overflow the batch
* parallel : add disabled experimental batch chunking in powers of two
* llama : llama.h formatting + comments
* simple : add README.md
* llama : fix kv cache heuristic when context is less than 32
* parallel : fix crash when `-n -1`
* llama : simplify returns if/else branches
* metal : use mm kernels for batch size > 2
* examples : utilize new llama_get_logits_ith()
* examples : add example for batched decoding
* examples : do not eval prompt 2 times (close  #3348 )
* server : clear the KV cache beyond n_past before llama_decode
* server : avoid context swaps by shifting the KV cache
---------
Co-authored-by: slaren <slarengh@gmail.com> 
							
						 
						
							2023-09-28 19:04:36 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8781013ef6 
								
							 
						 
						
							
							
								
								make : restore build-info.h dependency for several targets ( #3205 )  
							
							
							
						 
						
							2023-09-18 10:03:53 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e6616cf0db 
								
							 
						 
						
							
							
								
								examples : add compiler version and target to build info ( #2998 )  
							
							
							
						 
						
							2023-09-15 16:59:49 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3aefaab9e5 
								
							 
						 
						
							
							
								
								check C++ code with -Wmissing-declarations ( #3184 )  
							
							
							
						 
						
							2023-09-15 15:38:27 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e64f5b5578 
								
							 
						 
						
							
							
								
								examples : make n_ctx warning work again ( #3066 )  
							
							... 
							
							
							
							This was broken by commit e36ecdcc#2901 )"). 
							
						 
						
							2023-09-08 11:43:35 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								00d62adb79 
								
							 
						 
						
							
							
								
								fix some warnings from gcc and clang-tidy ( #3038 )  
							
							... 
							
							
							
							Co-authored-by: xaedes <xaedes@gmail.com> 
							
						 
						
							2023-09-07 13:22:29 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e36ecdccc8 
								
							 
						 
						
							
							
								
								build : on Mac OS enable Metal by default ( #2901 )  
							
							... 
							
							
							
							* build : on Mac OS enable Metal by default
* make : try to fix build on Linux
* make : move targets back to the top
* make : fix target clean
* llama : enable GPU inference by default with Metal
* llama : fix vocab_only logic when GPU is enabled
* common : better `n_gpu_layers` assignment
* readme : update Metal instructions
* make : fix merge conflict remnants
* gitignore : metal 
							
						 
						
							2023-09-04 22:26:24 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								fa3582f509 
								
							 
						 
						
							
							
								
								Tell users attmepting to run perplexity with too few tokens to use more ( #2882 )  
							
							... 
							
							
							
							Closes  #2858 
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
						
							2023-08-29 23:55:45 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Johannes Gäßler 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6b73ef1201 
								
							 
						 
						
							
							
								
								YAML result logging + preset script ( #2657 )  
							
							
							
						 
						
							2023-08-28 17:59:39 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								463173a6c0 
								
							 
						 
						
							
							
								
								llama : speedup tokenization ( #2831 )  
							
							... 
							
							
							
							* Speedup tokenization
On current master it takes ~3.2 seconds to tokenize
Wikitext. With this change it becomes ~525 ms.
* Fixit: it was missing the piece after the last found occurence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-27 16:50:33 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								edd4c14817 
								
							 
						 
						
							
							
								
								llama : more tokenizer fixes ( #2810 )  
							
							... 
							
							
							
							* tests : write a Python tokenizer test (wip)
* llama : prefix input text for tokenization with whitespace
* llama : distinguish pieces from decoded text + fix detokenization
* common : add comments
* examples : no longer manually add leading space when tokenizing
* tests : use Python to generate tokenizer tests for C++
* tests : add option to tokenize text files
ggml-ci
* tests : add test-tokenizer-1.py
* llama.cpp : fix LF token
* hellaswag : move the concat space for clarity
* tests : add falcon tests (py + cpp, currently do not pass Unicode)
ggml-ci
* common : temporary separate llama_detokenize calls for SPM and BPE
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> 
							
						 
						
							2023-08-27 14:19:19 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								771551a793 
								
							 
						 
						
							
							
								
								Fix HellaSwag ( #2805 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-26 16:48:53 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d046dcee08 
								
							 
						 
						
							
							
								
								Faster perplexity computation ( #2786 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-25 19:05:02 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								cf658adc83 
								
							 
						 
						
							
							
								
								llm : add Falcon support ( #2717 )  
							
							... 
							
							
							
							* llama : refactor GGUF constants into static maps
* llama : check if model architecture is known
* llama : refactor llama_model_load_internal()
* gguf : add KV constant maps
* llm : read arch-specific KVs
* convert : add dummy scores + types
* falcon : load tensor data (CPU only)
* llama : fix loading progress bar
* llama : add arch member to llama_model
* falcon : CPU inference working
* falcon : support non-40B models
* falcon : minor
* llama : minor updates
ggml-ci
* convert-falcon-hf-to-gguf.py : fix special token mapping
* llama.cpp : llama default UNK token = id 0
* llama.cpp : fix bpe tokenizer
* llama.cpp : fix the fix of bpe tokenizer
* ggml : pass eps to ggml_norm
* metal : implement RoPE (mode = 2) + avoid ggml_repeat
* ggml : ggml_repeat always creates new tensor
* falcon : copy-paste self-attention from LLaMA
* metal : print extra compute pipeline info
* falcon : minor changes (still chasing the Metal problem)
* llama.cpp : fix linefeed token
* metal : fix GELU kernel numerical stability by using precise::tanh
* metal : temporary workaround for the concurrency optimization bug
* falcon : add CUDA offloading (#2739 )
* llama : better model naming and size reporting
* llama : prep new tokenizer support
* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation
* llama : remove oboslete comment
ggml-ci
* common : remove obsolete BPE API + disable test-tokenizer-1
* llama : revert BPE special-case in llama_byte_to_token()
* cuda : add TODOs for RoPE NeoX implementation
* llama : default special tokens based on vocab type
* perplexity : add log for start of tokenization
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com> 
							
						 
						
							2023-08-23 23:08:04 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								62959e740e 
								
							 
						 
						
							
							
								
								Strided perplexity ( #2714 )  
							
							... 
							
							
							
							* Implementing strided computation of perplexity
* Alternative way to output PPL results
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-23 12:56:42 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6381d4e110 
								
							 
						 
						
							
							
								
								gguf : new file format with flexible meta data (beta) ( #2398 )  
							
							... 
							
							
							
							* gguf : first API pass
* gguf : read header + meta data
* gguf : read tensor info
* gguf : initial model loading - not tested
* gguf : add gguf_get_tensor_name()
* gguf : do not support passing existing ggml_context to gguf_init
* gguf : simplify gguf_get_val
* gguf : gguf.c is now part of ggml.c
* gguf : read / write sample models
* gguf : add comments
* refactor : reduce code duplication and better API (#2415 )
* gguf : expose the gguf_type enum through the API for now
* gguf : add array support
* gguf.py : some code style changes
* convert.py : start a new simplified implementation by removing old stuff
* convert.py : remove GGML vocab + other obsolete stuff
* GGUF : write tensor (#2426 )
* WIP: Write tensor
* GGUF : Support writing tensors in Python
* refactor : rm unused import and upd todos
* fix : fix errors upd writing example
* rm example.gguf
* gitignore *.gguf
* undo formatting
* gguf : add gguf_find_key (#2438 )
* gguf.cpp : find key example
* ggml.h : add gguf_find_key
* ggml.c : add gguf_find_key
* gguf : fix writing tensors
* gguf : do not hardcode tensor names to read
* gguf : write sample tensors to read
* gguf : add tokenization constants
* quick and dirty conversion example
* gguf : fix writing gguf arrays
* gguf : write tensors one by one and code reuse
* gguf : fix writing gguf arrays
* gguf : write tensors one by one
* gguf : write tensors one by one
* gguf : write tokenizer data
* gguf : upd gguf conversion script
* Update convert-llama-h5-to-gguf.py
* gguf : handle already encoded string
* ggml.h : get array str and f32
* ggml.c : get arr str and f32
* gguf.py : support any type
* Update convert-llama-h5-to-gguf.py
* gguf : fix set is not subscriptable
* gguf : update convert-llama-h5-to-gguf.py
* constants.py : add layer norm eps
* gguf.py : add layer norm eps and merges
* ggml.h : increase GGML_MAX_NAME to 64
* ggml.c : add gguf_get_arr_n
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Makefile : add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* gguf : support custom alignment value
* gguf : fix typo in function call
* gguf : mmap tensor data example
* fix : update convert-llama-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* convert-gptneox-h5-to-gguf.py : Special tokens
* gptneox-main.cpp : special tokens
* Update gptneox-main.cpp
* constants.py : special tokens
* gguf.py : accumulate kv and tensor info data + special tokens
* convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens
* gguf : gguf counterpart of llama-util.h
* gguf-util.h : update note
* convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens
* convert-llama-h5-to-gguf.py : special tokens
* Delete gptneox-common.cpp
* Delete gptneox-common.h
* convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer
* gptneox-main.cpp : gpt2 bpe tokenizer
* gpt2 bpe tokenizer (handles merges and unicode)
* Makefile : remove gptneox-common
* gguf.py : bytesarray for gpt2bpe tokenizer
* cmpnct_gpt2bpe.hpp : comments
* gguf.py : use custom alignment if present
* gguf : minor stuff
* Update gptneox-main.cpp
* map tensor names
* convert-gptneox-h5-to-gguf.py : map tensor names
* convert-llama-h5-to-gguf.py : map tensor names
* gptneox-main.cpp : map tensor names
* gguf : start implementing libllama in GGUF (WIP)
* gguf : start implementing libllama in GGUF (WIP)
* rm binary commited by mistake
* upd .gitignore
* gguf : calculate n_mult
* gguf :  inference with 7B model working (WIP)
* gguf : rm deprecated function
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : add gguf_get_kv_type
* gguf : add gguf_get_kv_type
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver
* gguf : rm references to old file formats
* gguf : shorter name for member variable
* gguf : rm redundant method
* gguf : get rid of n_mult, read n_ff from file
* Update gguf_tensor_map.py
* Update gptneox-main.cpp
* gguf : rm references to old file magics
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : quantization is working
* gguf : roper closing of file
* gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : no need to convert tensors twice
* convert-llama-h5-to-gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : simplify nbytes
* convert-llama-h5-to-gguf.py : simplify nbytes
* gptneox-main.cpp : n_layer --> n_block
* constants.py : n_layer --> n_block
* gguf.py : n_layer --> n_block
* convert-gptneox-h5-to-gguf.py : n_layer --> n_block
* convert-llama-h5-to-gguf.py : n_layer --> n_block
* gptneox-main.cpp : n_layer --> n_block
* Update gguf_tensor_map.py
* convert-gptneox-h5-to-gguf.py : load model in parts to save memory
* convert-llama-h5-to-gguf.py : load model in parts to save memory
* convert : write more metadata for LLaMA
* convert : rm quantization version
* convert-gptneox-h5-to-gguf.py : add file_type key
* gptneox-main.cpp : add file_type key
* fix conflicts
* gguf : add todos and comments
* convert-gptneox-h5-to-gguf.py : tensor name map changes
* Create gguf_namemap.py : tensor name map changes
* Delete gguf_tensor_map.py
* gptneox-main.cpp : tensor name map changes
* convert-llama-h5-to-gguf.py : fixes
* gguf.py : dont add empty strings
* simple : minor style changes
* gguf : use UNIX line ending
* Create convert-llama-7b-pth-to-gguf.py
* llama : sync gguf-llama.cpp with latest llama.cpp (#2608 )
* llama : sync gguf-llama.cpp with latest llama.cpp
* minor : indentation + assert
* llama : refactor gguf_buffer and gguf_ctx_buffer
* llama : minor
* gitignore : add gptneox-main
* llama : tokenizer fixes (#2549 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* convert : update convert-new.py with tokenizer fixes (#2614 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* llama : sync gguf-llama with llama (#2613 )
* llama : sync gguf-llama with llama
* tests : fix build + warnings (test-tokenizer-1 still fails)
* tests : fix wstring_convert
* convert : fix layer names
* llama : sync gguf-llama.cpp
* convert : update HF converter to new tokenizer voodoo magics
* llama : update tokenizer style
* convert-llama-h5-to-gguf.py : add token types
* constants.py : add token types
* gguf.py : add token types
* convert-llama-7b-pth-to-gguf.py : add token types
* gguf-llama.cpp :  fix n_head_kv
* convert-llama-h5-to-gguf.py : add 70b gqa support
* gguf.py : add tensor data layout
* convert-llama-h5-to-gguf.py : add tensor data layout
* convert-llama-7b-pth-to-gguf.py : add tensor data layout
* gptneox-main.cpp : add tensor data layout
* convert-llama-h5-to-gguf.py : clarify the reverse permute
* llama : refactor model loading code (#2620 )
* llama : style formatting + remove helper methods
* llama : fix quantization using gguf tool
* llama : simplify gguf_file_saver
* llama : fix method names
* llama : simplify write_header()
* llama : no need to pass full file loader to the file saver
just gguf_ctx
* llama : gguf_file_saver write I32
* llama : refactor tensor names (#2622 )
* gguf: update tensor names searched in quantization
* gguf : define tensor names as constants
* gguf : initial write API (not tested yet)
* gguf : write to file API (not tested)
* gguf : initial write API ready + example
* gguf : fix header write
* gguf : fixes + simplify example + add ggml_nbytes_pad()
* gguf : minor
* llama : replace gguf_file_saver with new gguf write API
* gguf : streaming support when writing files
* gguf : remove oboslete write methods
* gguf : remove obosolete gguf_get_arr_xxx API
* llama : simplify gguf_file_loader
* llama : move hparams and vocab from gguf_file_loader to llama_model_loader
* llama : merge gguf-util.h in llama.cpp
* llama : reorder definitions in .cpp to match .h
* llama : minor simplifications
* llama : refactor llama_model_loader (WIP)
wip : remove ggml_ctx from llama_model_loader
wip : merge gguf_file_loader in llama_model_loader
* llama : fix shape prints
* llama : fix Windows build + fix norm_rms_eps key
* llama : throw error on missing KV paris in model meta data
* llama : improve printing + log meta data
* llama : switch print order of meta data
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
* gguf : deduplicate (#2629 )
* gguf : better type names
* dedup : CPU + Metal is working
* ggml : fix warnings about unused results
* llama.cpp : fix line feed and compiler warning
* llama : fix strncpy warning + note token_to_str does not write null
* llama : restore the original load/save session implementation
Will migrate this to GGUF in the future
* convert-llama-h5-to-gguf.py : support alt ctx param name
* ggml : assert when using ggml_mul with non-F32 src1
* examples : dedup simple
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
* gguf.py : merge all files in gguf.py
* convert-new.py : pick #2427  for HF 70B support
* examples/gguf : no need to keep q option for quantization any more
* llama.cpp : print actual model size
* llama.cpp : use ggml_elements()
* convert-new.py : output gguf (#2635 )
* convert-new.py : output gguf (WIP)
* convert-new.py : add gguf key-value pairs
* llama : add hparams.ctx_train + no longer print ftype
* convert-new.py : minor fixes
* convert-new.py : vocab-only option should work now
* llama : fix tokenizer to use llama_char_to_byte
* tests : add new ggml-vocab-llama.gguf
* convert-new.py : tensor name mapping
* convert-new.py : add map for skipping tensor serialization
* convert-new.py : convert script now works
* gguf.py : pick some of the refactoring from #2644 
* convert-new.py : minor fixes
* convert.py : update to support GGUF output
* Revert "ci : disable CI temporary to not waste energy"
This reverts commit 7e82d25f40#2644 )
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : style fixes in simple conversion script
* gguf : refactor gptneox conversion script
* gguf : rename h5 to hf (for HuggingFace)
* gguf : refactor pth to gguf conversion script
* gguf : rm file_type key and method
* gguf.py : fix vertical alignment
* gguf.py : indentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* convert-gptneox-hf-to-gguf.py : fixes
* gguf.py : gptneox mapping
* convert-llama-hf-to-gguf.py : fixes
* convert-llama-7b-pth-to-gguf.py : fixes
* ggml.h : reverse GGUF_MAGIC
* gguf.py : reverse GGUF_MAGIC
* test-tokenizer-0.cpp : fix warning
* llama.cpp : print kv general.name
* llama.cpp : get special token kv and linefeed token id
* llama : print number of tensors per type + print arch + style
* tests : update vocab file with new magic
* editorconfig : fix whitespaces
* llama : re-order functions
* llama : remove C++ API + reorganize common source in /common dir
* llama : minor API updates
* llama : avoid hardcoded special tokens
* llama : fix MPI build
ggml-ci
* llama : introduce enum llama_vocab_type + remove hardcoded string constants
* convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested
* falcon-main.cpp : falcon inference example
* convert-falcon-hf-to-gguf.py : remove extra kv
* convert-gptneox-hf-to-gguf.py : remove extra kv
* convert-llama-7b-pth-to-gguf.py : remove extra kv
* convert-llama-hf-to-gguf.py : remove extra kv
* gguf.py : fix for falcon 40b
* falcon-main.cpp : fix for falcon 40b
* convert-falcon-hf-to-gguf.py : update ref
* convert-falcon-hf-to-gguf.py : add tensor data layout
* cmpnct_gpt2bpe.hpp : fixes
* falcon-main.cpp : fixes
* gptneox-main.cpp : fixes
* cmpnct_gpt2bpe.hpp : remove non-general stuff
* Update examples/server/README.md
Co-authored-by: slaren <slarengh@gmail.com>
* cmpnct_gpt2bpe.hpp : cleanup
* convert-llama-hf-to-gguf.py : special tokens
* convert-llama-7b-pth-to-gguf.py : special tokens
* convert-permute-debug.py : permute debug print
* convert-permute-debug-master.py : permute debug for master
* convert-permute-debug.py : change permute type of attn_q
* convert.py : 70b model working (change attn_q permute)
* Delete convert-permute-debug-master.py
* Delete convert-permute-debug.py
* convert-llama-hf-to-gguf.py : fix attn_q permute
* gguf.py : fix rope scale kv
* convert-llama-hf-to-gguf.py : rope scale and added tokens
* convert-llama-7b-pth-to-gguf.py : rope scale and added tokens
* llama.cpp : use rope scale kv
* convert-llama-7b-pth-to-gguf.py : rope scale fix
* convert-llama-hf-to-gguf.py : rope scale fix
* py : fix whitespace
* gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682 )
* First pass at converting GGMLv3 LLaMA models to GGUF
* Cleanups, better output during conversion
* Fix vocab space conversion logic
* More vocab conversion fixes
* Add description to converted GGUF files
* Improve help text, expand warning
* Allow specifying name and description for output GGUF
* Allow overriding vocab and hyperparams from original model metadata
* Use correct params override var name
* Fix wrong type size for Q8_K
Better handling of original style metadata
* Set default value for gguf add_tensor raw_shape KW arg
* llama : improve token type support (#2668 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* llama : add API for token type
ggml-ci
* tests : use new tokenizer type API (#2692 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* Improve commentary
* Use token type API in test-tokenizer-1.cpp
* py : cosmetics
* readme : add notice about new file format
ggml-ci
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: goerch <jhr.walter@t-online.de>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> 
							
						 
						
							2023-08-21 23:07:43 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								cb1c0727bd 
								
							 
						 
						
							
							
								
								HellaSwag: split token evaluation into batches if needed ( #2681 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-21 11:11:31 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								5e9ff54a67 
								
							 
						 
						
							
							
								
								More efficient Hellaswag implementation ( #2677 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-20 16:44:46 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e9b12c332e 
								
							 
						 
						
							
							
								
								perplexity : more meaningful ETA number - 2 decimal points  
							
							
							
						 
						
							2023-08-18 12:48:55 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Borislav Stanimirov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ff966e7ca6 
								
							 
						 
						
							
							
								
								build : fix several cast and printf warnings ( #2499 )  
							
							
							
						 
						
							2023-08-04 13:07:21 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									klosax 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8a88e5855c 
								
							 
						 
						
							
							
								
								perplexity : add Hellaswag calculation ( #2389 )  
							
							... 
							
							
							
							* common.h : add hellaswag / remove perplexity-lines
* common.cpp : add hellaswag / remove perplexity-lines
* perplexity.cpp : add hellswag scores / remove perplexity-lines
* perplexity.cpp : clean up
* common.h : change default param value
* common.cpp : Change default param
* perplexity.cpp : alter wording
* common.h : alter wording
* common.cpp : alter wording 
							
						 
						
							2023-07-28 21:25:36 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									klosax 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b5fe67f8c6 
								
							 
						 
						
							
							
								
								Perplexity: Compute scores correlated to HellaSwag ( #2312 )  
							
							... 
							
							
							
							* Add parameter --perplexity-lines to perplexity.cpp 
							
						 
						
							2023-07-22 14:21:24 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d01bccde9f 
								
							 
						 
						
							
							
								
								ci : integrate with ggml-org/ci ( #2250 )  
							
							... 
							
							
							
							* ci : run ctest
ggml-ci
* ci : add open llama 3B-v2 tests
ggml-ci
* ci : disable wget progress output
ggml-ci
* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations
ggml-ci
* tests : try to fix tail free sampling test
ggml-ci
* ci : add K-quants
ggml-ci
* ci : add short perplexity tests
ggml-ci
* ci : add README.md
* ppl : add --chunks argument to limit max number of chunks
ggml-ci
* ci : update README 
							
						 
						
							2023-07-18 14:24:43 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Evan Miller 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								5656d10599 
								
							 
						 
						
							
							
								
								mpi : add support for distributed inference via MPI ( #2099 )  
							
							... 
							
							
							
							* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099 )
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-07-10 18:49:56 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Judd 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								36680f6e40 
								
							 
						 
						
							
							
								
								convert : update for baichuan ( #2081 )  
							
							... 
							
							
							
							1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.
Co-authored-by: Judd <foldl@boxvest.com> 
							
						 
						
							2023-07-06 19:23:49 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Howard Su 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b8c8dda75f 
								
							 
						 
						
							
							
								
								Use unsigned for random seed ( #2006 )  
							
							... 
							
							
							
							* Use unsigned for random seed. Keep -1 as the value to use a time based seed.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-29 06:15:15 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									zrm 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b853d45601 
								
							 
						 
						
							
							
								
								ggml : add NUMA support ( #1556 )  
							
							... 
							
							
							
							* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-26 20:57:59 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Didzis Gosko 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								527b6fba1d 
								
							 
						 
						
							
							
								
								llama : make model stateless and context stateful (llama_state) ( #1797 )  
							
							... 
							
							
							
							* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-24 11:47:58 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Borislav Stanimirov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								9cbf50c041 
								
							 
						 
						
							
							
								
								build : fix and ignore MSVC warnings ( #1889 )  
							
							
							
						 
						
							2023-06-16 21:23:53 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ec2e10c444 
								
							 
						 
						
							
							
								
								llama : add llama_init_backend() API ( close   #1527 )  
							
							
							
						 
						
							2023-05-20 11:06:37 +03:00