Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ec893798b7 
								
							 
						 
						
							
							
								
								llama : custom attention mask + parallel decoding + no context swaps ( #3228 )  
							
							... 
							
							
							
							* tests : verify that RoPE is "additive"
* llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)
* ggml : ggml_rope now takes a vector with positions instead of n_past
* metal : add rope_f16 kernel + optimize cpy kernels
* llama : unified KV cache + batch inference API
* llama : add new llama_decode() API that works with llama_batch
* llama : add cell_max heuristic for more efficient kv_cache
* llama : extend llama_kv_cache API
* llama : more robust cell_max heuristic + wip shift
* metal : disable concurrency optimization
* llama : add llama_kv_cache_shift_seq + no more context swaps
* llama : apply K-cache roping for Falcon and Baichuan
* speculative : fix KV cache management
* parallel : example for serving multiple users in parallel
* parallel : disable hot-plug to avoid cache fragmentation
* fixes : speculative KV cache + llama worst-case graph
* llama : extend batch API to select which logits to output
* llama : fix worst case graph build
* ggml-cuda : update rope implementation for parallel decoding (#3254 )
* ggml-cuda : update rope implementation for parallel decoding
* better solution for p0 computation
* fix rope
* simpler rope implementation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make : add parallel to build + fix static functions in llama.cpp
* simple : fix token counting
* parallel : various improvements
* llama : fix cell_max logic + rename functions
* parallel : try smaller batches when the KV cache is fragmented
* parallel : fix sequence termination criteria
* llama : silence errors KV cache errors
* parallel : remove new line from prompt
* parallel : process system prompt once + configurable paramters + llama API
* parallel : remove question with short answers
* parallel : count cache misses
* parallel : print misses on each request
* parallel : minor
* llama : fix n_kv to never become 0
* parallel : rename hot-plug to continuous-batching
* llama : improve llama_batch API + simplify parallel example
* simple : add parallel decoding support
* simple : improve comments + free batch
* ggml-cuda : add rope f16, restore performance with parallel decoding (#3272 )
* ggml-cuda : add rope f16, restore performance
* offload KQ_mask with all models
* fix rope shift
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : disable MPI for now
ggml-ci
* train : make KQ_pos memory buffer permanent via dummy scale op
* ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275 )
ggml-ci
* parallel : fix bug (extra BOS) + smaller token_prev array
* parallel : fix cases where the input prompts can overflow the batch
* parallel : add disabled experimental batch chunking in powers of two
* llama : llama.h formatting + comments
* simple : add README.md
* llama : fix kv cache heuristic when context is less than 32
* parallel : fix crash when `-n -1`
* llama : simplify returns if/else branches
* metal : use mm kernels for batch size > 2
* examples : utilize new llama_get_logits_ith()
* examples : add example for batched decoding
* examples : do not eval prompt 2 times (close  #3348 )
* server : clear the KV cache beyond n_past before llama_decode
* server : avoid context swaps by shifting the KV cache
---------
Co-authored-by: slaren <slarengh@gmail.com> 
							
						 
						
							2023-09-28 19:04:36 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									BarfingLemurs 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ffe88a36a9 
								
							 
						 
						
							
							
								
								readme : add some recent perplexity and bpw measurements to READMES, link for k-quants ( #3340 )  
							
							... 
							
							
							
							* Update README.md
* Update README.md
* Update README.md with k-quants bpw measurements 
							
						 
						
							2023-09-27 18:30:36 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8781013ef6 
								
							 
						 
						
							
							
								
								make : restore build-info.h dependency for several targets ( #3205 )  
							
							
							
						 
						
							2023-09-18 10:03:53 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e6616cf0db 
								
							 
						 
						
							
							
								
								examples : add compiler version and target to build info ( #2998 )  
							
							
							
						 
						
							2023-09-15 16:59:49 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3aefaab9e5 
								
							 
						 
						
							
							
								
								check C++ code with -Wmissing-declarations ( #3184 )  
							
							
							
						 
						
							2023-09-15 15:38:27 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e64f5b5578 
								
							 
						 
						
							
							
								
								examples : make n_ctx warning work again ( #3066 )  
							
							... 
							
							
							
							This was broken by commit e36ecdcc#2901 )"). 
							
						 
						
							2023-09-08 11:43:35 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Cebtenzzre 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								00d62adb79 
								
							 
						 
						
							
							
								
								fix some warnings from gcc and clang-tidy ( #3038 )  
							
							... 
							
							
							
							Co-authored-by: xaedes <xaedes@gmail.com> 
							
						 
						
							2023-09-07 13:22:29 -04:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e36ecdccc8 
								
							 
						 
						
							
							
								
								build : on Mac OS enable Metal by default ( #2901 )  
							
							... 
							
							
							
							* build : on Mac OS enable Metal by default
* make : try to fix build on Linux
* make : move targets back to the top
* make : fix target clean
* llama : enable GPU inference by default with Metal
* llama : fix vocab_only logic when GPU is enabled
* common : better `n_gpu_layers` assignment
* readme : update Metal instructions
* make : fix merge conflict remnants
* gitignore : metal 
							
						 
						
							2023-09-04 22:26:24 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								fa3582f509 
								
							 
						 
						
							
							
								
								Tell users attmepting to run perplexity with too few tokens to use more ( #2882 )  
							
							... 
							
							
							
							Closes  #2858 
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
						
							2023-08-29 23:55:45 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Johannes Gäßler 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6b73ef1201 
								
							 
						 
						
							
							
								
								YAML result logging + preset script ( #2657 )  
							
							
							
						 
						
							2023-08-28 17:59:39 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								463173a6c0 
								
							 
						 
						
							
							
								
								llama : speedup tokenization ( #2831 )  
							
							... 
							
							
							
							* Speedup tokenization
On current master it takes ~3.2 seconds to tokenize
Wikitext. With this change it becomes ~525 ms.
* Fixit: it was missing the piece after the last found occurence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-27 16:50:33 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								edd4c14817 
								
							 
						 
						
							
							
								
								llama : more tokenizer fixes ( #2810 )  
							
							... 
							
							
							
							* tests : write a Python tokenizer test (wip)
* llama : prefix input text for tokenization with whitespace
* llama : distinguish pieces from decoded text + fix detokenization
* common : add comments
* examples : no longer manually add leading space when tokenizing
* tests : use Python to generate tokenizer tests for C++
* tests : add option to tokenize text files
ggml-ci
* tests : add test-tokenizer-1.py
* llama.cpp : fix LF token
* hellaswag : move the concat space for clarity
* tests : add falcon tests (py + cpp, currently do not pass Unicode)
ggml-ci
* common : temporary separate llama_detokenize calls for SPM and BPE
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> 
							
						 
						
							2023-08-27 14:19:19 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								771551a793 
								
							 
						 
						
							
							
								
								Fix HellaSwag ( #2805 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-26 16:48:53 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d046dcee08 
								
							 
						 
						
							
							
								
								Faster perplexity computation ( #2786 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-25 19:05:02 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								cf658adc83 
								
							 
						 
						
							
							
								
								llm : add Falcon support ( #2717 )  
							
							... 
							
							
							
							* llama : refactor GGUF constants into static maps
* llama : check if model architecture is known
* llama : refactor llama_model_load_internal()
* gguf : add KV constant maps
* llm : read arch-specific KVs
* convert : add dummy scores + types
* falcon : load tensor data (CPU only)
* llama : fix loading progress bar
* llama : add arch member to llama_model
* falcon : CPU inference working
* falcon : support non-40B models
* falcon : minor
* llama : minor updates
ggml-ci
* convert-falcon-hf-to-gguf.py : fix special token mapping
* llama.cpp : llama default UNK token = id 0
* llama.cpp : fix bpe tokenizer
* llama.cpp : fix the fix of bpe tokenizer
* ggml : pass eps to ggml_norm
* metal : implement RoPE (mode = 2) + avoid ggml_repeat
* ggml : ggml_repeat always creates new tensor
* falcon : copy-paste self-attention from LLaMA
* metal : print extra compute pipeline info
* falcon : minor changes (still chasing the Metal problem)
* llama.cpp : fix linefeed token
* metal : fix GELU kernel numerical stability by using precise::tanh
* metal : temporary workaround for the concurrency optimization bug
* falcon : add CUDA offloading (#2739 )
* llama : better model naming and size reporting
* llama : prep new tokenizer support
* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation
* llama : remove oboslete comment
ggml-ci
* common : remove obsolete BPE API + disable test-tokenizer-1
* llama : revert BPE special-case in llama_byte_to_token()
* cuda : add TODOs for RoPE NeoX implementation
* llama : default special tokens based on vocab type
* perplexity : add log for start of tokenization
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com> 
							
						 
						
							2023-08-23 23:08:04 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								62959e740e 
								
							 
						 
						
							
							
								
								Strided perplexity ( #2714 )  
							
							... 
							
							
							
							* Implementing strided computation of perplexity
* Alternative way to output PPL results
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-23 12:56:42 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								6381d4e110 
								
							 
						 
						
							
							
								
								gguf : new file format with flexible meta data (beta) ( #2398 )  
							
							... 
							
							
							
							* gguf : first API pass
* gguf : read header + meta data
* gguf : read tensor info
* gguf : initial model loading - not tested
* gguf : add gguf_get_tensor_name()
* gguf : do not support passing existing ggml_context to gguf_init
* gguf : simplify gguf_get_val
* gguf : gguf.c is now part of ggml.c
* gguf : read / write sample models
* gguf : add comments
* refactor : reduce code duplication and better API (#2415 )
* gguf : expose the gguf_type enum through the API for now
* gguf : add array support
* gguf.py : some code style changes
* convert.py : start a new simplified implementation by removing old stuff
* convert.py : remove GGML vocab + other obsolete stuff
* GGUF : write tensor (#2426 )
* WIP: Write tensor
* GGUF : Support writing tensors in Python
* refactor : rm unused import and upd todos
* fix : fix errors upd writing example
* rm example.gguf
* gitignore *.gguf
* undo formatting
* gguf : add gguf_find_key (#2438 )
* gguf.cpp : find key example
* ggml.h : add gguf_find_key
* ggml.c : add gguf_find_key
* gguf : fix writing tensors
* gguf : do not hardcode tensor names to read
* gguf : write sample tensors to read
* gguf : add tokenization constants
* quick and dirty conversion example
* gguf : fix writing gguf arrays
* gguf : write tensors one by one and code reuse
* gguf : fix writing gguf arrays
* gguf : write tensors one by one
* gguf : write tensors one by one
* gguf : write tokenizer data
* gguf : upd gguf conversion script
* Update convert-llama-h5-to-gguf.py
* gguf : handle already encoded string
* ggml.h : get array str and f32
* ggml.c : get arr str and f32
* gguf.py : support any type
* Update convert-llama-h5-to-gguf.py
* gguf : fix set is not subscriptable
* gguf : update convert-llama-h5-to-gguf.py
* constants.py : add layer norm eps
* gguf.py : add layer norm eps and merges
* ggml.h : increase GGML_MAX_NAME to 64
* ggml.c : add gguf_get_arr_n
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Makefile : add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* add gptneox gguf example
* Update convert-llama-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-gptneox-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* gguf : support custom alignment value
* gguf : fix typo in function call
* gguf : mmap tensor data example
* fix : update convert-llama-h5-to-gguf.py
* Update convert-llama-h5-to-gguf.py
* convert-gptneox-h5-to-gguf.py : Special tokens
* gptneox-main.cpp : special tokens
* Update gptneox-main.cpp
* constants.py : special tokens
* gguf.py : accumulate kv and tensor info data + special tokens
* convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens
* gguf : gguf counterpart of llama-util.h
* gguf-util.h : update note
* convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens
* convert-llama-h5-to-gguf.py : special tokens
* Delete gptneox-common.cpp
* Delete gptneox-common.h
* convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer
* gptneox-main.cpp : gpt2 bpe tokenizer
* gpt2 bpe tokenizer (handles merges and unicode)
* Makefile : remove gptneox-common
* gguf.py : bytesarray for gpt2bpe tokenizer
* cmpnct_gpt2bpe.hpp : comments
* gguf.py : use custom alignment if present
* gguf : minor stuff
* Update gptneox-main.cpp
* map tensor names
* convert-gptneox-h5-to-gguf.py : map tensor names
* convert-llama-h5-to-gguf.py : map tensor names
* gptneox-main.cpp : map tensor names
* gguf : start implementing libllama in GGUF (WIP)
* gguf : start implementing libllama in GGUF (WIP)
* rm binary commited by mistake
* upd .gitignore
* gguf : calculate n_mult
* gguf :  inference with 7B model working (WIP)
* gguf : rm deprecated function
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : start implementing gguf_file_saver (WIP)
* gguf : add gguf_get_kv_type
* gguf : add gguf_get_kv_type
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver (WIP)
* gguf : write metadata in gguf_file_saver
* gguf : rm references to old file formats
* gguf : shorter name for member variable
* gguf : rm redundant method
* gguf : get rid of n_mult, read n_ff from file
* Update gguf_tensor_map.py
* Update gptneox-main.cpp
* gguf : rm references to old file magics
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : start implementing quantization (WIP)
* gguf : quantization is working
* gguf : roper closing of file
* gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : no need to convert tensors twice
* convert-llama-h5-to-gguf.py : no need to convert tensors twice
* convert-gptneox-h5-to-gguf.py : simplify nbytes
* convert-llama-h5-to-gguf.py : simplify nbytes
* gptneox-main.cpp : n_layer --> n_block
* constants.py : n_layer --> n_block
* gguf.py : n_layer --> n_block
* convert-gptneox-h5-to-gguf.py : n_layer --> n_block
* convert-llama-h5-to-gguf.py : n_layer --> n_block
* gptneox-main.cpp : n_layer --> n_block
* Update gguf_tensor_map.py
* convert-gptneox-h5-to-gguf.py : load model in parts to save memory
* convert-llama-h5-to-gguf.py : load model in parts to save memory
* convert : write more metadata for LLaMA
* convert : rm quantization version
* convert-gptneox-h5-to-gguf.py : add file_type key
* gptneox-main.cpp : add file_type key
* fix conflicts
* gguf : add todos and comments
* convert-gptneox-h5-to-gguf.py : tensor name map changes
* Create gguf_namemap.py : tensor name map changes
* Delete gguf_tensor_map.py
* gptneox-main.cpp : tensor name map changes
* convert-llama-h5-to-gguf.py : fixes
* gguf.py : dont add empty strings
* simple : minor style changes
* gguf : use UNIX line ending
* Create convert-llama-7b-pth-to-gguf.py
* llama : sync gguf-llama.cpp with latest llama.cpp (#2608 )
* llama : sync gguf-llama.cpp with latest llama.cpp
* minor : indentation + assert
* llama : refactor gguf_buffer and gguf_ctx_buffer
* llama : minor
* gitignore : add gptneox-main
* llama : tokenizer fixes (#2549 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* convert : update convert-new.py with tokenizer fixes (#2614 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* llama : sync gguf-llama with llama (#2613 )
* llama : sync gguf-llama with llama
* tests : fix build + warnings (test-tokenizer-1 still fails)
* tests : fix wstring_convert
* convert : fix layer names
* llama : sync gguf-llama.cpp
* convert : update HF converter to new tokenizer voodoo magics
* llama : update tokenizer style
* convert-llama-h5-to-gguf.py : add token types
* constants.py : add token types
* gguf.py : add token types
* convert-llama-7b-pth-to-gguf.py : add token types
* gguf-llama.cpp :  fix n_head_kv
* convert-llama-h5-to-gguf.py : add 70b gqa support
* gguf.py : add tensor data layout
* convert-llama-h5-to-gguf.py : add tensor data layout
* convert-llama-7b-pth-to-gguf.py : add tensor data layout
* gptneox-main.cpp : add tensor data layout
* convert-llama-h5-to-gguf.py : clarify the reverse permute
* llama : refactor model loading code (#2620 )
* llama : style formatting + remove helper methods
* llama : fix quantization using gguf tool
* llama : simplify gguf_file_saver
* llama : fix method names
* llama : simplify write_header()
* llama : no need to pass full file loader to the file saver
just gguf_ctx
* llama : gguf_file_saver write I32
* llama : refactor tensor names (#2622 )
* gguf: update tensor names searched in quantization
* gguf : define tensor names as constants
* gguf : initial write API (not tested yet)
* gguf : write to file API (not tested)
* gguf : initial write API ready + example
* gguf : fix header write
* gguf : fixes + simplify example + add ggml_nbytes_pad()
* gguf : minor
* llama : replace gguf_file_saver with new gguf write API
* gguf : streaming support when writing files
* gguf : remove oboslete write methods
* gguf : remove obosolete gguf_get_arr_xxx API
* llama : simplify gguf_file_loader
* llama : move hparams and vocab from gguf_file_loader to llama_model_loader
* llama : merge gguf-util.h in llama.cpp
* llama : reorder definitions in .cpp to match .h
* llama : minor simplifications
* llama : refactor llama_model_loader (WIP)
wip : remove ggml_ctx from llama_model_loader
wip : merge gguf_file_loader in llama_model_loader
* llama : fix shape prints
* llama : fix Windows build + fix norm_rms_eps key
* llama : throw error on missing KV paris in model meta data
* llama : improve printing + log meta data
* llama : switch print order of meta data
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
* gguf : deduplicate (#2629 )
* gguf : better type names
* dedup : CPU + Metal is working
* ggml : fix warnings about unused results
* llama.cpp : fix line feed and compiler warning
* llama : fix strncpy warning + note token_to_str does not write null
* llama : restore the original load/save session implementation
Will migrate this to GGUF in the future
* convert-llama-h5-to-gguf.py : support alt ctx param name
* ggml : assert when using ggml_mul with non-F32 src1
* examples : dedup simple
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
* gguf.py : merge all files in gguf.py
* convert-new.py : pick #2427  for HF 70B support
* examples/gguf : no need to keep q option for quantization any more
* llama.cpp : print actual model size
* llama.cpp : use ggml_elements()
* convert-new.py : output gguf (#2635 )
* convert-new.py : output gguf (WIP)
* convert-new.py : add gguf key-value pairs
* llama : add hparams.ctx_train + no longer print ftype
* convert-new.py : minor fixes
* convert-new.py : vocab-only option should work now
* llama : fix tokenizer to use llama_char_to_byte
* tests : add new ggml-vocab-llama.gguf
* convert-new.py : tensor name mapping
* convert-new.py : add map for skipping tensor serialization
* convert-new.py : convert script now works
* gguf.py : pick some of the refactoring from #2644 
* convert-new.py : minor fixes
* convert.py : update to support GGUF output
* Revert "ci : disable CI temporary to not waste energy"
This reverts commit 7e82d25f40#2644 )
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : style fixes in simple conversion script
* gguf : refactor gptneox conversion script
* gguf : rename h5 to hf (for HuggingFace)
* gguf : refactor pth to gguf conversion script
* gguf : rm file_type key and method
* gguf.py : fix vertical alignment
* gguf.py : indentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* convert-gptneox-hf-to-gguf.py : fixes
* gguf.py : gptneox mapping
* convert-llama-hf-to-gguf.py : fixes
* convert-llama-7b-pth-to-gguf.py : fixes
* ggml.h : reverse GGUF_MAGIC
* gguf.py : reverse GGUF_MAGIC
* test-tokenizer-0.cpp : fix warning
* llama.cpp : print kv general.name
* llama.cpp : get special token kv and linefeed token id
* llama : print number of tensors per type + print arch + style
* tests : update vocab file with new magic
* editorconfig : fix whitespaces
* llama : re-order functions
* llama : remove C++ API + reorganize common source in /common dir
* llama : minor API updates
* llama : avoid hardcoded special tokens
* llama : fix MPI build
ggml-ci
* llama : introduce enum llama_vocab_type + remove hardcoded string constants
* convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested
* falcon-main.cpp : falcon inference example
* convert-falcon-hf-to-gguf.py : remove extra kv
* convert-gptneox-hf-to-gguf.py : remove extra kv
* convert-llama-7b-pth-to-gguf.py : remove extra kv
* convert-llama-hf-to-gguf.py : remove extra kv
* gguf.py : fix for falcon 40b
* falcon-main.cpp : fix for falcon 40b
* convert-falcon-hf-to-gguf.py : update ref
* convert-falcon-hf-to-gguf.py : add tensor data layout
* cmpnct_gpt2bpe.hpp : fixes
* falcon-main.cpp : fixes
* gptneox-main.cpp : fixes
* cmpnct_gpt2bpe.hpp : remove non-general stuff
* Update examples/server/README.md
Co-authored-by: slaren <slarengh@gmail.com>
* cmpnct_gpt2bpe.hpp : cleanup
* convert-llama-hf-to-gguf.py : special tokens
* convert-llama-7b-pth-to-gguf.py : special tokens
* convert-permute-debug.py : permute debug print
* convert-permute-debug-master.py : permute debug for master
* convert-permute-debug.py : change permute type of attn_q
* convert.py : 70b model working (change attn_q permute)
* Delete convert-permute-debug-master.py
* Delete convert-permute-debug.py
* convert-llama-hf-to-gguf.py : fix attn_q permute
* gguf.py : fix rope scale kv
* convert-llama-hf-to-gguf.py : rope scale and added tokens
* convert-llama-7b-pth-to-gguf.py : rope scale and added tokens
* llama.cpp : use rope scale kv
* convert-llama-7b-pth-to-gguf.py : rope scale fix
* convert-llama-hf-to-gguf.py : rope scale fix
* py : fix whitespace
* gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682 )
* First pass at converting GGMLv3 LLaMA models to GGUF
* Cleanups, better output during conversion
* Fix vocab space conversion logic
* More vocab conversion fixes
* Add description to converted GGUF files
* Improve help text, expand warning
* Allow specifying name and description for output GGUF
* Allow overriding vocab and hyperparams from original model metadata
* Use correct params override var name
* Fix wrong type size for Q8_K
Better handling of original style metadata
* Set default value for gguf add_tensor raw_shape KW arg
* llama : improve token type support (#2668 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* llama : add API for token type
ggml-ci
* tests : use new tokenizer type API (#2692 )
* Merge tokenizer fixes into the gguf branch.
* Add test vocabularies
* Adapt convert-new.py (and fix a clang-cl compiler error on windows)
* Improved tokenizer test
But does it work on MacOS?
* Improve token type support
- Added @klosax code to convert.py
- Improved token type support in vocabulary
* Exclude platform dependent tests
* More sentencepiece compatibility by eliminating magic numbers
* Restored accidentally removed comment
* Improve commentary
* Use token type API in test-tokenizer-1.cpp
* py : cosmetics
* readme : add notice about new file format
ggml-ci
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: goerch <jhr.walter@t-online.de>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> 
							
						 
						
							2023-08-21 23:07:43 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								cb1c0727bd 
								
							 
						 
						
							
							
								
								HellaSwag: split token evaluation into batches if needed ( #2681 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-21 11:11:31 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Kawrakow 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								5e9ff54a67 
								
							 
						 
						
							
							
								
								More efficient Hellaswag implementation ( #2677 )  
							
							... 
							
							
							
							Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> 
							
						 
						
							2023-08-20 16:44:46 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								e9b12c332e 
								
							 
						 
						
							
							
								
								perplexity : more meaningful ETA number - 2 decimal points  
							
							
							
						 
						
							2023-08-18 12:48:55 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Borislav Stanimirov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ff966e7ca6 
								
							 
						 
						
							
							
								
								build : fix several cast and printf warnings ( #2499 )  
							
							
							
						 
						
							2023-08-04 13:07:21 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									klosax 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8a88e5855c 
								
							 
						 
						
							
							
								
								perplexity : add Hellaswag calculation ( #2389 )  
							
							... 
							
							
							
							* common.h : add hellaswag / remove perplexity-lines
* common.cpp : add hellaswag / remove perplexity-lines
* perplexity.cpp : add hellswag scores / remove perplexity-lines
* perplexity.cpp : clean up
* common.h : change default param value
* common.cpp : Change default param
* perplexity.cpp : alter wording
* common.h : alter wording
* common.cpp : alter wording 
							
						 
						
							2023-07-28 21:25:36 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									klosax 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b5fe67f8c6 
								
							 
						 
						
							
							
								
								Perplexity: Compute scores correlated to HellaSwag ( #2312 )  
							
							... 
							
							
							
							* Add parameter --perplexity-lines to perplexity.cpp 
							
						 
						
							2023-07-22 14:21:24 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									wzy 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b1f4290953 
								
							 
						 
						
							
							
								
								cmake : install targets ( #2256 )  
							
							... 
							
							
							
							fix  #2252  
						
							2023-07-19 10:01:11 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								d01bccde9f 
								
							 
						 
						
							
							
								
								ci : integrate with ggml-org/ci ( #2250 )  
							
							... 
							
							
							
							* ci : run ctest
ggml-ci
* ci : add open llama 3B-v2 tests
ggml-ci
* ci : disable wget progress output
ggml-ci
* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations
ggml-ci
* tests : try to fix tail free sampling test
ggml-ci
* ci : add K-quants
ggml-ci
* ci : add short perplexity tests
ggml-ci
* ci : add README.md
* ppl : add --chunks argument to limit max number of chunks
ggml-ci
* ci : update README 
							
						 
						
							2023-07-18 14:24:43 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Evan Miller 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								5656d10599 
								
							 
						 
						
							
							
								
								mpi : add support for distributed inference via MPI ( #2099 )  
							
							... 
							
							
							
							* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099 )
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-07-10 18:49:56 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Judd 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								36680f6e40 
								
							 
						 
						
							
							
								
								convert : update for baichuan ( #2081 )  
							
							... 
							
							
							
							1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.
Co-authored-by: Judd <foldl@boxvest.com> 
							
						 
						
							2023-07-06 19:23:49 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Howard Su 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b8c8dda75f 
								
							 
						 
						
							
							
								
								Use unsigned for random seed ( #2006 )  
							
							... 
							
							
							
							* Use unsigned for random seed. Keep -1 as the value to use a time based seed.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-29 06:15:15 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									zrm 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b853d45601 
								
							 
						 
						
							
							
								
								ggml : add NUMA support ( #1556 )  
							
							... 
							
							
							
							* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-26 20:57:59 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Didzis Gosko 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								527b6fba1d 
								
							 
						 
						
							
							
								
								llama : make model stateless and context stateful (llama_state) ( #1797 )  
							
							... 
							
							
							
							* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-06-24 11:47:58 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Borislav Stanimirov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								9cbf50c041 
								
							 
						 
						
							
							
								
								build : fix and ignore MSVC warnings ( #1889 )  
							
							
							
						 
						
							2023-06-16 21:23:53 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								ec2e10c444 
								
							 
						 
						
							
							
								
								llama : add llama_init_backend() API ( close   #1527 )  
							
							
							
						 
						
							2023-05-20 11:06:37 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									András Salamon 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								9560655409 
								
							 
						 
						
							
							
								
								define default model path once, sync path with readme ( #1366 )  
							
							
							
						 
						
							2023-05-16 17:46:34 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								f9a6364912 
								
							 
						 
						
							
							
								
								llama : require first token to be BOS ( #1303 )  
							
							... 
							
							
							
							* llama : require first token to be BOS
* scripts : add ppl-run-all.sh
* perplexity : add BOS for each chunk
* readme : update perplexity values after BOS fix
* perplexity : add clarifying comments 
							
						 
						
							2023-05-08 17:41:54 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Ron Evans 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								67c77799e0 
								
							 
						 
						
							
							
								
								examples : add llama_init_from_gpt_params() common function ( #1290 )  
							
							... 
							
							
							
							Signed-off-by: deadprogram <ron@hybridgroup.com> 
							
						 
						
							2023-05-02 23:39:51 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Robert Brisita 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								2bb992f034 
								
							 
						 
						
							
							
								
								llama : allow 0 as a seed number. ( #1275 )  
							
							
							
						 
						
							2023-05-02 19:23:44 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									DannyDaemonic 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								f4cef87edf 
								
							 
						 
						
							
							
								
								Add git-based build information for better issue tracking ( #1232 )  
							
							... 
							
							
							
							* Add git-based build information for better issue tracking
* macOS fix
* "build (hash)" and "CMAKE_SOURCE_DIR" changes
* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages
* Fix conditional dependency on missing target
* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile
* 4 space indenting for cmake, attempt to clean up my mess in Makefile
* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it 
							
						 
						
							2023-05-01 18:23:47 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									slaren 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3d59769c3b 
								
							 
						 
						
							
							
								
								Show perplexity ETA in hours and minutes ( #1096 )  
							
							
							
						 
						
							2023-04-21 14:57:57 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									slaren 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								315a95a4d3 
								
							 
						 
						
							
							
								
								Add LoRA support ( #820 )  
							
							
							
						 
						
							2023-04-17 17:28:55 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Pavol Rusnak 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								489537e6cf 
								
							 
						 
						
							
							
								
								examples: add missing <ctime> include for time() ( #1011 )  
							
							
							
						 
						
							2023-04-16 10:13:00 +00:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Gary Linscott 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								be87b6ed20 
								
							 
						 
						
							
							
								
								perplexity : add support for batch size to --perplexity ( #407 )  
							
							... 
							
							
							
							* Add support to batch size for perplexity
* Revert "Fix memory allocation issues and seg faults"
This reverts commit 4870e455b3 
							
						 
						
							2023-04-14 00:50:42 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Pavol Rusnak 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								8b679987cd 
								
							 
						 
						
							
							
								
								Fix whitespace, add .editorconfig, add GitHub workflow ( #883 )  
							
							
							
						 
						
							2023-04-11 19:45:44 +00:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									comex 
								
							 
						 
						
							
							
							
							
								
							
							
								f963b63afa 
								
							 
						 
						
							
							
								
								Rewrite loading code to try to satisfy everyone:  
							
							... 
							
							
							
							- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)
- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).
- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.
- Improve validation and error checking.
- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).
- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).
    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)
      - To help implement this, move mlock support from ggml to the
        loading code.
- madvise/PrefetchVirtualMemory support (based on #740 )
- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).
- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').
Implementation notes:
I tried to factor the code into more discrete pieces than before.
Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:
- Destructors to make it easier to ensure everything gets cleaned up.
- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740 ) 
							
						 
						
							2023-04-10 01:10:46 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									anzz1 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								7f4c5c6651 
								
							 
						 
						
							
							
								
								llama : fix linkage with mingw ( #551 )  
							
							... 
							
							
							
							* Revert 7e53955#542 )
Still needs to be fixed properly
* Fix linking on mingw32 
							
						 
						
							2023-03-28 21:23:09 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Stephan Walter 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								436e561931 
								
							 
						 
						
							
							
								
								all : be more strict about converting float to double ( #458 )  
							
							... 
							
							
							
							* Be more strict about converting float to double
* Test equivalence of round, SILU implementations
Test module is commented out in CMakeLists.txt because the tests may
take a long time, depending on how much the compiler optimizes.
* Fix softmax in perplexity.cpp
* all : prefer float over double where appropriate
* perplexity : add <cmath>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 
							
						 
						
							2023-03-28 19:48:20 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Marco Matthies 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								7e5395575a 
								
							 
						 
						
							
							
								
								Fix missing ggml link in cmake for examples/* on w64-mingw32 ( #542 )  
							
							
							
						 
						
							2023-03-27 07:55:26 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Stephan Walter 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b391579db9 
								
							 
						 
						
							
							
								
								Update README and comments for standalone perplexity tool ( #525 )  
							
							
							
						 
						
							2023-03-26 16:14:01 +03:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								03f7e33560 
								
							 
						 
						
							
							
								
								Cleanup STL headers + fix embedding examples + minor stuff  
							
							
							
						 
						
							2023-03-25 20:51:14 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Georgi Gerganov 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								a316a425d0 
								
							 
						 
						
							
							
								
								Overhaul the examples structure  
							
							... 
							
							
							
							- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"
Hope I didn't break something ! 
							
						 
						
							2023-03-25 20:26:40 +02:00