Commit graph

1072 commits

Author SHA1 Message Date
staviq
0b367255c1 fix msvc 2023-08-26 02:57:50 +02:00
staviq
9bace227c5 stub LOG_DUMP_CMDLINE for WIN32 for now 2023-08-25 18:24:55 +02:00
staviq
c8a1118308 fix LOG_TEELN and configchecker 2023-08-25 17:41:15 +02:00
staviq
4fdcede9ec LOG_DISABLE_LOGS compile flag, wrapped f in macros 2023-08-25 15:51:27 +02:00
staviq
54e81bac6f
Merge branch 'ggerganov:master' into betterlogs 2023-08-25 06:08:45 +02:00
staviq
181e8a9902 main: replaced fprintf/LOG_TEE, some trace logging 2023-08-25 05:22:23 +02:00
staviq
360b36c921 log tostring helpers, token vectors pretty prints 2023-08-25 05:20:47 +02:00
Marcus Dunn
2e5f70a25f
Added enum to llama_token_get_type return type (#2774) 2023-08-24 23:49:30 +02:00
slaren
d0f77b1353
convert.py : try to determine n_ctx automatically for CodeLlama (#2770) 2023-08-24 21:10:39 +02:00
staviq
45f1a43d86 mv include to common, params, help msg 2023-08-24 20:44:47 +02:00
slaren
0d3094f0c7
gguf : add rope_freq_base parameter for CodeLlama (#2769) 2023-08-24 21:04:05 +03:00
staviq
8054c32260
Merge branch 'master' into betterlogs 2023-08-24 19:52:31 +02:00
staviq
f5080da474 update .gitignore 2023-08-24 19:47:18 +02:00
Georgi Gerganov
01f2224682
falcon : write file type 2023-08-24 19:58:30 +03:00
Shouzheng Liu
38b16dfca6
metal : bug-fix when enable ggml-alloc (#2757)
* metal: better memory alloc w/ concurrency dispatch

The ggml-alloc should only free tensors at memory barriers.

* ggml-alloc: avoid return silently

In certain cases, the allocate_node() function may silently return
without performing any memory allocation.
2023-08-24 19:27:25 +03:00
Georgi Gerganov
8f8c28e89c
convert : auto-determine model name based on dir + scripts update 2023-08-24 19:26:47 +03:00
Kerfuffle
7694adda8d
Fix for main example getting stuck when -n -2 and --interactive (#2767)
* Fix for main example getting stuck when -n -2 and --interactive

* Add a comment so future generations may suffer less.
2023-08-24 10:11:13 -06:00
slaren
fea95c682d
fix convert.py for codellama, add llama 34B to the list of recognized models (#2768) 2023-08-24 17:44:11 +02:00
DannyDaemonic
ef955fbd23
Tag release with build number (#2732)
* Modified build.yml to use build number for release

* Add the short hash back into the tag

* Prefix the build number with b
2023-08-24 15:58:02 +02:00
Georgi Gerganov
d67777c202
metal : add Q8_0 support (#2763)
* metal : add dequantize_q8_0 kernel

* metal : add mul_mat_q8_0_f32 kernel

* metal : add Q8_0 mul_mm kernel
2023-08-24 16:19:57 +03:00
Georgi Gerganov
c3e53b421a
llama : escape all U+2581 in a string (#2750) 2023-08-24 12:26:01 +03:00
Evan Jones
6e91a1b070
llama : fix grammar sometimes generating null char (#2756) 2023-08-24 07:07:13 +03:00
staviq
47b9f2d36f log_enable/disable, LOG_TEE, basic usage doc 2023-08-24 02:00:16 +02:00
Georgi Gerganov
44d5462b5c
readme : fix link 2023-08-23 23:44:19 +03:00
Georgi Gerganov
c7868b0753
minor : fix trailing whitespace 2023-08-23 23:43:00 +03:00
Georgi Gerganov
79da24b58c
readme : update hot topics 2023-08-23 23:41:16 +03:00
Georgi Gerganov
cf658adc83
llm : add Falcon support (#2717)
* llama : refactor GGUF constants into static maps

* llama : check if model architecture is known

* llama : refactor llama_model_load_internal()

* gguf : add KV constant maps

* llm : read arch-specific KVs

* convert : add dummy scores + types

* falcon : load tensor data (CPU only)

* llama : fix loading progress bar

* llama : add arch member to llama_model

* falcon : CPU inference working

* falcon : support non-40B models

* falcon : minor

* llama : minor updates

ggml-ci

* convert-falcon-hf-to-gguf.py : fix special token mapping

* llama.cpp : llama default UNK token = id 0

* llama.cpp : fix bpe tokenizer

* llama.cpp : fix the fix of bpe tokenizer

* ggml : pass eps to ggml_norm

* metal : implement RoPE (mode = 2) + avoid ggml_repeat

* ggml : ggml_repeat always creates new tensor

* falcon : copy-paste self-attention from LLaMA

* metal : print extra compute pipeline info

* falcon : minor changes (still chasing the Metal problem)

* llama.cpp : fix linefeed token

* metal : fix GELU kernel numerical stability by using precise::tanh

* metal : temporary workaround for the concurrency optimization bug

* falcon : add CUDA offloading (#2739)

* llama : better model naming and size reporting

* llama : prep new tokenizer support

* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation

* llama : remove oboslete comment

ggml-ci

* common : remove obsolete BPE API + disable test-tokenizer-1

* llama : revert BPE special-case in llama_byte_to_token()

* cuda : add TODOs for RoPE NeoX implementation

* llama : default special tokens based on vocab type

* perplexity : add log for start of tokenization

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23 23:08:04 +03:00
Georgi Gerganov
a192860cfe
minor : fix trailing whitespace 2023-08-23 22:37:39 +03:00
Olivier Chafik
95385241a9
examples : restore the functionality to import llama2.c models (#2685)
* Fix import of llama2.c models that don't share weights between embedding layers

* llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv

* llama2.c: comment out legacy "load from ggml model" logic

* llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin
2023-08-23 22:33:05 +03:00
staviq
71d05b9ae4 remove atomics and add dynamic log target 2023-08-23 20:27:20 +02:00
slaren
335acd2ffd
fix convert-lora-to-ggml.py (#2738) 2023-08-23 16:46:54 +02:00
klosax
5290c38e6e
main : insert bos if no tokens (#2727)
* main.cpp : insert bos if no tokens

* Update examples/main/main.cpp

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-23 16:46:03 +02:00
akawrykow
cc34dbda96
gitignore : fix for windows (#2729) 2023-08-23 17:31:34 +03:00
Cebtenzzre
7c2227a197
chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
JohnnyB
f19dca04ea
devops : RPM Specs (#2723)
* Create llama-cpp.srpm

* Rename llama-cpp.srpm to llama-cpp.srpm.spec

Correcting extension.

* Tested spec success.

* Update llama-cpp.srpm.spec

* Create lamma-cpp-cublas.srpm.spec

* Create lamma-cpp-clblast.srpm.spec

* Update lamma-cpp-cublas.srpm.spec

Added BuildRequires

* Moved to devops dir
2023-08-23 17:28:22 +03:00
staviq
356a166b19 reverted log auto endline to better mimic printf 2023-08-23 16:06:46 +02:00
staviq
d5156d3345 added basic log file handler 2023-08-23 16:03:30 +02:00
staviq
727af3ea16 add *.log to .gitignore 2023-08-23 16:02:43 +02:00
Kawrakow
8207214b6a
Fix values shown in the quantize tool help (#2735)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:57:12 +03:00
Kawrakow
62959e740e
Strided perplexity (#2714)
* Implementing strided computation of perplexity

* Alternative way to output PPL results

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:56:42 +03:00
IgnacioFDM
7f7ddd5002
Fix ggml to gguf conversion on Windows (#2733)
This fixes `RuntimeWarning: overflow encountered in long_scalars`

Credit: anon (not mine)
2023-08-23 03:31:09 -06:00
Xiao-Yong Jin
b8ad1b66b2
server : allow json array in prompt or content for direct token input (#2306)
* server: allow json array in prompt or content

We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.

* server: use tokenizePrompt(json) and default "" if empty prompt

* server: fix prompt check

* server: tokenize endpoint no longer adds BOS
2023-08-23 15:12:12 +08:00
Evan Jones
f5fe98d11b
docs : add grammar docs (#2701)
* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
2023-08-22 21:01:57 -04:00
staviq
62de8a6224 initial, base LOG macro 2023-08-23 02:27:26 +02:00
Kerfuffle
777f42ba18
Improve handling of special tokens in GGML to GGUF converter (#2725)
* Improve UNK, BOS, EOS token handling when converting without metadata.

* Allow importing as a module.

* Remove some obsolete code and minor cleanups.

* Set default UNK token mapping from -1 to 0 in llama.cpp

* Try to handle overflow due to buggy Windows Python with a better error message
2023-08-22 17:39:39 -06:00
goerch
46ef5b5fcf
llama : fix whitespace escaping in tokenizer (#2724) 2023-08-23 00:10:42 +03:00
Johannes Gäßler
c63bb1d16a
CUDA: use mul_mat_q kernels by default (#2683) 2023-08-22 22:47:05 +02:00
Alex Petenchea
3b6cfe7c92
convert.py : clarifying error message (#2718) 2023-08-22 21:58:16 +03:00
Jiahao Li
800c9635b4
Fix CUDA softmax by subtracting max value before exp (#2665) 2023-08-22 20:27:06 +02:00
Georgi Gerganov
deb7dfca4b
gguf : add ftype meta info to the model (#2710)
* llama : add ftype meta info to the model

ggml-ci

* convert.py : add ftype when converting (does not work)

* convert.py : fix Enum to IntEnum

ggml-ci
2023-08-22 20:05:59 +03:00