Commit graph

3471 commits

Author SHA1 Message Date
slaren
e702f2ff11 ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 21:38:41 +08:00
Judd
a1cf044dd1 llama : fix order of parameters (#8706)
usage of `aclrtGetMemInfo` is correct:

https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html

Co-authored-by: Judd <foldl@boxvest.com>
2024-07-27 21:38:41 +08:00
Yaiko
3395a68a2d server : add Speech Recognition & Synthesis to UI (#8679)
* server : add Speech Recognition & Synthesis to UI

* server : add Speech Recognition & Synthesis to UI (fixes)
2024-07-27 21:38:41 +08:00
Xuan Son Nguyen
fbc71e9312 examples : export-lora : fix issue with quantized base models (#8687) 2024-07-27 21:38:41 +08:00
DavidKorczynski
df106e9211 ggml: handle ggml_init failure to fix NULL pointer deref (#8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.
2024-07-27 21:38:41 +08:00
Georgi Gerganov
1353a813cc llama : fix build + fix fabs compile warnings (#8683)
ggml-ci
2024-07-27 21:38:41 +08:00
Andreas (Andi) Kunar
21905dd445 ggml : fix build on Windows with Snapdragon X (#8531)
* Improvements for Windows with Snapdragon X

* Revert "Improvements for Windows with Snapdragon X"

This reverts commit bf21397ae5.

* Improvements for Windows with Snapdragon X

* WOA build clarifications

* WIndows on ARM build clarifications

* cmake build for Windows clarifications

* Update docs/build.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-27 21:38:41 +08:00
Georgi Gerganov
fd905e8be1 tests : fix printfs (#8068) 2024-07-27 21:38:41 +08:00
Georgi Gerganov
ca87ca98a7 ggml : add and use ggml_cpu_has_llamafile() (#8664) 2024-07-27 21:37:37 +08:00
Xuan Son Nguyen
43d92892c3 examples : remove finetune and train-text-from-scratch (#8669)
* examples : remove finetune and train-text-from-scratch

* fix build

* update help message

* fix small typo for export-lora
2024-07-27 21:37:37 +08:00
Ujjawal Panchal
5a5d5d28f8 docs : Quantum -> Quantized (#8666)
* docfix: imatrix readme, quantum models -> quantized models.

* docfix: server readme: quantum models -> quantized models.
2024-07-27 21:37:37 +08:00
Fan Shupei
fc5b21bf10 llama: use sliding window for phi3 (#8627)
* use sliding window for phi3

* fix typo, "data_swa" -> "data"

* [conver_hf_to_gguf.py] add phi3 sliding window
2024-07-27 21:37:37 +08:00
MorganRO8
65e54b5db4 readme : update games list (#8673)
Added link to game I made that depends on llama
2024-07-27 21:37:37 +08:00
Joe Todd
0aeae29190 Build Llama SYCL Intel with static libs (#8668)
Ensure SYCL CI builds both static & dynamic libs for testing purposes

Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-07-27 21:37:37 +08:00
Thorsten Sommer
791e3e0b27 readme : update UI list [no ci] (#8505) 2024-07-27 21:37:37 +08:00
Xuan Son Nguyen
dc7836c79e llama : fix llama_chat_format_single for mistral (#8657)
* fix `llama_chat_format_single` for mistral

* fix typo

* use printf
2024-07-27 21:37:37 +08:00
Joe Todd
146da8b6c6 Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (#8667) 2024-07-27 21:37:37 +08:00
Xuan Son Nguyen
a5f1f44b6a add llama_lora_adapter_clear (#8653) 2024-07-27 21:37:37 +08:00
Xuan Son Nguyen
a17bcdfd4c examples : Fix llama-export-lora example (#8607)
* fix export-lora example

* add more logging

* reject merging subset

* better check

* typo
2024-07-27 21:37:37 +08:00
Vali Malinoiu
b7e8bada5e server : fix URL.parse in the UI (#8646) 2024-07-27 21:37:36 +08:00
Joe Todd
e2d7ec46fc sycl : Add support for non-release DPC++ & oneMKL (#8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-07-27 21:37:36 +08:00
Georgi Gerganov
96d14f4b58 llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling

ggml-ci

* llama : move grammar code into llama-grammar

ggml-ci

* cont

ggml-ci

* cont : pre-fetch rules

* cont

ggml-ci

* llama : deprecate llama_sample_grammar

* llama : move tokenizers into llama-vocab

ggml-ci

* make : update llama.cpp deps [no ci]

* llama : redirect external API to internal APIs

ggml-ci

* llama : suffix the internal APIs with "_impl"

ggml-ci

* llama : clean-up
2024-07-27 21:37:36 +08:00
0cc4m
4b93675489 Vulkan IQ4_NL Support (#8613)
* Fix Vulkan matmul tests compile errors

* Add Vulkan IQ4_NL support

* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
2024-07-27 21:37:36 +08:00
Jeroen Mostert
f54ffa8f04 Allow all RDNA2 archs to use sdot4 intrinsic (#8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-07-27 21:37:36 +08:00
Georgi Gerganov
6676362327 contrib : clarify PR squashing + module names (#8630)
* contrib : clarify PR squashing

* contrib : fix typo + add list of modules
2024-07-27 21:37:36 +08:00
luoyu-intel
a9c03e4827 [SYCL] fix scratch size of softmax (#8642) 2024-07-27 21:37:36 +08:00
Keke Han
549e1c7e41 llama : fix codeshell support (#8599)
* llama : fix codeshell support

* llama : move codeshell after smollm below to respect the enum order
2024-07-27 21:37:36 +08:00
Jason Stillerman
c70ddd889f llama : add support for SmolLm pre-tokenizer (#8609)
* Adding SmolLM Pre Tokenizer

* Update convert_hf_to_gguf_update.py

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* handle regex

* removed .inp and out .out ggufs

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-27 21:37:36 +08:00
Jiří Podivín
b4c8a9c8a6 *.py: Stylistic adjustments for python (#8233)
* Superflous parens in conditionals were removed.
* Unused args in function were removed.
* Replaced unused `idx` var with `_`
* Initializing file_format and format_version attributes
* Renaming constant to capitals
* Preventing redefinition of the `f` var

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-27 21:37:36 +08:00
Georgi Gerganov
525b48a49c llama : allow overrides for tokenizer flags (#8614)
ggml-ci
2024-07-27 21:37:36 +08:00
Georgi Gerganov
52f90c6d3c tests : re-enable tokenizer tests (#8611)
* models : remove duplicated gpt-2 vocab

* models : remove old stablelm vocab

* tests : re-enable MPT tokenizer tests

* tests : re-enable DeepSeek tokenizer tests

* cmake : sort

ggml-ci
2024-07-27 21:37:36 +08:00
Douglas Hanley
af37c4bd7f llama : add Mistral Nemo inference support (#8604) 2024-07-27 21:37:36 +08:00
Jan Boon
9ffb78d54f server : update doc to clarify n_keep when there is bos token (#8619) 2024-07-27 21:37:36 +08:00
Mark Zhuang
6220d93595 ggml: fix compile error for RISC-V (#8623) 2024-07-27 21:37:36 +08:00
devojony
6f19b8c09d examples: fix android example cannot be generated continuously (#8621)
When generation ends `completion_loop()` should return a NULL, not the empty string
2024-07-27 21:37:36 +08:00
Georgi Gerganov
24404ef9f3 flake.lock: Update (#8610) 2024-07-27 21:37:36 +08:00
M-A
52a7238985 examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:

- Move each example into its own function. This makes the code much
  easier to read and understand.
- Make the program easy to only run one test by commenting out function
  calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
  instead of hardcoding the returned values. This makes the code more
  copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
  returned expected values. It's super useful to check for correctness.

Testing:

- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
  Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
  - I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
  - Llama-3 failed about a third of the time in example_concurrent: it
    only returned one call instead of 3. Even for F16.

Potential follow ups:

- Do not fix the prompt encoding yet. Surprisingly it mostly works even
  if the prompt encoding is not model optimized.
- Add chained answer and response.

Test only change.
2024-07-27 21:37:36 +08:00
compilade
82478f1934 gguf-py : fix some metadata name extraction edge cases (#8591)
* gguf-py : fix some metadata name extraction edge cases

* convert_lora : use the lora dir for the model card path

* gguf-py : more metadata edge cases fixes

Multiple finetune versions are now joined together,
and the removal of the basename annotation on trailing versions
is more robust.

* gguf-py : add more name metadata extraction tests

* convert_lora : fix default filename

The default filename was previously hardcoded.

* convert_hf : Model.fname_out can no longer be None

* gguf-py : do not use title case for naming convention

Some models use acronyms in lowercase,
which can't be title-cased like other words,
so it's best to simply use the same case
as in the original model name.

Note that the size label still has an uppercased suffix
to make it distinguishable from the context size of a finetune.
2024-07-27 21:37:36 +08:00
compilade
264c2830d8 convert_hf : fix Gemma v1 conversion (#8597)
* convert_hf : fix Gemma v1 conversion

* convert_hf : allow renaming tokens, but with a warning

* convert_hf : fix Gemma v1 not setting BOS and EOS tokens
2024-07-27 21:37:36 +08:00
Johannes Gäßler
6887f5f02a CUDA: MMQ code deduplication + iquant support (#8495)
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
2024-07-27 21:37:36 +08:00
Georgi Gerganov
c3ca2aa58e gguf : handle null name during init (#8587) 2024-07-27 21:37:36 +08:00
Michael Coppola
4b895a57cf llama : add support for Tekken pre-tokenizer (#8579)
* llama : Added support for Tekken pre-tokenizer (#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-27 21:37:36 +08:00
Huifeng Ou
8b6f28ab31 llama.swiftui: fix end of generation bug (#8268)
* fix continuing generating blank lines after getting EOT token or EOS token from LLM

* change variable name to is_done (variable name suggested by ggerganov)

* minor : fix trailing whitespace

* minor : add space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-27 21:37:36 +08:00
Brian
3a4d206f7b gguf_dump.py: fix markddown kv array print (#8588)
* gguf_dump.py: fix markddown kv array print

* Update gguf-py/scripts/gguf_dump.py

Co-authored-by: compilade <git@compilade.net>

* gguf_dump.py: refactor kv array string handling

* gguf_dump.py: escape backticks inside of strings

* gguf_dump.py: inline code markdown escape handler added

>>> escape_markdown_inline_code("hello world")
'`hello world`'
>>> escape_markdown_inline_code("hello ` world")
'``hello ` world``'

* gguf_dump.py: handle edge case about backticks on start or end of a string

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-27 21:37:36 +08:00
slaren
384b0cd40e ggml : fix quant dot product with odd number of blocks (#8549)
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (#8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-27 21:37:36 +08:00
Brian
c70cd4fbd1 convert-*.py: remove add_name from ChatGLMModel class (#8590) 2024-07-27 21:37:36 +08:00
Georgi Gerganov
af831106a4 llama : bump max layers from 256 to 512 (#8530)
* llama : bump max layers from 256 to 512

* llama : replace asserts with exceptions
2024-07-27 21:37:36 +08:00
Georgi Gerganov
093ee371b0 readme : fix server badge 2024-07-27 21:37:36 +08:00
Clint Herron
0ed406ebd2 ggml : add friendlier error message to fopen errors (#8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-07-27 21:37:36 +08:00
Frank Mai
e5cf8b43ec fix: typo of chatglm4 chat tmpl (#8586)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-07-27 21:37:36 +08:00