Commit graph

3420 commits

Author SHA1 Message Date
brian khuu
3625a42061 convert-*.py: add heuristic to directory name fallback
Also add source_url for huggingface url
2024-07-16 06:37:42 +10:00
brian khuu
39472a09da convert-*.py: need to include self in per_model_weight_count_estimation() 2024-07-16 06:37:42 +10:00
brian khuu
54918ad14e convert-*.py: refactor parameter weight class 2024-07-16 06:37:42 +10:00
brian khuu
32e80e094c convert-*.py: base_model is actually in spec for model cards 2024-07-16 06:37:42 +10:00
brian khuu
4d5cd0670a convert-*.py: use heuristics to parse _name_or_path 2024-07-16 06:37:42 +10:00
brian khuu
b0553f42da convert-*.py: adjust help message 2024-07-16 06:37:42 +10:00
brian khuu
dd1571211e convert-*.py: add quantized_by and enhance heuristics 2024-07-16 06:37:38 +10:00
brian khuu
5a86dfaa1c convert-*.py: add general.organization to kv store 2024-07-16 06:36:03 +10:00
brian khuu
f7c20793b9 convert-*.py: enable --model-name direct metadata override 2024-07-16 06:36:03 +10:00
brian khuu
b1927eed82 convert-*.py: move per model weight estimation away from util back to main script
plus some refactoring
2024-07-16 06:36:03 +10:00
brian khuu
684c604eca convert-*.py: add datasets and language to KV store 2024-07-16 06:36:03 +10:00
brian khuu
0f1d50fab7 convert-*.py: add parameter size class 2024-07-16 06:36:03 +10:00
brian khuu
8f734083dd convert-*.py: add base_version and add tags 2024-07-16 06:36:03 +10:00
brian khuu
b36e391b87 convert-*.py: parse model card in metadata util. Add license_link and license_name to kv store 2024-07-16 06:36:03 +10:00
brian khuu
5c263cb257 convert-*.py: encoding_scheme --> output_type 2024-07-16 06:36:03 +10:00
brian khuu
4d5f18a0e6 convert-*.py: metadata class moved to utility 2024-07-16 06:36:03 +10:00
brian khuu
916872f72f convert-*.py: model card metadata 2024-07-16 06:36:03 +10:00
brian khuu
a42c2b7efc convert-*.py: add basename and finetune metadata 2024-07-16 06:36:03 +10:00
brian khuu
dbb1b471e4 convert-*.py: add --get-outfile command and refactor 2024-07-16 06:36:03 +10:00
brian khuu
d3a936fd0e convert-*.py: licence -> license 2024-07-16 06:36:03 +10:00
Xuan Son Nguyen
97bdd26eee
Refactor lora adapter support (#8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-15 20:50:47 +02:00
Xuan Son Nguyen
4db8f60fe7
fix ci (#8494) 2024-07-15 19:23:10 +02:00
Daniel Bevenius
8fac431b06
ggml : suppress unknown pragma 'GCC' on windows (#8460)
This commit adds a macro guard to pragma GCC to avoid the following
warning on windows:

```console
C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068:
unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj]
```
2024-07-15 15:48:17 +03:00
M-A
f17f39ff9c
server: update README.md with llama-server --help output [no ci] (#8472)
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.

Did:

    make llama-server
    ./llama-server --help > t.txt
    vimdiff t.txt examples/server/README.md

I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.

No functional change.
2024-07-15 15:04:56 +03:00
Georgi Gerganov
9104bc20ed
common : add --no-cont-batching arg (#6358) 2024-07-15 14:54:58 +03:00
NikolaiLyssogor
fc690b018e
docs: fix links in development docs [no ci] (#8481)
Fixes a few links to within the repo that were broken in the reorganization of the
documentation in #8325.
2024-07-15 14:46:39 +03:00
Meng, Hengyu
16bdfa42ac
[SYCL] add concat through dim 1/2 (#8483)
* add concat through dim 1/2
2024-07-15 19:32:15 +08:00
Georgi Gerganov
3dfda05956
llama : de-duplicate deepseek2 norm 2024-07-15 14:10:39 +03:00
0cc4m
bda62d7999
Vulkan MMQ Fix (#8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter

* Fix Vulkan op result checker build error
2024-07-15 09:38:52 +02:00
compilade
090fca7a07
pydantic : replace uses of __annotations__ with get_type_hints (#8474)
* pydantic : replace uses of __annotations__ with get_type_hints

* pydantic : fix Python 3.9 and 3.10 support
2024-07-14 19:51:21 -04:00
Georgi Gerganov
aaab2419ea
flake.lock: Update (#8475)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)
  → 'github:NixOS/nixpkgs/7e7c39ea35c5cdd002cd4588b03a3fb9ece6fad9?narHash=sha256-EYekUHJE2gxeo2pM/zM9Wlqw1Uw2XTJXOSAO79ksc4Y%3D' (2024-07-12)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-14 08:54:02 -07:00
Georgi Gerganov
73cf442e7b
llama : fix Gemma-2 Query scaling factors (#8473)
* 9B - query_pre_attn_scalar = 256 not 224

See 03e657582d

Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)

* llama : fix Gemma-2 Query scaling factor

ggml-ci

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2024-07-14 14:05:09 +03:00
Brian
e236528e76
gguf_hash.py: Add sha256 (#8470)
* gguf_hash.py: Add sha256

* gguf_hash.py: rename string UUIDv5 --> uuid

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-14 16:47:14 +10:00
compilade
fa79495bb4
llama : fix pre-tokenization of non-special added tokens (#8228)
* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from #8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with #8379

* test-tokenizer-random : add a failing edge case for falcon
2024-07-13 23:35:10 -04:00
bandoti
17eb6aa8a9
vulkan : cmake integration (#8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-07-13 18:12:39 +02:00
Georgi Gerganov
c917b67f06
metal : template-ify some of the kernels (#8447)
ggml-ci
2024-07-13 18:32:33 +03:00
Georgi Gerganov
4e24cffd8c
server : handle content array in chat API (#8449)
* server : handle content array in chat API

* Update examples/server/utils.hpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-12 14:48:15 +03:00
Georgi Gerganov
6af51c0d96
main : print error on empty input (#8456) 2024-07-12 14:48:04 +03:00
Daniel Bevenius
f53226245f
llama : suppress unary minus operator warning (#8448)
This commit updates the _try_copy lambda and moves the unary minus
operator to after the cast to int32_t.

The motivation for this that currently the following warning is
generated on windows:

```console
llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator
applied to unsigned type, result still unsigned
```
2024-07-12 12:05:21 +03:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Armen Kaleshian
8a4441ea1a
docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to
convert_hf_to_gguf.py breaking how convert is called from within a Docker
container.
2024-07-12 11:08:19 +03:00
Jiří Podivín
5aefbce27a
convert : remove fsep token from GPTRefactForCausalLM (#8237)
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-12 11:06:33 +03:00
Georgi Gerganov
71c1121d11
examples : sprintf -> snprintf (#8434)
* examples : sprintf -> snprintf

ggml-ci

* examples : use sizeof() instead of hardcoded constants
2024-07-12 10:46:14 +03:00
Georgi Gerganov
370b1f7e7a
ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-12 10:46:02 +03:00
Chen Xi
b549a1bbef
[SYCL] fix the mul_mat_id ut issues (#8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-07-12 08:52:04 +08:00
Nicholai Tukanov
368645698a
ggml : add NVPL BLAS support (#8329) (#8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-07-11 18:49:15 +02:00
Daniel Bevenius
b078c619aa
cuda : suppress 'noreturn' warn in no_device_code (#8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-11 17:53:42 +02:00
Johannes Gäßler
808aba3916
CUDA: optimize and refactor MMQ (#8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-07-11 16:47:47 +02:00
Georgi Gerganov
a977c11544
gitignore : deprecated binaries 2024-07-11 11:20:40 +03:00
compilade
9a55ffe6fb
tokenize : add --no-parse-special option (#8423)
This should allow more easily explaining
how parse_special affects tokenization.
2024-07-11 10:41:48 +03:00