Commit graph

1461 commits

Author SHA1 Message Date
Jared Van Bortel
fefc3db527 address review comments 2023-11-05 16:24:48 -05:00
Galunid
781bc54986 Move everything to convert-hf-to-gguf.py 2023-11-05 08:42:11 +01:00
Galunid
f7de892ee5 Move util to gguf-py/gguf 2023-11-05 00:43:56 +01:00
Galunid
087f88cc15 Rename convert-generic -> convert-hf-to-gguf 2023-11-05 00:37:00 +01:00
Galunid
2120195bb1 Yarn rope for baichuan 2023-11-04 23:15:41 +01:00
Galunid
e64f4de189 Revert "Remove 'old' conversion scripts" - needed for testing
This reverts commit f4b9a7ea02.
2023-11-04 23:10:39 +01:00
Galunid
fd30850576 Add big endian support 2023-11-04 23:01:38 +01:00
Galunid
03c9683eb7 Restore support for RWForCausalLM 2023-11-04 20:43:29 +01:00
cebtenzzre
007be85087 model.py : add missing future import 2023-11-02 12:08:44 -04:00
cebtenzzre
e9abcc9c7c fix linter complaints 2023-11-02 00:06:32 -04:00
cebtenzzre
66ccd62102 sort imports 2023-11-01 23:26:28 -04:00
cebtenzzre
8f31dc54ec fix mypy errors 2023-11-01 23:24:46 -04:00
Galunid
4fdd7cdf2b Review fixes, persimmon fixes 2023-11-01 02:32:49 +01:00
Galunid
3ec89dcc69 Use 'IntEnum' instead of 'Enum' 2023-10-31 22:23:26 +01:00
Galunid
f4b9a7ea02 Remove 'old' conversion scripts 2023-10-31 16:27:06 +01:00
Galunid
235acc18cd Small refactor 2023-10-31 16:23:53 +01:00
Galunid
c94df09732 Rework tokenizer handling 2023-10-31 16:11:08 +01:00
Galunid
b2ba44eab2 Flake8 fixes 2023-10-31 15:38:24 +01:00
Galunid
dc3115f2a3 Add another alias to n_layers 2023-10-31 04:20:51 +01:00
Galunid
0743f7a900 Fix variable 2023-10-31 03:52:52 +01:00
Galunid
b9c664ab2f Woops 2023-10-31 03:42:55 +01:00
Galunid
6f6856c6ea [Untested] Initial Persimmon support 2023-10-31 03:27:04 +01:00
Galunid
94ba1db24a Add Starcoder and Refact 2023-10-31 03:12:25 +01:00
Galunid
0afa75a9a2 Add Falcon support 2023-10-31 02:57:37 +01:00
Galunid
3bb9844de9 Get rid of dumb print 2023-10-31 01:54:24 +01:00
Galunid
08918b700e MPT conversion fix 2023-10-31 01:52:55 +01:00
Galunid
443f7d586e Call add_tensor before write_* functions 2023-10-29 20:00:54 +01:00
Galunid
550b925af2 Missing variable 2023-10-29 02:06:41 +01:00
Galunid
989db34149 Missing variable 2023-10-29 02:05:28 +01:00
Galunid
8618b4e74c Add [UNTESTED] Baichuan support 2023-10-29 01:38:35 +02:00
Galunid
0ff237105d Make gguf_writer member of Model, rework tokenizer export 2023-10-29 00:33:05 +02:00
Galunid
22201248a0 Remove comments 2023-10-27 02:05:27 +02:00
Galunid
4823b9bdcb Initial generic convert script 2023-10-26 15:43:19 +02:00
Georgi Gerganov
6961c4bd0b
batched-bench : print params at start 2023-10-25 10:26:27 +03:00
Georgi Gerganov
cc44877486
log : disable pid in log filenames 2023-10-25 10:09:16 +03:00
cebtenzzre
ad93962657
server : add parameter -tb N, --threads-batch N (#3584) (#3768)
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2023-10-24 23:10:43 +03:00
Georgi Gerganov
1717521cdb
server : do not block system prompt update (#3767)
* server : do not block system prompt update

* server : update state machine logic to process system prompts

* server : minor
2023-10-24 23:08:20 +03:00
Georgi Gerganov
b2f7e04bd3
sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
2023-10-24 21:51:20 +03:00
John Smith
abd21fc99f
cmake : add missed dependencies (#3763) 2023-10-24 20:48:45 +03:00
Georgi Gerganov
2b4ea35e56
cuda : add batched cuBLAS GEMM for faster attention (#3749)
* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-10-24 16:48:37 +03:00
Galunid
daab3d7f45
Add more tokenizer tests (#3742)
* Add more tokenizer tests

* Add starcoder

* Update test vocab files

* Restrict bpe tokenizer tests to unicode planes

* Update comment

* Comment cosmetics

* Remove bloom vocab/test
2023-10-24 09:17:17 +02:00
Georgi Gerganov
469c9addef
metal : handle ggml_scale for n%4 != 0 (close #3754)
ggml-ci
2023-10-24 09:47:22 +03:00
Georgi Gerganov
e3932593d4
Revert "make : add optional CUDA_NATIVE_ARCH (#2482)"
This reverts commit 96981f37b1.

See:

https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866
2023-10-23 23:46:05 +03:00
M. Yusuf Sarıgöz
9d02956443
issues : separate bug and enhancement template + no default title (#3748) 2023-10-23 22:57:16 +03:00
Galunid
69a6735087
Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746)
We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for 

- a couple of tokenizer tests
- some more `special` and `non-special` added tokens handling (as far as I understand it)

* Update special token handling

* Add mpt
2023-10-23 21:46:00 +02:00
Marcus Dunn
5be6c803fa
llama : remove token functions with context args in favor of model (#3720)
* added `llama_model_token_*` variants to all the `llama_token_*` functions.

* added `LLAMA_API`

* formatting

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* removed old `llama_token` functions

* changed 3 more functions to take in model

- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`

* added back docs

* fixed main.cpp

* changed token functions to use new model variants

* changed token functions to use new model variants

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-23 22:40:03 +03:00
Galunid
6336701c93
Fix baichuan convert script not detecing model (#3739)
It seems nobody objects.
2023-10-23 17:47:03 +02:00
Alex
96981f37b1
make : add optional CUDA_NATIVE_ARCH (#2482)
Use the environment variable `CUDA_NATIVE_ARCH` if present to set NVCC arch. Otherwise, use `native`.
2023-10-22 22:56:53 +03:00
Georgi Gerganov
438c2ca830
server : parallel decoding and multimodal (#3677)
* implementing parallel decoding in server example

* crash fixed

* save dev progress

* refactored sampling function

* completion endpoint working

* multiple client support

* grammar + no stream completion

* cached prompt support

* chat.mjs support cached prompt + some fixes

* server ui now support multiple clients

* unused change reverted

* fixed timings per slot

* add context swap

* add changes to README.md

* llava multimodal integration

* fixed tokens probs

* add multimodal input - alfa

* refactor code + remove unused comments + improved README.md

* fix compilation errors with llvm

* notify the user from server ui that multimodality is unavialable

* some ci fixes

* fix ci make build undefined ref errors

* fix long prompt than ctx proposed in #3639

* fixed premature end due stop word

* context shift fixed

* fix llava implementation

* sync README.md changes

* readme change

* update api like OpenAI

* multimodal support enabled by default

* fix make bui;d errors

* fix multiple clients

* fix zig build

* new sampling API

* latest changes of sampling API

* server : coding-style normalization

* server : coding-style normalization (part 2)

* server : remove beam-search functionality

* server : bug fix in ingest_images

n_tokens is incremented internally by llama_batch_add

* server : use refs + use llama_batch_clear()

* server : snake case

* server : minor sync

* added thread safe pipeline

* server : bach has to be allocated for n_parallel sequences

* server : no need for atomic int - already using mutex

* server : logs + minor code style

* server : fix multibyte handle in partial response (#3706)

* fix image load + view image in chat

* make : silence stb warnings

* clip : link to ggml, not to llama

* server : fix switch fallthrough

* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)

* server : refactor ctx_sampling init + n_ctx + names

* server : bug fix for prompt caching

* Do not save/load image_data to localStorage

* editorconfig : new line in index.html

* server : completion requests remember slot_id

* Update readme to document multimodal in server

* server : minor style

* Update readme to document multimodal in server

* server : hide ctx_sampling->prev behind API (#3696)

* server : apply fix from #3722

* server : fix slot reuse

* server : add comment about changing slot_state to bool

---------

Co-authored-by: FSSRepo <go778sgt@gmail.com>
Co-authored-by: Damian Stewart <d@damianstewart.com>
Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
2023-10-22 22:53:08 +03:00
goerch
9e70cc0322
Add test for MPT tokenization (#3728)
* Add test for MPT tokenization

* Revert code motion

* Remove unnecessary restriction in test case

* Clarify logic in conversion
2023-10-22 21:21:42 +02:00