ochafik
cc50356470
minja: fix vigogne ( https://github.com/google/minja/pull/22 )
2025-01-18 17:55:04 +00:00
ochafik
e3c475cd12
Disable jinja test that has a cryptic windows failure
2025-01-18 14:55:27 +00:00
ochafik
33322e823e
Flush stdout in chat template before potential crash
2025-01-18 10:38:21 +00:00
ochafik
ee1e10e21e
Normalize newlines in test-chat-templates for windows tests
2025-01-18 02:52:40 +00:00
ochafik
d5fa351a24
Revert LLAMA_CHATML_TEMPLATE refactor
2025-01-18 01:04:12 +00:00
ochafik
40db78963b
Merge remote-tracking branch 'origin/master' into jinja
2025-01-18 00:44:37 +00:00
ochafik
b75d0622e4
Refactor common_chat_* functions to accept minja template + use_jinja option
2025-01-18 00:43:38 +00:00
Jeff Bolz
bd38ddea01
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl ( #11166 )
...
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
2025-01-16 22:47:10 +01:00
Johannes Gäßler
9c8dcefe17
CUDA: backwards pass for misc. ops, add tests ( #11257 )
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-01-16 16:43:38 +01:00
Georgi Gerganov
f11cfdfd7f
ci : use -no-cnv in gguf-split tests ( #11254 )
...
* ci : use -no-cnv in gguf-split tests
ggml-ci
* ci : use -no-cnv in requantize tests
ggml-ci
* scripts : fix [no ci]
2025-01-15 18:28:35 +02:00
Johannes Gäßler
432df2d5f9
RoPE: fix back, CUDA support for back + noncont. ( #11240 )
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
Olivier Chafik
3ed670b6dd
Merge remote-tracking branch 'origin/master' into jinja
2025-01-14 12:17:07 +00:00
Olivier Chafik
091592d758
Refactor test-chat-template.cpp ( #11224 )
...
* Refactor test-chat-template
* Update test-chat-template.cpp
2025-01-14 10:16:41 +00:00
ochafik
7c84ebc231
Test templates w/ minja
2025-01-13 21:23:30 +00:00
ochafik
1aac99ad54
Refactor test-chat-template
2025-01-13 20:11:27 +00:00
Georgi Gerganov
afa8a9ec9b
llama : add llama_vocab
, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture ( #11001 )
...
llama: add support for QRWKV6 model architecture (#11001 )
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Xuan Son Nguyen
d9feae1c06
llama-chat : add phi 4 template ( #11148 )
2025-01-09 10:07:33 +01:00
Johannes Gäßler
53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes ( #11030 )
...
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
2025-01-07 18:01:58 +01:00
Georgi Gerganov
47182dd03f
llama : update llama_model API names ( #11063 )
...
* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci
2025-01-06 10:55:18 +02:00
Jeff Bolz
716bd6dec3
vulkan: optimize mul_mat for small values of N ( #10991 )
...
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where
the batch_strides are overloaded to hold the row strides. Put the loads from the
B matrix in the innermost loop because it should cache better.
Share some code for reducing the result values to memory in mul_mat_vec_base.
2024-12-30 18:27:11 +01:00
Jeff Bolz
a813badbbd
vulkan: im2col and matmul optimizations for stable diffusion ( #10942 )
...
* tests: Add im2col perf tests
* vulkan: optimize im2col, more elements per thread
* vulkan: increase small tile size for NV_coopmat2
* vulkan: change im2col to 512 elements per workgroup
2024-12-29 10:16:34 +01:00
Yun Dou
b92a14a841
llama : support InfiniAI Megrez 3b ( #10893 )
...
* Support InfiniAI Megrez 3b
* Fix tokenizer_clean_spaces for megrez
2024-12-23 01:35:44 +01:00
Georgi Gerganov
5cab3e4aaa
llama : minor grammar refactor ( #10897 )
...
ggml-ci
2024-12-19 17:42:13 +02:00
Johannes Gäßler
cd920d0ac3
tests: disable GGUF test for bad value size ( #10886 )
2024-12-19 08:53:58 +01:00
Johannes Gäßler
081b29bd2a
tests: add tests for GGUF ( #10830 )
2024-12-17 19:09:35 +01:00
Georgi Gerganov
0006f5a74a
ggml : update ggml_backend_cpu_device_supports_op ( #10867 )
...
* ggml : fix cpy op for IQ-quants to use reference impl
ggml-ci
* ggml : disable tests involving i-matrix quantization
* ggml : update ggml_backend_cpu_device_supports_op
ggml-ci
2024-12-17 18:35:42 +02:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Valentin Mamedov
a0974156f3
llama : add Deepseek MoE v1 & GigaChat models ( #10827 )
...
* Add deepseek v1 arch & gigachat template
* improve template code
* add readme
* delete comments
* remove comment
* fix format
* lint llama.cpp
* fix order of deepseek and deepseek2, move gigachat temlate to the end of func
* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need
* remove comments
* move deepseek above deepseek2
* change placement of gigachat chat template
2024-12-15 19:02:46 +02:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE ( #10361 )
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
Diego Devesa
cb13ef85a4
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ( #10797 )
...
other windows build fixes
2024-12-12 19:02:49 +01:00
PAB
a8cbab201d
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-05 13:27:33 +02:00
PAB
c2082d93a8
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-05 13:27:31 +02:00
ltoniazzi
253b7fde91
Fix HF repo commit to clone lora test models ( #10649 )
2024-12-04 10:45:48 +01:00
Jeff Bolz
2759916d86
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy ( #10642 )
2024-12-04 08:28:59 +01:00
PAB
efb6ae9630
feat: add GGML_UNARY_OP_ARGMAX
Metal kernel (ggml/1019)
...
* implemented argmax kernel
* tpig -> tgpig
* change to strides
* contiguous assertions
* kernel working and tested
* argmax simd parallel implementation
* added 2 new tests for argmax in test-backend-ops
* cosmit
* added 3 tests cases for perf eval
* add test_argmax in make_test_cases_perf
* Update test-backend-ops.cpp
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-03 20:04:49 +02:00
Georgi Gerganov
0115df2f65
metal : small-batch mat-mul kernels ( #10581 )
...
* metal : small-batch mat-mul kernels
ggml-ci
* metal : add rest of types
ggml-ci
* metal : final adjustments
ggml-ci
* metal : add comments
ggml-ci
2024-12-03 11:52:33 +02:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates ( #10623 )
...
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
2024-12-02 22:10:19 +01:00
Juk Armstrong
917786f43d
Add mistral-v1
, mistral-v3
, mistral-v3-tekken
and mistral-v7
chat template types ( #10572 )
...
* Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken`
* Changed system message logic and added tests for all 4
* Invalid `system_message` instead of `content` fixed
* Removed tab-indented lines
* Added template code and test for `mistral-v7`
* Added all tests. Fixed bug with `tmpl == "llama2"` test.
* Replaced tabs with spaces.
* Removed `'mistral-v2'` option as no (open) models ever used it
* Removed all references to 'v2' template from comments
* Update llama.cpp
Fixed `trim_assistant_message` bug
2024-12-01 23:09:49 +01:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Georgi Gerganov
f0678c5ff4
ggml : fix I8MM Q4_1 scaling factor conversion ( #10562 )
...
ggml-ci
2024-11-29 16:25:39 +02:00
Jeff Bolz
904109ed0d
vulkan: fix group_norm ( #10496 )
...
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).
Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439 .
2024-11-26 16:45:05 +01:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends ( #10469 )
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Georgi Gerganov
f6d12e7df8
tests : fix compile warning
2024-11-25 15:17:32 +02:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Diego Devesa
a5e47592b6
cuda : optimize argmax ( #10441 )
...
* cuda : optimize argmax
* remove unused parameter
ggml-ci
* fixup : use full warps
ggml-ci
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix ub
* ggml : check ne00 <= INT32_MAX in argmax and argsort
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-11-21 18:18:50 +01:00
Johannes Gäßler
02e4eaf22f
ggml-opt: fix data corruption (ggml/1022)
2024-11-21 09:22:02 +02:00
Jeff Bolz
b3e585988f
vulkan: Optimize soft_max ( #10301 )
...
* vulkan: Optimize soft_max
Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.
Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.
Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.
* vulkan: Further soft_max optimizations
Restore the workgroup size of 512 case, use it for >1024.
Use unrollable loops for more iteration counts.
2024-11-19 08:25:17 +01:00
Georgi Gerganov
24203e9dd7
ggml : inttypes.h -> cinttypes ( #0 )
...
ggml-ci
2024-11-17 08:30:29 +02:00
Georgi Gerganov
84274a10c3
tests : remove test-grad0
2024-11-17 08:30:29 +02:00