Commit graph

4826 commits

Author SHA1 Message Date
Olivier Chafik
64287a328d tool-call: test Hermes-3-Llama-3.1-8B 2024-10-29 14:52:25 +00:00
Changyeon Kim
8f275a7c45
ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.

- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Correct the incorrect order of the parameters.

fix casting to int.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
2024-10-29 09:52:56 +01:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling (#10071)
ggml-ci
2024-10-29 10:42:05 +02:00
ochafik
aefac1e5cb tool-call: update scripts/fetch_server_test_models.py 2024-10-28 23:57:23 +00:00
ochafik
b825440c81 tool-call: use Q4_K_M models 2024-10-28 23:56:40 +00:00
ochafik
74d71a673e agent: simplify syntax (default tools to local w/ default port) 2024-10-28 23:54:01 +00:00
ochafik
b51c71c734 tool-call: remove duplicate script to fetch templates 2024-10-28 21:35:18 +00:00
arch-btw
61715d5cc8
llama : Add IBM granite template (#10013)
* Add granite template to llama.cpp

* Add granite template to test-chat-template.cpp

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Update tests/test-chat-template.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Added proper template and expected output

* Small change to \n

Small change to \n

* Add code space &

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix spacing

* Apply suggestions from code review

* Update src/llama.cpp

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-28 18:45:33 +01:00
Georgi Gerganov
07028f9d74
flake.lock: Update (#10063)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
  → 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-28 08:41:24 -07:00
ochafik
ec547e4137 tool-call: add tests: tool_call=none, parallel_tool_calls=true 2024-10-28 10:04:00 +00:00
R0CKSTAR
524afeec9d
musa: workaround for Guilty Lockup in cleaning src0 (#10042)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-28 10:02:48 +01:00
Georgi Gerganov
8125e6cbfc
server : don't overfill the batch during infill (#10018)
ggml-ci
2024-10-28 08:49:32 +02:00
ochafik
168add7ec8 Update tool_call.feature 2024-10-28 02:06:00 +00:00
ochafik
dd6d0241a7 tool-call: script to prefetch models used in server tests 2024-10-28 02:01:00 +00:00
ochafik
7fde6d0091 tool_call: test no tool call on a real model + rename scenarios 2024-10-28 02:00:09 +00:00
ochafik
c88095e3fc space nits 2024-10-28 00:27:04 +00:00
ochafik
9a86ea79a2 tool-call: slow tool call integration tests 2024-10-28 00:26:40 +00:00
Georgi Gerganov
8841ce3f43
llama : switch KQ multiplication to F32 precision by default (#10015)
ggml-ci
2024-10-27 20:59:58 +02:00
ochafik
ec9f3b101b nits 2024-10-27 16:44:54 +00:00
ochafik
080982ebf3 tool-call: test MistralNemo in forced tools server tests (w/ parallel tool calls disabled) 2024-10-27 16:39:51 +00:00
Georgi Gerganov
cc2983d375
sync : ggml 2024-10-26 10:34:08 +03:00
bssrdf
8c60a8a462
increase cuda_cpy block size (ggml/996)
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-10-26 10:33:56 +03:00
Georgi Gerganov
9e4a2563ea
scripts : fix amx sync [no ci] 2024-10-26 10:33:31 +03:00
Georgi Gerganov
668750357e
metal : support permuted matrix multiplicaions (#10033)
* metal : support permuted matrix multiplicaions

ggml-ci

* cont : use nb01 directly for row steps

ggml-ci

* cont : add comments [no ci]

* metal : minor refactor

* metal : minor
2024-10-25 22:26:15 +03:00
wwoodsTM
ff252ea48e
llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix (#10022)
* llama: Refactor string_split to use template specialization,  fixes parsing strings with spaces

* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Srihari-mcw
2f8bd2b901
llamafile : extend sgemm.cpp support for Q5_0 models (#10010) 2024-10-25 10:27:41 +03:00
Georgi Gerganov
bc5ba007b2
server : check that the prompt fits in the slot's context (#10030)
ggml-ci
2024-10-25 10:13:46 +03:00
Olivier Chafik
30bd00bcf7 agent: fix tools setup 2024-10-25 02:00:47 +01:00
Olivier Chafik
5c414a3335 agent: simplify tools setup 2024-10-25 01:03:45 +01:00
Xuan Son Nguyen
958367bf53
server : refactor slot input data, move tokenizer to HTTP thread (#10023)
* server : refactor slot input data, move tokenizer to HTTP thread

* move prompt_tokens.empty() check

* fix incorrect if branch

* fix infinite generation loop

* bring back infill validation

* add infill test

* try fixing format_infill

* fix test

* remove redundant code

* rename completion to inference

* update docs

* use llama_tokens everywhere
2024-10-24 21:51:22 +02:00
Georgi Gerganov
40f2555797
ci : fix cmake flags for SYCL 2024-10-24 21:23:33 +03:00
Olivier Chafik
0f4fc8cb28 agent: fix no-cache issue in squid for brave tool 2024-10-24 18:59:37 +01:00
Johannes Gäßler
167a515651
CUDA: fix insufficient buffer clearing for MMQ (#10032) 2024-10-24 14:40:23 +02:00
Olivier Chafik
03b86416e1 agent: fix deps + make docker compose setup easier to debug 2024-10-24 12:30:27 +01:00
Johannes Gäßler
c39665f589
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
2024-10-24 11:09:36 +02:00
ochafik
c2926e4bd9 Update README.md 2024-10-24 06:40:16 +01:00
ochafik
d338bfb87f agent: ditch aiohttp & define REQUESTS_CA_BUNDLE to fix http proxying / trust the self-signed cert from python 2024-10-24 06:35:37 +01:00
ochafik
0f5d63943f agent: display http errors nicely 2024-10-24 05:40:58 +01:00
ochafik
f5320af02a tool-call: return tool_call.id (required by Nemo) 2024-10-24 05:40:15 +01:00
ochafik
267e630c14 agent: isolate tools container + log its outgoing HTTP & HTTPS traffic w/ docker compose + self-signed squid proxy 2024-10-24 05:38:54 +01:00
ochafik
414f6f1b30 Merge branch 'tool-call' of github.com:ochafik/llama.cpp into tool-call 2024-10-23 21:22:08 +01:00
ochafik
4394e1cd5e Update tool-call.cpp 2024-10-23 21:21:39 +01:00
wwoodsTM
0a1c750c80
server : samplers accept the prompt correctly (#10019) 2024-10-23 22:27:51 +03:00
Georgi Gerganov
190a37d797
sync : ggml 2024-10-23 17:23:55 +03:00
Georgi Gerganov
2d3aba9ee8
llama.vim : bump generation time limit to 3s [no ci] 2024-10-23 17:16:56 +03:00
Johannes Gäßler
80273a306d CUDA: fix 1D im2col, add tests (ggml/993) 2024-10-23 16:50:02 +03:00
Daniel Bevenius
c19af0acb1 ggml : remove redundant set of contexts used field (ggml/978)
This commit removes the setting of the `used` field of the contexts in
the global state (g_state) in `ggml_init`.

The motivation for this change is that I believe that this additional
initialization might not be required after the changes in Commit
45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from
whisper.cpp"), which changed the initialization of the contexts field
from `{ 0 }` to `{ { 0 } }`:

```console
             g_state = (struct ggml_state) {
-                /*.contexts =*/ { 0 },
+                /*.contexts =*/ { { 0 } },
             };
```
My understanding is that the `{0}` initialization might not have
zero-initialized all the nested fields in every array element because of
compiler differences, and might have been the reason for having the
explicit setting of the `used` fields to false.
2024-10-23 16:50:02 +03:00
Michael Coppola
ac113a0fee
llama.vim : add classic vim support (#9995)
* added classic vim support

* fixed ring update, removed blank line

* minor

* minor

* minor doc update

* removed uneeded var

* minor

* minor

* fixed job_start creating new scratch buffers

* fixed job_start creating new scratch buffers

* fixed ghost text indenting when expandtab is on

* removed unused code

* minor

* unified fim_on_exit

* minor

* vim ghost text rendering now uses pos_x and pos_y parameters

* renamed *_hlgroup to hlgroup_*

* renamed *_ghost_text to ghost_text_*, moved nvim/vim detection to llama#init()

* minor

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-10-23 14:09:26 +03:00
Jun Hee Yoo
4c9388fb96
metal : add POOL2D and fix IM2COL (#9943)
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
2024-10-23 13:33:45 +03:00