Commit graph

  • 40dd64ae77
    minor : clear whitespaces Georgi Gerganov 2024-04-03 21:04:47 +03:00
  • 9f62c0173d
    ci : update checkout, setup-python and upload-artifact to latest (#6456) Ewout ter Hoeven 2024-04-03 20:01:13 +02:00
  • 5d4f12e462
    server: add cURL support to server.Dockerfile (#6461) Ed Lepedus 2024-04-03 18:56:37 +01:00
  • 154d4ee39c
    readme : add feature-rich rust bindings (#6465) Francisco Melo 2024-04-03 18:53:37 +01:00
  • e69945d953
    security : create policy (#6354) Joyce 2024-04-03 14:48:07 -03:00
  • 0614a58439
    fix Georgi Gerganov 2024-04-03 20:47:19 +03:00
  • 418d50d9f8
    fix Georgi Gerganov 2024-04-03 20:45:59 +03:00
  • 6bbed521fa
    minor Georgi Gerganov 2024-04-03 20:44:46 +03:00
  • 4ea3efdd59
    Fix: link on SECURITY.md Joyce 2024-04-03 14:37:36 -03:00
  • 18a3427d5e
    Fix: link on SECURITY.md Joyce 2024-04-03 14:36:29 -03:00
  • 92e728d570
    Update README.md Fattire 2024-04-03 10:35:03 -07:00
  • afcb3eb9a6
    Update README.md Francisco Melo 2024-04-03 17:52:58 +01:00
  • b996b00d6f ci: bench: add seed parameter in k6 script Pierrick HYMBERT 2024-04-03 17:18:54 +02:00
  • 22597a4848 ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default Pierrick HYMBERT 2024-04-03 17:16:05 +02:00
  • 0fb7bfad5c ci: bench: change trigger path to not spawn on each PR Pierrick HYMBERT 2024-04-03 16:46:10 +02:00
  • a7c6758214 Merge branch 'master' into sycl_fix_non_intel_fp16 OuadiElfarouki 2024-04-03 16:43:32 +01:00
  • 3a0d8f6921 server: add cURL support to server.Dockerfile elepedus 2024-04-03 16:43:06 +01:00
  • db214fa578
    Missing tokenizer.model error during gguf conversion (#6443) b2590 Abhishek Gopinath K 2024-04-03 21:12:52 +05:30
  • 1ff4d9f3d6
    Add OpenChat, Alpaca, Vicuna chat templates (#6397) b2589 kaizau 2024-04-03 23:24:31 +08:00
  • 021c6f50e1
    Merge branch 'ggerganov:master' into master kaizau 2024-04-03 21:57:00 +08:00
  • c511d6ac4c
    Apply suggestions from code review bryanSwk 2024-04-03 21:56:07 +08:00
  • 48850cff19 Remove BOS token from templates, unprefix openchat Kai Zau 2024-04-03 21:45:48 +08:00
  • 91f3db8aab
    Update llama.cpp Georgi Gerganov 2024-04-03 16:28:38 +03:00
  • 9769e9e083 Correct README link lmat 2024-04-03 09:25:18 -04:00
  • 076b08649e
    readme : update hot topics Georgi Gerganov 2024-04-03 16:11:15 +03:00
  • 417be61191
    CI: Update actions/upload-artifact to v4 Ewout ter Hoeven 2024-04-03 15:10:41 +02:00
  • 1d5dc5441b
    CI: Update actions/setup-python to v5 Ewout ter Hoeven 2024-04-03 15:08:25 +02:00
  • 08a0c02060
    ggml : mul_mat_id use the same tensor for all the experts (#6387) slaren 2024-04-03 15:07:05 +02:00
  • c1ca357a41
    CI: Update actions/checkout to v4 Ewout ter Hoeven 2024-04-03 15:06:50 +02:00
  • 716e960a80
    metal : pad shared memory to 16 bytes Georgi Gerganov 2024-04-03 15:25:47 +03:00
  • a054283c0f
    quantize : terminate on errors + trace logs Georgi Gerganov 2024-04-03 15:18:32 +03:00
  • 822caa46a1
    llama : produce error if imatrix size does not match Georgi Gerganov 2024-04-03 15:17:56 +03:00
  • c0c4b309d7 Merge branch 'master' into sycl_fix_non_intel_fp16 OuadiElfarouki 2024-04-03 12:57:32 +01:00
  • fc719b68cf
    imatrix : fix ncall counters Georgi Gerganov 2024-04-03 14:15:40 +03:00
  • 84ef62e8d5
    Update ggml-sycl.cpp Ouadie EL FAROUKI 2024-04-03 11:57:51 +01:00
  • e8108997f6
    convert : fix handling of n_experts == None Georgi Gerganov 2024-04-03 11:21:01 +03:00
  • 3779b984ac
    llama : remove ffn tensor counting + add sanity check Georgi Gerganov 2024-04-03 11:03:27 +03:00
  • 84e779df7d Added exception handling for wrong argument type in http request Jonas Holzner 2024-04-03 09:42:18 +02:00
  • d52df2102a
    Update convert-hf-to-gguf.py Abhishek Gopinath K 2024-04-03 13:02:17 +05:30
  • f62b3d8f46
    Merge 115f49a08a into 52604860f9 DisOOM 2024-04-03 15:25:47 +08:00
  • 115f49a08a
    Update constants.py DisOOM 2024-04-03 14:48:53 +08:00
  • 3b22eb7da5
    Update tensor_mapping.py DisOOM 2024-04-03 14:47:40 +08:00
  • b3cf383f24
    Update llama.cpp DisOOM 2024-04-03 14:44:11 +08:00
  • d9e48194e4 q/k ln and pos_embd only if required bryan 2024-04-03 10:50:59 +08:00
  • 52604860f9
    [SYCL] Disable iqx on windows as WA (#6435) b2586 Meng, Hengyu 2024-04-03 10:34:40 +08:00
  • 6f7ab29a07
    server: allow penalizing repetition of newlines on server webpage Shakhar Dasgupta 2024-04-02 21:01:22 -04:00
  • 19dafafd5f add review note slaren 2024-04-03 02:10:43 +02:00
  • a1343aeb8a llama : still use mmap for loading old models, but copy the data to a host buffer slaren 2024-04-03 01:57:33 +02:00
  • a4e54abe6f Cubic "smoothing curve" support kalomaze 2024-04-02 18:21:57 -05:00
  • 86f3666ab4 cuda : fix warning slaren 2024-04-03 00:46:56 +02:00
  • 31adc93486 llama : more loader cleanup, better error checking slaren 2024-04-03 00:46:15 +02:00
  • b5dbcf6f5e Smoothing factor backport kalomaze 2024-04-02 17:10:46 -05:00
  • 57ce61a307 Missing tokenizer.model error during gguf conversion Abhishek Gopinath Kovath 2024-04-03 03:31:17 +05:30
  • fe62909618 metal : add support for non-pow-2 argsort slaren 2024-04-02 20:31:01 +02:00
  • c704c778f6
    convert : fix grok tensor names Georgi Gerganov 2024-04-02 21:35:13 +03:00
  • f421b32d5a cuda/argsort : use shared memory instead of pool memory slaren 2024-04-02 20:09:25 +02:00
  • cb4422625a
    Merge pull request #1 from julialongtin/k1om Julia Longtin 2024-04-02 17:07:46 +00:00
  • 47190a7fe2 formatting. Julia Longtin 2024-04-02 17:01:53 +00:00
  • 8c17353717 minor changes. Julia Longtin 2024-04-02 16:55:40 +00:00
  • 9530398013 make linter happy slaren 2024-04-02 18:21:45 +02:00
  • d08a1f4860 convert-hf-to-gguf.py : update grok (untested) slaren 2024-04-02 18:14:57 +02:00
  • 9f569ca50b massively rewrite assembly routines. Julia Longtin 2024-04-02 15:41:56 +00:00
  • ee19a4ab7e
    fix KV cache padding, NaN from INFINITY (#6438) Johannes Gäßler 2024-04-02 17:26:22 +02:00
  • f27cbf3610 fix quantizing of merged experts slaren 2024-04-02 17:07:14 +02:00
  • 755c62b525 fix KV cache padding, NaN from INFINITY Johannes Gäßler 2024-04-02 17:04:24 +02:00
  • 68d21debe4 gguf : bump version slaren 2024-04-02 16:38:05 +02:00
  • 6f33852f3d minor slaren 2024-04-02 16:08:55 +02:00
  • 6875369909 llama : add merged experts tensors to the grok tensor map slaren 2024-04-02 16:08:45 +02:00
  • c63dfdf765 fix cmake build Johannes Gäßler 2024-04-02 11:58:59 +02:00
  • bb0d51accd fix excessive KQ_b loads Johannes Gäßler 2024-04-02 11:13:46 +02:00
  • e1ecd3b129 fix compile warnings Johannes Gäßler 2024-04-02 10:27:34 +02:00
  • 3f777acf06 Multiple parallel blocks for batch size 1 Johannes Gäßler 2024-04-01 16:41:56 +02:00
  • 68d793bee8 no ncols == 64 Johannes Gäßler 2024-04-01 15:54:50 +02:00
  • cca6d027a3 4 warps, 256 stride for all D Johannes Gäßler 2024-03-31 18:39:02 +02:00
  • 269374ed81 adjust kernel selection logic Johannes Gäßler 2024-03-31 16:01:27 +02:00
  • 81da919864 no vec for hs, no hs==256 ncols==32 for Volta Johannes Gäßler 2024-03-30 10:34:09 +01:00
  • d59ac670bf 16 cols for Phi-2 Johannes Gäßler 2024-03-30 09:19:19 +01:00
  • 75aa7b4b18 CUDA: faster FlashAttention, kernel for bs == 1 Johannes Gäßler 2024-03-29 23:02:39 +01:00
  • b446893475 fix cmake build Johannes Gäßler 2024-04-02 11:58:59 +02:00
  • 46968c93dc fix excessive KQ_b loads Johannes Gäßler 2024-04-02 11:13:46 +02:00
  • ef7282a445 fix compile warnings Johannes Gäßler 2024-04-02 10:27:34 +02:00
  • d100b7511c array instead of global_memory Meng, Hengyu 2024-04-02 08:38:35 +00:00
  • b69e6c0a0f minor fix bryan 2024-04-02 16:05:53 +08:00
  • 56e8e63cb0 add sealion support bryan 2024-04-02 15:58:17 +08:00
  • d2ecac551d fix indent Meng, Hengyu 2024-04-02 07:57:17 +00:00
  • 5ead10ff90 fix typo Meng, Hengyu 2024-04-02 07:56:15 +00:00
  • 9d49a41410 disable iqx on windows as WA Meng, Hengyu 2024-04-02 07:07:46 +00:00
  • 936289a13f initial commit for sealion support bryan 2024-04-02 13:22:32 +08:00
  • 1eafdc95c8
    Using ordered_map instead to make sure the function call observations are in the correct order. Yingbei 2024-04-01 18:20:49 -07:00
  • 5de4a5da07 update grok model loading slaren 2024-04-02 03:08:04 +02:00
  • 8f84ca3cd9 test-backend-ops : test qwen argsort slaren 2024-04-02 02:07:22 +02:00
  • b4a62062db update imatrix slaren 2024-04-02 02:05:38 +02:00
  • 9c5f784669
    server readme grammar/style fixes. Fattire 2024-04-01 17:01:59 -07:00
  • deea2007b4 cleanup + disable mmap automatically with split tensors models slaren 2024-04-02 01:55:22 +02:00
  • 6886fdb887 allow quantize to work for split and merged experts models in the same way slaren 2024-04-02 01:35:19 +02:00
  • cebb79f004
    Typo fix to server's README.md Fattire 2024-04-01 16:19:35 -07:00
  • 4531b029ee cuda : support non-pow-2 number of experts slaren 2024-04-02 01:11:59 +02:00
  • 0ccfbf2f61 update server doc Jan Boon 2024-04-02 05:13:55 +08:00
  • 822d338bf4 Multiple parallel blocks for batch size 1 Johannes Gäßler 2024-04-01 16:41:56 +02:00
  • be714a0fda check types for stricter restore Jan Boon 2024-04-02 04:17:15 +08:00