Commit graph

  • 1f8a592482
    cuda : make loops use the same loop values Georgi Gerganov 2024-02-03 14:01:32 +02:00
  • 3195d8267e
    Update server-llm.sh Нияз Гарифзянов 2024-02-03 14:57:59 +03:00
  • 7c34655b36
    cuda : use int instead of int64_t Georgi Gerganov 2024-02-03 13:39:46 +02:00
  • 4e0d6dd9c1 imatrix: be able to start from a specific chunk Iwan Kawrakow 2024-02-03 13:30:39 +02:00
  • 52bb63c708
    refactor : switch to emplace_back to avoid extra object (#5291) b2054 Michael Klimenko 2024-02-03 12:23:37 +01:00
  • 1ec3332ade
    YaRN : store rope scaling type as int32_t in memory (#5285) b2053 Jared Van Bortel 2024-02-03 06:22:06 -05:00
  • 6a66c5071a
    readme : add tenere in the ui tools list (#5284) BADR 2024-02-03 12:20:26 +01:00
  • b150abe83e
    cuda : avoid warp_reduce for smax Georgi Gerganov 2024-02-03 13:17:47 +02:00
  • 935227bf32 imatrix: adding --combine and --continue-from Iwan Kawrakow 2024-02-03 13:14:37 +02:00
  • bdec4a2263 Fix flake8 0cc4m 2024-02-03 11:54:03 +01:00
  • f04dc725d6 Enable asynchronous transfers in Vulkan backend 0cc4m 2024-02-03 11:51:53 +01:00
  • 0dcf96d437 Add Vulkan debug and validate flags to Make and CMakeLists.txt 0cc4m 2024-02-03 11:27:03 +01:00
  • 4e9091f11e Fix Vulkan on Intel ARC 0cc4m 2024-02-03 11:21:58 +01:00
  • 64e2bc9184
    Update README.md calvinweb 2024-02-03 17:06:40 +08:00
  • 3d873b8500
    server : fix tokens in prompt array being treated as separate prompts Niall Coates 2024-02-03 09:00:38 +00:00
  • a305dba8ff
    Fix im2col with 32fp (#5286) b2051 AidanBeltonS 2024-02-03 08:11:37 +00:00
  • ac01456725 update install make by w64devkit jianyuzh 2024-02-03 15:17:13 +08:00
  • d6fa7876a0 added help text l3utterfly 2024-02-03 10:19:49 +09:00
  • 620d4c5493 added dynamic temp params in main l3utterfly 2024-02-03 10:17:58 +09:00
  • 3de7ac76d5 Switch to emplace_back to avoid extra object Michael Klimenko 2024-02-02 19:08:40 +01:00
  • 245a5498f2 Fix cpy with dims of 3 Aidan 2024-02-02 17:46:59 +00:00
  • 531b470546
    Add TypeError exception for BPE vocab. Sang-Kil Park 2024-02-03 02:08:14 +09:00
  • 72af9abf5d
    Load Balancing Cluster Example JohnnyB 2024-02-02 16:47:23 +00:00
  • 339d64a6f7 Fix im2col with 32fp Aidan 2024-02-02 16:40:37 +00:00
  • 137971b2e1 llama : store mapped names as const char * Jared Van Bortel 2024-02-02 10:42:42 -05:00
  • 1dd7aa9b1c YaRN : store rope scaling type as int32_t in memory Jared Van Bortel 2024-02-02 10:27:59 -05:00
  • 5fbea121f3 docs: add tenere in the ui tools list pythops 2024-02-02 15:41:21 +01:00
  • 191221178f
    perplexity : fix KL divergence calculations on Windows (#5273) b2050 kalomaze 2024-02-02 08:15:30 -06:00
  • b68a112204
    cuda : fix __hisinf() result check Georgi Gerganov 2024-02-02 15:12:28 +02:00
  • e437b37fd0
    scripts : parse wtype in server-llm.sh (#5167) Georgi Gerganov 2024-02-02 14:23:40 +02:00
  • 784bad6379
    scripts : fix check for wfile Georgi Gerganov 2024-02-02 14:23:12 +02:00
  • 2d40085c26
    py : add check for '.attn.masked_bias' layers to GPT2model (#5281) Mirror Azure 2024-02-02 14:39:09 +03:00
  • 12eaa22628
    tests : update dims Georgi Gerganov 2024-02-02 11:55:38 +02:00
  • 7a6601c7f1 Added check for '.attn.masked_bias' layers to GPT2model MirrorAzure 2024-02-02 12:34:20 +03:00
  • b05102fe8c
    Tidy ggml-sycl (#5261) b2047 AidanBeltonS 2024-02-02 08:39:48 +00:00
  • 6b91b1e0a9
    docker : add build for SYCL, Vulkan + update readme (#5228) Xuan Son Nguyen 2024-02-02 08:56:31 +01:00
  • e805f0fa99
    [SYCL] get MAX_MEM_ALLOC from device property (#5270) b2045 Meng, Hengyu 2024-02-02 15:54:14 +08:00
  • af3ba5d946
    [SYCL] update guide of SYCL backend (#5254) Neo Zhang Jianyu 2024-02-02 15:53:27 +08:00
  • e1e721094d
    llama : fix memory leak in llama_batch_free (#5252) b2043 Ian Bull 2024-02-01 23:20:13 -08:00
  • 621e60e484
    fix macro typo Meng, Hengyu 2024-02-02 14:53:59 +08:00
  • 8a682864f7 Fix Windows KL divergence calculations kalomaze 2024-02-01 23:55:53 -06:00
  • 150eac27d6 fix grammer issues jianyuzh 2024-02-02 11:43:11 +08:00
  • 796d91af35 update help of llama-bench jianyuzh 2024-02-02 11:34:41 +08:00
  • 3a59a81eab update for gpu device check jianyuzh 2024-02-02 10:07:56 +08:00
  • a28c5eff5b space instead of tab Meng, Hengyu 2024-02-02 01:46:01 +00:00
  • ef73e58ff0
    fix indent Meng, Hengyu 2024-02-02 09:43:49 +08:00
  • 3f2e487f8c get max alloc size from device prop Meng, Hengyu 2024-02-02 01:40:23 +00:00
  • fb81c0c00e add vs install requirement jianyuzh 2024-02-02 09:26:48 +08:00
  • 35b7a7a183
    Update llava-surgery-v2.py John 2024-02-02 02:07:42 +01:00
  • 440b2ae2b1
    Update convert-image-encoder-to-gguf.py John 2024-02-02 02:07:29 +01:00
  • a27b9a45df
    Update convert-image-encoder-to-gguf.py John 2024-02-02 01:48:14 +01:00
  • 1f9367c134
    Rename llava-survery-v2.py to llava-surgery-v2.py John 2024-02-02 00:26:05 +01:00
  • 8ebdaec761
    Update convert-image-encoder-to-gguf.py John 2024-02-02 00:25:08 +01:00
  • 97dda1e098
    Update convert-image-encoder-to-gguf.py John 2024-02-01 23:16:30 +01:00
  • 10c830cdfe
    Create llava-survery-v2.py John 2024-02-01 23:15:40 +01:00
  • 185333bbde bug: Free the allocated tokens in the batch Ian Bull 2024-01-31 23:59:20 -08:00
  • db1f3c482e
    cuda : avoid zeroing fragments Georgi Gerganov 2024-02-01 22:08:37 +02:00
  • 128dcbd3c9
    add --no-mmap in llama-bench (#5257) b2042 Neo Zhang Jianyu 2024-02-02 03:48:53 +08:00
  • c6769b9422
    tests : minor fix Georgi Gerganov 2024-02-01 21:24:26 +02:00
  • cda5a60a41
    metal : optimize softmax Georgi Gerganov 2024-02-01 20:53:29 +02:00
  • 4d0924a890
    Vulkan Phi Fix for AMD Proprietary Drivers (#5260) b2041 0cc4m 2024-02-01 19:25:24 +01:00
  • 56e45a239e
    metal : optimize softmax for C > 32 Georgi Gerganov 2024-02-01 20:16:32 +02:00
  • 41d136b602
    Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-02-01 19:51:41 +02:00
  • 5a19a9f6d0
    cuda : add flash_attn kernel (wip) Georgi Gerganov 2024-02-01 19:47:11 +02:00
  • b957b8f5f6
    cuda : add flash_attn kernel (wip) gg/flash-attn-cuda Georgi Gerganov 2024-02-01 19:47:11 +02:00
  • 8ca511cade
    cuda : fix LLAMA_CUDA_F16 (#5262) b2040 slaren 2024-02-01 18:30:17 +01:00
  • 7a544d49ba
    Remove std::printf comments Abhilash Majumder 2024-02-01 22:59:15 +05:30
  • 3d2eb9bb9e cuda : fix LLAMA_CUDA_F16 slaren 2024-02-01 17:50:09 +01:00
  • 9463e6b067 Remove blank space Aidan 2024-02-01 16:39:04 +00:00
  • 04486e4076 Tidy some code in ggml-sycl Aidan 2024-02-01 16:31:57 +00:00
  • 23e35e9066 Fix another Vulkan CPY buffer size bug 0cc4m 2024-02-01 17:22:59 +01:00
  • 30c52f32b8 mv position to reduce model reload jianyuzh 2024-02-02 00:18:36 +08:00
  • e76d0013a8 Replace tanh to avoid NaN in gelu shader on AMD proprietary driver 0cc4m 2024-02-01 17:12:00 +01:00
  • da32e212a8 update guide for mmap jianyuzh 2024-02-02 00:01:35 +08:00
  • fb69ed8521 ren no_mmap to mmap, show mmap when not default value in printer jianyuzh 2024-02-01 23:47:41 +08:00
  • d71ac90985
    make : generate .a library for static linking (#5205) b2039 Ali Nehzat 2024-02-02 02:18:53 +11:00
  • 47d0d4c6d4 docs: correct docker image for Intel oneMKL ngxson 2024-02-01 16:15:35 +01:00
  • ac26f27028
    cuda : increase C to 128 for better performance flash-attn-cuda Georgi Gerganov 2024-02-01 16:12:56 +02:00
  • b2f6338d2f fix code format, change print for --no-mmap jianyuzh 2024-02-01 23:02:29 +08:00
  • 2e46013749
    cuda : fix soft_max to use correct mask size Georgi Gerganov 2024-02-01 16:47:20 +02:00
  • 910b15bb40
    ggml : fix ggml_soft_max mask requirement Georgi Gerganov 2024-02-01 16:41:02 +02:00
  • e4e28c1c46 fix conflict jianyuzh 2024-02-01 22:31:41 +08:00
  • 6a3fa7aab4
    Merge branch 'ggerganov:master' into xsn/docs-sycl-vulkan Xuan Son Nguyen 2024-02-01 15:20:10 +01:00
  • 708a3221f8 docs: correct TOC ngxson 2024-02-01 15:17:43 +01:00
  • e1cddc6e5e sycl: use intel/oneapi-basekit docker image ngxson 2024-02-01 15:16:18 +01:00
  • 5ab650435b
    Merge branch 'master' into update_bench Neo Zhang Jianyu 2024-02-01 22:11:20 +08:00
  • c36ecbfd37 add --no-mmap, show sycl backend jianyuzh 2024-02-01 22:05:19 +08:00
  • 9a5c2a1681
    cuda : switch to F16 scalars + tune warps for RTX 2060 Georgi Gerganov 2024-02-01 15:00:47 +02:00
  • 2c04beeb81
    cuda : avoid extra QxQ matrix in shared memory Georgi Gerganov 2024-02-01 14:03:03 +02:00
  • 23d7148d6a update guide for make installation, memory, gguf model link, rm todo for windows build Zhang 2024-02-01 19:32:12 +08:00
  • ce32060198
    llama : support InternLM2 (#5184) b2038 Guoteng 2024-02-01 17:19:51 +08:00
  • ca1350304e llama : support InternLM2 * support InternLM2 inference * add add_space_prefix KV pair 877825076@qq.com 2024-01-29 12:20:39 +08:00
  • 71b69aa7fd
    cuda : fix flash_attn kernel to produce same results as CPU Georgi Gerganov 2024-02-01 09:40:56 +02:00
  • fd878f71ed cuda: mask as fp16 FSSRepo 2024-01-31 16:22:11 -05:00
  • 3df0b8d47c Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-31 16:09:34 -05:00
  • 0afe47fa5f fix naive implementation FSSRepo 2024-01-31 15:43:42 -05:00
  • 1cfb5372cf
    Fix broken Vulkan Cmake (properly) (#5230) b2037 Eve 2024-01-31 19:21:55 +00:00
  • 8ad92dc1ec
    ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext Georgi Gerganov 2024-01-31 19:17:16 +02:00
  • 1ad42b1f1e
    ggml : ggml_soft_max uses F16 mask gg/flash-attn-mask-f16 Georgi Gerganov 2024-01-31 19:17:16 +02:00
  • b1479dfbc5 fix kernel FSSRepo 2024-01-31 12:28:48 -05:00