Commit graph

  • 054203ae69 metal : gemma2 flash attention support slaren 2024-08-24 22:22:50 +02:00
  • e11bd856d5
    CPU/CUDA: Gemma 2 FlashAttention support (#8542) b3620 Johannes Gäßler 2024-08-24 21:34:59 +02:00
  • 6e4080450d remove metal check Johannes Gäßler 2024-08-24 19:10:12 +02:00
  • f5f4cdef4e llama : fix qs.n_attention_wv for DeepSeek-V2 Francis Couture-Harpin 2024-08-24 10:25:39 -04:00
  • 832c6ee394 disable logit softcapping tests on Metal Johannes Gäßler 2024-07-17 18:07:43 +02:00
  • 8043640ef0 apply logit_softcap to scale in kernel Johannes Gäßler 2024-07-17 18:03:59 +02:00
  • 8618413712 CPU/CUDA: Gemma 2 FlashAttention support Johannes Gäßler 2024-07-16 13:54:18 +02:00
  • fc6abde7aa rm comments ltoniazzi 2024-08-24 09:20:31 +01:00
  • 42f65b0289 clone repo commit ltoniazzi 2024-08-24 09:05:43 +01:00
  • 56f2a80f16 Address comments ltoniazzi 2024-08-18 13:35:25 +01:00
  • b8c86b1095 Update .github/workflows/server-convert-and-infer.yml ltoniazzi 2024-08-18 12:24:15 +01:00
  • d1304ec4bf Update .github/workflows/server-convert-and-infer.yml ltoniazzi 2024-08-18 12:23:38 +01:00
  • 156a718ad4 Add lora test workflow ltoniazzi 2024-08-16 13:38:38 +01:00
  • 8f824ffe8e
    quantize : fix typo in usage help of quantize.cpp (#9145) b3619 João Dinis Ferreira 2024-08-24 07:22:45 +01:00
  • fa3638c6a9
    Correct typo run_llama2.sh > run-llama2.sh 蕭澧邦 2024-08-23 21:32:14 +08:00
  • 3ba780e2a8
    lora : fix llama conversion script with ROPE_FREQS (#9117) b3618 Xuan Son Nguyen 2024-08-23 12:58:53 +02:00
  • f9352543f2 quantize : fix typo in usage help of quantize.cpp João Dinis Ferreira 2024-08-22 15:34:48 +02:00
  • b180cb352b backup Meng, Hengyu 2024-08-20 08:23:41 +00:00
  • 1a88919759 fix: use potential head_dim for Exaone Carsten Kragelund 2024-08-23 08:27:50 +00:00
  • f4fefeefdd modify makefile caitianchi 2024-08-23 15:44:49 +08:00
  • 73df26713b init caitianchi 2024-08-23 15:44:32 +08:00
  • a07c32ea54
    llama : use F32 precision in GLM4 attention and no FA (#9130) gguf-v0.10.0 b3617 piDack 2024-08-23 15:27:17 +08:00
  • 2812b35ce4 remove trailing whitespace Markus Tavenrath 2024-08-23 09:01:46 +02:00
  • b77d7f6f0d fix: llama3.1 rope_freqs not respecting custom head_dim Carsten Kragelund 2024-08-23 05:03:46 +00:00
  • cb6d9962c4 Merge branch 'master' into compilade/bitnet-ternary Francis Couture-Harpin 2024-08-22 16:42:24 -04:00
  • 38913dc8dd convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present Francis Couture-Harpin 2024-08-22 14:31:12 -04:00
  • e353c037d8 vis the difference Yutong Dai 2024-08-22 17:33:53 +00:00
  • 11b84eb457
    [SYCL] Add a space to supress a cmake warning (#9133) b3616 Akarshan Biswas 2024-08-22 19:39:47 +05:30
  • 5e2ce24710 Fix issues where the last submit wasn't executed or handled properly. Markus Tavenrath 2024-08-22 15:48:24 +02:00
  • e8cf51bd7c
    [SYCL] Add a space to supress a cmake warning Akarshan Biswas 2024-08-22 18:09:12 +05:30
  • fcb771eca8
    llama : fix typo in xcda_array_view comment Daniel Bevenius 2024-08-22 14:21:02 +02:00
  • 81a37ca577 sync ggml-vocab-qwen2 Ren Xuancheng 2024-08-22 17:09:51 +08:00
  • 8a6ba03c54 fix glm GGG err pidack 2024-08-22 16:50:00 +08:00
  • 0fbed972aa Maybe fix ci Mathijs Henquet 2024-08-22 10:19:00 +02:00
  • 0d198bbf98 Fix trailing ws Mathijs Henquet 2024-08-22 09:49:26 +02:00
  • 395ae48cb0 Format modification Jia Liu 2024-08-22 14:39:43 +08:00
  • 0451b1f9ef re-use same llama_batch Jia Liu 2024-08-22 14:33:01 +08:00
  • fa358e7071 llama : add missing break Francis Couture-Harpin 2024-08-22 01:13:43 -04:00
  • 1731d4238f
    [SYCL] Add oneDNN primitive support (#9091) b3615 luoyu-intel 2024-08-22 12:50:10 +08:00
  • 3f5eaeab72 update doc luoyu-intel 2024-08-22 11:35:50 +08:00
  • 27ecb076cf simplify the code Jia Liu 2024-08-22 11:28:16 +08:00
  • e04910dc48 llama : remove unused variable Francis Couture-Harpin 2024-08-21 23:06:22 -04:00
  • c8ef556484 merge master to resolve conflicts Jia Liu 2024-08-22 10:33:51 +08:00
  • ba0861e384 the difference is from resize Yutong Dai 2024-08-22 00:04:54 +00:00
  • 0c5baa1cd1 Remove trailing ws Mathijs Henquet 2024-08-22 00:43:30 +02:00
  • 42fb6707e8 Add example of token splitting Mathijs Henquet 2024-08-22 00:41:44 +02:00
  • b11e63ce43 Handle case if tokenizer splits along utf8 continuation bytes Mathijs Henquet 2024-08-22 00:32:28 +02:00
  • 198daa4e34 server : Add tokenize with pieces tests to server.feature Mathijs Henquet 2024-08-22 00:04:21 +02:00
  • aff96920f9 llama : fix Mamba-2 conv state saving Francis Couture-Harpin 2024-08-21 16:28:07 -04:00
  • 2bfe9de6d3 llama : support running Mamba-Codestral-7B-v0.1 Francis Couture-Harpin 2024-08-18 22:43:39 -04:00
  • dceff23fae ggml : SIMD ggml_ssm_scan for Mamba-2 Francis Couture-Harpin 2024-08-18 21:49:39 -04:00
  • 1f0fea70fb llama : initial Mamba-2 support Francis Couture-Harpin 2024-08-01 10:43:42 -04:00
  • a1631e53f6
    llama : simplify Mamba with advanced batch splits (#8526) b3614 compilade 2024-08-21 17:58:11 -04:00
  • ad1af06737 llama_load_model_from_buffers Xuan Son Nguyen 2024-08-21 21:22:48 +02:00
  • 8062650343 llama : fix simple splits when the batch contains embeddings compilade/batch-splits Francis Couture-Harpin 2024-08-21 15:09:03 -04:00
  • 5ee96aa7e0 Add to README: ConfiChat - a lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption Rune Berg 2024-08-22 06:58:25 +12:00
  • 112b6647c4 llama : load model from buffer Xuan Son Nguyen 2024-08-21 20:37:59 +02:00
  • fb7befd045 fix compile issues Markus Tavenrath 2024-08-21 19:08:01 +02:00
  • 73beb8ddab Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. Markus Tavenrath 2024-08-21 14:31:59 +02:00
  • b9d6832239 lora : fix llama conversion script with ROPE_FREQS Xuan Son Nguyen 2024-08-21 14:53:54 +02:00
  • 644aa9fd41 Correction too small tensor embeddings to quantize Nexesenex 2024-08-21 13:07:32 +02:00
  • 32f6ead0d9 Improve IQ1 and IQ2 quants Nexesenex 2024-08-19 17:58:12 +02:00
  • d7b9d214fb Shrink a bit IQ3_XXS, bump a bit IQ3_M Nexesenex 2024-08-20 12:45:30 +02:00
  • 766792abd4
    Merge 38f4863a24 into fc54ef0d1c agray3 2024-08-21 03:48:58 -07:00
  • dbadcdd5cf harmonize formatting of tensor type conditions Nexesenex 2024-08-20 11:59:41 +02:00
  • ce86019770 change function use_*_bits into difquant_*_tensors Nexesenex 2024-08-21 12:25:38 +02:00
  • cfe866e152 Merge branch 'master' into pr/8836 Nexesenex 2024-08-21 12:23:41 +02:00
  • ff0f3e8427 also add LLAMA_ARG_CONT_BATCHING Xuan Son Nguyen 2024-08-21 12:12:20 +02:00
  • b3eed89478 add LLAMA_ARG_HOST to server dockerfile Xuan Son Nguyen 2024-08-21 11:55:35 +02:00
  • 3748c734bf server : add some missing env variables Xuan Son Nguyen 2024-08-21 11:45:16 +02:00
  • fc54ef0d1c
    server : support reading arguments from environment variables (#9105) b3613 Xuan Son Nguyen 2024-08-21 11:04:34 +02:00
  • 80d9d2a551 Merge branch 'master' into compilade/batch-splits Francis Couture-Harpin 2024-08-21 04:17:29 -04:00
  • b40eb84895
    llama : support for falcon-mamba architecture (#9074) b3612 Younes Belkada 2024-08-21 12:06:36 +04:00
  • 2ee02b290b use fp16fp16fp16 luoyu-intel 2024-08-20 16:01:29 +08:00
  • af1b276a34 use dnnl for intel only luoyu-intel 2024-08-19 17:02:53 +08:00
  • 267af4e75d format luoyu-intel 2024-08-19 16:58:47 +08:00
  • b830685378 fix luoyu-intel 2024-08-19 16:52:15 +08:00
  • c751e65d81 add engine map luoyu-intel 2024-08-19 07:29:43 +00:00
  • 4dc55156ee add dnnl stream luoyu-intel 2024-08-19 07:08:04 +00:00
  • 3d0a64f092 add sycl_f16 luoyu-intel 2024-08-19 06:12:39 +00:00
  • 4cffe910c3 add onednn luoyu-intel 2024-08-19 11:36:35 +08:00
  • f63f603c87
    llava : zero-initialize clip_ctx structure fields with aggregate initialization 908) b3611 fairydreaming 2024-08-21 09:45:49 +02:00
  • 8455340b87
    llama : std::move llm_bigram_bpe from work_queue (#9062) b3610 Daniel Bevenius 2024-08-21 09:32:58 +02:00
  • c2b5885502
    Fix minicpm example directory Xie Yanbo 2024-08-21 14:02:40 +08:00
  • 1be5ea7d97 llama : add llama_model_is_recurrent to simplify figuring that out Francis Couture-Harpin 2024-08-20 23:55:14 -04:00
  • b264eddbb2 llama : fix Mamba pooled embeddings with multiple sequences Francis Couture-Harpin 2024-08-20 23:29:48 -04:00
  • 9373e2ba58 cann: merge get row operations of float type zhenweijin 2024-08-19 14:27:26 +08:00
  • 652e9b0d61 llama : fix T5 segfault again Francis Couture-Harpin 2024-08-20 21:37:43 -04:00
  • a2d4d1913c server : added with_pieces functionality to /tokenize endpoint Mathijs Henquet 2024-08-20 23:28:06 +02:00
  • 28670bfbc8 Add missing sentencepiece dependency to pyproject.yaml Jesse Noller 2024-08-20 13:48:43 -06:00
  • 347247a24e imatrix : fix segfault when using a single chunk per batch Francis Couture-Harpin 2024-08-20 15:35:56 -04:00
  • bce54642c8 imatrix : allow processing multiple chunks per batch Francis Couture-Harpin 2024-08-20 15:17:24 -04:00
  • 2f3c1466ff
    llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984) b3609 Changyeon Kim 2024-08-21 04:00:00 +09:00
  • a2d1f44335
    Fix check results ggml_acc call 0cc4m 2024-08-20 20:00:00 +02:00
  • 50addec9a5
    [SYCL] fallback mmvq (#9088) b3608 Meng, Hengyu 2024-08-20 23:50:17 +08:00
  • 4f8d19ff17
    [SYCL] Fix SYCL im2col and convert Overflow with Large Dims (#9052) b3607 zhentaoyu 2024-08-20 23:06:51 +08:00
  • 7937057412 readme : specify non-arg env var Xuan Son Nguyen 2024-08-20 16:49:59 +02:00
  • a857c211e0 add -fa and -dt Xuan Son Nguyen 2024-08-20 16:44:50 +02:00
  • 2746e35607 server : support reading arguments from environment variables Xuan Son Nguyen 2024-08-20 16:28:52 +02:00
  • f3a3033415 fix style Xuan Son Nguyen 2024-08-20 15:33:02 +02:00