Commit graph

  • fa498a28cf Merge branch 'master' into xsn/improve_server_works ngxson 2024-02-27 21:25:41 +01:00
  • 88f5ae3b39 sigint: stderr ngxson 2024-02-27 21:23:39 +01:00
  • ac699991a8 sigint: message ngxson 2024-02-27 21:20:55 +01:00
  • 66c7480a42 Merge branch 'master' into xsn/server_twice_ctrl_c ngxson 2024-02-27 21:17:31 +01:00
  • 2540a290ed Make CUDA compile with QK_K = 64 Iwan Kawrakow 2024-02-27 21:35:11 +02:00
  • 969be5d42b llama : fix non-quantization of expert gating tensors Francis Couture-Harpin 2024-02-27 14:24:45 -05:00
  • 801abe52a3
    Simplify to_lower cases Douglas Hanley 2024-02-27 14:08:39 -05:00
  • 9242cf1421 add locale include Douglas Hanley 2024-02-27 12:46:01 -06:00
  • 14bf8965c2
    readme : add feature matrix Romain “Artefact2” Dal Maso 2024-02-27 19:36:13 +01:00
  • de64e061da QK_K = 64 tests pass on ARM_NEON and Metal Iwan Kawrakow 2024-02-27 20:12:54 +02:00
  • 6b33a09462 use builtin tolower Douglas Hanley 2024-02-27 11:42:23 -06:00
  • 6afc1f60e1 add srand() in speculative.cpp Minsoo Cheong 2024-02-28 02:26:01 +09:00
  • cb49e0f8c9
    Attempt to fix android build (#5752) b2282 Kawrakow 2024-02-27 19:16:49 +02:00
  • 28e6146c11 iq2_xs: attempt to fix AVX dot product for QK_K = 64 Iwan Kawrakow 2024-02-27 18:41:31 +02:00
  • 6bc2e4b854 Attempt to fix android build Iwan Kawrakow 2024-02-27 18:39:27 +02:00
  • fddedfb950
    Merge branch 'ggerganov:master' into server_branch pudepiedj 2024-02-27 15:53:43 +00:00
  • 13ba37f1aa WIP: make i-quants work for QK_K = 64 Iwan Kawrakow 2024-02-27 17:30:11 +02:00
  • 0becb22ac0
    IQ4_XS: a 4.25 bpw quantization (#5747) b2281 Kawrakow 2024-02-27 16:34:24 +02:00
  • 14d757066b
    llama : add llama_kv_cache_compress (EXPERIMENTAL) gg/kv-compress Georgi Gerganov 2024-02-25 22:16:13 +02:00
  • e8c37fd893 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch pudepiedj 2024-02-27 14:01:24 +00:00
  • 6e9b6a18fb extra comments pudepiedj 2024-02-27 14:01:22 +00:00
  • ad4f567d8e
    Merge branch 'ggerganov:master' into server_branch pudepiedj 2024-02-27 13:39:30 +00:00
  • c24a2a6e60
    cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744) b2280 Engininja2 2024-02-27 07:22:45 -06:00
  • 1f30b7a9f1
    ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742) b2279 Engininja2 2024-02-27 06:50:18 -06:00
  • d17bba34bc Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch pudepiedj 2024-02-27 12:38:18 +00:00
  • 5854b0b86d Improved apikey code pudepiedj 2024-02-27 12:38:16 +00:00
  • 9d533a77d0
    llama : fix defrag bugs + add parameter (#5735) b2278 Georgi Gerganov 2024-02-27 14:35:51 +02:00
  • 7824722c8c
    llama : fix graph size check during defrag Georgi Gerganov 2024-02-27 14:35:27 +02:00
  • 58eca1db04 Add more quantization types to unsupported Aidan 2024-02-27 12:34:38 +00:00
  • d7bb4b6d4d iq4_xs: Added forgotten check for 256 divisibility Iwan Kawrakow 2024-02-27 12:16:17 +02:00
  • 801f998bd9 Fix CI Iwan Kawrakow 2024-02-27 10:59:29 +02:00
  • f162fcaf81 iq4_xs: revert using IQ3_S for attn_k and attn_v Iwan Kawrakow 2024-02-27 10:14:12 +02:00
  • a402d3cf74
    Update issues.feature GangT 2024-02-27 14:27:36 +07:00
  • 5c2b230552 iq4_xs: shrink by using IQ3_S for attn_k and attn_q Iwan Kawrakow 2024-02-27 09:10:19 +02:00
  • 2dd36d6ddd sort nfd map; reuse iterator Douglas Hanley 2024-02-27 00:59:44 -06:00
  • 6c2b233b08 iq3_xs: minor fix Iwan Kawrakow 2024-02-27 08:52:49 +02:00
  • 875319b323 remove unused variables Minsoo Cheong 2024-02-27 15:30:52 +09:00
  • 34b942a429 fix style Minsoo Cheong 2024-02-27 15:29:14 +09:00
  • fb18827b4e remove p_accept parameter Minsoo Cheong 2024-02-27 15:09:12 +09:00
  • cbbd1efa06
    Makefile: use variables for cublas (#5689) b2277 le.chang 2024-02-27 10:03:06 +08:00
  • b8ca817690 cuda : replace remaining shfl_xor with calls to warp_reduce functions Engininja2 2024-02-25 14:09:11 -06:00
  • fcfb614990 ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc Engininja2 2024-02-26 16:16:25 -06:00
  • b11a93df41
    fix server hangs on empty prompt (#5733) b2276 Xuan Son Nguyen 2024-02-26 23:15:48 +01:00
  • b4a70fc949 merge: missing output in help ngxson 2024-02-26 23:13:33 +01:00
  • df9fb7e7bf first working version ngxson 2024-02-26 22:31:25 +01:00
  • 9c996e3d35 implement nfd for stripping accents in wpm tokenizer Douglas Hanley 2024-02-26 12:38:03 -06:00
  • 4bd4ac931c
    Merge branch 'ggerganov:master' into server_branch pudepiedj 2024-02-26 17:10:40 +00:00
  • 02702d975d Server header and README.md pudepiedj 2024-02-26 17:09:04 +00:00
  • 35613271b1
    llama : disable log message Georgi Gerganov 2024-02-26 18:45:16 +02:00
  • ad40ae635f iq4_nl: Metal implementation Iwan Kawrakow 2024-02-26 16:49:15 +02:00
  • a37980c3d0 iq4_xs: ARM_NEON dot product Iwan Kawrakow 2024-02-26 16:16:01 +02:00
  • 061a16f5a2 iq4_xs: AVX2 dot product Iwan Kawrakow 2024-02-26 15:55:08 +02:00
  • fddbfe839a iq4_xs: CUDA works - 133.2 t/s Iwan Kawrakow 2024-02-26 14:31:18 +02:00
  • 2b21d37a4b iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 Iwan Kawrakow 2024-02-26 13:17:52 +02:00
  • 67264b3b30 Try IQ4_NL with blocks of 64 - does not look good Iwan Kawrakow 2024-02-26 10:37:25 +02:00
  • 273d985271 std::atomic_flag ngxson 2024-02-26 17:30:00 +01:00
  • a33e6a0d2a
    Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721) b2275 Kawrakow 2024-02-26 18:28:38 +02:00
  • 547ddae5f7
    llama : cont Georgi Gerganov 2024-02-26 18:24:55 +02:00
  • 4e35db1a81
    llama : add defrag_thold parameter Georgi Gerganov 2024-02-26 18:19:23 +02:00
  • 465ced3808 Server update pudepiedj 2024-02-26 16:16:23 +00:00
  • 30c29f44cc
    llama : fix defrag bugs + enable by default Georgi Gerganov 2024-02-26 17:25:08 +02:00
  • 4d04056874 Split device code per kernel Aidan 2024-02-26 15:26:52 +00:00
  • 54af80752c a brief explaining comment Michael Podvitskiy 2024-02-26 16:20:35 +01:00
  • 7547ef5c6d server: twice ctrl+C to exit ngxson 2024-02-26 15:41:55 +01:00
  • 47bb7b48c7
    CUDA: fix DEBUG_CUDA_MALLOC (#5729) b2274 Johannes Gäßler 2024-02-26 15:36:38 +01:00
  • 20df113bfa fix server hangs on empty prompt ngxson 2024-02-26 15:31:54 +01:00
  • 6ff178fb92 CUDA: fix DEBUG_CUDA_MALLOC Johannes Gäßler 2024-02-26 13:01:12 +01:00
  • c4d7f81786
    readme : update ui list (#5731) Artem 2024-02-26 17:15:28 +03:00
  • e849078c6e
    [SYCL] Add support for soft_max ALiBi (#5639) b2272 AidanBeltonS 2024-02-26 14:02:11 +00:00
  • de3d90dd3b format fix Michael Podvitskiy 2024-02-26 14:59:54 +01:00
  • 3c23413b8b Adjust print_timings pudepiedj 2024-02-26 13:49:35 +00:00
  • b4bcaa7f47
    Update README.md Artem 2024-02-26 16:42:43 +03:00
  • e3ac833d3f using abort_callback from ggml to stop llama computation Michael Podvitskiy 2024-02-08 11:37:30 +01:00
  • 2768634743 Merge remote-tracking branch 'origin/master' into server_branch pudepiedj 2024-02-26 13:31:50 +00:00
  • 74d13ef335 Server updates pudepiedj 2024-02-26 12:09:06 +00:00
  • 67fd33132f
    unicode : reuse iterator (#5726) b2271 Georgi Gerganov 2024-02-26 14:02:12 +02:00
  • 0b0df95687
    unicode : reuse iterator Georgi Gerganov 2024-02-26 14:01:34 +02:00
  • 0d09ea792a fix CI Abhilash Majumder 2024-02-26 15:50:46 +05:30
  • 9790023b07 fix format Abhilash Majumder 2024-02-26 14:11:17 +05:30
  • 20453f3361 rm commented code Abhilash Majumder 2024-02-26 13:15:34 +05:30
  • 683ac381d8 Update pre-processor Aidan 2024-02-21 16:17:02 +00:00
  • c07ccd59fa Add support for bias Aidan 2024-02-21 14:46:10 +00:00
  • d90875a8f1
    Merge 1b8da8e0a6 into 4804215cb8 bobqianic 2024-02-26 11:07:34 +00:00
  • 4804215cb8
    server: CI fix trailing space (#5728) b2270 Pierrick Hymbert 2024-02-26 11:41:34 +01:00
  • 9d843a2c77 server: CI fix trailing space Pierrick HYMBERT 2024-02-26 11:37:00 +01:00
  • ec0abd2da5
    Update examples/quantize/quantize.cpp Kawrakow 2024-02-26 11:06:46 +02:00
  • 8a533f0d90
    server: CI tests reduce build matrix (#5725) b2269 Pierrick Hymbert 2024-02-26 09:56:10 +01:00
  • d437a8065a server: CI tests reduce build matrix Pierrick HYMBERT 2024-02-26 09:43:33 +01:00
  • aa12e7333f merge to upstream master jorgealias 2024-02-26 00:43:06 -07:00
  • 00d5cdbdbe Merge branch 'master' into multiple-mount-points jorgealias 2024-02-26 00:33:24 -07:00
  • 269de86ba0
    llama : fix Gemma rope type (#5691) b2268 Georgi Gerganov 2024-02-26 08:30:17 +02:00
  • a7812d9c6e fix build.zig hazelnutcloud 2024-02-26 14:26:15 +08:00
  • 43001ce077 Merge branch 'master' of https://github.com/hazelnutcloud/llama.cpp hazelnutcloud 2024-02-26 14:23:20 +08:00
  • 6f318cf76c Add "/chat/completions" as alias for "/v1/chat/completions" jorgealias 2024-02-25 22:58:47 -07:00
  • 88a18f9f6e Adding IQ2_S and IQ2_M as a single cumulative commit Iwan Kawrakow 2024-02-26 07:50:48 +02:00
  • c393733988 flake.lock: Update b2267 github-actions[bot] 2024-02-25 00:17:11 +00:00
  • e3965cf35a
    server: tests - slow inference causes timeout on the CI (#5715) b2266 Pierrick Hymbert 2024-02-25 22:48:33 +01:00
  • f6ef8ac45a server: tests: fix ci status ok not idle Pierrick HYMBERT 2024-02-25 22:41:44 +01:00
  • 0037c628ff server: tests - longer inference timeout for CI Pierrick HYMBERT 2024-02-25 21:55:20 +01:00
  • 92671d71c4 revert move server_log ngxson 2024-02-25 21:48:32 +01:00