Commit graph

  • eefb794bd7 mamba : support state saving and restoring Francis Couture-Harpin 2024-03-03 13:55:48 -05:00
  • 2c96afc272 add alias for chat template ngxson 2024-03-03 19:31:11 +01:00
  • 10c477b8a8 implement slerp ngxson 2024-03-03 18:58:42 +01:00
  • f2c2bd6b26 iq3_s_mult: also CUDA Iwan Kawrakow 2024-03-03 19:12:05 +02:00
  • a032bb6ca2 self merge ok ngxson 2024-03-03 18:04:48 +01:00
  • e5e72562c5 iq3_s_mult: back to blocks of 32 Iwan Kawrakow 2024-03-03 18:50:26 +02:00
  • 35d51a9595 Use defined default seed. DAN™ 2024-03-03 11:49:33 -05:00
  • b83fbc9287 convert : for Mamba, fallback to internal NeoX tokenizer Francis Couture-Harpin 2024-03-02 23:39:19 -05:00
  • d52dd501f0 ggml : in ggml_ssm_scan, use a threshold for soft_plus Francis Couture-Harpin 2024-03-02 21:39:28 -05:00
  • 1af1000f10 mamba : more correctly update the "used" field of the KV cache Francis Couture-Harpin 2024-03-02 11:12:30 -05:00
  • 206e8ee2b2 mamba : stop abusing attention metadata Francis Couture-Harpin 2024-02-28 10:58:17 -05:00
  • 8f605cfe0d mamba : adapt perplexity, batched, and batched-bench examples Francis Couture-Harpin 2024-02-26 20:25:23 -05:00
  • 79d636cc7e mamba : dedicate an input tensor for state copy indices Francis Couture-Harpin 2024-02-25 17:26:31 -05:00
  • 34e2fca8eb mamba : make the server and parallel examples work with whole sequences Francis Couture-Harpin 2024-02-25 09:59:53 -05:00
  • 3dcf79824d mamba : support llama_kv_cache_seq_cp copy chains Francis Couture-Harpin 2024-02-25 09:51:49 -05:00
  • 9473ec2147 mamba : simultaneous sequence processing Francis Couture-Harpin 2024-02-18 20:57:30 -05:00
  • de50c549c4 mamba : reduce memory usage of ggml_ssm_scan Francis Couture-Harpin 2024-02-17 20:30:29 -05:00
  • e73eaa7b4f mamba : in comments, properly refer to KV cells instead of slots Francis Couture-Harpin 2024-02-14 13:43:14 -05:00
  • 8a43ffcfa1 mamba : multiple sequences, but one at a time Francis Couture-Harpin 2024-02-13 19:06:18 -05:00
  • 6ff34da092 mamba : apply suggestions from code review Francis Couture-Harpin 2024-02-05 10:13:55 -05:00
  • c52fb3c2de convert : fix flake8 linter errors Francis Couture-Harpin 2024-02-04 20:41:07 -05:00
  • 766db753c8 mamba : remove some useless comments Francis Couture-Harpin 2024-02-04 18:25:14 -05:00
  • de92f15634 ggml : remove ggml_exp and ggml_soft_plus Francis Couture-Harpin 2024-02-04 17:08:54 -05:00
  • cd0f33f281 mamba : fix vocab size problems with official models Francis Couture-Harpin 2024-02-04 09:49:23 -05:00
  • 9f55809f72 convert : for Mamba, also consider the "MambaLMHeadModel" arch name Francis Couture-Harpin 2024-02-04 09:00:42 -05:00
  • a3f4a1c7dc mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator Francis Couture-Harpin 2024-02-03 17:49:36 -05:00
  • 5816ae687e mamba : very basic quantization support Francis Couture-Harpin 2024-02-01 21:22:28 -05:00
  • 78a853b788 ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation Francis Couture-Harpin 2024-02-01 21:16:40 -05:00
  • ffc116f5ec mamba : handle batches of more than 1 token Francis Couture-Harpin 2024-01-31 20:45:04 -05:00
  • 81b57bb375 mamba : fix self-overlapping view depth stride Francis Couture-Harpin 2024-01-31 08:47:53 -05:00
  • e9cc45ecae mamba : simplify the conv step with a self-overlapping view Francis Couture-Harpin 2024-01-30 21:48:04 -05:00
  • 3f7233b62e ggml : parallelize ggml_exp Francis Couture-Harpin 2024-01-29 13:33:27 -05:00
  • 9e77061a3b mamba : refactor recurrent conv, resulting in 20% perf increase Francis Couture-Harpin 2024-01-29 10:21:19 -05:00
  • 74eea856bf convert : optionally use d_conv and d_state from config.json for Mamba Francis Couture-Harpin 2024-01-29 08:27:09 -05:00
  • 54d3e48601 mamba : recurrent inference WORKS!!! Francis Couture-Harpin 2024-01-28 16:20:03 -05:00
  • f680364bd8 mamba : recurrent inference almost works, but incoherent Francis Couture-Harpin 2024-01-28 15:36:42 -05:00
  • 5a69a262a1 mamba : begin figuring out how to (ab)use the kv cache for Mamba Francis Couture-Harpin 2024-01-27 11:41:20 -05:00
  • 8cd0a286b4 mamba : begin working on support for Mamba SSM Francis Couture-Harpin 2024-01-26 08:29:09 -05:00
  • 65730438aa wip: new format ngxson 2024-03-03 16:06:48 +01:00
  • f4cb4eac45 iq3_s_mult: play with blocks of 16 Iwan Kawrakow 2024-03-03 16:43:00 +02:00
  • 60325ec78e Tokenize antiprompts only once. DAN™ 2024-03-03 08:46:43 -05:00
  • 67be2ce101
    cuda : fix data race in soft max (#5853) b2329 slaren 2024-03-03 14:26:18 +01:00
  • 6564fbab18 cuda : fix data race in soft max slaren 2024-03-03 13:19:03 +01:00
  • 6aefd11204
    llama : adapt new models to F16 KQ_mask Georgi Gerganov 2024-03-03 13:50:54 +02:00
  • 02a645e7b7
    Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-03-03 13:44:11 +02:00
  • 96ddeac1c6 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch pudepiedj 2024-03-03 11:20:12 +00:00
  • 480089d00d improve Llamaserver.py pudepiedj 2024-03-03 11:20:10 +00:00
  • 54bea4428f
    Merge branch 'ggerganov:master' into server_branch pudepiedj 2024-03-03 11:19:25 +00:00
  • dbe98dfe70 iq3_s_mult: another alternative multiplier Iwan Kawrakow 2024-03-03 13:13:52 +02:00
  • 231ae28f07
    readme : add API changes section Georgi Gerganov 2024-03-03 12:44:03 +02:00
  • 475df1d6cf
    llama : allow for user specified embedding pooling type (#5849) b2327 Douglas Hanley 2024-03-03 04:40:27 -06:00
  • 4661363b0f
    llama : use enum types over int Georgi Gerganov 2024-03-03 11:53:32 +02:00
  • 99b9edabca fix mul_mat fault in cpy_f32_f16 Jianyu Zhang 2024-03-03 17:35:00 +08:00
  • 8b713a987e iq3s_mult: quantization tuning Iwan Kawrakow 2024-03-03 11:32:53 +02:00
  • 5b9c8785fa iq3s_mult: ARM and Metal Iwan Kawrakow 2024-03-03 11:30:01 +02:00
  • b6402fa757 iq3_s_mult: ifdef'd slow / fast versions Iwan Kawrakow 2024-03-03 10:43:53 +02:00
  • 87c2e8b279
    gguf-dump : support i-quants (#5841) Nindaleth 2024-03-03 09:43:42 +01:00
  • de9692a7d2
    llama : fix llama_copy_state_data with fragmented KV cache (#5840) b2325 compilade 2024-03-03 03:41:55 -05:00
  • e6029348e8
    ci : schedule slow server tests only on Release or on demand (#5839) b2324 Pierrick Hymbert 2024-03-03 09:35:23 +01:00
  • 8ef969afce
    server : init http requests thread pool with --parallel if set (#5836) b2323 Pierrick Hymbert 2024-03-03 08:48:36 +01:00
  • 265741aa0f Merge remote-tracking branch 'origin/master' into server_branch pudepiedj 2024-03-03 06:56:31 +00:00
  • 726aed307a iq3_s_mult: alternative multiplier / bit twidling Iwan Kawrakow 2024-03-03 08:51:28 +02:00
  • fe3c20b251 iq3_s_mult: quantization tuning Iwan Kawrakow 2024-03-03 07:51:20 +02:00
  • efecd060c9 allow for user specified pooling type Douglas Hanley 2024-03-02 22:54:52 -06:00
  • 3000e0ac9e iq3_s_mult: Metal works - slower than lookup Iwan Kawrakow 2024-03-03 06:41:58 +02:00
  • fa974646e1
    flake.lock: Update (#5842) Georgi Gerganov 2024-03-03 06:11:31 +02:00
  • ef651247d0 Support special tokens as reverse/anti prompt. DAN™ 2024-03-02 21:56:45 -05:00
  • 1e6a2f12c6 Add --public-domain flag to server to enable CORS requests. StrangebytesDev 2024-03-02 17:31:57 -08:00
  • 0de8a548cd
    Update CMakeLists.txt Dane Madsen 2024-03-03 11:31:00 +10:00
  • 72bab141a9 flake.lock: Update github-actions[bot] 2024-03-03 00:17:03 +00:00
  • 0054e3309d gguf-dump: support i-quants Black_Fox 2024-03-03 00:28:13 +01:00
  • b448ccfe20 llama : fix llama_copy_state_data with fragmented KV cache Francis Couture-Harpin 2024-03-02 16:34:07 -05:00
  • eb0bf32caf server: tests: schedule slow dispatch only on release or on demand ci/server/fix-slow-test Pierrick HYMBERT 2024-03-02 23:18:31 +01:00
  • f3bb1e55c6 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch pudepiedj 2024-03-02 22:10:30 +00:00
  • bf366d2d9a add api key pudepiedj 2024-03-02 22:10:28 +00:00
  • ab7a989293
    Merge branch 'ggerganov:master' into master StrangeBytesDev 2024-03-02 13:13:17 -08:00
  • 9731134296
    server: tests: passkey challenge / self-extend with context shift demo (#5832) b2321 Pierrick Hymbert 2024-03-02 22:00:14 +01:00
  • 17dfcde615 Added admin-key param, and added endpoints to api-key description. Robey Holderith 2024-03-02 12:43:10 -08:00
  • 550722061f Fixed spacing/removed errant tab Robey Holderith 2024-03-02 12:35:04 -08:00
  • 0c7f5b26cf server: tests: passkey add a negative test Pierrick HYMBERT 2024-03-02 21:24:28 +01:00
  • ebc1decb10 Add Admin key param and generalize key check Robey Holderith 2024-03-02 12:02:21 -08:00
  • a6ea72541f server: tests: keep only the PHI-2 test Pierrick HYMBERT 2024-03-02 20:53:00 +01:00
  • 4a6e2d6142
    llama : add abort_callback to interrupt computation (#5409) b2320 Michael Podvitskiy 2024-03-02 20:52:25 +01:00
  • 418f2f9b23
    Merge branch 'master' into abort_callback Georgi Gerganov 2024-03-02 21:51:15 +02:00
  • 65e013b669 server: init server http requests threads pool with max of hardware_concurrency -1 or n_slots + 2 Pierrick HYMBERT 2024-03-02 20:45:40 +01:00
  • 2cdd21e26b server: tests: increase timeout for completion Pierrick HYMBERT 2024-03-02 20:32:20 +01:00
  • c1f66f05f5 server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 Pierrick HYMBERT 2024-03-02 20:19:21 +01:00
  • 1aa5ad9150 server: tests: fix re content Pierrick HYMBERT 2024-03-02 19:30:19 +01:00
  • 830d0efbd2 server: tests: CI workflow failed on first scenario failed Pierrick HYMBERT 2024-03-02 19:30:13 +01:00
  • 763ae0a1fd Merge remote-tracking branch 'origin/tests/server/passkey' into tests/server/passkey Pierrick HYMBERT 2024-03-02 19:13:33 +01:00
  • 61b97915b0 server: metrics: fix when no prompt processed Pierrick HYMBERT 2024-03-02 19:11:53 +01:00
  • 45465b21d1 check grammar in llama_sample_probability_distribution_impl Minsoo Cheong 2024-03-03 03:09:11 +09:00
  • 494c870326
    ggml : fix IQ3_S AVX implementation (#5834) b2319 Georgi Gerganov 2024-03-02 20:00:49 +02:00
  • 9fcfa63a11 server: tests: schedule slow tests on master Pierrick HYMBERT 2024-03-02 18:58:21 +01:00
  • 9ab72d7ade server: tests: schedule slow tests on master Pierrick HYMBERT 2024-03-02 18:58:21 +01:00
  • 178b0c693d server: tests: fix regex content matching Pierrick HYMBERT 2024-03-02 18:57:57 +01:00
  • a6042049be Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size 0cc4m 2024-03-02 18:53:35 +01:00
  • 407cc609d3 server: tests: fix passkey, add doc, fix regex content matching, fix timeout Pierrick HYMBERT 2024-03-02 18:53:01 +01:00
  • f4ec9a06ea Merge upstream changes, fix conflicts 0cc4m 2024-03-02 18:43:17 +01:00
  • 55ac610c7f
    ggml: fix IQ3_S AVX implementation Georgi Gerganov 2024-03-02 19:32:19 +02:00