Commit graph

  • e04dc51988
    ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) slaren 2023-09-20 13:00:28 +02:00
  • db0fc2da06
    simple : improve comments + free batch Georgi Gerganov 2023-09-20 13:54:20 +03:00
  • d30ab79b18 fix rope shift slaren 2023-09-20 12:42:50 +02:00
  • b377bf2266
    simple : add parallel decoding support Georgi Gerganov 2023-09-20 13:06:34 +03:00
  • 4a0c515da7 rename notepad to classic Concedo 2023-09-20 17:51:02 +08:00
  • addae65fd4
    llama : improve llama_batch API + simplify parallel example Georgi Gerganov 2023-09-20 10:46:18 +03:00
  • 436cd474cd regex fix Concedo 2023-09-20 16:02:19 +08:00
  • d119c04c15
    examples : fix benchmark-matmult (#1554) b1257 Georgi Gerganov 2023-09-20 10:02:39 +03:00
  • 2fc91d8727 updated lite Concedo 2023-09-20 14:28:55 +08:00
  • a1327c71c6
    parallel : rename hot-plug to continuous-batching Georgi Gerganov 2023-09-20 09:24:02 +03:00
  • 2e92aefef3
    Merge branch 'custom-attention-mask' into cam-cuda-2 Georgi Gerganov 2023-09-20 09:17:48 +03:00
  • e1067efbfa
    llama : fix n_kv to never become 0 Georgi Gerganov 2023-09-20 09:17:05 +03:00
  • 05adde4f1b build : use -Werror=implicit-function-declaration Cebtenzzre 2023-09-20 00:05:16 -04:00
  • 0465daaa1d baby-llama : fix -Wmaybe-uninitialized warning from gcc Cebtenzzre 2023-09-18 18:35:23 -04:00
  • 54e28be107 fix more -Wextra-semi-stmt warnings Cebtenzzre 2023-09-14 17:49:24 -04:00
  • df080fe7e8 ggml : do not put ';' after GGML_*_LOCALS (-Wextra-semi-stmt) Cebtenzzre 2023-09-14 17:31:35 -04:00
  • 90eb6653f3 examples : fix extra ';' after function definitions (-Wextra-semi) Cebtenzzre 2023-09-14 17:02:01 -04:00
  • ce6f1a0956
    syntax Eve 2023-09-20 00:23:57 +00:00
  • c0583b2abc
    fix LLAMA_NATIVE Eve 2023-09-20 00:07:29 +00:00
  • 4c0f243787 offload KQ_mask with all models slaren 2023-09-20 00:53:28 +02:00
  • 488c1fc778 ggml-cuda : add rope f16, restore performance slaren 2023-09-20 00:21:09 +02:00
  • 7b7472ee26
    parallel : minor Georgi Gerganov 2023-09-20 00:35:10 +03:00
  • 6028879f56 parallel : print misses on each request Georgi Gerganov 2023-09-19 23:50:05 +03:00
  • eed3fd4234 parallel : count cache misses Georgi Gerganov 2023-09-19 23:47:47 +03:00
  • 8a9aca37c1
    parallel : remove question with short answers Georgi Gerganov 2023-09-19 23:34:30 +03:00
  • a6070b7c5a Fixed vocabulary guarding assertions goerch 2023-09-19 21:34:41 +02:00
  • 59a30b768a Adding another assertion goerch 2023-09-19 21:21:47 +02:00
  • 4abbfb51f9 Adding a comment goerch 2023-09-19 19:37:25 +02:00
  • bd1553a4b7 Add --n-probs to server CLI args Samuel Stevens 2023-09-19 13:03:41 -04:00
  • 17ca832717 Streamlining code and adding some more assertions goerch 2023-09-19 19:04:48 +02:00
  • 5285c5e19c flake:Restore default package's buildInputs(#3261) tpdns90321 2023-09-19 23:06:41 +09:00
  • 4b5f3cd6bf
    parallel : process system prompt once + configurable paramters + llama API Georgi Gerganov 2023-09-19 17:00:42 +03:00
  • a4e9448ef0 Guarding some unusable code pathes goerch 2023-09-19 14:17:34 +02:00
  • 1b7c3692af Moving byte decoding back to token_to_piece ... goerch 2023-09-19 13:24:04 +02:00
  • 4efad756b1 fix for updated c lib Edward Taylor 2023-09-19 23:07:02 +12:00
  • 82e20e9ba0 parallel : remove new line from prompt Georgi Gerganov 2023-09-19 13:54:41 +03:00
  • d37081ae5d
    llama : silence errors KV cache errors Georgi Gerganov 2023-09-19 13:39:52 +03:00
  • 16090a5dde
    parallel : fix sequence termination criteria Georgi Gerganov 2023-09-19 13:29:29 +03:00
  • 806d397c1a
    parallel : try smaller batches when the KV cache is fragmented Georgi Gerganov 2023-09-19 13:21:36 +03:00
  • ddad227782
    llama : fix cell_max logic + rename functions Georgi Gerganov 2023-09-19 13:21:12 +03:00
  • 52fb04005f revert ubuntu focal make job Alon Faraj 2023-09-19 12:56:43 +03:00
  • a888b5a9a3 - freebsd ci: use qemu - change ubuntu make to latest Alon Faraj 2023-09-19 12:31:46 +03:00
  • 36714e16d0
    parallel : various improvements Georgi Gerganov 2023-09-19 12:29:37 +03:00
  • 467e307931
    simple : fix token counting Georgi Gerganov 2023-09-19 11:45:33 +03:00
  • 25bd254089
    make : add parallel to build + fix static functions in llama.cpp Georgi Gerganov 2023-09-19 11:37:02 +03:00
  • 7e2b9974d1
    ggml-cuda : update rope implementation for parallel decoding (#3254) slaren 2023-09-19 10:31:36 +02:00
  • c0990bb739 Testing Aquila goerch 2023-09-19 10:31:10 +02:00
  • 933527695b
    Merge branch 'custom-attention-mask' into cam-cuda Georgi Gerganov 2023-09-19 11:13:45 +03:00
  • daf4c6d360
    llama : fix worst case graph build Georgi Gerganov 2023-09-19 11:05:08 +03:00
  • a9e6b46fb0
    Fixed Typos Aayush Shah 2023-09-19 13:19:43 +05:30
  • 113e837582 Cleaned README installing orders for available options Aayush Shah 2023-09-19 13:17:44 +05:30
  • aa18b93980 simpler rope implementation slaren 2023-09-19 08:51:05 +02:00
  • 787bc4af60 CI Fix chooper1 2023-09-18 22:18:59 -07:00
  • 80f69694e5 CI Fix chooper1 2023-09-18 20:17:06 -07:00
  • cbe2bac281 fix rope slaren 2023-09-19 02:16:07 +02:00
  • fb92acdd6b better solution for p0 computation slaren 2023-09-19 00:17:55 +02:00
  • 048e659dae Another Makefile rule goerch 2023-09-19 00:04:10 +02:00
  • c85cb29b09 Update .gitignore and Makefile goerch 2023-09-19 00:00:10 +02:00
  • eec6b66ac9 ggml-cuda : update rope implementation for parallel decoding slaren 2023-09-18 23:48:34 +02:00
  • 311fcf113b Test doesn't work over the full range of Unicodes goerch 2023-09-18 23:24:54 +02:00
  • fa0e677820
    llama : extend batch API to select which logits to output Georgi Gerganov 2023-09-19 00:24:13 +03:00
  • 407f76d9b8 Cleanup goerch 2023-09-18 23:22:17 +02:00
  • 208d3d7cda Fix compiler warning goerch 2023-09-18 23:01:32 +02:00
  • 91a527a0e0 Cleanup and an improvement goerch 2023-09-18 22:53:35 +02:00
  • 897caccdf4
    fixes : speculative KV cache + llama worst-case graph Georgi Gerganov 2023-09-18 22:00:02 +03:00
  • 37cf135cb0 Fix MSVC Unicode BOM problem goerch 2023-09-18 21:15:01 +02:00
  • 77704232b2 Fix debug assertion failure goerch 2023-09-18 20:55:07 +02:00
  • 466b513851
    parallel : disable hot-plug to avoid cache fragmentation Georgi Gerganov 2023-09-18 21:34:20 +03:00
  • 89e74c67e2 Try to fix build problem goerch 2023-09-18 20:15:43 +02:00
  • 0161372b9a
    parallel : example for serving multiple users in parallel Georgi Gerganov 2023-09-18 20:30:05 +03:00
  • bfaab6f4fa Work on the BPE tokenizer goerch 2023-09-18 19:18:15 +02:00
  • de8035af2c CUDA: use only 1 thread if fully offloaded JohannesGaessler 2023-09-18 18:14:53 +02:00
  • c03409c1f6 grammar sampling added for lite Concedo 2023-09-19 00:13:30 +08:00
  • 1f17ea631c
    speculative : fix KV cache management Georgi Gerganov 2023-09-18 19:01:20 +03:00
  • 7c1bdd0e8a
    llama : apply K-cache roping for Falcon and Baichuan Georgi Gerganov 2023-09-18 18:26:05 +03:00
  • 0142760fc3 Merge branch 'master' into concedo_experimental Concedo 2023-09-18 23:20:02 +08:00
  • 0cbf3bfef8
    llama : add llama_kv_cache_shift_seq + no more context swaps Georgi Gerganov 2023-09-18 18:00:25 +03:00
  • 8c453d1e4e added grammar sampling Concedo 2023-09-18 23:02:00 +08:00
  • 86c90e34f5
    metal : disable concurrency optimization Georgi Gerganov 2023-09-18 18:00:01 +03:00
  • f015b26689
    llama : more robust cell_max heuristic + wip shift Georgi Gerganov 2023-09-18 17:15:25 +03:00
  • 8781013ef6
    make : restore build-info.h dependency for several targets (#3205) b1256 Cebtenzzre 2023-09-18 10:03:53 -04:00
  • c620f4d677 KV cache quantized to q8_0 JohannesGaessler 2023-09-01 20:15:49 +02:00
  • 4d76d762ef
    llama : extend llama_kv_cache API Georgi Gerganov 2023-09-18 15:53:03 +03:00
  • 6952a460b9
    llama : add cell_max heuristic for more efficient kv_cache Georgi Gerganov 2023-09-18 15:31:24 +03:00
  • 9f42e75489
    llama : add new llama_decode() API that works with llama_batch Georgi Gerganov 2023-09-18 14:23:52 +03:00
  • 58bb5110ca
    Merge branch 'master' into custom-attention-mask Georgi Gerganov 2023-09-18 11:15:18 +03:00
  • d29e76937c
    llama : unified KV cache + batch inference API Georgi Gerganov 2023-09-18 10:08:22 +03:00
  • 951614bfc6 library unloading is working Concedo 2023-09-18 15:03:52 +08:00
  • e2d3353010
    rocm: Automatically build externally Daniel Tang 2023-09-02 08:20:10 -04:00
  • e90734e072 examples : include build-info.h only where needed Cebtenzzre 2023-09-17 21:44:22 -04:00
  • 95e168a67f remove unused variable Cebtenzzre 2023-09-17 21:34:24 -04:00
  • 1191cc3769 fix unreachable 'break' and 'return' (-Wunreachable-code-*) Cebtenzzre 2023-09-14 17:19:24 -04:00
  • 141c645fc4 make : do not pass compiler-specific options to nvcc Cebtenzzre 2023-09-15 17:21:37 -04:00
  • 86170e0374 make : remove redundant -Wno-pedantic Cebtenzzre 2023-09-17 21:29:32 -04:00
  • 80926572f7 quantize : fix missing 'noreturn' (-Wmissing-noreturn) Cebtenzzre 2023-09-14 15:12:56 -04:00
  • a80cb4cf1b build : separate common warning flags Cebtenzzre 2023-09-14 15:41:28 -04:00
  • 724a0c2071 build : remove -Wno-multichar as it is no longer needed Cebtenzzre 2023-09-14 16:14:05 -04:00
  • e63254755c fix more missing 'static' specifiers (-Wmissing-declarations) Cebtenzzre 2023-09-15 16:03:45 -04:00
  • 5457b0c11d make : add some missing build targets Cebtenzzre 2023-09-15 16:10:18 -04:00
  • b52ce44afd cmake : make -Wmissing-prototypes etc. match the Makefile Cebtenzzre 2023-09-15 15:49:24 -04:00