Commit graph

  • 251e602d25 Fix non-intel device selection Aidan 2024-03-13 15:26:49 +00:00
  • 0d934ee517
    server : construct batch with size of llama_n_batch Georgi Gerganov 2024-03-13 16:16:20 +02:00
  • 015e1bfe64 llama : do not limit n_batch to n_ctx with non-casual attn slaren 2024-03-13 14:59:32 +01:00
  • d8fd0ccf6a
    test-backend-ops : skip CPU backend by default (#6028) b2412 slaren 2024-03-13 14:58:30 +01:00
  • b3d978600f
    Update get version (#6025) b2411 AidanBeltonS 2024-03-13 13:17:54 +00:00
  • cda49d3828 check n_ubatch >= n_tokens with non-casual attention slaren 2024-03-13 13:59:08 +01:00
  • 54cdd478d7 ggml : allow ggml_get_rows to use multiple threads if they are available slaren 2024-03-13 13:33:10 +01:00
  • 529e749e3c
    swiftui : sync after decode Georgi Gerganov 2024-03-13 14:04:54 +02:00
  • b25a0f1965
    batched-bench : sync after decode Georgi Gerganov 2024-03-13 14:04:32 +02:00
  • 7cfc6dfcc4 README.md: Update details about running llama in Termux on Android wanix1988 2024-03-13 19:37:54 +08:00
  • 46b3ccaa6f 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; willhe 2024-03-13 11:25:59 +00:00
  • 62629ebcac Support xverse model convert to gguf format. willhe 2024-03-12 03:18:11 +00:00
  • 9e7cecc1c8 llama : fix norm backend slaren 2024-03-13 12:18:09 +01:00
  • 94a1050e57 revert function renaming Michael Podvitskiy 2024-03-13 12:07:22 +01:00
  • 0a1322acbd pr review fixes Michael Podvitskiy 2024-03-13 11:44:08 +01:00
  • 99b71c068f
    Server: Use multi-task for embeddings endpoint (#6001) b2410 Xuan Son Nguyen 2024-03-13 11:39:11 +01:00
  • 56c6210831 Update get version Aidan 2024-03-12 17:42:14 +00:00
  • 4400153348 llama : limit n_batch and n_ubatch to n_ctx during context creation slaren 2024-03-13 02:55:29 +01:00
  • 48c02498f2
    make changes to make sure it's an exact 1 to 1 mapping to our python rubra tool formatter Yingbei 2024-03-12 17:59:00 -07:00
  • 89e775aaa7 test-backend-ops : skip CPU backend by default slaren 2024-03-12 23:33:56 +01:00
  • aeefbbb4a3
    modify llama2 template formatting logic Yingbei 2024-03-12 15:08:16 -07:00
  • 8902fd41f0
    merge from main Yingbei 2024-03-11 13:57:02 -07:00
  • 9a8762532e
    edit oai.hpp to accept function calling usage in openai format. Yingbei 2024-03-05 13:47:31 -08:00
  • 255c1ec18e fix hip build slaren 2024-03-12 23:01:25 +01:00
  • aec982eefd try to detect the PHI cross compiler in make. Julia Longtin 2024-03-12 21:54:38 +00:00
  • ead5c8b895 fix sycl build (disable cpy_tensor_async) slaren 2024-03-12 22:45:59 +01:00
  • a31c936c5a try to detect the PHI cross compiler in make. Julia Longtin 2024-03-12 21:40:46 +00:00
  • deb3e245c2 Merge remote-tracking branch 'origin/master' into sl/pipeline-parallelism slaren 2024-03-12 22:34:02 +01:00
  • 5a2973af25 instead of checking on glibc, check on SYS_getcpu Julia Longtin 2024-03-12 21:07:10 +00:00
  • 7f3722beb6 handle the case that we have no glibc on the PHI. Julia Longtin 2024-03-12 21:02:14 +00:00
  • 868a2016ac add detection of Xeon PHI: Knights Corner. Julia Longtin 2024-03-12 20:57:43 +00:00
  • aa1e2f8b2f fix hip build slaren 2024-03-12 21:41:45 +01:00
  • 89bfa1f2ed add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) slaren 2024-03-12 21:09:01 +01:00
  • 00a415d19b llama : limit max batch size to n_batch slaren 2024-03-12 20:44:40 +01:00
  • 35d5a02bef
    metal : fix embed build + update library load logic Georgi Gerganov 2024-03-12 21:29:56 +02:00
  • 937966d75e llama : fix Mamba inference for pipeline parallelism Francis Couture-Harpin 2024-03-12 14:54:35 -04:00
  • 4ddccc2852 fix server embedding test slaren 2024-03-12 17:59:59 +01:00
  • 9f805264dc Attempt 2 ik/try_fix_iq1s_sycl Iwan Kawrakow 2024-03-12 18:40:13 +02:00
  • dd0b2be713 remove redundant {"n_predict", 0} ngxson 2024-03-12 17:00:17 +01:00
  • 306d34be7a
    ci : remove tidy-review (#6021) b2409 slaren 2024-03-12 16:55:19 +01:00
  • 573a5dc205 update format Jianyu Zhang 2024-03-12 23:32:45 +08:00
  • 9d4a130275 fix set main gpu error, support single/mul gpu mode Jianyu Zhang 2024-03-12 23:30:17 +08:00
  • eb5ec38a05 ci : remove tidy-review slaren 2024-03-12 16:25:11 +01:00
  • 6b90566052 control vector api and implementation Theia Vogel 2024-03-09 20:22:37 -08:00
  • 10ac60b37b Resolves #3878 by enforcing existence of root node before returning valid grammar structure. This is an alternate fix location to put the required root node in sampling.cpp instead of grammar-parser.cpp. Clint Herron 2024-03-11 17:46:05 -04:00
  • 16c31f6b2a fix conflict Jianyu Zhang 2024-03-12 23:00:54 +08:00
  • 1ac668e4ec server : add -ub, --ubatch-size parameter slaren 2024-03-12 15:50:31 +01:00
  • 822121fbcd llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs slaren 2024-02-13 01:46:26 +01:00
  • 34cdece33a
    metal : build metallib + fix embed path Georgi Gerganov 2024-03-12 15:54:02 +02:00
  • 9188523f70 iq1_s[SYCL]: remove unnecessary (unused) data Iwan Kawrakow 2024-03-12 15:20:04 +02:00
  • da5a6f05f6 iq1_s: attempt to fix SYCL Iwan Kawrakow 2024-03-12 15:09:07 +02:00
  • 164d6987fb fix conflict Jianyu Zhang 2024-03-12 21:02:30 +08:00
  • 9114338588 update format Jianyu Zhang 2024-03-12 20:44:30 +08:00
  • c34100e717 ggml : reuse quantum structs across backends (#5943) Georgi Gerganov 2024-03-12 14:27:20 +02:00
  • 9f68423f1e ggml : fix UB in IQ2_S and IQ3_S (#6012) Georgi Gerganov 2024-03-12 13:49:55 +02:00
  • 662211bcc6 sycl : update IQ1_S kernels (WIP - not working!) (#5995) Georgi Gerganov 2024-03-12 11:15:05 +02:00
  • 4c26dea734 grammar : fix unnecessarily retained pointer to rules (#6003) gliptic 2024-03-11 20:59:03 +01:00
  • fd2340dc6f 1.5 bit: we can do even better (#5999) Kawrakow 2024-03-11 16:53:15 +01:00
  • f6846a3071 llama : more consistent names of count variables (#5994) Georgi Gerganov 2024-03-11 17:49:47 +02:00
  • d0b01ef898 llama : refactor unicode stuff (#5992) Georgi Gerganov 2024-03-11 17:47:47 +02:00
  • 0a8db3ff55 Update server docker image URLs (#5997) Jakub N 2024-03-11 14:40:42 +01:00
  • f8aa86b973 Server: format error to json (#5961) Xuan Son Nguyen 2024-03-11 10:56:41 +01:00
  • 42aa0e6e26 ggml, ci : Windows ARM runner and build fixes (#5979) Michael Podvitskiy 2024-03-11 10:28:51 +01:00
  • bbc73d9196 server : maintain chat completion id for streaming responses (#5988) Minsoo Cheong 2024-03-11 17:09:32 +09:00
  • cd7144af95 cmake : fix subdir for LLAMA_METAL_EMBED_LIBRARY (#5985) Gilad S 2024-03-11 10:00:08 +02:00
  • 6a12b127e8 llama : fix F16/F32 downcast + improve names (#5980) Georgi Gerganov 2024-03-11 09:56:47 +02:00
  • 613fb83722 Better 1.5 bit quantization (#5971) Kawrakow 2024-03-11 07:51:49 +01:00
  • bec4dda41a [SYCL] Add q3_s and q1_s (#5886) Abhilash Majumder 2024-03-11 10:27:56 +05:30
  • 1c4007ca9a [SYCL] Add support for SYCL Nvidia target (#5738) AidanBeltonS 2024-03-11 01:13:57 +00:00
  • 98a4398471 metal : move mm_id indices to shared mem (#5982) Georgi Gerganov 2024-03-10 23:12:48 +02:00
  • c48be16af6 android : fix utf8 decoding error (#5935) Dean 2024-03-11 04:03:17 +08:00
  • afe90b7088 readme : update hot topics Georgi Gerganov 2024-03-10 20:58:26 +02:00
  • 89c1ec1ed1 sync : ggml Georgi Gerganov 2024-03-10 20:10:46 +02:00
  • 18f1d1f1bc ggml : try fix 32-bit arm compat (whisper/1938) Georgi Gerganov 2024-03-08 23:45:07 +02:00
  • 21638498a2 ggml : remove __constant__ specifier for CUDA tables (#5940) Georgi Gerganov 2024-03-10 20:09:24 +02:00
  • eddc89a206 server: ci: windows build and tests (#5968) Pierrick Hymbert 2024-03-10 18:17:47 +01:00
  • b0b95f7420 llama : add support for GritLM (#5959) DAN™ 2024-03-10 11:56:30 -04:00
  • 58ec96383d grammar : verify parsed state (#5950) Clint Herron 2024-03-10 11:17:43 -04:00
  • b85720e273 nix: update flake.lock (#5969) Georgi Gerganov 2024-03-10 16:43:08 +02:00
  • 74adda1015 server: benchmark: chat/completions scenario and other llm servers comparison (#5941) Pierrick Hymbert 2024-03-09 23:41:49 +01:00
  • f950aee817 server : print chat template info Georgi Gerganov 2024-03-09 22:04:00 +02:00
  • 116800303e perplexity : support using multiple sequences to allow larger batch sizes (#5946) slaren 2024-03-09 19:55:54 +01:00
  • 2c1bf7a1a4 readme : update hot topics Georgi Gerganov 2024-03-09 18:14:13 +02:00
  • 3548b1ccd6 ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951) Georgi Gerganov 2024-03-09 17:36:20 +02:00
  • 647235c2a9 server : fix metrics init (#5964) Georgi Gerganov 2024-03-09 17:34:15 +02:00
  • 099843b83e ggml : remove old quantization functions (#5942) Georgi Gerganov 2024-03-09 15:53:59 +02:00
  • 9a57e2f7cb server : clarify some items in the readme (#5957) Georgi Gerganov 2024-03-09 15:47:47 +02:00
  • ab0a466502 server : normalize embeddings (#5956) SeungWon Jeong 2024-03-09 21:27:58 +09:00
  • ea9403207d tests : gitignore ggml-common.h Georgi Gerganov 2024-03-09 14:17:11 +02:00
  • 50dff9d24e server : fix passing prompt as tokens (#5955) Alexey Parfenov 2024-03-09 11:16:53 +00:00
  • 21d2ca9141 ggml : add ggml-common.h to deduplicate shared code (#5940) Georgi Gerganov 2024-03-09 12:47:57 +02:00
  • 97fde80b37 server : simplify logic for empty prompts (#5953) Georgi Gerganov 2024-03-09 12:34:18 +02:00
  • 741051a4f2 Server: reorganize some http logic (#5939) Xuan Son Nguyen 2024-03-09 11:27:53 +01:00
  • d6dcca5567 server : add SSL support (#5926) Gabe Goodhart 2024-03-09 02:57:09 -07:00
  • 777be5aa5a server: tests: add truncated prompt tests, better kv cache size (#5933) Pierrick Hymbert 2024-03-09 10:30:04 +01:00
  • 852c7c5454 llama : support Mamba Selective State Space Models (#5328) compilade 2024-03-08 17:31:00 -05:00
  • 1885859913 llama : fix quantization of shared token_embd (#5944) compilade 2024-03-08 10:53:37 -05:00
  • 6932e1aa46 server: metrics: add llamacpp:prompt_seconds_total and llamacpp:tokens_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (#5937) Pierrick Hymbert 2024-03-08 12:25:04 +01:00
  • 39ccc17614 llama : assume tied weights if lm_head/output weights is missing (#5824) Don Mahurin 2024-03-08 02:41:50 -08:00
  • 8fb8716df4 server : fix EOS token detection with disabled cache (#5938) Georgi Gerganov 2024-03-08 12:40:02 +02:00