server: docs - refresh and tease a little bit more the http server

2024-02-25 19:30:15 +01:00 · 2024-02-25 19:30:15 +01:00 · 18239fa7fb
commit 18239fa7fb
parent f1a98c5254
2 changed files with 22 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -114,6 +114,11 @@ Typically finetunes of the base models below are supported as well.
 - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
 - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)

+**HTTP server**
+
+We offer a fast, lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server.
+
+[LLaMA.cpp web server](./examples/server) can be used to serve local models and easily connect them to existing clients.

 **Bindings:**

--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -1,8 +1,22 @@
-# llama.cpp/example/server
+# LLaMA.cpp HTTP Server

-This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
+Fast, lightweight, production ready pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.

-Command line options:
+Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
+
+**Features:**
+ * SOTA LLM inference performance with GGUF quantized models on GPU and CPU
+ * [OpenAI API](https://github.com/openai/openai-openapi) compatibles chat completions and embeddings routes
+ * Continuous batching
+ * KV cache attention
+ * Embedding
+ * Multimodal
+ * API Key security
+ * Production ready monitoring endpoints
+
+The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
+
+**Command line options:**

 - `--threads N`, `-t N`: Set the number of threads to use during generation.
 - `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.