added support custom prompts and more functions

2023-05-17 00:12:45 -06:00 · 2023-05-17 00:12:45 -06:00 · da7f370a94
commit da7f370a94
parent 0cfbd1d7d7
4 changed files with 625 additions and 636 deletions
--- a/examples/server/CMakeLists.txt
+++ b/examples/server/CMakeLists.txt
@ -1,6 +1,6 @@
 set(TARGET server)
 include_directories(${CMAKE_CURRENT_SOURCE_DIR})
-add_executable(${TARGET} server.cpp json.hpp httplib.h server.h)
+add_executable(${TARGET} server.cpp json.hpp httplib.h)
 target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
 target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -7,8 +7,9 @@ This example allow you to have a llama.cpp http server to interact from a web pa
 1. [Quick Start](#quick-start)
 2. [Node JS Test](#node-js-test)
 3. [API Endpoints](#api-endpoints)
-4. [Common Options](#common-options)
-5. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)
+4. [More examples](#more-examples)
+5. [Common Options](#common-options)
+6. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)

 ## Quick Start

@ -17,13 +18,13 @@ To get started right away, run the following command, making sure to use the cor
 #### Unix-based systems (Linux, macOS, etc.):

 ```bash
-./server -m models/7B/ggml-model.bin --keep -1 --ctx_size 2048
+./server -m models/7B/ggml-model.bin --ctx_size 2048
 ```

 #### Windows:

 ```powershell
-server.exe -m models\7B\ggml-model.bin --keep -1 --ctx_size 2048
+server.exe -m models\7B\ggml-model.bin --ctx_size 2048
 ```

 That will start a server that by default listens on `127.0.0.1:8080`. You can consume the endpoints with Postman or NodeJS with axios library.
@ -42,45 +43,22 @@ npm install axios
 Create a index.js file and put inside this:

 ```javascript
-const axios = require('axios');
+const axios = require("axios");

-async function LLamaTest() {
-    let result = await axios.post("http://127.0.0.1:8080/setting-context", {
-        context: [
-            { role: "system", content: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions." },
-            { role: "user", content: "Hello, Assistant." },
-            { role: "assistant", content: "Hello. How may I help you today?" },
-            { role: "user", content: "Please tell me the largest city in Europe." },
-            { role: "assistant", content: "Sure. The largest city in Europe is Moscow, the capital of Russia." }
-        ],
-        batch_size: 64,
-        temperature: 0.2,
-        top_k: 40,
-        top_p: 0.9,
-        n_predict: 2048,
-        threads: 5
+const prompt = `Building a website can be done in 10 simple steps:`;
+
+async function Test() {
+    let result = await axios.post("http://127.0.0.1:8080/completion", {
+        prompt,
+        batch_size: 128,
+        n_predict: 512,
    });
-    result = await axios.post("http://127.0.0.1:8080/set-message", {
-        message: ' What is linux?'
-    });
-    if(result.data.can_inference) {
-        result = await axios.get("http://127.0.0.1:8080/completion?stream=true", { responseType: 'stream' });
-        result.data.on('data', (data) => {
-            let completion = JSON.parse(data.toString());
-            // token by token completion like Chat GPT
-            process.stdout.write(completion.content);
-        });

-        /*
-        Wait the entire completion (takes long time for response)
-
-        result = await axios.get("http://127.0.0.1:8080/completion");
-        console.log(result.data.content);
-        */
-    }
+    // the response is received until completion finish
+    console.log(result.data.content);
 }

-LLamaTest();
+Test();
 ```

 And run it:
@ -93,7 +71,7 @@ node .

 You can interact with this API Endpoints. This implementations just support chat style interaction.

-    `POST hostname:port/setting-context`: Setting up the Llama Context to begin the completions tasks.
+-   **POST** `hostname:port/completion`: Setting up the Llama Context to begin the completions tasks.

 Options:
 `batch_size`: Set the batch size for prompt processing (default: 512).
@ -108,38 +86,200 @@ Options:

 `threads`: Set the number of threads to use during computation.

-`context`: Set a short conversation as context.
+`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.

-Insert items to an array of this form: `{ role: "user", content: "Hello, Assistant." }`, where:
+`as_loop`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.

-`role` can be `system`, `assistant` and `user`.
+`interactive`: It allows interacting with the completion, and the completion stops as soon as it encounters a `stop word`. To enable this, set to `true`.

-`content` the message content.
+`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.

-   `POST hostname:port/set-message`: Set the message of the user to Llama.
+`stop`: Specify the words or characters that indicate a stop. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.

-`message`: Set the message content.
+`exclude`: Specify the words or characters you do not want to appear in the completion. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.

-   `GET hostname:port/completion`: Receive the response, it can be a stream or wait until finish the completion.
+-   **POST** `hostname:port/embedding`: Generate embedding of a given text

-`stream`: Set `true` if you want to receive a stream response.
+`content`: Set the text to get generate the embedding.
+
+`threads`: Set the number of threads to use during computation.
+
+To use this endpoint, you need to start the server with the `--embedding` option added.
+
+-   **POST** `hostname:port/tokenize`: Tokenize a given text
+
+`content`: Set the text to tokenize.
+
+-   **GET** `hostname:port/next-token`: Receive the next token predicted, execute this request in a loop. Make sure set `as_loop` as `true` in the completion request.
+
+## More examples
+
+### Interactive mode
+
+This mode allows interacting in a chat-like manner. It is recommended for models designed as assistants such as `Vicuna`, `WizardLM`, `Koala`, among others. Make sure to add the correct stop word for the corresponding model.
+
+The prompt should be generated by you, according to the model's guidelines. You should keep adding the model's completions to the context as well.
+
+This example works well for `Vicuna - version 1`.
+
+```javascript
+const axios = require("axios");
+
+let prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
+### Human: Hello, Assistant.
+### Assistant: Hello. How may I help you today?
+### Human: Please tell me the largest city in Europe.
+### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.`;
+
+async function ChatCompletion(answer) {
+    // the user's next question to the prompt
+    prompt += `\n### Human: ${answer}\n`
+
+    result = await axios.post("http://127.0.0.1:8080/completion", {
+        prompt,
+        batch_size: 128,
+        temperature: 0.2,
+        top_k: 40,
+        top_p: 0.9,
+        n_keep: -1,
+        n_predict: 2048,
+        stop: ["\n### Human:"], // when detect this, stop completion
+        exclude: ["### Assistant:"], // no show in the completion
+        threads: 8,
+        as_loop: true, // use this to request the completion token by token
+        interactive: true, // enable the detection of a stop word
+    });
+
+    // create a loop to receive every token predicted
+    // note: this operation is blocking, avoid use this in a ui thread
+
+    let message = "";
+    while (true) {
+        result = await axios.get("http://127.0.0.1:8080/next-token");
+        process.stdout.write(result.data.content);
+        message += result.data.content;
+
+        // to avoid an infinite loop
+        if (result.data.stop) {
+            console.log("Completed");
+            // make sure to add the completion to the prompt.
+            prompt += `### Assistant: ${message}`;
+            break;
+        }
+    }
+}
+
+// This function should be called every time a question to the model is needed.
+async function Test() {
+    // the server can't inference in paralell
+    await ChatCompletion("Write a long story about a time magician in a fantasy world");
+    await ChatCompletion("Summary the story");
+}
+
+Test();
+```
+
+### Alpaca example
+
+**Temporaly note:** no tested, if you have the model, please test it and report me some issue
+
+```javascript
+const axios = require("axios");
+
+let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.
+`;
+
+async function DoInstruction(instruction) {
+    prompt += `\n\n### Instruction:\n\n${instruction}\n\n### Response:\n\n`;
+    result = await axios.post("http://127.0.0.1:8080/completion", {
+        prompt,
+        batch_size: 128,
+        temperature: 0.2,
+        top_k: 40,
+        top_p: 0.9,
+        n_keep: -1,
+        n_predict: 2048,
+        stop: ["### Instruction:\n\n"], // when detect this, stop completion
+        exclude: [], // no show in the completion
+        threads: 8,
+        as_loop: true, // use this to request the completion token by token
+        interactive: true, // enable the detection of a stop word
+    });
+
+    // create a loop to receive every token predicted
+    // note: this operation is blocking, avoid use this in a ui thread
+
+    let message = "";
+    while (true) {
+        result = await axios.get("http://127.0.0.1:8080/next-token");
+        process.stdout.write(result.data.content);
+        message += result.data.content;
+
+        // to avoid an infinite loop
+        if (result.data.stop) {
+            console.log("Completed");
+            // make sure to add the completion and the user's next question to the prompt.
+            prompt += message;
+            break;
+        }
+    }
+}
+
+// This function should be called every time a instruction to the model is needed.
+DoInstruction("Destroy the world");
+```
+
+### Embeddings
+
+First, run the server with `--embedding` option:
+
+```bash
+server -m models/7B/ggml-model.bin --ctx_size 2048 --embedding
+```
+
+Run this code in NodeJS:
+
+```javascript
+const axios = require('axios');
+
+async function Test() {
+    let result = await axios.post("http://127.0.0.1:8080/embedding", {
+        content: `Hello`,
+        threads: 5
+    });
+    // print the embedding array
+    console.log(result.data.embedding);
+}
+
+Test();
+```
+
+### Tokenize
+
+Run this code in NodeJS:
+
+```javascript
+const axios = require('axios');
+
+async function Test() {
+    let result = await axios.post("http://127.0.0.1:8080/tokenize", {
+        content: `Hello`
+    });
+    // print the embedding array
+    console.log(result.data.tokens);
+}
+
+Test();
+```

 ## Common Options

 -   `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
 -   `-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
+-   `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
 -   `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
-
 -   `--port`: Set the port to listen. Default: `8080`.

-### Keep Prompt
-
-The `--keep` option allows users to retain the original prompt when the model runs out of context, ensuring a connection to the initial instruction or conversation topic is maintained.
-
-   `--keep N`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
-
-By utilizing context management options like `--ctx_size` and `--keep`, you can maintain a more coherent and consistent interaction with the LLaMA models, ensuring that the generated text remains relevant to the original prompt or conversation.
-
 ### RNG Seed

 -   `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
@ -150,12 +290,12 @@ The RNG seed is used to initialize the random number generator that influences t

 ### No Memory Mapping

-   `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.
+-   `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance.

 ### Memory Float 32

 -   `--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage.

 ## Limitations:
-* The actual implementation of llama.cpp need a `llama-state` for support multiple contexts and clients.
-* The context can't be reset during runtime.
+
+-   The actual implementation of llama.cpp need a `llama-state` for handle multiple contexts and clients, but this could require more powerful hardware.
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
--- a/examples/server/server.h
+++ b/examples/server/server.h
@ -1,50 +0,0 @@
-#include <httplib.h>
-#include <json.hpp>
-#include <cstring>
-#include "common.h"
-#include "llama.h"
-
-/*
-    This isn't the best way to do this.
-
-    Missing:
-        - Clean context (insert new prompt for change the behavior,
-        this implies clean kv cache and emb_inp in runtime)
-        - Release context (free memory) after shutdown the server
-
-*/
-
-class Llama{
-    public:
-        Llama(gpt_params params_) : params(params_){};
-        bool load_context();
-        bool prompt_test();
-        void setting_context();
-        int set_message(std::string msg);
-        void release();
-
-        llama_token nextToken();
-        std::string inference();
-
-        bool context_config = false;
-        bool is_antiprompt = false;
-        int tokens_completion = 0;
-        gpt_params params;
-        std::string user_tag = "### Human:", assistant_tag = "### Assistant:";
-    private:
-        llama_context *ctx;
-        int n_ctx;
-        int n_past = 0;
-        int n_consumed = 0;
-        int n_session_consumed = 0;
-        int n_remain = 0;
-        std::vector<llama_token> embd;
-        std::vector<llama_token> last_n_tokens;
-        bool is_interacting = false;
-        std::vector<int> llama_token_newline;
-        std::vector<int> embd_inp;
-
-        // to ignore this in the completion
-        std::vector<int> user_tag_tokens;
-        std::vector<int> assistant_tag_tokens;
-};