From f647ce040ff06348d2ceaa5443a6a7a8b80c70c9 Mon Sep 17 00:00:00 2001 From: Tomas Date: Thu, 4 May 2023 17:02:30 +0700 Subject: [PATCH 01/11] fix #1224 reverse prompt and multi line (#1297) * fix reverse prompt and multi line * Code Formatting Co-authored-by: Georgi Gerganov --------- Co-authored-by: Georgi Gerganov --- examples/main/main.cpp | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/examples/main/main.cpp b/examples/main/main.cpp index 125c189a3..17a5a90d1 100644 --- a/examples/main/main.cpp +++ b/examples/main/main.cpp @@ -551,12 +551,14 @@ int main(int argc, char ** argv) { return 0; } #endif - if (line.empty() || line.back() != '\\') { - another_line = false; - } else { - line.pop_back(); // Remove the continue character + if (!line.empty()) { + if (line.back() == '\\') { + line.pop_back(); // Remove the continue character + } else { + another_line = false; + } + buffer += line + '\n'; // Append the line to the result } - buffer += line + '\n'; // Append the line to the result } while (another_line); // done taking input, reset color From c65a7fbfa9c736416a25369cc05d356789df4c15 Mon Sep 17 00:00:00 2001 From: DannyDaemonic Date: Thu, 4 May 2023 03:02:59 -0700 Subject: [PATCH 02/11] Update main's README.md with new features (#1296) --- examples/main/README.md | 141 ++++++++++++++++++++++++++++++++-------- 1 file changed, 113 insertions(+), 28 deletions(-) diff --git a/examples/main/README.md b/examples/main/README.md index ba210d14a..493a8c095 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -17,23 +17,45 @@ This example program allows you to use various LLaMA language models in an easy To get started right away, run the following command, making sure to use the correct path for the model you have: +#### Unix-based systems (Linux, macOS, etc.): + ```bash ./main -m models/7B/ggml-model.bin --prompt "Once upon a time" ``` -The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it): +#### Windows: -```bash -./main -m models/7B/ggml-model.bin --ignore-eos --n_predict -1 --keep -1 --prompt "Once upon a time" +```powershell +main.exe -m models\7B\ggml-model.bin --prompt "Once upon a time" ``` For an interactive experience, try this command: +#### Unix-based systems (Linux, macOS, etc.): + ```bash -./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt $'User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:' +./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt 'User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:' ``` -Note that the newline characters in the prompt string above only work on Linux. On Windows, you will have to use the ``--file`` option (see below) to load a multi-line prompt from file instead. +#### Windows: + +```powershell +main.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:" +``` + +The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it): + +#### Unix-based systems (Linux, macOS, etc.): + +```bash +./main -m models/7B/ggml-model.bin --ignore-eos -n -1 --random-prompt +``` + +#### Windows: + +```powershell +main.exe -m models\7B\ggml-model.bin --ignore-eos -n -1 --random-prompt +``` ## Common Options @@ -42,7 +64,6 @@ In this section, we cover the most commonly used options for running the `main` - `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`). - `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses. - `-ins, --instruct`: Run the program in instruction mode, which is particularly useful when working with Alpaca models. -- `-t N, --threads N`: Set the number of threads to use during computation. It is recommended to set this to the number of physical cores your CPU has. - `-n N, --n_predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text. - `-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. @@ -92,7 +113,7 @@ Instruction mode is particularly useful when working with Alpaca models, which a - `-ins, --instruct`: Enable instruction mode to leverage the capabilities of Alpaca models in completing tasks based on user-provided instructions. -Technical detail: the user's input is internally prefixed with the reverse prompt (or ``### Instruction:`` as the default), and followed by ``### Response:`` (except if you just press Return without any input, to keep generating a longer response). +Technical detail: the user's input is internally prefixed with the reverse prompt (or `### Instruction:` as the default), and followed by `### Response:` (except if you just press Return without any input, to keep generating a longer response). By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs. @@ -116,7 +137,7 @@ By utilizing context management options like `--ctx_size` and `--keep`, you can ## Generation Flags -The following options are related to controlling the text generation process, influencing the diversity, creativity, and quality of the generated text. Understanding these options will help you fine-tune the output according to your needs: +The following options allow you to control the text generation process and fine-tune the diversity, creativity, and quality of the generated text according to your needs. By adjusting these options and experimenting with different combinations of values, you can find the best settings for your specific use case. ### Number of Tokens to Predict @@ -124,13 +145,7 @@ The following options are related to controlling the text generation process, in The `--n_predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit. -It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the ``--ignore-eos`` parameter. - -### RNG Seed - -- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1). - -The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run. +It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter. ### Temperature @@ -138,15 +153,21 @@ The RNG seed is used to initialize the random number generator that influences t Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. -Example usage: `--temp 0.8` +Example usage: `--temp 0.5` ### Repeat Penalty - `--repeat_penalty N`: Control the repetition of token sequences in the generated text (default: 1.1). +- `--repeat_last_n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx_size). +- `--no-penalize-nl`: Disable penalization for newline tokens when applying the repeat penalty. -Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1. +The `repeat_penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1. -Example usage: `--repeat_penalty 1.1` +The `repeat_last_n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx_size`). + +Use the `--no-penalize-nl` option to disable newline penalization when applying the repeat penalty. This option is particularly useful for generating chat conversations, dialogues, code, poetry, or any text where newline tokens play a significant role in structure and formatting. Disabling newline penalization helps maintain the natural flow and intended formatting in these specific use cases. + +Example usage: `--repeat_penalty 1.15 --repeat_last_n 128 --no-penalize-nl` ### Top-K Sampling @@ -154,7 +175,7 @@ Example usage: `--repeat_penalty 1.1` Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40. -Example usage: `--top_k 40` +Example usage: `--top_k 30` ### Top-P Sampling @@ -162,23 +183,87 @@ Example usage: `--top_k 40` Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9. -Example usage: `--top_p 0.9` +Example usage: `--top_p 0.95` -By adjusting these options, you can control the diversity, quality, and creativity of the generated text to better suit your needs. You can experiment with different combinations of values to find the best settings for your specific use case. +### Tail Free Sampling (TFS) + +- `--tfs N`: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled). + +Tail free sampling (TFS) is a text generation technique that aims to reduce the impact of less likely tokens, which may be less relevant, less coherent, or nonsensical, on the output. The method adjusts the logits (token probabilities) by raising them to the power of the parameter z. A higher value of z (e.g., 2.0) will further suppress less likely tokens from the tail of the distribution, while a value of 1.0 disables the effect of TFS. By setting the parameter z, you can control how much the probabilities of less likely tokens are reduced. + +Example usage: `--tfs 2.0` + +### Locally Typical Sampling + +- `--typical N`: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled). + +Locally typical sampling promotes the generation of contextually coherent and diverse text by sampling tokens that are typical or expected based on the surrounding context. By setting the parameter p between 0 and 1, you can control the balance between producing text that is locally coherent and diverse. A value closer to 1 will promote more contextually coherent tokens, while a value closer to 0 will promote more diverse tokens. A value equal to 1 disables locally typical sampling. + +Example usage: `--typical 0.9` + +### Mirostat Sampling + +- `--mirostat N`: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0). +- `--mirostat_lr N`: Set the Mirostat learning rate, parameter eta (default: 0.1). +- `--mirostat_ent N`: Set the Mirostat target entropy, parameter tau (default: 5.0). + +Mirostat is an algorithm that actively maintains the quality of generated text within a desired range during text generation. It aims to strike a balance between coherence and diversity, avoiding low-quality output caused by excessive repetition (boredom traps) or incoherence (confusion traps). + +The `--mirostat_lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`. + +The `--mirostat_ent` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`. + +Example usage: `--mirostat 2 --mirostat_lr 0.05 --mirostat_ent 3.0` + +### Logit Bias + +- `-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS`: Modify the likelihood of a token appearing in the generated text completion. + +The logit bias option allows you to manually adjust the likelihood of specific tokens appearing in the generated text. By providing a token ID and a positive or negative bias value, you can increase or decrease the probability of that token being generated. + +For example, use `--logit-bias 15043+1` to increase the likelihood of the token 'Hello', or `--logit-bias 15043-1` to decrease its likelihood. Using a value of negative infinity, `--logit-bias 15043-inf` ensures that the token `Hello` is never produced. + +A more practical use case might be to prevent the generation of `\code{begin}` and `\code{end}` by setting the `\` token (29905) to negative infinity with `-l 29905-inf`. (This is due to the prevalence of LaTeX codes that show up in LLaMA model inference.) + +Example usage: `--logit-bias 29905-inf` + +### RNG Seed + +- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed). + +The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run. ## Performance Tuning and Memory Options -These options help improve the performance and memory usage of the LLaMA models: +These options help improve the performance and memory usage of the LLaMA models. By adjusting these settings, you can fine-tune the model's behavior to better suit your system's capabilities and achieve optimal performance for your specific use case. + +### Number of Threads + +- `-t N, --threads N`: Set the number of threads to use during computation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). Using the correct number of threads can greatly improve performance. + +### Mlock + +- `--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM. + +### No Memory Mapping + +- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all. + +### Memory Float 32 + +- `--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage. + +### Batch Size -- `-t N, --threads N`: Set the number of threads to use during computation. Using the correct number of threads can greatly improve performance. It is recommended to set this value to the number of CPU cores. -- `--mlock`: Lock the model in memory, preventing it from being swapped out when mmaped. This can improve performance. -- `--no-mmap`: Do not memory-map the model. This results in a slower load time but may reduce pageouts if you're not using `mlock`. -- `--memory_f32`: Use 32 bit floats instead of 16 bit floats for memory key+value, allowing higher quality inference at the cost of memory. - `-b N, --batch_size N`: Set the batch size for prompt processing (default: 512). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations. -For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-data--run). +### Session Caching -By understanding and using these performance tuning settings, you can optimize the LLaMA model's behavior to achieve the best performance for your specific needs. +- `--session FNAME`: Specify a file to load/save the session, which caches the model state after the initial prompt. This can significantly speed up the startup time when you're using longer prompts. The session file is created during the first run and is reused in subsequent runs. If you change your prompt such that 75% or less of the session is reusable, the existing session file will be overwritten with a new, updated version to maintain optimal performance. + +### Quantization + +For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-data--run). ## Additional Options From db1080876a62ec3bb4119d90b16e7dce7594b733 Mon Sep 17 00:00:00 2001 From: DannyDaemonic Date: Thu, 4 May 2023 05:08:25 -0700 Subject: [PATCH 03/11] Only escape prompts when used with `-e` (#1311) --- examples/common.cpp | 46 ++++++++++++++++++++++------------------- examples/main/README.md | 9 ++++++-- 2 files changed, 32 insertions(+), 23 deletions(-) diff --git a/examples/common.cpp b/examples/common.cpp index 1a2f4743a..cd6300041 100644 --- a/examples/common.cpp +++ b/examples/common.cpp @@ -66,35 +66,33 @@ int32_t get_num_physical_cores() { return n_threads > 0 ? (n_threads <= 4 ? n_threads : n_threads / 2) : 4; } -std::string process_escapes(const char* input) { - std::string output; +void process_escapes(std::string& input) { + std::size_t input_len = input.length(); + std::size_t output_idx = 0; - if (input != nullptr) { - std::size_t input_len = std::strlen(input); - output.reserve(input_len); - - for (std::size_t i = 0; i < input_len; ++i) { - if (input[i] == '\\' && i + 1 < input_len) { - switch (input[++i]) { - case 'n': output.push_back('\n'); break; - case 't': output.push_back('\t'); break; - case '\'': output.push_back('\''); break; - case '\"': output.push_back('\"'); break; - case '\\': output.push_back('\\'); break; - default: output.push_back('\\'); - output.push_back(input[i]); break; - } - } else { - output.push_back(input[i]); + for (std::size_t input_idx = 0; input_idx < input_len; ++input_idx) { + if (input[input_idx] == '\\' && input_idx + 1 < input_len) { + switch (input[++input_idx]) { + case 'n': input[output_idx++] = '\n'; break; + case 'r': input[output_idx++] = '\r'; break; + case 't': input[output_idx++] = '\t'; break; + case '\'': input[output_idx++] = '\''; break; + case '\"': input[output_idx++] = '\"'; break; + case '\\': input[output_idx++] = '\\'; break; + default: input[output_idx++] = '\\'; + input[output_idx++] = input[input_idx]; break; } + } else { + input[output_idx++] = input[input_idx]; } } - return output; + input.resize(output_idx); } bool gpt_params_parse(int argc, char ** argv, gpt_params & params) { bool invalid_param = false; + bool escape_prompt = false; std::string arg; gpt_params default_params; @@ -118,7 +116,9 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) { invalid_param = true; break; } - params.prompt = process_escapes(argv[i]); + params.prompt = argv[i]; + } else if (arg == "-e") { + escape_prompt = true; } else if (arg == "--session") { if (++i >= argc) { invalid_param = true; @@ -335,6 +335,9 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) { gpt_print_usage(argc, argv, default_params); exit(1); } + if (escape_prompt) { + process_escapes(params.prompt); + } return true; } @@ -355,6 +358,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) { fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads); fprintf(stderr, " -p PROMPT, --prompt PROMPT\n"); fprintf(stderr, " prompt to start generation with (default: empty)\n"); + fprintf(stderr, " -e process prompt escapes sequences (\\n, \\r, \\t, \\', \\\", \\\\)\n"); fprintf(stderr, " --session FNAME file to cache model state in (may be large!) (default: none)\n"); fprintf(stderr, " --random-prompt start with a randomized prompt.\n"); fprintf(stderr, " --in-prefix STRING string to prefix user inputs with (default: empty)\n"); diff --git a/examples/main/README.md b/examples/main/README.md index 493a8c095..6b7facb3b 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -34,13 +34,18 @@ For an interactive experience, try this command: #### Unix-based systems (Linux, macOS, etc.): ```bash -./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt 'User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:' +./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " \ +'User: Hi +AI: Hello. I am an AI chatbot. Would you like to talk? +User: Sure! +AI: What would you like to talk about? +User:' ``` #### Windows: ```powershell -main.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:" +main.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " -e --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:" ``` The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it): From 20fbf2a2a08d8edefe9b3435fa86f8b2f63f8588 Mon Sep 17 00:00:00 2001 From: Ron Jailall Date: Thu, 4 May 2023 11:05:59 -0400 Subject: [PATCH 04/11] ggml : change immintrin.h to intrin.h for compatibility (#1307) * change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c --- ggml.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/ggml.c b/ggml.c index 0bcb5f617..4d49242a4 100644 --- a/ggml.c +++ b/ggml.c @@ -180,9 +180,13 @@ typedef double ggml_float; #undef bool #define bool _Bool #else +#if defined(_MSC_VER) || defined(__MINGW32__) +#include +#else #include #endif #endif +#endif #ifdef __F16C__ From 2edbdb0f99336cb41f0995061c7602ed54beb863 Mon Sep 17 00:00:00 2001 From: 44670 <44670@users.noreply.github.com> Date: Thu, 4 May 2023 23:41:12 +0800 Subject: [PATCH 05/11] main : add --in-suffix option (#1318) * adding --in-suffix option * print input suffix before generation --- examples/common.cpp | 7 +++++++ examples/common.h | 1 + examples/main/README.md | 8 ++++++++ examples/main/main.cpp | 9 +++++++++ 4 files changed, 25 insertions(+) diff --git a/examples/common.cpp b/examples/common.cpp index cd6300041..97eded6ec 100644 --- a/examples/common.cpp +++ b/examples/common.cpp @@ -324,6 +324,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) { break; } params.input_prefix = argv[i]; + } else if (arg == "--in-suffix") { + if (++i >= argc) { + invalid_param = true; + break; + } + params.input_suffix = argv[i]; } else { fprintf(stderr, "error: unknown argument: %s\n", arg.c_str()); gpt_print_usage(argc, argv, default_params); @@ -362,6 +368,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) { fprintf(stderr, " --session FNAME file to cache model state in (may be large!) (default: none)\n"); fprintf(stderr, " --random-prompt start with a randomized prompt.\n"); fprintf(stderr, " --in-prefix STRING string to prefix user inputs with (default: empty)\n"); + fprintf(stderr, " --in-suffix STRING string to suffix after user inputs with (default: empty)\n"); fprintf(stderr, " -f FNAME, --file FNAME\n"); fprintf(stderr, " prompt file to start generation.\n"); fprintf(stderr, " -n N, --n_predict N number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict); diff --git a/examples/common.h b/examples/common.h index 138d0ded0..842e1516f 100644 --- a/examples/common.h +++ b/examples/common.h @@ -43,6 +43,7 @@ struct gpt_params { std::string prompt = ""; std::string path_session = ""; // path to file for saving/loading model eval state std::string input_prefix = ""; // string to prefix user inputs with + std::string input_suffix = ""; // string to suffix user inputs with std::vector antiprompt; // string upon seeing which more user input is prompted std::string lora_adapter = ""; // lora adapter path diff --git a/examples/main/README.md b/examples/main/README.md index 6b7facb3b..35f87bcd5 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -112,6 +112,14 @@ The `--in-prefix` flag is used to add a prefix to your input, primarily, this is ./main -r "User:" --in-prefix " " ``` +### In-Suffix + +The `--in-suffix` flag is used to add a suffix after your input. This is useful for adding an "Assistant:" prompt after the user's input. It's added after the new-line character (`\n`) that's automatically added to the end of the user's input. Here's an example of how to use the `--in-suffix` flag in conjunction with the `--reverse-prompt` flag: + +```sh +./main -r "User:" --in-prefix " " --in-suffix "Assistant:" +``` + ### Instruction Mode Instruction mode is particularly useful when working with Alpaca models, which are designed to follow user instructions for specific tasks: diff --git a/examples/main/main.cpp b/examples/main/main.cpp index 17a5a90d1..43dca8eb5 100644 --- a/examples/main/main.cpp +++ b/examples/main/main.cpp @@ -260,6 +260,10 @@ int main(int argc, char ** argv) { if (!params.input_prefix.empty()) { fprintf(stderr, "Input prefix: '%s'\n", params.input_prefix.c_str()); } + + if (!params.input_suffix.empty()) { + fprintf(stderr, "Input suffix: '%s'\n", params.input_suffix.c_str()); + } } fprintf(stderr, "sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n", params.repeat_last_n, params.repeat_penalty, params.presence_penalty, params.frequency_penalty, params.top_k, params.tfs_z, params.top_p, params.typical_p, params.temp, params.mirostat, params.mirostat_eta, params.mirostat_tau); @@ -567,6 +571,11 @@ int main(int argc, char ** argv) { // Add tokens to embd only if the input buffer is non-empty // Entering a empty line lets the user pass control back if (buffer.length() > 1) { + // append input suffix if any + if (!params.input_suffix.empty()) { + buffer += params.input_suffix; + printf("%s", params.input_suffix.c_str()); + } // instruct mode: insert instruction prefix if (params.instruct && !is_antiprompt) { From 360cfe5bec852805b84eec799102fc6f45df9fef Mon Sep 17 00:00:00 2001 From: 44670 <44670@users.noreply.github.com> Date: Fri, 5 May 2023 00:33:31 +0800 Subject: [PATCH 06/11] readme : add OpenBuddy link (#1321) --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 0002f8cc1..f1fa63542 100644 --- a/README.md +++ b/README.md @@ -43,6 +43,7 @@ as the main playground for developing new features for the [ggml](https://github - [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne) - [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894) - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) +- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy) **Bindings:** From d3e8093e9b5845514b049ede3b12728c8f013eba Mon Sep 17 00:00:00 2001 From: Ivan Stepanov Date: Thu, 4 May 2023 19:54:37 +0300 Subject: [PATCH 07/11] convert: support DT_BF16 tensors (#1309) Co-authored-by: Pavol Rusnak --- convert.py | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/convert.py b/convert.py index 7f7ae05fa..c817a343e 100644 --- a/convert.py +++ b/convert.py @@ -67,6 +67,7 @@ FTYPE_TO_DATA_TYPE: Dict[int, DataType] = \ {ftype: dtype for (dtype, ftype) in DATA_TYPE_TO_FTYPE.items()} DATA_TYPE_TO_NUMPY: Dict[DataType, 'np.dtype[Any]'] = { + DT_BF16: np.dtype(np.uint16), DT_F16: np.dtype(np.float16), DT_F32: np.dtype(np.float32), DT_I32: np.dtype(np.int32), @@ -276,6 +277,12 @@ class Tensor(metaclass=ABCMeta): def to_ggml(self) -> 'GGMLCompatibleTensor': ... +def bf16_to_fp32(bf16_arr: np.ndarray) -> np.ndarray: + assert bf16_arr.dtype == np.uint16, f"Input array should be of dtype uint16, but got {bf16_arr.dtype}" + fp32_arr = bf16_arr.astype(np.uint32) << 16 + return fp32_arr.view(np.float32) + + class UnquantizedTensor(Tensor): def __init__(self, ndarray: NDArray) -> None: assert isinstance(ndarray, np.ndarray) @@ -284,6 +291,8 @@ class UnquantizedTensor(Tensor): def astype(self, data_type: DataType) -> Tensor: dtype = DATA_TYPE_TO_NUMPY[data_type] + if self.data_type == DT_BF16: + self.ndarray = bf16_to_fp32(self.ndarray) return UnquantizedTensor(self.ndarray.astype(dtype)) def to_ggml(self) -> 'UnquantizedTensor': @@ -686,6 +695,7 @@ class LazyUnpickler(pickle.Unpickler): description = f'storage data_type={data_type} path-in-zip={filename} path={self.zip_file.filename}' return LazyStorage(load=load, kind=pid[1], description=description) + @staticmethod def lazy_rebuild_tensor_v2(storage: Any, storage_offset: Any, size: Any, stride: Any, # pyright: ignore[reportSelfClsParameterName] requires_grad: Any, backward_hooks: Any, metadata: Any = None) -> LazyTensor: assert isinstance(storage, LazyStorage) @@ -696,12 +706,18 @@ class LazyUnpickler(pickle.Unpickler): description = f'pickled storage_offset={storage_offset} in {storage.description}' return LazyTensor(load, list(size), storage.kind.data_type, description) + @staticmethod + def rebuild_from_type_v2(func, new_type, args, state): + return func(*args) + CLASSES: Dict[Any, Any] = { + ('torch._tensor', '_rebuild_from_type_v2'): rebuild_from_type_v2, ('torch._utils', '_rebuild_tensor_v2'): lazy_rebuild_tensor_v2, ('torch', 'BFloat16Storage'): LazyStorageKind(DT_BF16), ('torch', 'HalfStorage'): LazyStorageKind(DT_F16), ('torch', 'FloatStorage'): LazyStorageKind(DT_F32), ('torch', 'IntStorage'): LazyStorageKind(DT_I32), + ('torch', 'Tensor'): LazyTensor, } def find_class(self, module: str, name: str) -> Any: @@ -961,7 +977,7 @@ class OutputFile: def pick_output_type(model: LazyModel, output_type_str: Optional[str]) -> GGMLFileType: wq_type = model["layers.0.attention.wq.weight"].data_type - if output_type_str == "f32" or (output_type_str is None and wq_type == DT_F32): + if output_type_str == "f32" or (output_type_str is None and wq_type in (DT_F32, DT_BF16)): return GGMLFileType.AllF32 if output_type_str == "f16" or (output_type_str is None and wq_type == DT_F16): return GGMLFileType.MostlyF16 From 34d9f22f44c42d345cc72c8f3aa4cb71c5df0acb Mon Sep 17 00:00:00 2001 From: Ivan Stepanov Date: Thu, 4 May 2023 19:56:27 +0300 Subject: [PATCH 08/11] Wrap exceptions in std::exception to verbose output on exception. (#1316) --- llama-util.h | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/llama-util.h b/llama-util.h index d531588d5..88ec28dca 100644 --- a/llama-util.h +++ b/llama-util.h @@ -14,6 +14,7 @@ #include #include +#include #ifdef __has_include #if __has_include() @@ -74,7 +75,7 @@ struct llama_file { llama_file(const char * fname, const char * mode) { fp = std::fopen(fname, mode); if (fp == NULL) { - throw format("failed to open %s: %s", fname, std::strerror(errno)); + throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno))); } seek(0, SEEK_END); size = tell(); @@ -107,10 +108,10 @@ struct llama_file { errno = 0; std::size_t ret = std::fread(ptr, size, 1, fp); if (ferror(fp)) { - throw format("read error: %s", strerror(errno)); + throw std::runtime_error(format("read error: %s", strerror(errno))); } if (ret != 1) { - throw std::string("unexpectedly reached end of file"); + throw std::runtime_error(std::string("unexpectedly reached end of file")); } } @@ -133,7 +134,7 @@ struct llama_file { errno = 0; size_t ret = std::fwrite(ptr, size, 1, fp); if (ret != 1) { - throw format("write error: %s", strerror(errno)); + throw std::runtime_error(format("write error: %s", strerror(errno))); } } @@ -180,7 +181,7 @@ struct llama_mmap { #endif addr = mmap(NULL, file->size, PROT_READ, flags, fd, 0); if (addr == MAP_FAILED) { - throw format("mmap failed: %s", strerror(errno)); + throw std::runtime_error(format("mmap failed: %s", strerror(errno))); } if (prefetch) { @@ -207,7 +208,7 @@ struct llama_mmap { DWORD error = GetLastError(); if (hMapping == NULL) { - throw format("CreateFileMappingA failed: %s", llama_format_win_err(error).c_str()); + throw std::runtime_error(format("CreateFileMappingA failed: %s", llama_format_win_err(error).c_str())); } addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0); @@ -215,7 +216,7 @@ struct llama_mmap { CloseHandle(hMapping); if (addr == NULL) { - throw format("MapViewOfFile failed: %s", llama_format_win_err(error).c_str()); + throw std::runtime_error(format("MapViewOfFile failed: %s", llama_format_win_err(error).c_str())); } #if _WIN32_WINNT >= _WIN32_WINNT_WIN8 @@ -245,7 +246,7 @@ struct llama_mmap { llama_mmap(struct llama_file *, bool prefetch = true) { (void)prefetch; - throw std::string("mmap not supported"); + throw std::runtime_error(std::string("mmap not supported")); } #endif }; From 94c5652fc0f4d04ac54412c4d81e2ebcdafb6ede Mon Sep 17 00:00:00 2001 From: slaren Date: Fri, 5 May 2023 00:58:56 +0200 Subject: [PATCH 09/11] quantize: make output filename optional, default to ggml-model-.bin (#1301) --- examples/quantize/quantize.cpp | 100 ++++++++++++++++++++++++++------- 1 file changed, 81 insertions(+), 19 deletions(-) diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp index 198bd5fcb..7c77018da 100644 --- a/examples/quantize/quantize.cpp +++ b/examples/quantize/quantize.cpp @@ -6,23 +6,47 @@ #include #include -static const std::map LLAMA_FTYPE_MAP = { - {"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0}, - {"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1}, - {"q4_2", LLAMA_FTYPE_MOSTLY_Q4_2}, - {"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0}, - {"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1}, - {"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0}, +static const std::map LLAMA_FTYPE_MAP = { + {"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0}, + {"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1}, + {"q4_2", LLAMA_FTYPE_MOSTLY_Q4_2}, + {"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0}, + {"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1}, + {"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0}, }; +bool try_parse_ftype(const std::string & ftype_str, llama_ftype & ftype, std::string & ftype_str_out) { + auto it = LLAMA_FTYPE_MAP.find(ftype_str); + if (it != LLAMA_FTYPE_MAP.end()) { + ftype = it->second; + ftype_str_out = it->first; + return true; + } + // try to parse as an integer + try { + int ftype_int = std::stoi(ftype_str); + for (auto it = LLAMA_FTYPE_MAP.begin(); it != LLAMA_FTYPE_MAP.end(); it++) { + if (it->second == ftype_int) { + ftype = it->second; + ftype_str_out = it->first; + return true; + } + } + } + catch (...) { + // stoi failed + } + return false; +} + // usage: -// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type +// ./quantize models/llama/ggml-model.bin [models/llama/ggml-model-quant.bin] type [nthreads] // int main(int argc, char ** argv) { ggml_time_init(); - if (argc < 4) { - fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]); + if (argc < 3) { + fprintf(stderr, "usage: %s model-f32.bin [model-quant.bin] type [nthreads]\n", argv[0]); for (auto it = LLAMA_FTYPE_MAP.begin(); it != LLAMA_FTYPE_MAP.end(); it++) { fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second); } @@ -36,24 +60,62 @@ int main(int argc, char ** argv) { ggml_free(ctx); } + // parse command line arguments const std::string fname_inp = argv[1]; - const std::string fname_out = argv[2]; + std::string fname_out; + int nthread; + llama_ftype ftype; - enum llama_ftype ftype; - if (argv[3][0] == 'q') { - auto it = LLAMA_FTYPE_MAP.find(argv[3]); - if (it == LLAMA_FTYPE_MAP.end()) { - fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]); + int arg_idx = 2; + std::string ftype_str; + if (try_parse_ftype(argv[arg_idx], ftype, ftype_str)) { + // argv[2] is the ftype + std::string fpath; + const size_t pos = fname_inp.find_last_of('/'); + if (pos != std::string::npos) { + fpath = fname_inp.substr(0, pos + 1); + } + // export as [inp path]/ggml-model-[ftype].bin + fname_out = fpath + "ggml-model-" + ftype_str + ".bin"; + arg_idx++; + } + else { + // argv[2] is the output path + fname_out = argv[arg_idx]; + arg_idx++; + + if (argc <= arg_idx) { + fprintf(stderr, "%s: missing ftype\n", __func__); + return 1; + } + // argv[3] is the ftype + if (!try_parse_ftype(argv[arg_idx], ftype, ftype_str)) { + fprintf(stderr, "%s: invalid ftype '%s'\n", __func__, argv[3]); + return 1; + } + arg_idx++; + } + + // parse nthreads + if (argc > arg_idx) { + try { + nthread = std::stoi(argv[arg_idx]); + } + catch (const std::exception & e) { + fprintf(stderr, "%s: invalid nthread '%s' (%s)\n", __func__, argv[arg_idx], e.what()); return 1; } - ftype = it->second; } else { - ftype = (enum llama_ftype)atoi(argv[3]); + nthread = 0; } fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT); - int nthread = argc > 4 ? atoi(argv[4]) : 0; + fprintf(stderr, "%s: quantizing '%s' to '%s' as %s", __func__, fname_inp.c_str(), fname_out.c_str(), ftype_str.c_str()); + if (nthread > 0) { + fprintf(stderr, " using %d threads", nthread); + } + fprintf(stderr, "\n"); const int64_t t_main_start_us = ggml_time_us(); From a90e96b266873ebb5e947c9864b12193bdada0fb Mon Sep 17 00:00:00 2001 From: Benjamin Lecaillon <84293038+blecaillon@users.noreply.github.com> Date: Fri, 5 May 2023 02:17:07 +0200 Subject: [PATCH 10/11] Convert.py @staticmethod (#1327) * Line 698 has one #staticmethod and should not otherwise throw error at unpickle.load() as not callable * Update convert.py --------- Co-authored-by: Ivan Stepanov --- convert.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/convert.py b/convert.py index c817a343e..126beaabc 100644 --- a/convert.py +++ b/convert.py @@ -695,7 +695,7 @@ class LazyUnpickler(pickle.Unpickler): description = f'storage data_type={data_type} path-in-zip={filename} path={self.zip_file.filename}' return LazyStorage(load=load, kind=pid[1], description=description) - @staticmethod + # @staticmethod def lazy_rebuild_tensor_v2(storage: Any, storage_offset: Any, size: Any, stride: Any, # pyright: ignore[reportSelfClsParameterName] requires_grad: Any, backward_hooks: Any, metadata: Any = None) -> LazyTensor: assert isinstance(storage, LazyStorage) @@ -706,7 +706,7 @@ class LazyUnpickler(pickle.Unpickler): description = f'pickled storage_offset={storage_offset} in {storage.description}' return LazyTensor(load, list(size), storage.kind.data_type, description) - @staticmethod + # @staticmethod def rebuild_from_type_v2(func, new_type, args, state): return func(*args) From 2d13786e91ec9fd28ddf737053822042a824da78 Mon Sep 17 00:00:00 2001 From: Ionoclast Laboratories Date: Fri, 5 May 2023 08:18:21 -0400 Subject: [PATCH 11/11] Fix for OpenCL / clbast builds on macOS. (#1329) --- Makefile | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 94acefdde..260b2487f 100644 --- a/Makefile +++ b/Makefile @@ -121,7 +121,12 @@ ggml-cuda.o: ggml-cuda.cu ggml-cuda.h endif ifdef LLAMA_CLBLAST CFLAGS += -DGGML_USE_CLBLAST - LDFLAGS += -lclblast -lOpenCL + # Mac provides OpenCL as a framework + ifeq ($(UNAME_S),Darwin) + LDFLAGS += -lclblast -framework OpenCL + else + LDFLAGS += -lclblast -lOpenCL + endif OBJS += ggml-opencl.o ggml-opencl.o: ggml-opencl.c ggml-opencl.h $(CC) $(CFLAGS) -c $< -o $@