diff --git a/README.md b/README.md index 432347df4..324d49f07 100644 --- a/README.md +++ b/README.md @@ -225,7 +225,7 @@ As the models are currently fully loaded into memory, you will need adequate dis ### Interactive mode -If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter. +If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter. In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMa emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`. Here is an example of a few-shot interaction, invoked with the command @@ -290,7 +290,7 @@ convert the model from the old format to the new format with [./migrate-ggml-202 ### Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data -- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.** +- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.** - The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository. - Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data. - Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files. @@ -302,14 +302,14 @@ convert the model from the old format to the new format with [./migrate-ggml-202 `shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS -- If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT: -- LLaMA: -- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) -- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) -- GPT-3 -- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) -- GPT-3.5 / InstructGPT / ChatGPT: -- [Aligning language models to follow instructions](https://openai.com/research/instruction-following) +- If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT: +- LLaMA: +- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) +- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) +- GPT-3 +- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) +- GPT-3.5 / InstructGPT / ChatGPT: +- [Aligning language models to follow instructions](https://openai.com/research/instruction-following) - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) ### Perplexity (measuring model quality) @@ -318,58 +318,58 @@ You can use the `perplexity` example to measure perplexity over the given prompt #### Latest measurements -The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running +The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running 13B at q4_0 beats the 7B f16 model by a significant amount. -All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context). -Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity). -``` -Perplexity - model options -5.5985 - 13B, q4_0 -5.9565 - 7B, f16 -6.3001 - 7B, q4_1 -6.5949 - 7B, q4_0 -6.5995 - 7B, q4_0, --memory_f16 +All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context). +Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity). +``` +Perplexity - model options +5.5985 - 13B, q4_0 +5.9565 - 7B, f16 +6.3001 - 7B, q4_1 +6.5949 - 7B, q4_0 +6.5995 - 7B, q4_0, --memory_f16 ``` #### How to run -1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research -2. Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` -3. Output: -``` -perplexity : calculating perplexity over 655 chunks -24.43 seconds per pass - ETA 4.45 hours -[1]4.5970,[2]5.1807,[3]6.0382,... -``` +1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research +2. Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` +3. Output: +``` +perplexity : calculating perplexity over 655 chunks +24.43 seconds per pass - ETA 4.45 hours +[1]4.5970,[2]5.1807,[3]6.0382,... +``` And after 4.45 hours, you will have the final perplexity. ### Android -You can easily run `llama.cpp` on Android device with [termux](https://termux.dev/). -First, obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake: -``` -$ mkdir build-android -$ cd build-android -$ export NDK= -$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod .. -$ make -``` -Install [termux](https://termux.dev/) on your device and run `termux-setup-storage` to get access to your SD card. +You can easily run `llama.cpp` on Android device with [termux](https://termux.dev/). +First, obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake: +``` +$ mkdir build-android +$ cd build-android +$ export NDK= +$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod .. +$ make +``` +Install [termux](https://termux.dev/) on your device and run `termux-setup-storage` to get access to your SD card. Finally, copy the `llama` binary and the model files to your device storage. Here is a demo of an interactive session running on Pixel 5 phone: https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4 ### Docker -#### Prerequisites -* Docker must be installed and running on your system. +#### Prerequisites +* Docker must be installed and running on your system. * Create a folder to store big models & intermediate files (ex. /llama/models) -#### Images +#### Images We have two Docker images available for this project: -1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. +1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. 2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. #### Usage @@ -378,38 +378,38 @@ The easiest way to download the models, convert them to ggml and optimize them i Replace `/path/to/models` below with the actual path where you downloaded the models. -```bash -docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B +```bash +docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B ``` On completion, you are ready to play! -```bash -docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 +```bash +docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 ``` or with a light image: -```bash -docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 +```bash +docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 ``` ### Contributing -- Contributors can open PRs -- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch -- Collaborators will be invited based on contributions -- Any help with managing issues and PRs is very appreciated! -- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205) +- Contributors can open PRs +- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch +- Collaborators will be invited based on contributions +- Any help with managing issues and PRs is very appreciated! +- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205) - A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532) ### Coding guidelines -- Avoid adding third-party dependencies, extra files, extra headers, etc. -- Always consider cross-compatibility with other operating systems and architectures -- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple -- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit -- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a` +- Avoid adding third-party dependencies, extra files, extra headers, etc. +- Always consider cross-compatibility with other operating systems and architectures +- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple +- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit +- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a` - See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions ### Docs