readme : more examples
This commit is contained in:
parent
fad62ea17a
commit
ec1f6079f9
1 changed files with 151 additions and 146 deletions
297
README.md
297
README.md
|
@ -42,9 +42,9 @@ The `llama.cpp` project is the main playground for developing new features for t
|
|||
|
||||
Typically finetunes of the base models below are supported as well.
|
||||
|
||||
Instructions for adding support for new models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md)
|
||||
Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
|
||||
|
||||
**Text-only:**
|
||||
#### Text-only
|
||||
|
||||
- [X] LLaMA 🦙
|
||||
- [x] LLaMA 2 🦙🦙
|
||||
|
@ -99,7 +99,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
|
|||
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
|
||||
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
|
||||
|
||||
**Multimodal:**
|
||||
#### Multimodal
|
||||
|
||||
- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
|
||||
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
|
||||
|
@ -213,24 +213,24 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
|
|||
|
||||
| Backend | Target devices |
|
||||
| --- | --- |
|
||||
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
|
||||
| [BLAS](./docs/build.md#blas-build) | All |
|
||||
| [BLIS](./docs/backend/BLIS.md) | All |
|
||||
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
|
||||
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
|
||||
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
|
||||
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
|
||||
| [Vulkan](./docs/build.md#vulkan) | GPU |
|
||||
| [CANN](./docs/build.md#cann) | Ascend NPU |
|
||||
| [Metal](docs/build.md#metal-build) | Apple Silicon |
|
||||
| [BLAS](docs/build.md#blas-build) | All |
|
||||
| [BLIS](docs/backend/BLIS.md) | All |
|
||||
| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
|
||||
| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
|
||||
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
|
||||
| [hipBLAS](docs/build.md#hipblas) | AMD GPU |
|
||||
| [Vulkan](docs/build.md#vulkan) | GPU |
|
||||
| [CANN](docs/build.md#cann) | Ascend NPU |
|
||||
|
||||
## Building the project
|
||||
|
||||
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
|
||||
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
|
||||
|
||||
- Clone this repository and build locally, see [how to build](./docs/build.md)
|
||||
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](./docs/install.md)
|
||||
- Use a Docker image, see [documentation for Docker](./docs/docker.md)
|
||||
- Clone this repository and build locally, see [how to build](docs/build.md)
|
||||
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
|
||||
- Use a Docker image, see [documentation for Docker](docs/docker.md)
|
||||
- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
|
||||
|
||||
## Obtaining and quantizing models
|
||||
|
@ -251,201 +251,203 @@ The Hugging Face platform provides a variety of online tools for converting, qua
|
|||
- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
|
||||
- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
|
||||
|
||||
To learn more about model quantization, [read this documentation](./examples/quantize/README.md)
|
||||
To learn more about model quantization, [read this documentation](examples/quantize/README.md)
|
||||
|
||||
## Sample usage
|
||||
## [`llama-cli`](examples/main)
|
||||
|
||||
### [`llama-cli`](./examples/main)
|
||||
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
|
||||
|
||||
A CLI tool for accessing and experimenting with most of the available functionality in the terminal.
|
||||
|
||||
<details open>
|
||||
- <details open>
|
||||
<summary>Run simple text completion</summary>
|
||||
|
||||
```bash
|
||||
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
|
||||
```bash
|
||||
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
|
||||
|
||||
# Output:
|
||||
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
|
||||
```
|
||||
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Run in conversation mode</summary>
|
||||
|
||||
```bash
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
|
||||
```bash
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
|
||||
|
||||
# Output:
|
||||
# > hi, who are you?
|
||||
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
|
||||
#
|
||||
# > what is 1+1?
|
||||
# Easy peasy! The answer to 1+1 is... 2!
|
||||
```
|
||||
# > hi, who are you?
|
||||
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
|
||||
#
|
||||
# > what is 1+1?
|
||||
# Easy peasy! The answer to 1+1 is... 2!
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Run with custom chat template</summary>
|
||||
|
||||
[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
|
||||
```bash
|
||||
# use the "chatml" template
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
|
||||
|
||||
```bash
|
||||
# use the "chatml" template
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
|
||||
# use a custom template
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
|
||||
```
|
||||
|
||||
# use a custom template
|
||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
|
||||
```
|
||||
[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Constrain the output with a custom grammar</summary>
|
||||
|
||||
```bash
|
||||
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
|
||||
```bash
|
||||
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
|
||||
|
||||
# Output:
|
||||
# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
|
||||
```
|
||||
# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
|
||||
```
|
||||
|
||||
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
|
||||
The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
|
||||
|
||||
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
|
||||
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
|
||||
### [`llama-server`](./examples/server)
|
||||
## [`llama-server`](examples/server)
|
||||
|
||||
A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
|
||||
#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
|
||||
|
||||
<details open>
|
||||
- <details open>
|
||||
<summary>Start a local HTTP server with default configuration on port 8080</summary>
|
||||
|
||||
```bash
|
||||
llama-server -m model.gguf --port 8080
|
||||
```bash
|
||||
llama-server -m model.gguf --port 8080
|
||||
|
||||
# Basic web UI can be accessed via browser: http://localhost:8080
|
||||
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
|
||||
```
|
||||
# Basic web UI can be accessed via browser: http://localhost:8080
|
||||
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Support multiple-users and parallel decoding</summary>
|
||||
|
||||
```bash
|
||||
# TODO
|
||||
```
|
||||
```bash
|
||||
# up to 4 concurrent requests, each with 4096 max context
|
||||
llama-server -m model.gguf -c 16384 -np 4
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Enable speculative decoding</summary>
|
||||
|
||||
```bash
|
||||
# TODO
|
||||
```
|
||||
```bash
|
||||
# the draft.gguf model should be a small variant of the target model.gguf
|
||||
llama-server -m model.gguf -md draft.gguf
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Serve an embedding model</summary>
|
||||
|
||||
```bash
|
||||
# TODO
|
||||
```
|
||||
```bash
|
||||
# use the /embedding endpoint
|
||||
llama-server -m model.gguf --embedding --pooling cls -ub 8192
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
- <details>
|
||||
<summary>Serve a reranking model</summary>
|
||||
|
||||
```bash
|
||||
# use the /reranking endpoint
|
||||
llama-server -m model.gguf --reranking
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
- <details>
|
||||
<summary>Constrain all outputs with a grammar</summary>
|
||||
|
||||
```bash
|
||||
# custom grammar
|
||||
llama-server -m model.gguf --grammar-file grammar.gbnf
|
||||
|
||||
# JSON
|
||||
llama-server -m model.gguf --grammar-file grammars/json.gbnf
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### [`llama-perplexity`](./examples/perplexity)
|
||||
## [`llama-perplexity`](examples/perplexity)
|
||||
|
||||
The `llama-perplexity` tool can be used to measure perplexity (and other quality metrics) of a model over a given text.
|
||||
#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
|
||||
|
||||
For more information, see:
|
||||
- [./examples/perplexity/README.md](./examples/perplexity/README.md)
|
||||
- [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
|
||||
|
||||
<details open>
|
||||
- <details open>
|
||||
<summary>Measure the perplexity over a text file</summary>
|
||||
|
||||
```bash
|
||||
llama-perplexity -m model.gguf -f file.txt
|
||||
```bash
|
||||
llama-perplexity -m model.gguf -f file.txt
|
||||
|
||||
# Output:
|
||||
# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297,[9]4.8358,[10]4.5653,[11]4.4790,[12]4.1959, ...
|
||||
# Final estimate: PPL = 5.4007 +/- 0.67339
|
||||
```
|
||||
# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
|
||||
# Final estimate: PPL = 5.4007 +/- 0.67339
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Measure KL divergence</summary>
|
||||
|
||||
```bash
|
||||
# TODO
|
||||
```
|
||||
```bash
|
||||
# TODO
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
[^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
|
||||
[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
|
||||
|
||||
### [`llama-bench`](./example/bench)
|
||||
## [`llama-bench`](example/bench)
|
||||
|
||||
Benchmark the performance of the inference for various parameters.
|
||||
#### Benchmark the performance of the inference for various parameters.
|
||||
|
||||
<details>
|
||||
- <details open>
|
||||
<summary>Run default benchmark</summary>
|
||||
|
||||
```bash
|
||||
llama-bench -m model.gguf
|
||||
```bash
|
||||
llama-bench -m model.gguf
|
||||
|
||||
# Output:
|
||||
# | model | size | params | backend | threads | test | t/s |
|
||||
# | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
|
||||
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 |
|
||||
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 |
|
||||
#
|
||||
# build: 3e0ba0e60 (4229)
|
||||
```
|
||||
# Output:
|
||||
# | model | size | params | backend | threads | test | t/s |
|
||||
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
|
||||
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 |
|
||||
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 |
|
||||
#
|
||||
# build: 3e0ba0e60 (4229)
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
</details>
|
||||
|
||||
|
||||
### [`llama-simple`](./examples/simple)
|
||||
## [`llama-simple`](examples/simple)
|
||||
|
||||
A minimal example for implementing apps using the `llama` library. Useful for developers.
|
||||
#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
|
||||
|
||||
<details>
|
||||
- <details>
|
||||
<summary>Basic text completion</summary>
|
||||
|
||||
```bash
|
||||
llama-simple -m model.gguf
|
||||
```bash
|
||||
llama-simple -m model.gguf
|
||||
|
||||
# Output:
|
||||
# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
|
||||
```
|
||||
# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
|
||||
```
|
||||
|
||||
---
|
||||
</details>
|
||||
|
||||
|
||||
### [`llama-embedding`](./example/embedding)
|
||||
</details>
|
||||
|
||||
|
||||
## Contributing
|
||||
|
@ -461,19 +463,19 @@ llama-simple -m model.gguf
|
|||
|
||||
## Other documentation
|
||||
|
||||
- [main (cli)](./examples/main/README.md)
|
||||
- [server](./examples/server/README.md)
|
||||
- [GBNF grammars](./grammars/README.md)
|
||||
- [main (cli)](examples/main/README.md)
|
||||
- [server](examples/server/README.md)
|
||||
- [GBNF grammars](grammars/README.md)
|
||||
|
||||
**Development documentation**
|
||||
#### Development documentation
|
||||
|
||||
- [How to build](./docs/build.md)
|
||||
- [Running on Docker](./docs/docker.md)
|
||||
- [Build on Android](./docs/android.md)
|
||||
- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
|
||||
- [How to build](docs/build.md)
|
||||
- [Running on Docker](docs/docker.md)
|
||||
- [Build on Android](docs/android.md)
|
||||
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
|
||||
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
|
||||
|
||||
**Seminal papers and background on the models**
|
||||
#### Seminal papers and background on the models
|
||||
|
||||
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
|
||||
- LLaMA:
|
||||
|
@ -484,3 +486,6 @@ If your issue is with model generation quality, then please at least scan the fo
|
|||
- GPT-3.5 / InstructGPT / ChatGPT:
|
||||
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
|
||||
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
|
||||
|
||||
#### References
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue