readme : update

This commit is contained in:
Georgi Gerganov 2024-02-05 16:34:08 +02:00
parent 05c9cd81a9
commit 853b6b980d
No known key found for this signature in database
GPG key ID: BF970631944C16B7

View file

@ -6,7 +6,7 @@
[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml) [Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++. Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
### Hot topics ### Hot topics
@ -58,23 +58,20 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
## Description ## Description
The goal of `llama.cpp` is to run large language models such as Meta's LLaMA model The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
with minimal setup and state-of-the-art performance on a wide variety of hardware. variety of hardware - locally and in the cloud.
Its selling points are:
- Plain C/C++ implementation without any dependencies. AVX, AVX2, and AVX512 support on x86 architectures. - Plain C/C++ implementation without any dependencies
- First-class Apple silicon support - optimized via ARM NEON, Accelerate and Metal frameworks. - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- Custom CUDA kernels for running LLMs on NVIDIA GPUs. Can be run on AMD GPUs via HIP. - AVX, AVX2 and AVX512 support for x86 architectures
- Support for Vulkan, SYCL, and OpenCL. - 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization support for faster inference and reduced memory use. - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- CPU+GPU hybrid inference to partially accelerate models larger than total VRAM capacity. - Vulkan, SYCL, and (partial) OpenCL backend support
- Fast C++ implementations for a variety of samplers: top-k, top-p, typical free sampling, min-p, Mirostat, temperature. - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
- Can be used as a library, from the command line via one of the examples, or via an HTTP web server.
- Analysis tools that e.g. provide metrics such as perplexity or [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) to judge the precision loss from quantization.
The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022). Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
Since then, the project has improved significantly thanks to many contributions. improved significantly thanks to many contributions. It is the main playground for developing new features for the
This project serves as the main playground for the more general [ggml](https://github.com/ggerganov/ggml) machine learning library. [ggml](https://github.com/ggerganov/ggml) library.
**Supported platforms:** **Supported platforms:**