From 853b6b980dad5a3019d55892b2e08e0fd5aeda4f Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Mon, 5 Feb 2024 16:34:08 +0200 Subject: [PATCH] readme : update --- README.md | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 50e4178d2..bb6c49338 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ [Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml) -Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++. +Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++ ### Hot topics @@ -58,23 +58,20 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) ## Description -The goal of `llama.cpp` is to run large language models such as Meta's LLaMA model -with minimal setup and state-of-the-art performance on a wide variety of hardware. -Its selling points are: +The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide +variety of hardware - locally and in the cloud. -- Plain C/C++ implementation without any dependencies. AVX, AVX2, and AVX512 support on x86 architectures. -- First-class Apple silicon support - optimized via ARM NEON, Accelerate and Metal frameworks. -- Custom CUDA kernels for running LLMs on NVIDIA GPUs. Can be run on AMD GPUs via HIP. -- Support for Vulkan, SYCL, and OpenCL. -- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization support for faster inference and reduced memory use. -- CPU+GPU hybrid inference to partially accelerate models larger than total VRAM capacity. -- Fast C++ implementations for a variety of samplers: top-k, top-p, typical free sampling, min-p, Mirostat, temperature. -- Can be used as a library, from the command line via one of the examples, or via an HTTP web server. -- Analysis tools that e.g. provide metrics such as perplexity or [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) to judge the precision loss from quantization. +- Plain C/C++ implementation without any dependencies +- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks +- AVX, AVX2 and AVX512 support for x86 architectures +- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use +- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP) +- Vulkan, SYCL, and (partial) OpenCL backend support +- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity -The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022). -Since then, the project has improved significantly thanks to many contributions. -This project serves as the main playground for the more general [ggml](https://github.com/ggerganov/ggml) machine learning library. +Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has +improved significantly thanks to many contributions. It is the main playground for developing new features for the +[ggml](https://github.com/ggerganov/ggml) library. **Supported platforms:**