7.1 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Docker
Prerequisites
- Docker must be installed and running on your system.
- Create a folder to store big models & intermediate files (ex. /llama/models)
Images
We have three Docker images available for this project:
- ghcr.io/ggerganov/llama.cpp:full: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms:- linux/amd64,- linux/arm64)
- ghcr.io/ggerganov/llama.cpp:light: This image only includes the main executable file. (platforms:- linux/amd64,- linux/arm64)
- ghcr.io/ggerganov/llama.cpp:server: This image only includes the server executable file. (platforms:- linux/amd64,- linux/arm64)
Additionally, there the following images, similar to the above:
- ghcr.io/ggerganov/llama.cpp:full-cuda: Same as- fullbut compiled with CUDA support. (platforms:- linux/amd64)
- ghcr.io/ggerganov/llama.cpp:light-cuda: Same as- lightbut compiled with CUDA support. (platforms:- linux/amd64)
- ghcr.io/ggerganov/llama.cpp:server-cuda: Same as- serverbut compiled with CUDA support. (platforms:- linux/amd64)
- ghcr.io/ggerganov/llama.cpp:full-rocm: Same as- fullbut compiled with ROCm support. (platforms:- linux/amd64,- linux/arm64)
- ghcr.io/ggerganov/llama.cpp:light-rocm: Same as- lightbut compiled with ROCm support. (platforms:- linux/amd64,- linux/arm64)
- ghcr.io/ggerganov/llama.cpp:server-rocm: Same as- serverbut compiled with ROCm support. (platforms:- linux/amd64,- linux/arm64)
- ghcr.io/ggerganov/llama.cpp:full-musa: Same as- fullbut compiled with MUSA support. (platforms:- linux/amd64)
- ghcr.io/ggerganov/llama.cpp:light-musa: Same as- lightbut compiled with MUSA support. (platforms:- linux/amd64)
- ghcr.io/ggerganov/llama.cpp:server-musa: Same as- serverbut compiled with MUSA support. (platforms:- linux/amd64)
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in .devops/ and the GitHub Action defined in .github/workflows/docker.yml. If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).
Usage
The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
Replace /path/to/models below with the actual path where you downloaded the models.
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
On completion, you are ready to play!
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
or with a light image:
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
or with a server image:
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
Docker With CUDA
Assuming one has the nvidia-container-toolkit properly installed on Linux, or is using a GPU enabled cloud, cuBLAS should be accessible inside the container.
Building Docker locally
docker build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
docker build -t local/llama.cpp:light-cuda --target light -f .devops/cuda.Dockerfile .
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
You may want to pass in some different ARGS, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
The defaults are:
- CUDA_VERSIONset to- 12.6.0
- CUDA_DOCKER_ARCHset to the cmake build default, which includes all the supported architectures
The resulting images, are essentially the same as the non-CUDA images:
- local/llama.cpp:full-cuda: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
- local/llama.cpp:light-cuda: This image only includes the main executable file.
- local/llama.cpp:server-cuda: This image only includes the server executable file.
Usage
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the --gpus flag. You will also want to use the --n-gpu-layers flag.
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
Docker With MUSA
Assuming one has the mt-container-toolkit properly installed on Linux, muBLAS should be accessible inside the container.
Building Docker locally
docker build -t local/llama.cpp:full-musa --target full -f .devops/musa.Dockerfile .
docker build -t local/llama.cpp:light-musa --target light -f .devops/musa.Dockerfile .
docker build -t local/llama.cpp:server-musa --target server -f .devops/musa.Dockerfile .
You may want to pass in some different ARGS, depending on the MUSA environment supported by your container host, as well as the GPU architecture.
The defaults are:
- MUSA_VERSIONset to- rc3.1.0
The resulting images, are essentially the same as the non-MUSA images:
- local/llama.cpp:full-musa: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
- local/llama.cpp:light-musa: This image only includes the main executable file.
- local/llama.cpp:server-musa: This image only includes the server executable file.
Usage
After building locally, Usage is similar to the non-MUSA examples, but you'll need to set mthreads as default Docker runtime. This can be done by executing (cd /usr/bin/musa && sudo ./docker setup $PWD) and verifying the changes by executing docker info | grep mthreads on the host machine. You will also want to use the --n-gpu-layers flag.
docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1