How to run llama.cpp

First, make sure you have configured a repository or rebuilt the packages.

Packages to install

Install the llama-cpp-cli package, possibly with additional ggml backends:

sudo apt install llama-cpp-tools libggml-blas
sudo apt install libggml-vulkan # if a GPU is available

Test a small open source model on a laptop with a GPU:

llama-cli --hf-repo allenai/OLMo-2-1124-7B-Instruct-GGUF:Q4_K_M \
 -p "You are a helpful assistant." --conversation \
 --threads $(nproc) --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn \
 --gpu-layers 33

Test Meta's small Llama 3 model on a server without GPU:

llama-cli --hf-repo unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M \
 -p "You are a helpful assistant." --conversation \
 --threads $(nproc)

Note: the models will be downloaded under ~/.cache/llama.cpp, make sure you have enough space on this partition.

Use torrents to deploy models

Using --hf-repo <owner>/<repo>:<quants> better identifies the models, but by default it will download from HuggingFace servers, which is not necessarily the most efficient if the model has already been downloaded nearby.

An experimental alternative is to use the BitTorrent protocol to dispatch the model files where they are neeeded.

Directory structure

The models are distributed as single GGUF files, with the same name used by llama.cpp to when downloading with the hf-repo option.

llama.cpp will expect such files to be stored under ~/.cache/llama.cpp.