First, make sure you have configured a repository or rebuilt the packages.
Install the llama-cpp-cli
package, possibly with
additional ggml backends:
sudo apt install llama-cpp-tools libggml-blas
sudo apt install libggml-vulkan # if a GPU is available
Test a small open source model on a laptop with a GPU:
llama-cli --hf-repo allenai/OLMo-2-1124-7B-Instruct-GGUF:Q4_K_M \
-p "You are a helpful assistant." --conversation \
--threads $(nproc) --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn \
--gpu-layers 33
Test Meta's small Llama 3 model on a server without GPU:
llama-cli --hf-repo unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M \
-p "You are a helpful assistant." --conversation \
--threads $(nproc)
Note: the models will be downloaded under
~/.cache/llama.cpp
, make sure you have enough space on this
partition.
Using
--hf-repo <owner>/<repo>:<quants>
better
identifies the models, but by default it will download from HuggingFace
servers, which is not necessarily the most efficient if the model has
already been downloaded nearby.
An experimental alternative is to use the BitTorrent protocol to dispatch the model files where they are neeeded.
The models are distributed as single GGUF files, with the same name used by llama.cpp to when downloading with the hf-repo option.
llama.cpp will expect such files to be stored under ~/.cache/llama.cpp.