Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.
llama-server or llama-cli command from the Hub.gguf files and sizes for a repoPrefer URL workflows before asking for hf, Python, or custom scripts.
https://huggingface.co/models?apps=llama.cpp&sort=trending
- Add search=<term> for a model family
- Add num_parameters=min:0,max:24B or similar when the user has size constraintshttps://huggingface.co/<repo>?local-app=llama.cppllama-server or llama-cli command
- report the recommended quant exactly as HF shows it?local-app=llama.cpp URL as page text or HTML and extract the section under Hardware compatibility:
- prefer its exact quant labels and sizes over generic tables
- keep repo-specific labels such as UD-Q4_K_M or IQ4_NL_XL
- if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidancehttps://huggingface.co/api/models/<repo>/tree/main?recursive=true
- keep entries where type is file and path ends with .gguf
- use path and size as the source of truth for filenames and byte sizes
- separate quantized checkpoints from mmproj-*.gguf projector files and BF16/ shard files
- use https://huggingface.co/<repo>/tree/main only as a human fallbackllama-server -hf <repo>:<QUANT>
- exact-file fallback: llama-server --hf-repo <repo> --hf-file <filename.gguf># macOS / Linux (simplest)
brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
Use this when the tree API shows custom file naming or the exact HF snippet is missing.
llama-server \
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
-c 4096
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about Python exceptions"}
]
}'
pip install llama-cpp-python (CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir; Metal: CMAKE_ARGS="-DGGML_METAL=on" ...).
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35, # 0 for CPU, 99 to offload everything
n_threads=8,
)
out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
print(out["choices"][0]["text"])
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3", # or "chatml", "mistral", etc.
)
resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])
# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")
You can also load a GGUF straight from the Hub:
llm = Llama.from_pretrained(
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
filename="*Q4_K_M.gguf",
n_gpu_layers=35,
)
Use the Hub page first, generic heuristics second.
Q4_K_M.Q5_K_M or Q6_K if memory allows.Q3_K_M, IQ variants, or Q2 variants only if the user explicitly prioritizes fit over quality.mmproj-*.gguf separately. The projector is not the main model file.UD-Q4_K_M, report UD-Q4_K_M.When the user asks what GGUFs exist, return:
Ignore unless requested:
Use the tree API for this step:
https://huggingface.co/api/models/<repo>/tree/main?recursive=trueFor a repo like unsloth/Qwen3.6-35B-A3B-GGUF, the local-app page can show quant chips such as UD-Q4_K_M, UD-Q5_K_M, UD-Q6_K, and Q8_0, while the tree API exposes exact file paths such as Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and Qwen3.6-35B-A3B-Q8_0.gguf with byte sizes. Use the tree API to turn a quant label into an exact filename.
Use these URL shapes directly:
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
https://huggingface.co/<repo>/tree/main
When answering discovery requests, prefer a compact structured result like:
Repo: <repo>
Recommended quant from HF: <label> (<size>)
llama-server: <command>
Other GGUFs:
- <filename> - <size>
- <filename> - <size>
Source URLs:
- <local-app URL>
- <tree API URL>