GGUF (GPTQ GPU Unified Format) is the evolved file format used in llama.cpp.

It is a binary file format optimized for efficiently loading and saving Large Language Models and other machine learning models using Quantization, particularly for Inferencing.

It stores compressed versions of LLMs by reducing the precision of weights from:

FP16 or FP32 → INT4 / INT5 / INT8, etc.

GGUF Quant Levels (Simplified Table)

Quant Level	Bits	Precision Type	Size	Speed	Quality	Typical Use
`Q2_K`	2-bit	VERY low precision	🔻 Smallest	🚀 Fastest	❌ Poor	Experiments, small devices
`Q3_K`	3-bit	Low precision	🔻 Small	🚀 Fast	⚠️ Low	Small devices, toy use
`Q4_0`	4-bit	Basic INT4 (old)	🔻 Small	🚀 Fast	⚠️ Noticeable loss	Older models, testing
`Q4_K_M`	4-bit	Modern INT4 + mul mat	⚖️ Medium	⚡ Fast	✅ Good (~95%)	✅ Best Q4-level compromise
`Q5_0`	5-bit	Basic INT5	⚖️ Medium	⚡ Fast	✅+ Very good	Good balance for general use
`Q5_K_M`	5-bit	Optimized 5-bit	📈 Larger	✅ Fast	✅✅ Excellent	Near-FP16 quality
`Q6_K`	6-bit	Almost INT8-level	📈 Big	✅ Fast	✅✅✅ Very close to FP	Ideal if you have VRAM room
`Q8_0`	8-bit	Full INT8	📈📈 Large	⚠️ Slower	✅✅✅✅ Excellent	Very close to full precision

The community favorite right now is:

🔥 Q4_K_M → Great speed & quality on mid-range GPUs like your 3080 Ti.

GGML and GGUF refer to the same concept, with GGUF being the newer version that incorporates additional data about the model.

For more info: What is GGUF and GGML?

The Latent Space

Explorer

Explorer

Recent Notes

ContentEngine

ContextCore

ContextStore

GGUF

GGUF Quant Levels (Simplified Table)

Graph View

Recent Notes

ContentEngine

ContextCore

ContextStore