GGUF (GPTQ GPU Unified Format) is the evolved file format used in llama.cpp.

It is a binary file format optimized for efficiently loading and saving Large Language Models and other machine learning models using Quantization, particularly for Inferencing.

It stores compressed versions of LLMs by reducing the precision of weights from:

FP16 or FP32INT4 / INT5 / INT8, etc.


GGUF Quant Levels (Simplified Table)

Quant LevelBitsPrecision TypeSizeSpeedQualityTypical Use
Q2_K2-bitVERY low precision🔻 Smallest🚀 Fastest❌ PoorExperiments, small devices
Q3_K3-bitLow precision🔻 Small🚀 Fast⚠️ LowSmall devices, toy use
Q4_04-bitBasic INT4 (old)🔻 Small🚀 Fast⚠️ Noticeable lossOlder models, testing
Q4_K_M4-bitModern INT4 + mul mat⚖️ Medium⚡ Fast✅ Good (~95%)✅ Best Q4-level compromise
Q5_05-bitBasic INT5⚖️ Medium⚡ Fast✅+ Very goodGood balance for general use
Q5_K_M5-bitOptimized 5-bit📈 Larger✅ Fast✅✅ ExcellentNear-FP16 quality
Q6_K6-bitAlmost INT8-level📈 Big✅ Fast✅✅✅ Very close to FPIdeal if you have VRAM room
Q8_08-bitFull INT8📈📈 Large⚠️ Slower✅✅✅✅ ExcellentVery close to full precision

The community favorite right now is:

🔥 Q4_K_M → Great speed & quality on mid-range GPUs like your 3080 Ti.


GGML and GGUF refer to the same concept, with GGUF being the newer version that incorporates additional data about the model.

For more info: What is GGUF and GGML?