viva_tensor/quant/nf4
NF4 (NormalFloat4) Quantization - QLoRA Style
Reference: Dettmers et al. (2023) - “QLoRA: Efficient Finetuning of Quantized LLMs” https://arxiv.org/abs/2305.14314
— The Key Insight — Neural network weights follow a normal distribution (approximately). Standard 4-bit quantization uses uniform levels: wasteful! NF4 uses 16 levels derived from quantiles of N(0,1). Result: More precision where weights concentrate (near zero).
— Compression Math — NF4: 32-bit / 4-bit = 8x theoretical With block scaling overhead (FP16 per 64 values): ~7.5x effective Double Quantization (quantize the scales too): ~7.8x effective
— Why NF4 Beats Uniform Q4 — Uniform Q4: 16 evenly spaced levels in [-1, 1] NF4: 16 levels at normal distribution quantiles For Gaussian weights: NF4 has 2x lower quantization error
— Production Numbers — 24GB VRAM with NF4: Can fit 180B parameters (24GB * 7.5 / 1 byte) LLaMA-65B in 24GB? Easy. LLaMA-180B? Just barely.
FP16 was a mistake for storage. NF4 is the future.
Implementation based on: bitsandbytes, Hugging Face Transformers
Types
Double Quantization: quantize the quantization constants
Genius insight from QLoRA paper:
- Standard NF4: 0.5 bits/param for data + 0.5 bits/param for scales = 1 bit total overhead
- Double Quant: 0.5 bits/param for data + 0.127 bits/param for scales = 0.627 bits overhead
How? Quantize the FP16 scales to INT8, with one FP32 scale for all scales. Reduces metadata overhead by ~75%!
pub type DoubleQuantNF4 {
DoubleQuantNF4(
blocks: List(NF4Block),
quantized_scales: List(Int),
scales_scale: Float,
shape: List(Int),
num_elements: Int,
memory_bytes: Int,
)
}
Constructors
-
DoubleQuantNF4( blocks: List(NF4Block), quantized_scales: List(Int), scales_scale: Float, shape: List(Int), num_elements: Int, memory_bytes: Int, )Arguments
- quantized_scales
-
Scales quantized to INT8 (one per block)
- scales_scale
-
Global scale for the quantized scales (one FP32 for entire tensor)
Single NF4 block (typically 64 values) Block size 64: empirically optimal tradeoff between accuracy and metadata overhead. Smaller blocks (32): 2x scale overhead, marginal accuracy gain. Larger blocks (128): Half scale overhead, noticeable accuracy loss.
pub type NF4Block {
NF4Block(indices: List(Int), abs_max: Float, block_size: Int)
}
Constructors
-
NF4Block(indices: List(Int), abs_max: Float, block_size: Int)Arguments
- indices
-
4-bit indices [0-15] mapping to nf4_levels()
- abs_max
-
Per-block scale factor (max absolute value before normalization)
- block_size
-
Block size for unpacking
NF4 configuration
pub type NF4Config {
NF4Config(block_size: Int, double_quant: Bool)
}
Constructors
-
NF4Config(block_size: Int, double_quant: Bool)Arguments
- block_size
-
Block size (64 is QLoRA default, don’t change unless you know why)
- double_quant
-
Double Quantization: quantize the scales themselves Genius idea from QLoRA paper. Reduces scale overhead by 4x.
Quantization statistics for analysis
pub type NF4Stats {
NF4Stats(
original_bytes: Int,
compressed_bytes: Int,
compression_ratio: Float,
mean_error: Float,
max_error: Float,
num_blocks: Int,
)
}
Constructors
-
NF4Stats( original_bytes: Int, compressed_bytes: Int, compression_ratio: Float, mean_error: Float, max_error: Float, num_blocks: Int, )
Complete NF4-quantized tensor
pub type NF4Tensor {
NF4Tensor(
blocks: List(NF4Block),
shape: List(Int),
num_elements: Int,
memory_bytes: Int,
compression_ratio: Float,
)
}
Constructors
-
NF4Tensor( blocks: List(NF4Block), shape: List(Int), num_elements: Int, memory_bytes: Int, compression_ratio: Float, )Arguments
- memory_bytes
-
Memory in bytes: (num_elements / 2) + (num_blocks * 2)
- compression_ratio
-
Effective compression ratio (typically 7.5-7.8x)
Values
pub fn benchmark_nf4() -> Nil
pub fn compute_stats(
original: tensor.Tensor,
nf4: NF4Tensor,
) -> NF4Stats
Compute quantization error statistics
pub fn dequantize(nf4: NF4Tensor) -> tensor.Tensor
Dequantize NF4 tensor back to FP32 Note: Quantization error is permanent. This is NOT lossless.
pub fn double_quantize(
t: tensor.Tensor,
config: NF4Config,
) -> DoubleQuantNF4
Apply Double Quantization for maximum compression
pub fn nf4_levels() -> List(Float)
The 16 NF4 levels are quantiles of N(0,1) normalized to [-1, 1] These exact values are hardcoded in bitsandbytes and used by QLoRA.
Why these specific values?
- Level 7 is exactly 0.0 (critical for sparse weights)
- More levels near zero (where weights concentrate)
- Fewer levels at tails (where weights are rare)
Derivation: quantile(k/16) for k in 1..16, then normalized
pub fn quantize(t: tensor.Tensor, config: NF4Config) -> NF4Tensor
Quantize tensor to NF4
Compression: 32/4 = 8x theoretical, ~7.5x with FP16 scales per block Error: ~0.1% mean absolute error for Gaussian-distributed weights