viva_tensor/quant/nf4

NF4 (NormalFloat4) Quantization - QLoRA Style

Reference: Dettmers et al. (2023) - “QLoRA: Efficient Finetuning of Quantized LLMs” https://arxiv.org/abs/2305.14314

— The Key Insight — Neural network weights follow a normal distribution (approximately). Standard 4-bit quantization uses uniform levels: wasteful! NF4 uses 16 levels derived from quantiles of N(0,1). Result: More precision where weights concentrate (near zero).

— Compression Math — NF4: 32-bit / 4-bit = 8x theoretical With block scaling overhead (FP16 per 64 values): ~7.5x effective Double Quantization (quantize the scales too): ~7.8x effective

— Why NF4 Beats Uniform Q4 — Uniform Q4: 16 evenly spaced levels in [-1, 1] NF4: 16 levels at normal distribution quantiles For Gaussian weights: NF4 has 2x lower quantization error

— Production Numbers — 24GB VRAM with NF4: Can fit 180B parameters (24GB * 7.5 / 1 byte) LLaMA-65B in 24GB? Easy. LLaMA-180B? Just barely.

FP16 was a mistake for storage. NF4 is the future.

Implementation based on: bitsandbytes, Hugging Face Transformers

Types

Double Quantization: quantize the quantization constants

Genius insight from QLoRA paper:

  • Standard NF4: 0.5 bits/param for data + 0.5 bits/param for scales = 1 bit total overhead
  • Double Quant: 0.5 bits/param for data + 0.127 bits/param for scales = 0.627 bits overhead

How? Quantize the FP16 scales to INT8, with one FP32 scale for all scales. Reduces metadata overhead by ~75%!

pub type DoubleQuantNF4 {
  DoubleQuantNF4(
    blocks: List(NF4Block),
    quantized_scales: List(Int),
    scales_scale: Float,
    shape: List(Int),
    num_elements: Int,
    memory_bytes: Int,
  )
}

Constructors

  • DoubleQuantNF4(
      blocks: List(NF4Block),
      quantized_scales: List(Int),
      scales_scale: Float,
      shape: List(Int),
      num_elements: Int,
      memory_bytes: Int,
    )

    Arguments

    quantized_scales

    Scales quantized to INT8 (one per block)

    scales_scale

    Global scale for the quantized scales (one FP32 for entire tensor)

Single NF4 block (typically 64 values) Block size 64: empirically optimal tradeoff between accuracy and metadata overhead. Smaller blocks (32): 2x scale overhead, marginal accuracy gain. Larger blocks (128): Half scale overhead, noticeable accuracy loss.

pub type NF4Block {
  NF4Block(indices: List(Int), abs_max: Float, block_size: Int)
}

Constructors

  • NF4Block(indices: List(Int), abs_max: Float, block_size: Int)

    Arguments

    indices

    4-bit indices [0-15] mapping to nf4_levels()

    abs_max

    Per-block scale factor (max absolute value before normalization)

    block_size

    Block size for unpacking

NF4 configuration

pub type NF4Config {
  NF4Config(block_size: Int, double_quant: Bool)
}

Constructors

  • NF4Config(block_size: Int, double_quant: Bool)

    Arguments

    block_size

    Block size (64 is QLoRA default, don’t change unless you know why)

    double_quant

    Double Quantization: quantize the scales themselves Genius idea from QLoRA paper. Reduces scale overhead by 4x.

Quantization statistics for analysis

pub type NF4Stats {
  NF4Stats(
    original_bytes: Int,
    compressed_bytes: Int,
    compression_ratio: Float,
    mean_error: Float,
    max_error: Float,
    num_blocks: Int,
  )
}

Constructors

  • NF4Stats(
      original_bytes: Int,
      compressed_bytes: Int,
      compression_ratio: Float,
      mean_error: Float,
      max_error: Float,
      num_blocks: Int,
    )

Complete NF4-quantized tensor

pub type NF4Tensor {
  NF4Tensor(
    blocks: List(NF4Block),
    shape: List(Int),
    num_elements: Int,
    memory_bytes: Int,
    compression_ratio: Float,
  )
}

Constructors

  • NF4Tensor(
      blocks: List(NF4Block),
      shape: List(Int),
      num_elements: Int,
      memory_bytes: Int,
      compression_ratio: Float,
    )

    Arguments

    memory_bytes

    Memory in bytes: (num_elements / 2) + (num_blocks * 2)

    compression_ratio

    Effective compression ratio (typically 7.5-7.8x)

Values

pub fn benchmark_nf4() -> Nil
pub fn compute_stats(
  original: tensor.Tensor,
  nf4: NF4Tensor,
) -> NF4Stats

Compute quantization error statistics

pub fn default_config() -> NF4Config

QLoRA default configuration

pub fn dequantize(nf4: NF4Tensor) -> tensor.Tensor

Dequantize NF4 tensor back to FP32 Note: Quantization error is permanent. This is NOT lossless.

pub fn double_quantize(
  t: tensor.Tensor,
  config: NF4Config,
) -> DoubleQuantNF4

Apply Double Quantization for maximum compression

pub fn main() -> Nil
pub fn nf4_levels() -> List(Float)

The 16 NF4 levels are quantiles of N(0,1) normalized to [-1, 1] These exact values are hardcoded in bitsandbytes and used by QLoRA.

Why these specific values?

  • Level 7 is exactly 0.0 (critical for sparse weights)
  • More levels near zero (where weights concentrate)
  • Fewer levels at tails (where weights are rare)

Derivation: quantile(k/16) for k in 1..16, then normalized

pub fn quantize(t: tensor.Tensor, config: NF4Config) -> NF4Tensor

Quantize tensor to NF4

Compression: 32/4 = 8x theoretical, ~7.5x with FP16 scales per block Error: ~0.1% mean absolute error for Gaussian-distributed weights

Search Document