viva_tensor/quant/nf4

NF4 (NormalFloat4) Quantization - QLoRA Style

Reference: Dettmers et al. (2023) - “QLoRA: Efficient Finetuning of Quantized LLMs” https://arxiv.org/abs/2305.14314

— The Key Insight — Neural network weights follow a normal distribution (approximately). Standard 4-bit quantization uses uniform levels: wasteful! NF4 uses 16 levels derived from quantiles of N(0,1). Result: More precision where weights concentrate (near zero).

— Compression Math — NF4: 32-bit / 4-bit = 8x theoretical With block scaling overhead (FP16 per 64 values): ~7.5x effective Double Quantization (quantize the scales too): ~7.8x effective

— Why NF4 Beats Uniform Q4 — Uniform Q4: 16 evenly spaced levels in [-1, 1] NF4: 16 levels at normal distribution quantiles For Gaussian weights: NF4 has 2x lower quantization error

— Production Numbers — 24GB VRAM with NF4: Can fit 180B parameters (24GB * 7.5 / 1 byte) LLaMA-65B in 24GB? Easy. LLaMA-180B? Just barely.

FP16 was a mistake for storage. NF4 is the future.

Implementation based on: bitsandbytes, Hugging Face Transformers

Types

DoubleQuantNF4

</>

Double Quantization: quantize the quantization constants

Genius insight from QLoRA paper:

Standard NF4: 0.5 bits/param for data + 0.5 bits/param for scales = 1 bit total overhead
Double Quant: 0.5 bits/param for data + 0.127 bits/param for scales = 0.627 bits overhead

How? Quantize the FP16 scales to INT8, with one FP32 scale for all scales. Reduces metadata overhead by ~75%!

pub type DoubleQuantNF4 {
  DoubleQuantNF4(
    blocks: List(NF4Block),
    quantized_scales: List(Int),
    scales_scale: Float,
    shape: List(Int),
    num_elements: Int,
    memory_bytes: Int,
  )
}

Constructors

DoubleQuantNF4(
  blocks: List(NF4Block),
  quantized_scales: List(Int),
  scales_scale: Float,
  shape: List(Int),
  num_elements: Int,
  memory_bytes: Int,
)

Arguments

quantized_scales: Scales quantized to INT8 (one per block)
scales_scale: Global scale for the quantized scales (one FP32 for entire tensor)

NF4Block

</>

Single NF4 block (typically 64 values) Block size 64: empirically optimal tradeoff between accuracy and metadata overhead. Smaller blocks (32): 2x scale overhead, marginal accuracy gain. Larger blocks (128): Half scale overhead, noticeable accuracy loss.

pub type NF4Block {
  NF4Block(indices: List(Int), abs_max: Float, block_size: Int)
}

Constructors

```
NF4Block(indices: List(Int), abs_max: Float, block_size: Int)
```
Arguments

indices

4-bit indices [0-15] mapping to nf4_levels()

abs_max

Per-block scale factor (max absolute value before normalization)

block_size

Block size for unpacking

NF4Config

</>

NF4 configuration

pub type NF4Config {
  NF4Config(block_size: Int, double_quant: Bool)
}

Constructors

```
NF4Config(block_size: Int, double_quant: Bool)
```
Arguments

block_size

Block size (64 is QLoRA default, don’t change unless you know why)

double_quant

Double Quantization: quantize the scales themselves Genius idea from QLoRA paper. Reduces scale overhead by 4x.

NF4Stats

</>

Quantization statistics for analysis

pub type NF4Stats {
  NF4Stats(
    original_bytes: Int,
    compressed_bytes: Int,
    compression_ratio: Float,
    mean_error: Float,
    max_error: Float,
    num_blocks: Int,
  )
}

Constructors

NF4Stats(
  original_bytes: Int,
  compressed_bytes: Int,
  compression_ratio: Float,
  mean_error: Float,
  max_error: Float,
  num_blocks: Int,
)

NF4Tensor

</>

Complete NF4-quantized tensor

pub type NF4Tensor {
  NF4Tensor(
    blocks: List(NF4Block),
    shape: List(Int),
    num_elements: Int,
    memory_bytes: Int,
    compression_ratio: Float,
  )
}

Constructors

NF4Tensor(
  blocks: List(NF4Block),
  shape: List(Int),
  num_elements: Int,
  memory_bytes: Int,
  compression_ratio: Float,
)

Arguments

memory_bytes: Memory in bytes: (num_elements / 2) + (num_blocks * 2)
compression_ratio: Effective compression ratio (typically 7.5-7.8x)

Values

benchmark_nf4

</>

pub fn benchmark_nf4() -> Nil

compute_stats

</>

pub fn compute_stats(
  original: tensor.Tensor,
  nf4: NF4Tensor,
) -> NF4Stats

Compute quantization error statistics

default_config

</>

pub fn default_config() -> NF4Config

QLoRA default configuration

dequantize

</>

pub fn dequantize(nf4: NF4Tensor) -> tensor.Tensor

Dequantize NF4 tensor back to FP32 Note: Quantization error is permanent. This is NOT lossless.

double_quantize

</>

pub fn double_quantize(
  t: tensor.Tensor,
  config: NF4Config,
) -> DoubleQuantNF4

Apply Double Quantization for maximum compression

main

</>

pub fn main() -> Nil

nf4_levels

</>

pub fn nf4_levels() -> List(Float)

The 16 NF4 levels are quantiles of N(0,1) normalized to [-1, 1] These exact values are hardcoded in bitsandbytes and used by QLoRA.

Why these specific values?

Level 7 is exactly 0.0 (critical for sparse weights)
More levels near zero (where weights concentrate)
Fewer levels at tails (where weights are rare)

Derivation: quantile(k/16) for k in 1..16, then normalized

quantize

</>

pub fn quantize(t: tensor.Tensor, config: NF4Config) -> NF4Tensor

Quantize tensor to NF4

Compression: 32/4 = 8x theoretical, ~7.5x with FP16 scales per block Error: ~0.1% mean absolute error for Gaussian-distributed weights