viva_tensor/quant

Quantization - NF4 / INT8 Compression

Run larger models by compressing weights.

Values

pub fn matmul_nf4(
  a: ffi.NativeTensorRef,
  b_indices: List(Int),
  b_scales: List(Float),
  m: Int,
  n: Int,
  k: Int,
  block_size: Int,
) -> ffi.NativeTensorRef

Perform Matrix Multiplication with NF4 Quantized Weights

C = A @ Quantized(B)

Dequantizes B on-the-fly during computation. High compute density, low memory bandwidth.

  • a: Input activations (NativeTensorRef)
  • b_indices: Packed 4-bit indices (2 per byte) for weights
  • b_scales: Scales for blocks of weights
  • block_size: Size of quantization block (usually 64 or 128)
Search Document