viva_tensor/optim/blackwell
Blackwell-Inspired Compression Engine
INSPIRED BY THE NVIDIA BLACKWELL ULTRA ARCHITECTURE:
- NVFP4: Two-level scaling (micro-block FP8 E4M3 + tensor-level FP32)
- Hardware decompression: 800 GB/s
- Micro-block size: 16 values
- Memory hierarchy: HBM3e → L2 → L1 → Registers
GLEAM DIFFERENTIATOR:
- GenServer actors to manage memory chunks
- OTP supervisors for fault tolerance
- Zero-copy views via Erlang binaries
- BEAM schedulers for massive parallelism
SILICON PHYSICS (real limits):
- 8-bit multiplier: 64 area units
- 32-bit multiplier: 576 units (9x larger!)
- HBM4 (2026): 2 TB/s per chip
- Blackwell: 8 TB/s HBM3e bandwidth
GOAL: Make Pure Gleam compete with dedicated hardware!
Types
Blackwell-style compressed tensor
pub type BlackwellTensor {
BlackwellTensor(
blocks: List(MicroBlock),
global_scale: Float,
shape: List(Int),
num_elements: Int,
memory_bytes: Int,
compression_ratio: Float,
)
}
Constructors
-
BlackwellTensor( blocks: List(MicroBlock), global_scale: Float, shape: List(Int), num_elements: Int, memory_bytes: Int, compression_ratio: Float, )Arguments
- blocks
-
Micro-blocks of 16 values each
- global_scale
-
Global tensor scale (FP32)
- shape
-
Original shape
- num_elements
-
Number of elements
- memory_bytes
-
Memory in bytes (actual)
- compression_ratio
-
Achieved compression ratio
Compression configuration
pub type CompressionConfig {
CompressionConfig(
block_size: Int,
bits_per_value: Int,
symmetric: Bool,
max_error_pct: Float,
)
}
Constructors
-
CompressionConfig( block_size: Int, bits_per_value: Int, symmetric: Bool, max_error_pct: Float, )Arguments
- block_size
-
Micro-block size (default: 16 for NVFP4)
- bits_per_value
-
Bits per value (4 for NVFP4, 8 for INT8)
- symmetric
-
Use symmetric quantization
- max_error_pct
-
Maximum error tolerance
Compression statistics
pub type CompressionStats {
CompressionStats(
original_bytes: Int,
compressed_bytes: Int,
compression_ratio: Float,
mean_error: Float,
max_error: Float,
blocks_processed: Int,
)
}
Constructors
-
CompressionStats( original_bytes: Int, compressed_bytes: Int, compression_ratio: Float, mean_error: Float, max_error: Float, blocks_processed: Int, )
Distribution statistics
pub type DistributionStats {
DistributionStats(
mean: Float,
std: Float,
min_val: Float,
max_val: Float,
dynamic_range: Float,
sparsity: Float,
)
}
Constructors
-
DistributionStats( mean: Float, std: Float, min_val: Float, max_val: Float, dynamic_range: Float, sparsity: Float, )
Level in the memory hierarchy
pub type MemoryLevel {
Registers
L1Cache
L2Cache
Hbm
SystemRam
Storage
}
Constructors
-
RegistersRegisters (fastest, ~10KB)
-
L1CacheL1 Cache (~128KB, 100+ GB/s)
-
L2CacheL2 Cache (~6MB, 50 GB/s)
-
HbmHBM/DRAM (~24GB, 8 TB/s for Blackwell)
-
SystemRamSystem RAM (~32GB, 50 GB/s)
-
StorageNVMe SSD (~1TB, 7 GB/s)
Micro-block of 16 values (Blackwell NVFP4 inspired)
pub type MicroBlock {
MicroBlock(values: List(Int), scale: Float, zero_point: Float)
}
Constructors
-
MicroBlock(values: List(Int), scale: Float, zero_point: Float)Arguments
- values
-
Quantized data (4-bit each, packed)
- scale
-
Micro-block scale (simulated FP8 E4M3)
- zero_point
-
Zero-point for negative values
Streaming data chunk
pub type StreamChunk {
StreamChunk(id: Int, block: MicroBlock, compressed: Bool)
}
Constructors
-
StreamChunk(id: Int, block: MicroBlock, compressed: Bool)
Streaming compressor state
pub type StreamState {
StreamState(
config: CompressionConfig,
processed_chunks: Int,
total_bytes_in: Int,
total_bytes_out: Int,
)
}
Constructors
-
StreamState( config: CompressionConfig, processed_chunks: Int, total_bytes_in: Int, total_bytes_out: Int, )
Values
pub fn analyze_and_compress(t: tensor.Tensor) -> BlackwellTensor
Analyzes tensor and chooses best configuration
pub fn benchmark_blackwell_compression() -> Nil
pub fn compress(
t: tensor.Tensor,
config: CompressionConfig,
) -> BlackwellTensor
Compresses tensor using NVFP4 style
pub fn compression_stats(
original: tensor.Tensor,
compressed: BlackwellTensor,
) -> CompressionStats
Computes compression statistics
pub fn decompress(bt: BlackwellTensor) -> tensor.Tensor
Decompresses Blackwell tensor back to FP32
pub fn memory_bandwidth_gbps(level: MemoryLevel) -> Float
Simulates bandwidth in GB/s
pub fn new_stream(config: CompressionConfig) -> StreamState
Creates new streaming state
pub fn nvfp4_config() -> CompressionConfig
Default NVFP4 configuration (Blackwell style)
pub fn process_chunk(
state: StreamState,
data: List(Float),
) -> #(StreamState, MicroBlock)
Processes a data chunk in streaming mode
pub fn transfer_time_us(
size_mb: Float,
level: MemoryLevel,
) -> Float
Computes transfer time