blackwell

Blackwell-Inspired Compression Engine

INSPIRED BY THE NVIDIA BLACKWELL ULTRA ARCHITECTURE:

NVFP4: Two-level scaling (micro-block FP8 E4M3 + tensor-level FP32)
Hardware decompression: 800 GB/s
Micro-block size: 16 values
Memory hierarchy: HBM3e → L2 → L1 → Registers

GLEAM DIFFERENTIATOR:

GenServer actors to manage memory chunks
OTP supervisors for fault tolerance
Zero-copy views via Erlang binaries
BEAM schedulers for massive parallelism

SILICON PHYSICS (real limits):

8-bit multiplier: 64 area units
32-bit multiplier: 576 units (9x larger!)
HBM4 (2026): 2 TB/s per chip
Blackwell: 8 TB/s HBM3e bandwidth

GOAL: Make Pure Gleam compete with dedicated hardware!

Types

BlackwellTensor

</>

Blackwell-style compressed tensor

pub type BlackwellTensor {
  BlackwellTensor(
    blocks: List(MicroBlock),
    global_scale: Float,
    shape: List(Int),
    num_elements: Int,
    memory_bytes: Int,
    compression_ratio: Float,
  )
}

Constructors

```
BlackwellTensor(
  blocks: List(MicroBlock),
  global_scale: Float,
  shape: List(Int),
  num_elements: Int,
  memory_bytes: Int,
  compression_ratio: Float,
)
```
Arguments

blocks

Micro-blocks of 16 values each

global_scale

Global tensor scale (FP32)

shape

Original shape

num_elements

Number of elements

memory_bytes

Memory in bytes (actual)

compression_ratio

Achieved compression ratio

CompressionConfig

</>

Compression configuration

pub type CompressionConfig {
  CompressionConfig(
    block_size: Int,
    bits_per_value: Int,
    symmetric: Bool,
    max_error_pct: Float,
  )
}

Constructors

```
CompressionConfig(
  block_size: Int,
  bits_per_value: Int,
  symmetric: Bool,
  max_error_pct: Float,
)
```
Arguments

block_size

Micro-block size (default: 16 for NVFP4)

bits_per_value

Bits per value (4 for NVFP4, 8 for INT8)

symmetric

Use symmetric quantization

max_error_pct

Maximum error tolerance

CompressionStats

</>

Compression statistics

pub type CompressionStats {
  CompressionStats(
    original_bytes: Int,
    compressed_bytes: Int,
    compression_ratio: Float,
    mean_error: Float,
    max_error: Float,
    blocks_processed: Int,
  )
}

Constructors

CompressionStats(
  original_bytes: Int,
  compressed_bytes: Int,
  compression_ratio: Float,
  mean_error: Float,
  max_error: Float,
  blocks_processed: Int,
)

DistributionStats

</>

Distribution statistics

pub type DistributionStats {
  DistributionStats(
    mean: Float,
    std: Float,
    min_val: Float,
    max_val: Float,
    dynamic_range: Float,
    sparsity: Float,
  )
}

Constructors

DistributionStats(
  mean: Float,
  std: Float,
  min_val: Float,
  max_val: Float,
  dynamic_range: Float,
  sparsity: Float,
)

MemoryLevel

</>

Level in the memory hierarchy

pub type MemoryLevel {
  Registers
  L1Cache
  L2Cache
  Hbm
  SystemRam
  Storage
}

Constructors

```
Registers
```
Registers (fastest, ~10KB)
```
L1Cache
```
L1 Cache (~128KB, 100+ GB/s)
```
L2Cache
```
L2 Cache (~6MB, 50 GB/s)
```
Hbm
```
HBM/DRAM (~24GB, 8 TB/s for Blackwell)
```
SystemRam
```
System RAM (~32GB, 50 GB/s)
```
Storage
```
NVMe SSD (~1TB, 7 GB/s)

MicroBlock

</>

Micro-block of 16 values (Blackwell NVFP4 inspired)

pub type MicroBlock {
  MicroBlock(values: List(Int), scale: Float, zero_point: Float)
}

Constructors

```
MicroBlock(values: List(Int), scale: Float, zero_point: Float)
```
Arguments

values

Quantized data (4-bit each, packed)

scale

Micro-block scale (simulated FP8 E4M3)

zero_point

Zero-point for negative values

StreamChunk

</>

Streaming data chunk

pub type StreamChunk {
  StreamChunk(id: Int, block: MicroBlock, compressed: Bool)
}

Constructors

StreamChunk(id: Int, block: MicroBlock, compressed: Bool)

StreamState

</>

Streaming compressor state

pub type StreamState {
  StreamState(
    config: CompressionConfig,
    processed_chunks: Int,
    total_bytes_in: Int,
    total_bytes_out: Int,
  )
}

Constructors

StreamState(
  config: CompressionConfig,
  processed_chunks: Int,
  total_bytes_in: Int,
  total_bytes_out: Int,
)

Values

analyze_and_compress

</>

pub fn analyze_and_compress(t: tensor.Tensor) -> BlackwellTensor

Analyzes tensor and chooses best configuration

benchmark_blackwell_compression

</>

pub fn benchmark_blackwell_compression() -> Nil

compress

</>

pub fn compress(
  t: tensor.Tensor,
  config: CompressionConfig,
) -> BlackwellTensor

Compresses tensor using NVFP4 style

compression_stats

</>

pub fn compression_stats(
  original: tensor.Tensor,
  compressed: BlackwellTensor,
) -> CompressionStats

Computes compression statistics

decompress

</>

pub fn decompress(bt: BlackwellTensor) -> tensor.Tensor

Decompresses Blackwell tensor back to FP32

int8_config

</>

pub fn int8_config() -> CompressionConfig

INT8 configuration (higher precision)

main

</>

pub fn main() -> Nil

memory_bandwidth_gbps

</>

pub fn memory_bandwidth_gbps(level: MemoryLevel) -> Float

Simulates bandwidth in GB/s

memory_latency_ns

</>

pub fn memory_latency_ns(level: MemoryLevel) -> Int

Simulates access latency

new_stream

</>

pub fn new_stream(config: CompressionConfig) -> StreamState

Creates new streaming state

nvfp4_config

</>

pub fn nvfp4_config() -> CompressionConfig

Default NVFP4 configuration (Blackwell style)

process_chunk

</>

pub fn process_chunk(
  state: StreamState,
  data: List(Float),
) -> #(StreamState, MicroBlock)

Processes a data chunk in streaming mode

transfer_time_us

</>

pub fn transfer_time_us(
  size_mb: Float,
  level: MemoryLevel,
) -> Float

Computes transfer time

Constructors

Arguments

Constructors

Arguments

Constructors

Constructors

Constructors

Constructors

Arguments

Constructors

Constructors