viva_tensor/tflops

TFLOPS - Tera Floating Point Operations Per Second

Multi-platform computational throughput measurement and auto-dispatch. From Pure Erlang (~0.001 TFLOPS) to CUDA Sparse 2:4 (~660 TFLOPS).

The Auto backend automatically selects the fastest available compute: GPU Sparse > GPU FP16 > GPU INT8 > GPU FP32 > CPU MKL > CPU SIMD > Erlang

import viva_tensor/tflops

// Auto-select fastest backend
let result = tflops.measure_matmul(tflops.Auto, 2048, 2048, 2048)
io.println(tflops.format_result(result))

// Benchmark all available backends
let backends = tflops.detect_backends()
let results = list.map(backends, fn(b) { tflops.measure_matmul(b, 1024, 1024, 1024) })
io.println(tflops.format_table(results))

Types

Backend

</>

Compute backend — ordered from slowest to fastest

pub type Backend {
  PureErlang
  ZigSIMD
  MklBLAS
  CudaFP32
  CudaFP16
  CudaINT8
  CudaSparse
  Auto
}

Constructors

```
PureErlang
```
Pure Erlang lists — ~0.001 TFLOPS (baseline, always available)
```
ZigSIMD
```
Zig SIMD NIF — ~1.5 TFLOPS (AVX2/SSE, portable)
```
MklBLAS
```
Intel MKL BLAS — ~2.0 TFLOPS (multi-threaded SGEMM)
```
CudaFP32
```
CUDA FP32 cuBLAS — ~59 TFLOPS (RTX 4090 measured)
```
CudaFP16
```
CUDA FP16 Tensor Cores — ~172 TFLOPS (HMMA, RTX 4090 measured)
```
CudaINT8
```
CUDA INT8 IMMA Tensor Cores — ~330 TOPS
```
CudaSparse
```
CUDA 2:4 Sparse FP16 — ~660 TFLOPS (cuSPARSELt)
```
Auto
```
Auto-select fastest available backend

TflopsResult

</>

TFLOPS measurement result

pub type TflopsResult {
  TflopsResult(
    backend: Backend,
    matrix_size: Int,
    flops: Int,
    time_us: Int,
    tflops: Float,
    gflops: Float,
    efficiency: Float,
  )
}

Constructors

TflopsResult(
  backend: Backend,
  matrix_size: Int,
  flops: Int,
  time_us: Int,
  tflops: Float,
  gflops: Float,
  efficiency: Float,
)

Values

backend_name

</>

pub fn backend_name(backend: Backend) -> String

Backend name as string

best_backend

</>

pub fn best_backend() -> Backend

Detect the fastest available backend

detect_backends

</>

pub fn detect_backends() -> List(Backend)

Detect all available backends (ordered slowest to fastest)

format_result

</>

pub fn format_result(result: TflopsResult) -> String

Format single result as a one-line string

format_table

</>

pub fn format_table(results: List(TflopsResult)) -> String

Format list of results as a table

measure_matmul

</>

pub fn measure_matmul(
  backend: Backend,
  m: Int,
  n: Int,
  k: Int,
) -> TflopsResult

Measure single matmul TFLOPS for a backend

measure_matmul_averaged

</>

pub fn measure_matmul_averaged(
  backend: Backend,
  m: Int,
  n: Int,
  k: Int,
  iterations: Int,
) -> TflopsResult

Measure averaged TFLOPS (warmup + iterations)

theoretical_peak

</>

pub fn theoretical_peak(backend: Backend) -> Float

Theoretical peak TFLOPS for a backend (RTX 4090 / i9-13900K)