viva_tensor/tflops

TFLOPS - Tera Floating Point Operations Per Second

Multi-platform computational throughput measurement and auto-dispatch. From Pure Erlang (~0.001 TFLOPS) to CUDA Sparse 2:4 (~660 TFLOPS).

The Auto backend automatically selects the fastest available compute: GPU Sparse > GPU FP16 > GPU INT8 > GPU FP32 > CPU MKL > CPU SIMD > Erlang

import viva_tensor/tflops

// Auto-select fastest backend
let result = tflops.measure_matmul(tflops.Auto, 2048, 2048, 2048)
io.println(tflops.format_result(result))

// Benchmark all available backends
let backends = tflops.detect_backends()
let results = list.map(backends, fn(b) { tflops.measure_matmul(b, 1024, 1024, 1024) })
io.println(tflops.format_table(results))

Types

Compute backend — ordered from slowest to fastest

pub type Backend {
  PureErlang
  ZigSIMD
  MklBLAS
  CudaFP32
  CudaFP16
  CudaINT8
  CudaSparse
  Auto
}

Constructors

  • PureErlang

    Pure Erlang lists — ~0.001 TFLOPS (baseline, always available)

  • ZigSIMD

    Zig SIMD NIF — ~1.5 TFLOPS (AVX2/SSE, portable)

  • MklBLAS

    Intel MKL BLAS — ~2.0 TFLOPS (multi-threaded SGEMM)

  • CudaFP32

    CUDA FP32 cuBLAS — ~59 TFLOPS (RTX 4090 measured)

  • CudaFP16

    CUDA FP16 Tensor Cores — ~172 TFLOPS (HMMA, RTX 4090 measured)

  • CudaINT8

    CUDA INT8 IMMA Tensor Cores — ~330 TOPS

  • CudaSparse

    CUDA 2:4 Sparse FP16 — ~660 TFLOPS (cuSPARSELt)

  • Auto

    Auto-select fastest available backend

TFLOPS measurement result

pub type TflopsResult {
  TflopsResult(
    backend: Backend,
    matrix_size: Int,
    flops: Int,
    time_us: Int,
    tflops: Float,
    gflops: Float,
    efficiency: Float,
  )
}

Constructors

  • TflopsResult(
      backend: Backend,
      matrix_size: Int,
      flops: Int,
      time_us: Int,
      tflops: Float,
      gflops: Float,
      efficiency: Float,
    )

Values

pub fn backend_name(backend: Backend) -> String

Backend name as string

pub fn best_backend() -> Backend

Detect the fastest available backend

pub fn detect_backends() -> List(Backend)

Detect all available backends (ordered slowest to fastest)

pub fn format_result(result: TflopsResult) -> String

Format single result as a one-line string

pub fn format_table(results: List(TflopsResult)) -> String

Format list of results as a table

pub fn measure_matmul(
  backend: Backend,
  m: Int,
  n: Int,
  k: Int,
) -> TflopsResult

Measure single matmul TFLOPS for a backend

pub fn measure_matmul_averaged(
  backend: Backend,
  m: Int,
  n: Int,
  k: Int,
  iterations: Int,
) -> TflopsResult

Measure averaged TFLOPS (warmup + iterations)

pub fn theoretical_peak(backend: Backend) -> Float

Theoretical peak TFLOPS for a backend (RTX 4090 / i9-13900K)

Search Document