viva_tensor/tflops
TFLOPS - Tera Floating Point Operations Per Second
Multi-platform computational throughput measurement and auto-dispatch. From Pure Erlang (~0.001 TFLOPS) to CUDA Sparse 2:4 (~660 TFLOPS).
The Auto backend automatically selects the fastest available compute:
GPU Sparse > GPU FP16 > GPU INT8 > GPU FP32 > CPU MKL > CPU SIMD > Erlang
import viva_tensor/tflops
// Auto-select fastest backend
let result = tflops.measure_matmul(tflops.Auto, 2048, 2048, 2048)
io.println(tflops.format_result(result))
// Benchmark all available backends
let backends = tflops.detect_backends()
let results = list.map(backends, fn(b) { tflops.measure_matmul(b, 1024, 1024, 1024) })
io.println(tflops.format_table(results))
Types
Compute backend — ordered from slowest to fastest
pub type Backend {
PureErlang
ZigSIMD
MklBLAS
CudaFP32
CudaFP16
CudaINT8
CudaSparse
Auto
}
Constructors
-
PureErlangPure Erlang lists — ~0.001 TFLOPS (baseline, always available)
-
ZigSIMDZig SIMD NIF — ~1.5 TFLOPS (AVX2/SSE, portable)
-
MklBLASIntel MKL BLAS — ~2.0 TFLOPS (multi-threaded SGEMM)
-
CudaFP32CUDA FP32 cuBLAS — ~59 TFLOPS (RTX 4090 measured)
-
CudaFP16CUDA FP16 Tensor Cores — ~172 TFLOPS (HMMA, RTX 4090 measured)
-
CudaINT8CUDA INT8 IMMA Tensor Cores — ~330 TOPS
-
CudaSparseCUDA 2:4 Sparse FP16 — ~660 TFLOPS (cuSPARSELt)
-
AutoAuto-select fastest available backend
TFLOPS measurement result
pub type TflopsResult {
TflopsResult(
backend: Backend,
matrix_size: Int,
flops: Int,
time_us: Int,
tflops: Float,
gflops: Float,
efficiency: Float,
)
}
Constructors
-
TflopsResult( backend: Backend, matrix_size: Int, flops: Int, time_us: Int, tflops: Float, gflops: Float, efficiency: Float, )
Values
pub fn detect_backends() -> List(Backend)
Detect all available backends (ordered slowest to fastest)
pub fn format_result(result: TflopsResult) -> String
Format single result as a one-line string
pub fn format_table(results: List(TflopsResult)) -> String
Format list of results as a table
pub fn measure_matmul(
backend: Backend,
m: Int,
n: Int,
k: Int,
) -> TflopsResult
Measure single matmul TFLOPS for a backend
pub fn measure_matmul_averaged(
backend: Backend,
m: Int,
n: Int,
k: Int,
iterations: Int,
) -> TflopsResult
Measure averaged TFLOPS (warmup + iterations)
pub fn theoretical_peak(backend: Backend) -> Float
Theoretical peak TFLOPS for a backend (RTX 4090 / i9-13900K)