Gleam BEAM OTP CUDA SM89 Tests Version License

πŸ‡§πŸ‡· PortuguΓͺs Β· πŸ‡ΊπŸ‡Έ English Β· πŸ‡¨πŸ‡³ δΈ­ζ–‡


β€œTensors speak Gleam. Kernels burn silicon. The BEAM holds the soul.”


viva_tensor IS NOT A WRAPPER. It is a production-grade FP8 LLM inference engine written from scratch: hand-tuned CUDA kernels, blocked W8A16 GEMV, full-token CUDA Graphs, and a public ModelHandle API β€” all driven from Gleam on the BEAM.

It is faster than Ollama on the same hardware.


🎯 Overview

A tensor library for Gleam on the BEAM. Provides a pure-Gleam tensor API for portability, an inference API for FP8 / INT4-2:4 / INT8-2:4 sparse linear layers, and a public LLM ModelHandle API for Llama-family HuggingFace checkpoints.

The library works fully in pure BEAM (slow but portable) and transparently upgrades to the native CUDA path when the NIF is loaded.

PropertyValue
LanguagePure Gleam (type-safe functional)
RuntimeBEAM / OTP 27+
Native backendCUDA 12 + CUTLASS + cuSPARSELt (SM89 / Ada)
Tests792 passing
Decode448 tok/s TinyLlama-1.1B (vs Ollama 352)
Public APIviva_tensor.load_model / viva_tensor.generate

⚑ Quick Start

git clone https://github.com/gabrielmaialva33/viva_tensor.git && cd viva_tensor
gleam deps download

# Optional: native CUDA backend (RTX 4090 / Ada SM89)
make cutlass-libs    # CUTLASS + cuSPARSELt static archives
make zig             # the NIF shared object

gleam test           # 792 tests, all pass with NIF loaded

Generate text in 4 lines of Gleam

import viva_tensor as t

let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let assert Ok(result) = t.generate(model, "Hello", t.default_generate_opts())
result.text
πŸ“‹ Prerequisites
ToolVersionRequired for
Gleam>= 1.14Build / pure-Gleam
Erlang/OTP>= 27BEAM runtime
CUDA toolkit>= 12.0Native inference path
NVIDIA GPUAda+ (SM89)FP8 / Tensor Cores
make + zig + clangrecentNIF build pipeline

The pure-Gleam path needs only Gleam + Erlang/OTP.


πŸ—οΈ Architecture

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                  Gleam application code                 β”‚
   β”‚       viva_tensor.load_model / .generate / .Tensor      β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚            Erlang public API (viva_tensor_llm)          β”‚
   β”‚  SafeTensors loader Β· BPE tokenizer Β· sampling Β· KV     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚          NIF dispatch (viva_tensor_zig.so + .erl)       β”‚
   β”‚   PackedWeight Β· EmbeddingTable Β· KvCache Β· ModelHandle β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                              β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Pure-Gleam tensors β”‚         β”‚      CUDA kernels        β”‚
   β”‚   (no GPU needed)  β”‚         β”‚ W8A16 GEMV Β· FlashAttn   β”‚
   β”‚                    β”‚         β”‚ Full-token CUDA Graph    β”‚
   β”‚                    β”‚         β”‚ CUTLASS FP8/INT4 sparse  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ“‹ Core Modules
ModuleDescription
viva_tensorPublic Gleam API: tensors, prepack, linear, LLM
viva_tensor_llmload_model / generate β€” opaque ModelHandle
viva_tensor_zigNIF dispatch (Erlang stubs)
viva_tensor_safetensors_ffiHF SafeTensors loader, sharded support, BF16/F16
viva_tensor_tokenizer_ffiSentencePiece + byte-level BPE (GPT-2/Llama-3)
zig_src/cuda_block_forward.cuRMSNorm, RoPE, GQA flash attn, SiLU, residual
zig_src/nif_forward_block.cDecode-step orchestration, CUDA Graph capture
zig_src/cuda_fp8_cutlass.cuCUTLASS FP8 dense GEMM
zig_src/nif_prepack_int_sparse.cINT4 / INT8 2:4 sparse weight prepack

πŸ“Š Performance

All numbers measured on RTX 4090 (Ada SM89) + Intel i9-13900K (32 threads @ 5.80 GHz). Reproducible via bench/ harness.

Text generation β€” TinyLlama-1.1B-Chat (FP8 W8A16)

RuntimeDecode speed
viva_tensor β€” best run448 tok/s
Ollama local baseline (same model)352 tok/s
viva_tensor.generate (warm)2.31 ms/token
viva_tensor.generate Llama-3.2-1B-Instruct2.47 ms/token

Validated models

ModelStatusPathNotes
TinyLlama-1.1B-Chatβœ… validatedsingle safetensorsbyte-identical baseline, 2.31 ms/tok
Llama-3.2-1B-Instruct (unsloth)βœ… validatedsingle safetensorstied embeddings, byte-level BPE, 2.47 ms/tok
NousResearch/Llama-2-7b-chat-hfβœ… validatedsharded F16 (13.5GB)head_dim=128 dynamic path, 113 ms/tok
Phi-2⚠️ partialsharded foldersharded discovery OK, Phi arch β‰  Llama

Quantized GEMM kernels (RTX 4090)

KernelPeak performanceBackend
INT8 2:4 sparse (cuSPARSELt)1320 TOPScuSPARSELt
INT4 2:4 sparse (CUTLASS Sm80)1854 TOPSCUTLASS
FP8 dense (CUTLASS E4M3 W8A8)~660 TFLOPSCUTLASS
FP8 W8A16 blocked GEMV (custom)decode-optimizedhand-tuned CUDA

Full methodology + raw numbers in bench/results/matmul_showdown.md.


🧬 Design Principles

PrincipleDescription
Honest numericsargmax tokens stay byte-identical to HF reference fp32
Pure-Gleam fallbackEvery API works without CUDA, just slower
Owned device memoryPackedWeight, EmbeddingTable, KvCache are Erlang resources
Single-token by defaultDecode is batch=1 first; batched prefill is future work
No magic kernelsEvery .cu file is human-written, benchmarked, and committed

πŸ› οΈ Public API

High-level: LLM inference

import viva_tensor as t

pub fn main() {
  let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")

  let opts =
    t.GenerateOpts(
      max_new_tokens: 50,
      temperature: 0.0,           // argmax β€” deterministic
      top_k: t.TopKInfinity,
      top_p: 1.0,
      seed: 42,
      stop_on_eos: True,
    )

  let assert Ok(result) = t.generate(model, "Hello", opts)
  result.text
}

Reproducible sampling

let sampling_opts =
  t.GenerateOpts(
    max_new_tokens: 30,
    temperature: 0.8,
    top_k: t.TopK(40),
    top_p: 0.95,
    seed: 42,
    stop_on_eos: True,
  )

Same seed β†’ same token sequence across machines.

Low-level: quantized linears

let assert Ok(packed) = t.prepack_fp8_weight_blocked(weight, 16)
let assert Ok(output) = t.linear_fp8_w8a16(input, packed, bias)

Prepack once, run linear forwards many times β€” the FP8 weight + scales live on the device for the lifetime of the PackedWeight resource.


πŸ—ΊοΈ Roadmap

PhaseStatus
Pure-Gleam tensorsβœ…
CUDA backend (CUTLASS + cuSPARSELt)βœ…
FP8 / INT4-2:4 / INT8-2:4 sparse kernelsβœ…
Public ModelHandle APIβœ…
Sharded SafeTensors loaderβœ…
Byte-level BPE + SentencePiece tokenizersβœ…
Weight-tied embeddingsβœ…
Full-token CUDA Graph captureβœ…
Reproducible temperature/top-k/top-p samplingβœ…
FP16 weight dtype (Llama-2-7B)πŸ”„
Batched prefill⏳
Speculative decoding⏳
Hopper SM90 / Blackwell FP4 / NVFP4⏳

🀝 Contributing

git checkout -b feature/your-feature
make cutlass-libs && make zig
gleam test          # 792 should pass with NIF loaded
make test-no-nif    # 791 should pass without NIF

See CONTRIBUTING.md for guidelines.


πŸ“š Documentation

LanguageLink
πŸ‡§πŸ‡· PortuguΓͺsdocs/pt-br/
πŸ‡ΊπŸ‡Έ Englishdocs/en/
πŸ‡¨πŸ‡³ δΈ­ζ–‡docs/zh-cn/

Guides

API reference

Technical paper


πŸ“œ What’s new in 2.2.102

Full evolution from 63 tok/s baseline to 448 tok/s across 14 rounds of optimization is documented in CHANGELOG.md.


Star if you believe BEAM can do LLM inference ⭐

GitHub stars

Created by Gabriel Maia Β· MIT License

✨ Search Document