viva_tensor · v2.2.104

🇧🇷 Português · 🇺🇸 English · 🇨🇳 中文

“Tensors speak Gleam. Kernels burn silicon. The BEAM holds the soul.”

viva_tensor IS NOT A WRAPPER. It is a production-grade FP8 LLM inference engine written from scratch: hand-tuned CUDA kernels, blocked W8A16 GEMV, full-token CUDA Graphs, and a public ModelHandle API — all driven from Gleam on the BEAM.

It is faster than Ollama on the same hardware.

🎯 Overview

A tensor library for Gleam on the BEAM. Provides a pure-Gleam tensor API for portability, an inference API for FP8 / INT4-2:4 / INT8-2:4 sparse linear layers, and a public LLM ModelHandle API for Llama-family HuggingFace checkpoints.

The library works fully in pure BEAM (slow but portable) and transparently upgrades to the native CUDA path when the NIF is loaded.

Property	Value
Language	Pure Gleam (type-safe functional)
Runtime	BEAM / OTP 27+
Native backend	CUDA 12 + CUTLASS + cuSPARSELt (SM89 / Ada)
Tests	792 passing
Decode	`448 tok/s` TinyLlama-1.1B (vs Ollama 352)
Public API	`viva_tensor.load_model` / `viva_tensor.generate`

⚡ Quick Start

git clone https://github.com/gabrielmaialva33/viva_tensor.git && cd viva_tensor
gleam deps download

# Optional: native CUDA backend (RTX 4090 / Ada SM89)
make cutlass-libs    # CUTLASS + cuSPARSELt static archives
make zig             # the NIF shared object

gleam test           # 792 tests, all pass with NIF loaded

Generate text in 4 lines of Gleam

import viva_tensor as t

let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let assert Ok(result) = t.generate(model, "Hello", t.default_generate_opts())
result.text

📋 Prerequisites

Tool	Version	Required for
Gleam	`>= 1.14`	Build / pure-Gleam
Erlang/OTP	`>= 27`	BEAM runtime
CUDA toolkit	`>= 12.0`	Native inference path
NVIDIA GPU	Ada+ (SM89)	FP8 / Tensor Cores
`make` + `zig` + `clang`	recent	NIF build pipeline

The pure-Gleam path needs only Gleam + Erlang/OTP.

🏗️ Architecture

   ┌─────────────────────────────────────────────────────────┐
   │                  Gleam application code                 │
   │       viva_tensor.load_model / .generate / .Tensor      │
   └────────────────────────┬────────────────────────────────┘
                            │
   ┌────────────────────────▼────────────────────────────────┐
   │            Erlang public API (viva_tensor_llm)          │
   │  SafeTensors loader · BPE tokenizer · sampling · KV     │
   └────────────────────────┬────────────────────────────────┘
                            │
   ┌────────────────────────▼────────────────────────────────┐
   │          NIF dispatch (viva_tensor_zig.so + .erl)       │
   │   PackedWeight · EmbeddingTable · KvCache · ModelHandle │
   └─────────┬──────────────────────────────┬────────────────┘
             │                              │
   ┌─────────▼──────────┐         ┌─────────▼────────────────┐
   │ Pure-Gleam tensors │         │      CUDA kernels        │
   │   (no GPU needed)  │         │ W8A16 GEMV · FlashAttn   │
   │                    │         │ Full-token CUDA Graph    │
   │                    │         │ CUTLASS FP8/INT4 sparse  │
   └────────────────────┘         └──────────────────────────┘

📋 Core Modules

Module	Description
`viva_tensor`	Public Gleam API: tensors, prepack, linear, LLM
`viva_tensor_llm`	`load_model` / `generate` — opaque `ModelHandle`
`viva_tensor_zig`	NIF dispatch (Erlang stubs)
`viva_tensor_safetensors_ffi`	HF SafeTensors loader, sharded support, BF16/F16
`viva_tensor_tokenizer_ffi`	SentencePiece + byte-level BPE (GPT-2/Llama-3)
`zig_src/cuda_block_forward.cu`	RMSNorm, RoPE, GQA flash attn, SiLU, residual
`zig_src/nif_forward_block.c`	Decode-step orchestration, CUDA Graph capture
`zig_src/cuda_fp8_cutlass.cu`	CUTLASS FP8 dense GEMM
`zig_src/nif_prepack_int_sparse.c`	INT4 / INT8 2:4 sparse weight prepack

📊 Performance

All numbers measured on RTX 4090 (Ada SM89) + Intel i9-13900K (32 threads @ 5.80 GHz). Reproducible via bench/ harness.

Text generation — TinyLlama-1.1B-Chat (FP8 W8A16)

Runtime	Decode speed
viva_tensor — best run	`448 tok/s`
Ollama local baseline (same model)	`352 tok/s`
`viva_tensor.generate` (warm)	`2.31 ms/token`
`viva_tensor.generate` Llama-3.2-1B-Instruct	`2.47 ms/token`

Validated models

Model	Status	Path	Notes
TinyLlama-1.1B-Chat	✅ validated	single safetensors	byte-identical baseline, `2.31 ms/tok`
Llama-3.2-1B-Instruct (unsloth)	✅ validated	single safetensors	tied embeddings, byte-level BPE, `2.47 ms/tok`
NousResearch/Llama-2-7b-chat-hf	✅ validated	sharded F16 (13.5GB)	`head_dim=128` dynamic path, `113 ms/tok`
Phi-2	⚠️ partial	sharded folder	sharded discovery OK, Phi arch ≠ Llama

Quantized GEMM kernels (RTX 4090)

Kernel	Peak performance	Backend
INT8 2:4 sparse (cuSPARSELt)	`1320 TOPS`	cuSPARSELt
INT4 2:4 sparse (CUTLASS Sm80)	`1854 TOPS`	CUTLASS
FP8 dense (CUTLASS E4M3 W8A8)	`~660 TFLOPS`	CUTLASS
FP8 W8A16 blocked GEMV (custom)	decode-optimized	hand-tuned CUDA

Full methodology + raw numbers in bench/results/matmul_showdown.md.

🧬 Design Principles

Principle	Description
Honest numerics	argmax tokens stay byte-identical to HF reference fp32
Pure-Gleam fallback	Every API works without CUDA, just slower
Owned device memory	`PackedWeight`, `EmbeddingTable`, `KvCache` are Erlang resources
Single-token by default	Decode is `batch=1` first; batched prefill is future work
No magic kernels	Every `.cu` file is human-written, benchmarked, and committed

🛠️ Public API

High-level: LLM inference

import viva_tensor as t

pub fn main() {
  let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")

  let opts =
    t.GenerateOpts(
      max_new_tokens: 50,
      temperature: 0.0,           // argmax — deterministic
      top_k: t.TopKInfinity,
      top_p: 1.0,
      seed: 42,
      stop_on_eos: True,
    )

  let assert Ok(result) = t.generate(model, "Hello", opts)
  result.text
}

Reproducible sampling

let sampling_opts =
  t.GenerateOpts(
    max_new_tokens: 30,
    temperature: 0.8,
    top_k: t.TopK(40),
    top_p: 0.95,
    seed: 42,
    stop_on_eos: True,
  )

Same seed → same token sequence across machines.

Low-level: quantized linears

let assert Ok(packed) = t.prepack_fp8_weight_blocked(weight, 16)
let assert Ok(output) = t.linear_fp8_w8a16(input, packed, bias)

Prepack once, run linear forwards many times — the FP8 weight + scales live on the device for the lifetime of the PackedWeight resource.

🗺️ Roadmap

Phase	Status
Pure-Gleam tensors	✅
CUDA backend (CUTLASS + cuSPARSELt)	✅
FP8 / INT4-2:4 / INT8-2:4 sparse kernels	✅
Public `ModelHandle` API	✅
Sharded SafeTensors loader	✅
Byte-level BPE + SentencePiece tokenizers	✅
Weight-tied embeddings	✅
Full-token CUDA Graph capture	✅
Reproducible temperature/top-k/top-p sampling	✅
FP16 weight dtype (Llama-2-7B)	🔄
Batched prefill	⏳
Speculative decoding	⏳
Hopper SM90 / Blackwell FP4 / NVFP4	⏳

🤝 Contributing

git checkout -b feature/your-feature
make cutlass-libs && make zig
gleam test          # 792 should pass with NIF loaded
make test-no-nif    # 791 should pass without NIF

See CONTRIBUTING.md for guidelines.

📚 Documentation

Language	Link
🇧🇷 Português	docs/pt-br/
🇺🇸 English	docs/en/
🇨🇳 中文	docs/zh-cn/

Guides

Getting started — install, first run.
LLM inference end-to-end — load → tokenize → decode → sample.
FFI architecture — Gleam ↔ Erlang ↔ C/CUDA boundaries.
Project structure — repo layout.

API reference

LLM ModelHandle — load_model, generate, tested models.
Inference API — prepack + linear FP8 / INT-sparse.
Tensor API — pure-Gleam tensor surface.

Technical paper

Honest paper — what works, what doesn’t, why.

📜 What’s new in 2.2.102

Public LLM API. viva_tensor.load_model(path) accepts a HuggingFace Llama-family checkpoint (single file, sharded, or folder) and returns an opaque ModelHandle. viva_tensor.generate(model, prompt, opts) drives deterministic argmax or seeded temperature/top-k/top-p sampling.
Fast FP8 W8A16 decode. Hand-tuned vt_w8a16_mmv_blocked_k16 GEMV with uint4 vectorized loads, full-token CUDA Graph capture with cudaGraphExecUpdate, persistent device-resident KV caches, and a cuBLASLt plan cache.
Multi-model validated. TinyLlama-1.1B and Llama-3.2-1B-Instruct pass byte-identical and through the same public API.
Sharded SafeTensors. Loads single .safetensors, HF model.safetensors.index.json, or any folder containing either.
Byte-level BPE. GPT-2 / Llama-3 byte-encoded vocabularies decode back to readable text; SentencePiece (▁) still works as before.
Tied embeddings. Detects tie_word_embeddings from config.json and reuses embed_tokens as lm_head when set.

Full evolution from 63 tok/s baseline to 448 tok/s across 14 rounds of optimization is documented in CHANGELOG.md.

Star if you believe BEAM can do LLM inference ⭐

Created by Gabriel Maia · MIT License