Getting started

Quick path from zero to a running tensor program. The pure-Gleam path needs nothing but gleam. The CUDA inference path needs a recent CUDA toolkit and an Ada-or-better NVIDIA GPU.

Run a model with the public API

git clone https://github.com/gabrielmaialva33/viva_tensor
cd viva_tensor
make cutlass-libs     # CUTLASS + cuSPARSELt static archives
make zig              # the NIF .so

import viva_tensor as t

pub fn main() {
  let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
  let opts = t.default_generate_opts()
  let assert Ok(result) = t.generate(model, "Hello", opts)
  result.text
}

That is the preferred v2.2.102 path for Llama-family HF checkpoints. The same API has been validated on TinyLlama-1.1B and Llama-3.2-1B-Instruct.

Pure-Gleam tensor path

gleam new my_app
cd my_app
gleam add viva_tensor

// src/my_app.gleam
import gleam/io
import viva_tensor as t

pub fn main() {
  let assert Ok(a) = t.matrix(2, 3, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
  let assert Ok(b) = t.matrix(3, 2, [1.0, 0.0, 0.0, 1.0, 1.0, 0.0])
  let assert Ok(c) = t.matmul(a, b)
  io.println(string.inspect(t.to_list(c)))
}

gleam run

That works on any platform Gleam supports. No NIF needed.

CUDA inference path

Prerequisites:

NVIDIA GPU, Ada SM89 (RTX 4090) or newer recommended
CUDA 12.0+ toolkit (nvcc)
Driver 555+
zig 0.14+ (build system)
g++ 14 (GCC 16 has known nvcc breakage on <functional>; the Makefile auto-detects g++-15 as the host compiler)

Build:

make cutlass-libs     # CUTLASS + cuSPARSELt static archives
make zig              # the NIF .so
gleam test            # 792 tests, all should pass with NIF loaded

If the NIF .so isn’t present, the same gleam test still runs — it just skips the native-only paths.

Run TinyLlama-1.1B end-to-end

See inference.md for the full walkthrough. Quick version:

mkdir -p tmp/tinyllama && cd tmp/tinyllama
for f in model.safetensors config.json tokenizer.json tokenizer_config.json; do
  wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/$f
done
cd ../..

erlc -o /tmp dev/llama_forward.erl
erl -pa /tmp -pa build/dev/erlang/viva_tensor/ebin -noshell \
    -eval 'llama_forward:run_generate_w8a16(22, <<"Hello">>, 20, #{}, 16), halt(0).'

Expected: the model prints a continuation of “Hello” generated at ~2.31 ms/token through the public handle path. The dev/llama_forward.erl runner is kept for advanced debugging and kernel bisects.

Verify your install

# Check NIF loaded
erl -pa build/dev/erlang/viva_tensor/ebin -noshell -eval \
    'io:format("~p~n", [viva_tensor_zig:cuda_available()]), halt(0).'
# -> true   (or false on CPU-only build)

# Run a quick CUDA matmul
gleam run -m viva_tensor/bench/peak

Next steps

inference.md — full end-to-end Llama walkthrough.
../api/tensor.md — public tensor API reference.
../api/inference.md — prepack / linear / sampling / tokenizer reference.
ffi-architecture.md — how the Gleam ↔ NIF boundary works (maintainer-facing).