LLM ModelHandle API

viva_tensor exposes a public ModelHandle API for Llama-family HuggingFace models stored as SafeTensors. It packages the production TinyLlama decode path into two calls: load the model once, then generate from the cached handle.

The API is designed for local BF16 HF checkpoints with the standard Llama tensor names:

If config.json is present next to the SafeTensors file, viva_tensor reads the hidden size, layer count, head count, KV head count, RMSNorm epsilon, RoPE theta, intermediate size, and vocab size from it. Otherwise it infers what it can from tensor shapes and uses the TinyLlama-compatible defaults.

Gleam

import viva_tensor as t

pub fn main() {
  let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")

  let opts =
    t.GenerateOpts(
      max_new_tokens: 50,
      temperature: 0.0,
      top_k: t.TopKInfinity,
      top_p: 1.0,
      seed: 42,
      stop_on_eos: True,
    )

  let assert Ok(result) = t.generate(model, "Hello", opts)
  result.text
}

temperature: 0.0 uses the fused argmax decode-step NIF for byte-identical reproducibility. temperature > 0.0 uses fused top-k logits plus host temperature, top-k, top-p, and seeded multinomial sampling.

For reproducible sampling:

let opts =
  t.GenerateOpts(
    max_new_tokens: 20,
    temperature: 0.8,
    top_k: t.TopK(40),
    top_p: 0.95,
    seed: 42,
    stop_on_eos: True,
  )

let assert Ok(result) = t.generate(model, "Hello", opts)

Erlang

{ok, Model} = viva_tensor_llm:load(
    <<"tmp/tinyllama/model.safetensors">>,
    #{block_size => 16}
),

{ok, Result} = viva_tensor_llm:generate(
    Model,
    <<"Hello">>,
    #{max_new_tokens => 50, temperature => 0.0}
),

#{tokens := Tokens,
  text := Text,
  ms_per_token := MsPerToken,
  total_tokens := TotalTokens} = Result.

Load Options

viva_tensor_llm:load/2 accepts:

OptionDefaultNotes
num_layersdetected from SafeTensors / config.jsonNumber of decoder blocks to load.
block_size16FP8 blocked prepack size used by the decode-step path.
tokenizer_path<model>_tokenizer.json, then sibling tokenizer.json fallbackHF tokenizer JSON.

Generation Options

viva_tensor_llm:generate/3 accepts:

OptionDefaultNotes
max_new_tokens50Maximum generated tokens.
temperature0.00.0 keeps the argmax path and absolute reproducibility; values above zero enable sampling.
top_kinfinitySampling candidate cap. infinity uses up to 256 fused top-k logits; explicit values are capped at 256.
top_p1.0Nucleus sampling probability applied over the fused candidate set.
seed42Deterministic seed; the same prompt, model, and options reproduce the same sampled tokens.
stop_on_eostrueStop after emitting EOS.

Cached vs Per Call

The ModelHandle caches:

Each generate call allocates fresh KV caches before prefill. KV cache resources are mutable during decode, so they are intentionally per call to keep one ModelHandle reusable across prompts.

Tested models

ModelStatusDecode speedNotes
TinyLlama-1.1B-Chat-v1.0validated2.31 ms/tokenhead_dim=64, GQA fast path, byte-level BPE tokenizer.
Llama-3.2-1B-Instructvalidated2.47 ms/tokensharded SafeTensors, tied embeddings / lm_head, Llama-3 tokenizer path.
NousResearch/Llama-2-7b-chat-hfvalidated113.18 ms/tokensharded F16 SafeTensors, head_dim=128, no GQA; exercises the dynamic CUDA fallback path.

The same public API drives both models:

let assert Ok(model) = t.load_model("tmp/llama32_1b/model-00001-of-00002.safetensors")
let opts = t.default_generate_opts()
let assert Ok(result) = t.generate(model, "Hello", opts)

Performance

On an RTX 4090, the current public handle API has been validated at 2.31 ms/token for TinyLlama-1.1B, 2.47 ms/token for Llama-3.2-1B-Instruct, and 113.18 ms/token for NousResearch/Llama-2-7b-chat-hf. The Llama-2-7B run is functional and coherent, but much slower because it exercises the current head_dim=128 dynamic path. A best TinyLlama FP8 W8A16 decode run reaches 448 tok/s, ahead of the local Ollama baseline at 352 tok/s.

Generation still calls nt_forward_decode_step/8 once per decoded token. Prefill is also token-by-token today; a batched prefill path is future work.

Limitations

Search Document