viva_tensor/core/ffi

FFI - Foreign Function Interface to Erlang

The escape hatch from pure functional bliss into the world of mutable arrays and hardware-specific optimizations.

Why we need this:

  1. Erlang lists are O(n) for random access. That’s death for matrix ops.
  2. Erlang’s :array gives us O(1) access (technically O(log32 n), close enough).
  3. Native NIFs unlock SIMD, BLAS, and GPU backends.

Performance Hierarchy (fastest to slowest)

  1. Zig SIMD NIF: Hand-tuned SIMD for the hot paths. 10-100x vs pure Gleam.
  2. Apple Accelerate NIF: cblas_dgemm on macOS. Ridiculously optimized.
  3. Erlang :array: O(1) access, pure Erlang. 10-50x vs lists for matmul.
  4. Pure Gleam lists: Beautiful, correct, slow. Fine for small tensors.

Architecture

We have three acceleration backends that we auto-select from:

The ops module auto-selects the best available backend at runtime.

Types

pub type CudaInt8TensorRef
pub type CudaTensor16Ref
pub type CudaTensorRef

Erlang :array - the key to O(1) tensor element access.

Under the hood, it’s a tree of 10-element tuples. Technically O(log32 n) but that’s effectively O(1) for any reasonable tensor size.

A 1M element lookup: log32(1_000_000) = ~4 tree traversals. For lists: 500,000 average. That’s the difference between matmul taking 1 second vs 2 hours.

The array is immutable (functional updates create new nodes), but we mostly just read from it during tensor operations.

pub type ErlangArray

HDC vector reference - binary hyperdimensional vector

pub type HdcVectorRef

Horde reference - SoA entity collection

pub type HordeRef

LNS tensor reference - f32 storage for IADD-based multiplication

pub type LnsTensorRef

Native tensor resource - opaque reference to contiguous C memory. Created by NIF constructors, ops return new refs. Erlang GC frees native memory automatically via destructor.

pub type NativeTensorRef
pub type SparseTensorRef

Values

pub fn abs(x: Float) -> Float

Absolute value.

Implemented in pure Gleam because :math.abs/1 doesn’t exist and erlang:abs/1 is polymorphic (returns same type as input).

pub fn array_dot(a: ErlangArray, b: ErlangArray) -> Float

Dot product using Erlang arrays.

Performance: ~10-50x faster than list-based for large vectors. The speedup comes entirely from O(1) vs O(n) element access.

pub fn array_get(arr: ErlangArray, index: Int) -> Float

Get element from array at index - O(1).

Contrast with list indexing: O(n). For a 1000-element matmul (1000 iterations, each indexing both inputs), that’s 2M list traversals vs 2K array lookups. Huge difference.

pub fn array_matmul(
  a: ErlangArray,
  b: ErlangArray,
  m: Int,
  n: Int,
  k: Int,
) -> ErlangArray

Matrix multiplication using Erlang arrays.

C[m,n] = A[m,k] @ B[k,n]

Naive O(mnk) algorithm but with O(1) element access. For 100x100 matrices: ~50x faster than list-based.

For serious work, use the Zig SIMD or Accelerate NIF backends. This is the reliable fallback that works everywhere.

pub fn array_scale(
  arr: ErlangArray,
  scalar: Float,
) -> ErlangArray

Scale all elements by scalar - O(n).

pub fn array_size(arr: ErlangArray) -> Int

Get array size - O(1).

pub fn array_sum(arr: ErlangArray) -> Float

Sum all elements - O(n).

pub fn array_to_list(arr: ErlangArray) -> List(Float)

Convert array back to list - O(n).

Use this for final output or when you need list operations. Try to stay in array-land as long as possible for hot paths.

pub fn cos(x: Float) -> Float

Cosine - wraps :math.cos/1

pub fn ct16_available() -> Bool
pub fn ct16_from_list(
  data: List(Float),
  shape: List(Int),
) -> Result(CudaTensor16Ref, String)
pub fn ct16_matmul(
  a: CudaTensor16Ref,
  b: CudaTensor16Ref,
  m: Int,
  n: Int,
  k: Int,
) -> Result(CudaTensor16Ref, String)
pub fn ct16_shape(
  ref: CudaTensor16Ref,
) -> Result(List(Int), String)
pub fn ct16_to_list(
  ref: CudaTensor16Ref,
) -> Result(List(Float), String)
pub fn ct_from_list(
  data: List(Float),
  shape: List(Int),
) -> Result(CudaTensorRef, String)
pub fn ct_int8_available() -> Bool
pub fn ct_int8_from_list(
  data: List(Float),
  shape: List(Int),
) -> Result(CudaInt8TensorRef, String)
pub fn ct_int8_matmul(
  a: CudaInt8TensorRef,
  b: CudaInt8TensorRef,
  m: Int,
  n: Int,
  k: Int,
) -> Result(CudaInt8TensorRef, String)
pub fn ct_int8_shape(
  ref: CudaInt8TensorRef,
) -> Result(List(Int), String)
pub fn ct_int8_to_list(
  ref: CudaInt8TensorRef,
) -> Result(List(Float), String)
pub fn ct_matmul(
  a: CudaTensorRef,
  b: CudaTensorRef,
  m: Int,
  n: Int,
  k: Int,
) -> Result(CudaTensorRef, String)
pub fn ct_shape(ref: CudaTensorRef) -> Result(List(Int), String)
pub fn ct_to_list(
  ref: CudaTensorRef,
) -> Result(List(Float), String)
pub const e: Float

Euler’s number e to 20 decimal places.

pub fn exp(x: Float) -> Float

Exponential e^x - wraps :math.exp/1

Watch for overflow: exp(710) = inf in Float64. For softmax, subtract max first: exp(x - max(x)).

pub fn hdc_bind(
  a: HdcVectorRef,
  b: HdcVectorRef,
) -> Result(HdcVectorRef, String)

XOR binding: associates two concepts (invertible: A XOR B XOR B = A)

pub fn hdc_create(dim: Int) -> Result(HdcVectorRef, String)

Create empty hypervector (dim must be multiple of 64)

pub const hdc_default_dim: Int

Standard HDC dimension (10,000 bits)

pub fn hdc_dim(vec: HdcVectorRef) -> Result(Int, String)

Get dimensionality (total bits)

pub fn hdc_permute(
  vec: HdcVectorRef,
  shift: Int,
) -> Result(HdcVectorRef, String)

Circular permutation for sequence encoding encode(ABC) = A XOR perm(B,1) XOR perm(C,2)

pub fn hdc_random(
  dim: Int,
  seed: Int,
) -> Result(HdcVectorRef, String)

Create random hypervector (seed for reproducibility)

pub fn hdc_similarity(
  a: HdcVectorRef,
  b: HdcVectorRef,
) -> Result(Float, String)

Cosine-like similarity via Hamming distance [0, 1] 1 = identical, 0.5 = orthogonal (random), 0 = opposite

pub fn horde_count(horde: HordeRef) -> Result(Int, String)

Get entity count

pub fn horde_create(
  entity_count: Int,
  dims: Int,
) -> Result(HordeRef, String)

Create new Horde with entity count and dimensionality (1, 2, or 3)

pub fn horde_dampen(
  horde: HordeRef,
  friction: Float,
) -> Result(Nil, String)

Apply velocity damping: velocities *= friction

pub fn horde_get_positions(
  horde: HordeRef,
) -> Result(List(Float), String)

Get current positions as flat list

pub fn horde_get_velocities(
  horde: HordeRef,
) -> Result(List(Float), String)

Get current velocities as flat list

pub fn horde_integrate(
  horde: HordeRef,
  dt: Float,
) -> Result(Nil, String)

Euler integration step: positions += velocities * dt (FMA)

pub fn horde_kinetic_energy(
  horde: HordeRef,
) -> Result(Float, String)

Compute total kinetic energy: 0.5 * sum(vel^2)

pub fn horde_set_positions(
  horde: HordeRef,
  data: List(Float),
) -> Result(Nil, String)

Set all positions from flat list [x0, y0, x1, y1, …] for 2D

pub fn horde_set_velocities(
  horde: HordeRef,
  data: List(Float),
) -> Result(Nil, String)

Set all velocities from flat list

pub fn horde_wrap(
  horde: HordeRef,
  max_bound: Float,
) -> Result(Nil, String)

Toroidal wrap: positions mod max_bound

pub fn is_nif_loaded() -> Bool

Check if the Apple Accelerate NIF is loaded.

Returns True on macOS with the NIF built, False elsewhere. Use this to decide whether to use nif_* functions or fall back.

pub fn list_to_array(lst: List(Float)) -> ErlangArray

Convert list to Erlang array for O(1) access.

O(n) to build, but subsequent access is O(1). Worth it for any tensor you’ll index more than once.

pub fn lns_div(
  a: LnsTensorRef,
  b: LnsTensorRef,
) -> Result(LnsTensorRef, String)

LNS division via ISUB

pub fn lns_from_f64(
  ref: NativeTensorRef,
) -> Result(LnsTensorRef, String)

Convert f64 NativeTensor to f32 LNS tensor

pub fn lns_mul(
  a: LnsTensorRef,
  b: LnsTensorRef,
) -> Result(LnsTensorRef, String)

Fast LNS multiply via IADD (~11% max error, 8x throughput)

pub fn lns_mul_corrected(
  a: LnsTensorRef,
  b: LnsTensorRef,
) -> Result(LnsTensorRef, String)

Mitchell’s corrected LNS multiply (~2% max error)

pub fn lns_rsqrt(a: LnsTensorRef) -> Result(LnsTensorRef, String)

Fast inverse sqrt (Quake III trick)

pub fn lns_sqrt(a: LnsTensorRef) -> Result(LnsTensorRef, String)

LNS sqrt via bit shift

pub fn lns_to_f64(
  ref: LnsTensorRef,
) -> Result(NativeTensorRef, String)

Convert LNS tensor back to f64 NativeTensor

pub fn log(x: Float) -> Float

Natural logarithm - wraps :math.log/1

Undefined for x <= 0. Erlang will return -inf for 0, NaN for negative. Caller’s responsibility to check input.

pub fn nif_backend_info() -> String

Get backend info string for debugging.

Returns something like “Apple Accelerate (cblas_dgemm, vDSP)” on macOS.

pub fn nif_dot(
  a: List(Float),
  b: List(Float),
) -> Result(Float, String)

NIF-accelerated dot product via vDSP.

pub fn nif_matmul(
  a: List(Float),
  b: List(Float),
  m: Int,
  n: Int,
  k: Int,
) -> Result(List(Float), String)

NIF-accelerated matrix multiplication via cblas_dgemm.

This is where the magic happens on macOS. Apple has spent years optimizing BLAS for their chips. We just call their code.

Falls back to pure Erlang if NIF not available.

pub fn nif_scale(
  data: List(Float),
  scalar: Float,
) -> Result(List(Float), String)

NIF-accelerated scale via vDSP.

pub fn nif_sum(data: List(Float)) -> Result(Float, String)

NIF-accelerated sum via vDSP.

pub fn now_microseconds() -> Int

Get current time in microseconds.

Use for benchmarking: before/after difference gives wall-clock time. For production profiling, use Erlang’s :fprof or :eprof instead.

pub fn nt_add(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native add: ref + ref → ref (zero copy)

pub fn nt_add_mut(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(Nil, String)

In-place add: a += b. Returns ok. MUTATES a.

pub fn nt_dot(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(Float, String)

Native dot product: ref · ref → scalar

pub fn nt_exp(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native exp

pub fn nt_fill(
  shape: List(Int),
  value: Float,
) -> Result(NativeTensorRef, String)

Create native tensor filled with value

pub fn nt_from_list(
  data: List(Float),
  shape: List(Int),
) -> Result(NativeTensorRef, String)

Create native tensor from list data + shape

pub fn nt_fused_linear_relu(
  a: NativeTensorRef,
  b: NativeTensorRef,
  bias: NativeTensorRef,
  m: Int,
  n: Int,
  k: Int,
) -> Result(NativeTensorRef, String)

Fused MatMul + Bias + ReLU: C = max(0, A@B + bias) Single pass, saves 2 full tensor traversals.

pub fn nt_log(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native log

pub fn nt_matmul(
  a: NativeTensorRef,
  b: NativeTensorRef,
  m: Int,
  n: Int,
  k: Int,
) -> Result(NativeTensorRef, String)

Native matmul: [m,k] @ [k,n] → [m,n] in native memory

pub fn nt_matmul_nf4(
  a: NativeTensorRef,
  b_indices: List(Int),
  b_scales: List(Float),
  m: Int,
  n: Int,
  k: Int,
  block_size: Int,
) -> Result(NativeTensorRef, String)

Matrix multiplication with NF4 quantized weights

pub fn nt_max(a: NativeTensorRef) -> Result(Float, String)

Native max → scalar

pub fn nt_min(a: NativeTensorRef) -> Result(Float, String)

Native min → scalar

pub fn nt_mul(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native element-wise mul: ref * ref → ref

pub fn nt_negate(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native negate: -ref → ref

pub fn nt_negate_mut(a: NativeTensorRef) -> Result(Nil, String)

In-place negate: a = -a. Returns ok. MUTATES a.

pub fn nt_ones(
  shape: List(Int),
) -> Result(NativeTensorRef, String)

Create native tensor of ones

pub fn nt_relu(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native ReLU activation

pub fn nt_relu_mut(a: NativeTensorRef) -> Result(Nil, String)

In-place ReLU: a = max(0, a). Returns ok. MUTATES a.

pub fn nt_resonance_mul(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Resonance Multiply: LNS element-wise multiply. result[i] = sign * exp(log|a[i]| + log|b[i]|) Multiplication via addition in log domain — better precision for chains.

pub fn nt_resonance_power(
  data: NativeTensorRef,
  exponent: Float,
) -> Result(NativeTensorRef, String)

Resonance Power: LNS element-wise power. result[i] = sign(x) * |x|^exponent via exp(exponent * log|x|) Power = multiply in log domain. Sign preserved for bipolar states.

pub fn nt_saturn_blend(
  texture: NativeTensorRef,
  shade: NativeTensorRef,
  bias: Float,
) -> Result(NativeTensorRef, String)

Saturn Blend: result = texture + (shade - bias) VDP1-inspired lighting with pure SIMD addition.

pub fn nt_scale(
  a: NativeTensorRef,
  scalar: Float,
) -> Result(NativeTensorRef, String)

Native scale: ref * scalar → ref

pub fn nt_scale_mut(
  a: NativeTensorRef,
  scalar: Float,
) -> Result(Nil, String)

In-place scale: a *= scalar. Returns ok. MUTATES a.

pub fn nt_shape(
  ref: NativeTensorRef,
) -> Result(List(Int), String)

Get shape from native tensor

pub fn nt_sigmoid(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native sigmoid activation

pub fn nt_size(ref: NativeTensorRef) -> Result(Int, String)

Get total element count

pub fn nt_sub(
  a: NativeTensorRef,
  b: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native sub: ref - ref → ref

pub fn nt_sum(a: NativeTensorRef) -> Result(Float, String)

Native sum reduction → scalar

pub fn nt_to_list(
  ref: NativeTensorRef,
) -> Result(List(Float), String)

Extract data as list (one-time conversion at boundaries)

pub fn nt_transpose(
  a: NativeTensorRef,
) -> Result(NativeTensorRef, String)

Native transpose: [m,n] → [n,m] contiguous copy

pub fn nt_zeros(
  shape: List(Int),
) -> Result(NativeTensorRef, String)

Create native tensor of zeros

pub const pi: Float

Pi to 20 decimal places. More than Float64 can represent anyway.

pub fn pow(x: Float, y: Float) -> Float

Power x^y - wraps :math.pow/2

pub fn random_uniform() -> Float

Uniform random float in [0, 1).

Uses Erlang’s per-process PRNG (Xoroshiro116+ by default). Not suitable for cryptography, but fine for ML initialization.

For reproducible results, seed with :rand.seed(Algorithm, Seed).

pub fn sin(x: Float) -> Float

Sine - wraps :math.sin/1

pub fn sparse_available() -> Bool
pub fn sparse_compression_ratio(
  ref: SparseTensorRef,
) -> Result(Float, String)
pub fn sparse_from_ct16(
  ref: CudaTensor16Ref,
) -> Result(SparseTensorRef, String)
pub fn sparse_matmul(
  a_sparse: SparseTensorRef,
  b_dense: CudaTensor16Ref,
  m: Int,
  n: Int,
  k: Int,
) -> Result(CudaTensor16Ref, String)
pub fn sparse_shape(
  ref: SparseTensorRef,
) -> Result(List(Int), String)
pub fn sqrt(x: Float) -> Float

Square root - wraps :math.sqrt/1

pub fn tan(x: Float) -> Float

Tangent - wraps :math.tan/1

pub fn tanh(x: Float) -> Float

Hyperbolic tangent - wraps :math.tanh/1

Range: (-1, 1). Saturates for |x| > ~20. Used in some activation functions, though ReLU dominates now.

pub fn zig_add(
  a: List(Float),
  b: List(Float),
) -> Result(List(Float), String)

Zig SIMD element-wise add.

pub fn zig_backend_info() -> String

Get Zig backend info for debugging.

Returns SIMD capability info: “Zig SIMD (AVX2)” or “Zig SIMD (NEON)” etc.

pub fn zig_dot(
  a: List(Float),
  b: List(Float),
) -> Result(Float, String)

Zig SIMD dot product.

Uses 4-way or 8-way SIMD depending on platform. Unrolled loop with accumulator to maximize throughput.

pub fn zig_is_loaded() -> Bool

Check if Zig SIMD NIF is loaded.

pub fn zig_matmul(
  a: List(Float),
  b: List(Float),
  m: Int,
  n: Int,
  k: Int,
) -> Result(List(Float), String)

Zig SIMD matrix multiplication.

Tiled implementation with SIMD inner loops. Not quite BLAS-level but respectable: ~10-50 GFLOPS depending on platform.

pub fn zig_mul(
  a: List(Float),
  b: List(Float),
) -> Result(List(Float), String)

Zig SIMD element-wise multiply.

pub fn zig_scale(
  data: List(Float),
  scalar: Float,
) -> Result(List(Float), String)

Zig SIMD scale (multiply all elements by scalar).

pub fn zig_sum(data: List(Float)) -> Result(Float, String)

Zig SIMD sum reduction.

Search Document