viva_tensor/core/ffi
FFI - Foreign Function Interface to Erlang
The escape hatch from pure functional bliss into the world of mutable arrays and hardware-specific optimizations.
Why we need this:
- Erlang lists are O(n) for random access. That’s death for matrix ops.
- Erlang’s :array gives us O(1) access (technically O(log32 n), close enough).
- Native NIFs unlock SIMD, BLAS, and GPU backends.
Performance Hierarchy (fastest to slowest)
- Zig SIMD NIF: Hand-tuned SIMD for the hot paths. 10-100x vs pure Gleam.
- Apple Accelerate NIF: cblas_dgemm on macOS. Ridiculously optimized.
- Erlang :array: O(1) access, pure Erlang. 10-50x vs lists for matmul.
- Pure Gleam lists: Beautiful, correct, slow. Fine for small tensors.
Architecture
We have three acceleration backends that we auto-select from:
- viva_tensor_zig: Portable SIMD via Zig. Works everywhere Zig compiles.
- viva_tensor_nif: Apple Accelerate on macOS (cblas, vDSP).
- viva_tensor_ffi: Pure Erlang fallback. Always works, just slower.
The ops module auto-selects the best available backend at runtime.
Types
pub type CudaInt8TensorRef
pub type CudaTensor16Ref
pub type CudaTensorRef
Erlang :array - the key to O(1) tensor element access.
Under the hood, it’s a tree of 10-element tuples. Technically O(log32 n) but that’s effectively O(1) for any reasonable tensor size.
A 1M element lookup: log32(1_000_000) = ~4 tree traversals. For lists: 500,000 average. That’s the difference between matmul taking 1 second vs 2 hours.
The array is immutable (functional updates create new nodes), but we mostly just read from it during tensor operations.
pub type ErlangArray
HDC vector reference - binary hyperdimensional vector
pub type HdcVectorRef
LNS tensor reference - f32 storage for IADD-based multiplication
pub type LnsTensorRef
Native tensor resource - opaque reference to contiguous C memory. Created by NIF constructors, ops return new refs. Erlang GC frees native memory automatically via destructor.
pub type NativeTensorRef
pub type SparseTensorRef
Values
pub fn abs(x: Float) -> Float
Absolute value.
Implemented in pure Gleam because :math.abs/1 doesn’t exist and erlang:abs/1 is polymorphic (returns same type as input).
pub fn array_dot(a: ErlangArray, b: ErlangArray) -> Float
Dot product using Erlang arrays.
Performance: ~10-50x faster than list-based for large vectors. The speedup comes entirely from O(1) vs O(n) element access.
pub fn array_get(arr: ErlangArray, index: Int) -> Float
Get element from array at index - O(1).
Contrast with list indexing: O(n). For a 1000-element matmul (1000 iterations, each indexing both inputs), that’s 2M list traversals vs 2K array lookups. Huge difference.
pub fn array_matmul(
a: ErlangArray,
b: ErlangArray,
m: Int,
n: Int,
k: Int,
) -> ErlangArray
Matrix multiplication using Erlang arrays.
C[m,n] = A[m,k] @ B[k,n]
Naive O(mnk) algorithm but with O(1) element access. For 100x100 matrices: ~50x faster than list-based.
For serious work, use the Zig SIMD or Accelerate NIF backends. This is the reliable fallback that works everywhere.
pub fn array_scale(
arr: ErlangArray,
scalar: Float,
) -> ErlangArray
Scale all elements by scalar - O(n).
pub fn array_to_list(arr: ErlangArray) -> List(Float)
Convert array back to list - O(n).
Use this for final output or when you need list operations. Try to stay in array-land as long as possible for hot paths.
pub fn ct16_available() -> Bool
pub fn ct16_from_list(
data: List(Float),
shape: List(Int),
) -> Result(CudaTensor16Ref, String)
pub fn ct16_matmul(
a: CudaTensor16Ref,
b: CudaTensor16Ref,
m: Int,
n: Int,
k: Int,
) -> Result(CudaTensor16Ref, String)
pub fn ct16_shape(
ref: CudaTensor16Ref,
) -> Result(List(Int), String)
pub fn ct16_to_list(
ref: CudaTensor16Ref,
) -> Result(List(Float), String)
pub fn ct_from_list(
data: List(Float),
shape: List(Int),
) -> Result(CudaTensorRef, String)
pub fn ct_int8_available() -> Bool
pub fn ct_int8_from_list(
data: List(Float),
shape: List(Int),
) -> Result(CudaInt8TensorRef, String)
pub fn ct_int8_matmul(
a: CudaInt8TensorRef,
b: CudaInt8TensorRef,
m: Int,
n: Int,
k: Int,
) -> Result(CudaInt8TensorRef, String)
pub fn ct_int8_shape(
ref: CudaInt8TensorRef,
) -> Result(List(Int), String)
pub fn ct_int8_to_list(
ref: CudaInt8TensorRef,
) -> Result(List(Float), String)
pub fn ct_matmul(
a: CudaTensorRef,
b: CudaTensorRef,
m: Int,
n: Int,
k: Int,
) -> Result(CudaTensorRef, String)
pub fn ct_shape(ref: CudaTensorRef) -> Result(List(Int), String)
pub fn ct_to_list(
ref: CudaTensorRef,
) -> Result(List(Float), String)
pub fn exp(x: Float) -> Float
Exponential e^x - wraps :math.exp/1
Watch for overflow: exp(710) = inf in Float64. For softmax, subtract max first: exp(x - max(x)).
pub fn hdc_bind(
a: HdcVectorRef,
b: HdcVectorRef,
) -> Result(HdcVectorRef, String)
XOR binding: associates two concepts (invertible: A XOR B XOR B = A)
pub fn hdc_create(dim: Int) -> Result(HdcVectorRef, String)
Create empty hypervector (dim must be multiple of 64)
pub fn hdc_dim(vec: HdcVectorRef) -> Result(Int, String)
Get dimensionality (total bits)
pub fn hdc_permute(
vec: HdcVectorRef,
shift: Int,
) -> Result(HdcVectorRef, String)
Circular permutation for sequence encoding encode(ABC) = A XOR perm(B,1) XOR perm(C,2)
pub fn hdc_random(
dim: Int,
seed: Int,
) -> Result(HdcVectorRef, String)
Create random hypervector (seed for reproducibility)
pub fn hdc_similarity(
a: HdcVectorRef,
b: HdcVectorRef,
) -> Result(Float, String)
Cosine-like similarity via Hamming distance [0, 1] 1 = identical, 0.5 = orthogonal (random), 0 = opposite
pub fn horde_create(
entity_count: Int,
dims: Int,
) -> Result(HordeRef, String)
Create new Horde with entity count and dimensionality (1, 2, or 3)
pub fn horde_dampen(
horde: HordeRef,
friction: Float,
) -> Result(Nil, String)
Apply velocity damping: velocities *= friction
pub fn horde_get_positions(
horde: HordeRef,
) -> Result(List(Float), String)
Get current positions as flat list
pub fn horde_get_velocities(
horde: HordeRef,
) -> Result(List(Float), String)
Get current velocities as flat list
pub fn horde_integrate(
horde: HordeRef,
dt: Float,
) -> Result(Nil, String)
Euler integration step: positions += velocities * dt (FMA)
pub fn horde_kinetic_energy(
horde: HordeRef,
) -> Result(Float, String)
Compute total kinetic energy: 0.5 * sum(vel^2)
pub fn horde_set_positions(
horde: HordeRef,
data: List(Float),
) -> Result(Nil, String)
Set all positions from flat list [x0, y0, x1, y1, …] for 2D
pub fn horde_set_velocities(
horde: HordeRef,
data: List(Float),
) -> Result(Nil, String)
Set all velocities from flat list
pub fn horde_wrap(
horde: HordeRef,
max_bound: Float,
) -> Result(Nil, String)
Toroidal wrap: positions mod max_bound
pub fn is_nif_loaded() -> Bool
Check if the Apple Accelerate NIF is loaded.
Returns True on macOS with the NIF built, False elsewhere. Use this to decide whether to use nif_* functions or fall back.
pub fn list_to_array(lst: List(Float)) -> ErlangArray
Convert list to Erlang array for O(1) access.
O(n) to build, but subsequent access is O(1). Worth it for any tensor you’ll index more than once.
pub fn lns_div(
a: LnsTensorRef,
b: LnsTensorRef,
) -> Result(LnsTensorRef, String)
LNS division via ISUB
pub fn lns_from_f64(
ref: NativeTensorRef,
) -> Result(LnsTensorRef, String)
Convert f64 NativeTensor to f32 LNS tensor
pub fn lns_mul(
a: LnsTensorRef,
b: LnsTensorRef,
) -> Result(LnsTensorRef, String)
Fast LNS multiply via IADD (~11% max error, 8x throughput)
pub fn lns_mul_corrected(
a: LnsTensorRef,
b: LnsTensorRef,
) -> Result(LnsTensorRef, String)
Mitchell’s corrected LNS multiply (~2% max error)
pub fn lns_rsqrt(a: LnsTensorRef) -> Result(LnsTensorRef, String)
Fast inverse sqrt (Quake III trick)
pub fn lns_sqrt(a: LnsTensorRef) -> Result(LnsTensorRef, String)
LNS sqrt via bit shift
pub fn lns_to_f64(
ref: LnsTensorRef,
) -> Result(NativeTensorRef, String)
Convert LNS tensor back to f64 NativeTensor
pub fn log(x: Float) -> Float
Natural logarithm - wraps :math.log/1
Undefined for x <= 0. Erlang will return -inf for 0, NaN for negative. Caller’s responsibility to check input.
pub fn nif_backend_info() -> String
Get backend info string for debugging.
Returns something like “Apple Accelerate (cblas_dgemm, vDSP)” on macOS.
pub fn nif_dot(
a: List(Float),
b: List(Float),
) -> Result(Float, String)
NIF-accelerated dot product via vDSP.
pub fn nif_matmul(
a: List(Float),
b: List(Float),
m: Int,
n: Int,
k: Int,
) -> Result(List(Float), String)
NIF-accelerated matrix multiplication via cblas_dgemm.
This is where the magic happens on macOS. Apple has spent years optimizing BLAS for their chips. We just call their code.
Falls back to pure Erlang if NIF not available.
pub fn nif_scale(
data: List(Float),
scalar: Float,
) -> Result(List(Float), String)
NIF-accelerated scale via vDSP.
pub fn nif_sum(data: List(Float)) -> Result(Float, String)
NIF-accelerated sum via vDSP.
pub fn now_microseconds() -> Int
Get current time in microseconds.
Use for benchmarking: before/after difference gives wall-clock time. For production profiling, use Erlang’s :fprof or :eprof instead.
pub fn nt_add(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native add: ref + ref → ref (zero copy)
pub fn nt_add_mut(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(Nil, String)
In-place add: a += b. Returns ok. MUTATES a.
pub fn nt_dot(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(Float, String)
Native dot product: ref · ref → scalar
pub fn nt_fill(
shape: List(Int),
value: Float,
) -> Result(NativeTensorRef, String)
Create native tensor filled with value
pub fn nt_from_list(
data: List(Float),
shape: List(Int),
) -> Result(NativeTensorRef, String)
Create native tensor from list data + shape
pub fn nt_fused_linear_relu(
a: NativeTensorRef,
b: NativeTensorRef,
bias: NativeTensorRef,
m: Int,
n: Int,
k: Int,
) -> Result(NativeTensorRef, String)
Fused MatMul + Bias + ReLU: C = max(0, A@B + bias) Single pass, saves 2 full tensor traversals.
pub fn nt_matmul(
a: NativeTensorRef,
b: NativeTensorRef,
m: Int,
n: Int,
k: Int,
) -> Result(NativeTensorRef, String)
Native matmul: [m,k] @ [k,n] → [m,n] in native memory
pub fn nt_matmul_nf4(
a: NativeTensorRef,
b_indices: List(Int),
b_scales: List(Float),
m: Int,
n: Int,
k: Int,
block_size: Int,
) -> Result(NativeTensorRef, String)
Matrix multiplication with NF4 quantized weights
pub fn nt_mul(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native element-wise mul: ref * ref → ref
pub fn nt_negate(
a: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native negate: -ref → ref
pub fn nt_negate_mut(a: NativeTensorRef) -> Result(Nil, String)
In-place negate: a = -a. Returns ok. MUTATES a.
pub fn nt_ones(
shape: List(Int),
) -> Result(NativeTensorRef, String)
Create native tensor of ones
pub fn nt_relu(
a: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native ReLU activation
pub fn nt_relu_mut(a: NativeTensorRef) -> Result(Nil, String)
In-place ReLU: a = max(0, a). Returns ok. MUTATES a.
pub fn nt_resonance_mul(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Resonance Multiply: LNS element-wise multiply. result[i] = sign * exp(log|a[i]| + log|b[i]|) Multiplication via addition in log domain — better precision for chains.
pub fn nt_resonance_power(
data: NativeTensorRef,
exponent: Float,
) -> Result(NativeTensorRef, String)
Resonance Power: LNS element-wise power. result[i] = sign(x) * |x|^exponent via exp(exponent * log|x|) Power = multiply in log domain. Sign preserved for bipolar states.
pub fn nt_saturn_blend(
texture: NativeTensorRef,
shade: NativeTensorRef,
bias: Float,
) -> Result(NativeTensorRef, String)
Saturn Blend: result = texture + (shade - bias) VDP1-inspired lighting with pure SIMD addition.
pub fn nt_scale(
a: NativeTensorRef,
scalar: Float,
) -> Result(NativeTensorRef, String)
Native scale: ref * scalar → ref
pub fn nt_scale_mut(
a: NativeTensorRef,
scalar: Float,
) -> Result(Nil, String)
In-place scale: a *= scalar. Returns ok. MUTATES a.
pub fn nt_shape(
ref: NativeTensorRef,
) -> Result(List(Int), String)
Get shape from native tensor
pub fn nt_sigmoid(
a: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native sigmoid activation
pub fn nt_sub(
a: NativeTensorRef,
b: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native sub: ref - ref → ref
pub fn nt_to_list(
ref: NativeTensorRef,
) -> Result(List(Float), String)
Extract data as list (one-time conversion at boundaries)
pub fn nt_transpose(
a: NativeTensorRef,
) -> Result(NativeTensorRef, String)
Native transpose: [m,n] → [n,m] contiguous copy
pub fn nt_zeros(
shape: List(Int),
) -> Result(NativeTensorRef, String)
Create native tensor of zeros
pub fn random_uniform() -> Float
Uniform random float in [0, 1).
Uses Erlang’s per-process PRNG (Xoroshiro116+ by default). Not suitable for cryptography, but fine for ML initialization.
For reproducible results, seed with :rand.seed(Algorithm, Seed).
pub fn sparse_available() -> Bool
pub fn sparse_compression_ratio(
ref: SparseTensorRef,
) -> Result(Float, String)
pub fn sparse_from_ct16(
ref: CudaTensor16Ref,
) -> Result(SparseTensorRef, String)
pub fn sparse_matmul(
a_sparse: SparseTensorRef,
b_dense: CudaTensor16Ref,
m: Int,
n: Int,
k: Int,
) -> Result(CudaTensor16Ref, String)
pub fn sparse_shape(
ref: SparseTensorRef,
) -> Result(List(Int), String)
pub fn tanh(x: Float) -> Float
Hyperbolic tangent - wraps :math.tanh/1
Range: (-1, 1). Saturates for |x| > ~20. Used in some activation functions, though ReLU dominates now.
pub fn zig_add(
a: List(Float),
b: List(Float),
) -> Result(List(Float), String)
Zig SIMD element-wise add.
pub fn zig_backend_info() -> String
Get Zig backend info for debugging.
Returns SIMD capability info: “Zig SIMD (AVX2)” or “Zig SIMD (NEON)” etc.
pub fn zig_dot(
a: List(Float),
b: List(Float),
) -> Result(Float, String)
Zig SIMD dot product.
Uses 4-way or 8-way SIMD depending on platform. Unrolled loop with accumulator to maximize throughput.
pub fn zig_matmul(
a: List(Float),
b: List(Float),
m: Int,
n: Int,
k: Int,
) -> Result(List(Float), String)
Zig SIMD matrix multiplication.
Tiled implementation with SIMD inner loops. Not quite BLAS-level but respectable: ~10-50 GFLOPS depending on platform.
pub fn zig_mul(
a: List(Float),
b: List(Float),
) -> Result(List(Float), String)
Zig SIMD element-wise multiply.
pub fn zig_scale(
data: List(Float),
scalar: Float,
) -> Result(List(Float), String)
Zig SIMD scale (multiply all elements by scalar).