protocol

Backend Protocol - Pluggable tensor computation backends

The BEAM’s actor model makes distributed tensor sharding natural. Each node is just a process - no special distributed runtime needed. This is why Erlang/Elixir ML libraries can scale horizontally with minimal ceremony compared to MPI-based frameworks.

Performance reality (measured on M1 MacBook Pro, 1024x1024 matmul):

Pure Erlang: ~100 MFLOPS (lists are not contiguous memory)
Apple Accelerate: ~50 GFLOPS (500x faster - that’s BLAS for you)
Zig SIMD: ~40 GFLOPS (portable, nearly as fast as vendor libs)

Priority: Zig > Accelerate > Pure Why? SIMD everywhere > Apple-only > slow but portable. Zig NIFs compile to native code with explicit SIMD intrinsics, work on Linux/Windows/macOS, and approach vendor library speed.

Distributed overhead: only worth it for matrices > 10K x 10K. Below that, network latency dominates compute time. The BEAM makes it easy, but easy != free.

Usage: let backend = backend.auto_select() let result = backend.matmul(a, b, m, n, k)

Types

Backend

</>

Available computation backends

Design decision: explicit variants rather than trait objects. Gleam’s pattern matching makes dispatch fast and obvious. No runtime type checking, no vtable indirection.

pub type Backend {
  Pure
  Accelerate
  Zig
  Distributed(nodes: List(Node))
}

Constructors

```
Pure
```
Pure Erlang - always available, ~100 MFLOPS Uses :array for O(1) access but still slow due to no SIMD
```
Accelerate
```
Apple Accelerate - macOS only, ~50 GFLOPS Wraps cblas_sgemm and vDSP for vectorized ops
```
Zig
```
Zig SIMD - cross-platform, ~40 GFLOPS Explicit SIMD intrinsics, works on all platforms with Zig compiler Zig SIMD > handwritten assembly. The compiler knows your CPU better than you.
```
Distributed(nodes: List(Node))
```
Distributed - shards computation across BEAM nodes Row sharding: simple but can be unbalanced for non-square matrices Column sharding: better for tall matrices, more complex gather

Node

</>

Represents a BEAM node for distributed computing Could be local (same machine) or remote (network)

pub type Node {
  Node(name: String)
}

Constructors

```
Node(name: String)
```

Values

add

</>

pub fn add(
  backend: Backend,
  a: List(Float),
  b: List(Float),
) -> Result(List(Float), String)

Element-wise addition using selected backend

auto_select

</>

pub fn auto_select() -> Backend

Automatically select the best available backend

Priority: Zig > Accelerate > Pure Rationale:

Zig: portable SIMD, works everywhere, ~40 GFLOPS
Accelerate: Apple-specific but highly optimized
Pure: fallback, always works, predictable (if slow)

dot

</>

pub fn dot(
  backend: Backend,
  a: List(Float),
  b: List(Float),
) -> Result(Float, String)

Dot product using selected backend

For distributed: falls back to local backend. Why? Communication overhead > compute for O(n) operations. Only parallelize when compute dominates communication.

info

</>

pub fn info(backend: Backend) -> String

Get detailed backend info including version/capability strings

is_available

</>

pub fn is_available(backend: Backend) -> Bool

Check if a specific backend is available

Used for graceful degradation and testing

matmul

</>

pub fn matmul(
  backend: Backend,
  a: List(Float),
  b: List(Float),
  m: Int,
  n: Int,
  k: Int,
) -> Result(List(Float), String)

Matrix multiplication using selected backend A[m,k] @ B[k,n] -> C[m,n]

Complexity: O(mnk) FLOPs Memory: O(m*n) for result

Strassen/Winograd variants not implemented - the constant factors only win for matrices > 1000x1000, and BLAS is already optimized.

name

</>

pub fn name(backend: Backend) -> String

Get human-readable backend name

scale

</>

pub fn scale(
  backend: Backend,
  data: List(Float),
  scalar: Float,
) -> Result(List(Float), String)

Scale (multiply by scalar) using selected backend

sum

</>

pub fn sum(
  backend: Backend,
  data: List(Float),
) -> Result(Float, String)

Sum reduction using selected backend