viva_tensor/nn/autograd
Autograd - Reverse-Mode Automatic Differentiation
“The chain rule is the unsung hero of machine learning.” — Every ML practitioner who debugged NaN gradients at 3am
Implements reverse-mode AD (Speelpenning, 1980) with an explicit tape. Why reverse-mode? Because we have few outputs (loss) and many inputs (params). Forward-mode would require O(n) passes; reverse needs just one. Math wins.
References:
- Speelpenning, B. (1980). “Compiling Fast Partial Derivatives of Functions Given by Algorithms.” PhD thesis, UIUC. The OG automatic differentiation.
- Baydin et al. (2018). “Automatic Differentiation in Machine Learning: a Survey” https://arxiv.org/abs/1502.05767 - If you read one AD paper, make it this one.
- Paszke et al. (2017). “Automatic differentiation in PyTorch” - Dynamic graphs done right.
Design choice: Explicit tape > implicit global graph. Fight me. PyTorch uses dynamic graphs because Chainer proved it works (Tokui et al., 2015). We take it further: the tape is a value you pass around. Pure FP, no spooky action.
The math that makes it all work: Chain rule: dL/dx = dL/dy * dy/dx In reverse-mode, we propagate dL/dy backward, accumulating dL/dx.
Key Concepts
- Tape: The computation graph. Records ops for the backward pass.
- Variable: A tensor with an identity. It knows who it is in the graph.
- Traced(a): State monad in disguise. Carries the value AND updated tape.
Example
import viva_tensor/core/tensor
import viva_tensor/nn/autograd.{Traced}
let tape = autograd.new_tape()
let Traced(x, tape1) = autograd.new_variable(tape, tensor.from_list([2.0]))
let Traced(y, tape2) = autograd.new_variable(tape1, tensor.from_list([3.0]))
let assert Ok(Traced(z, tape3)) = autograd.mul(tape2, x, y)
let assert Ok(grads) = autograd.backward(tape3, z)
// dz/dx = y = 3.0 (partial derivative w.r.t. first input)
// dz/dy = x = 2.0 (partial derivative w.r.t. second input)
let assert Ok(dx) = dict.get(grads, x.id)
let assert Ok(dy) = dict.get(grads, y.id)
Types
The closure that computes gradients for parent nodes. Given dL/dself, returns [(parent_id, dL/dparent), …]. This is where the chain rule lives.
pub type BackwardFn =
fn(tensor.Tensor) -> List(#(Int, tensor.Tensor))
Unique identifier for each node in the computational graph. Sequential IDs give us implicit topological ordering for free. Sometimes the simplest solution is the best one.
pub type NodeId =
Int
The Tape: our explicit computation graph.
Unlike PyTorch’s implicit global state, we pass this around explicitly. Functional programming purists rejoice. Debugging becomes tractable. Trade-off: slightly more verbose code, but no hidden state surprises.
pub type Tape {
Tape(
next_id: Int,
operations: dict.Dict(
Int,
fn(tensor.Tensor) -> List(#(Int, tensor.Tensor)),
),
)
}
Constructors
-
Tape( next_id: Int, operations: dict.Dict( Int, fn(tensor.Tensor) -> List(#(Int, tensor.Tensor)), ), )Arguments
- operations
-
Maps node ID -> backward function that computes parent gradients. Only non-leaf nodes have entries here.
The result of a traced operation: value + updated tape.
This is secretly a State monad: State s a = s -> (a, s) We just make the state threading explicit. Gleam doesn’t have do-notation, so explicit is actually clearer here.
pub type Traced(a) {
Traced(value: a, tape: Tape)
}
Constructors
-
Traced(value: a, tape: Tape)
A variable tracked in the autograd system. Think of it as a tensor that remembers its place in the computation graph.
pub type Variable {
Variable(id: Int, data: tensor.Tensor)
}
Constructors
-
Variable(id: Int, data: tensor.Tensor)
Values
pub fn add(
tape: Tape,
a: Variable,
b: Variable,
) -> Result(Traced(Variable), error.TensorError)
Traced addition: c = a + b
Gradient: dc/da = 1, dc/db = 1 With broadcasting, we sum over expanded dimensions. This is trickier than it looks - broadcasting gradients must reduce back.
pub fn backward(
tape: Tape,
loss: Variable,
) -> Result(dict.Dict(Int, tensor.Tensor), String)
Executes backpropagation starting from a loss variable. Returns gradients for all variables: Map(NodeId -> Tensor). Instrumented: records graph size and backward pass latency.
The loss should be a scalar (or we treat it as sum of elements). Multi-output differentiation is possible but rarely needed in ML.
pub fn matmul(
tape: Tape,
a: Variable,
b: Variable,
) -> Result(Traced(Variable), error.TensorError)
Traced matrix multiplication: C = A @ B
Gradients (the beautiful part): dL/dA = dL/dC @ B^T dL/dB = A^T @ dL/dC
This is why linear algebra and calculus are best friends. The transpose “reverses” the dimension matching from the forward pass.
pub fn mean(tape: Tape, a: Variable) -> Traced(Variable)
Traced mean reduction: y = mean(x)
Gradient: dy/dx_i = 1/n for all i The gradient “fans out” uniformly to all inputs. This is why mean loss converges more stably than sum loss.
pub fn mul(
tape: Tape,
a: Variable,
b: Variable,
) -> Result(Traced(Variable), error.TensorError)
Traced element-wise multiplication: c = a * b (Hadamard product)
Gradient: dc/da = b, dc/db = a The classic product rule: d(uv) = udv + vdu
pub fn new_tape() -> Tape
Creates a fresh tape. The beginning of every gradient computation.
pub fn new_variable(
tape: Tape,
data: tensor.Tensor,
) -> Traced(Variable)
Registers a new variable (leaf node) in the graph. Leaf nodes have no backward function - they’re where gradients accumulate.
pub fn relu(tape: Tape, a: Variable) -> Traced(Variable)
Traced ReLU activation: y = max(0, x)
Gradient: dy/dx = 1 if x > 0, else 0
The “dying ReLU” problem lives here: once a neuron outputs 0, its gradient is 0, so it never learns again. RIP that neuron. Leaky ReLU fixes this, but plain ReLU is still surprisingly effective.
pub fn sequence(
input: Result(Traced(Variable), e),
layer_fn: fn(Tape, Variable) -> Result(Traced(Variable), e),
) -> Result(Traced(Variable), e)
Operation sequencing (monadic bind, essentially). Allows chaining: x |> sequence(layer1) |> sequence(layer2)
This is >>= from Haskell, but we call it sequence because Gleam users shouldn’t need a category theory PhD to read the code.