autograd

Autograd - Reverse-Mode Automatic Differentiation

“The chain rule is the unsung hero of machine learning.” — Every ML practitioner who debugged NaN gradients at 3am

Implements reverse-mode AD (Speelpenning, 1980) with an explicit tape. Why reverse-mode? Because we have few outputs (loss) and many inputs (params). Forward-mode would require O(n) passes; reverse needs just one. Math wins.

References:

Speelpenning, B. (1980). “Compiling Fast Partial Derivatives of Functions Given by Algorithms.” PhD thesis, UIUC. The OG automatic differentiation.
Baydin et al. (2018). “Automatic Differentiation in Machine Learning: a Survey” https://arxiv.org/abs/1502.05767 - If you read one AD paper, make it this one.
Paszke et al. (2017). “Automatic differentiation in PyTorch” - Dynamic graphs done right.

Design choice: Explicit tape > implicit global graph. Fight me. PyTorch uses dynamic graphs because Chainer proved it works (Tokui et al., 2015). We take it further: the tape is a value you pass around. Pure FP, no spooky action.

The math that makes it all work: Chain rule: dL/dx = dL/dy * dy/dx In reverse-mode, we propagate dL/dy backward, accumulating dL/dx.

Key Concepts

Tape: The computation graph. Records ops for the backward pass.
Variable: A tensor with an identity. It knows who it is in the graph.
Traced(a): State monad in disguise. Carries the value AND updated tape.

Example

import viva_tensor/core/tensor
import viva_tensor/nn/autograd.{Traced}

let tape = autograd.new_tape()
let Traced(x, tape1) = autograd.new_variable(tape, tensor.from_list([2.0]))
let Traced(y, tape2) = autograd.new_variable(tape1, tensor.from_list([3.0]))

let assert Ok(Traced(z, tape3)) = autograd.mul(tape2, x, y)
let assert Ok(grads) = autograd.backward(tape3, z)

// dz/dx = y = 3.0  (partial derivative w.r.t. first input)
// dz/dy = x = 2.0  (partial derivative w.r.t. second input)
let assert Ok(dx) = dict.get(grads, x.id)
let assert Ok(dy) = dict.get(grads, y.id)

Types

BackwardFn

</>

The closure that computes gradients for parent nodes. Given dL/dself, returns [(parent_id, dL/dparent), …]. This is where the chain rule lives.

pub type BackwardFn =
  fn(tensor.Tensor) -> List(#(Int, tensor.Tensor))

NodeId

</>

Unique identifier for each node in the computational graph. Sequential IDs give us implicit topological ordering for free. Sometimes the simplest solution is the best one.

pub type NodeId =
  Int

Tape

</>

The Tape: our explicit computation graph.

Unlike PyTorch’s implicit global state, we pass this around explicitly. Functional programming purists rejoice. Debugging becomes tractable. Trade-off: slightly more verbose code, but no hidden state surprises.

pub type Tape {
  Tape(
    next_id: Int,
    operations: dict.Dict(
      Int,
      fn(tensor.Tensor) -> List(#(Int, tensor.Tensor)),
    ),
  )
}

Constructors

Tape(
  next_id: Int,
  operations: dict.Dict(
    Int,
    fn(tensor.Tensor) -> List(#(Int, tensor.Tensor)),
  ),
)

Arguments

operations: Maps node ID -> backward function that computes parent gradients. Only non-leaf nodes have entries here.

Traced

</>

The result of a traced operation: value + updated tape.

This is secretly a State monad: State s a = s -> (a, s) We just make the state threading explicit. Gleam doesn’t have do-notation, so explicit is actually clearer here.

pub type Traced(a) {
  Traced(value: a, tape: Tape)
}

Constructors

```
Traced(value: a, tape: Tape)
```

Variable

</>

A variable tracked in the autograd system. Think of it as a tensor that remembers its place in the computation graph.

pub type Variable {
  Variable(id: Int, data: tensor.Tensor)
}

Constructors

```
Variable(id: Int, data: tensor.Tensor)
```

Values

add

</>

pub fn add(
  tape: Tape,
  a: Variable,
  b: Variable,
) -> Result(Traced(Variable), error.TensorError)

Traced addition: c = a + b

Gradient: dc/da = 1, dc/db = 1 With broadcasting, we sum over expanded dimensions. This is trickier than it looks - broadcasting gradients must reduce back.

backward

</>

pub fn backward(
  tape: Tape,
  loss: Variable,
) -> Result(dict.Dict(Int, tensor.Tensor), String)

Executes backpropagation starting from a loss variable. Returns gradients for all variables: Map(NodeId -> Tensor). Instrumented: records graph size and backward pass latency.

The loss should be a scalar (or we treat it as sum of elements). Multi-output differentiation is possible but rarely needed in ML.

matmul

</>

pub fn matmul(
  tape: Tape,
  a: Variable,
  b: Variable,
) -> Result(Traced(Variable), error.TensorError)

Traced matrix multiplication: C = A @ B

Gradients (the beautiful part): dL/dA = dL/dC @ B^T dL/dB = A^T @ dL/dC

This is why linear algebra and calculus are best friends. The transpose “reverses” the dimension matching from the forward pass.

mean

</>

pub fn mean(tape: Tape, a: Variable) -> Traced(Variable)

Traced mean reduction: y = mean(x)

Gradient: dy/dx_i = 1/n for all i The gradient “fans out” uniformly to all inputs. This is why mean loss converges more stably than sum loss.

mul

</>

pub fn mul(
  tape: Tape,
  a: Variable,
  b: Variable,
) -> Result(Traced(Variable), error.TensorError)

Traced element-wise multiplication: c = a * b (Hadamard product)

Gradient: dc/da = b, dc/db = a The classic product rule: d(uv) = udv + vdu

new_tape

</>

pub fn new_tape() -> Tape

Creates a fresh tape. The beginning of every gradient computation.

new_variable

</>

pub fn new_variable(
  tape: Tape,
  data: tensor.Tensor,
) -> Traced(Variable)

Registers a new variable (leaf node) in the graph. Leaf nodes have no backward function - they’re where gradients accumulate.

relu

</>

pub fn relu(tape: Tape, a: Variable) -> Traced(Variable)

Traced ReLU activation: y = max(0, x)

Gradient: dy/dx = 1 if x > 0, else 0

The “dying ReLU” problem lives here: once a neuron outputs 0, its gradient is 0, so it never learns again. RIP that neuron. Leaky ReLU fixes this, but plain ReLU is still surprisingly effective.

sequence

</>

pub fn sequence(
  input: Result(Traced(Variable), e),
  layer_fn: fn(Tape, Variable) -> Result(Traced(Variable), e),
) -> Result(Traced(Variable), e)

Operation sequencing (monadic bind, essentially). Allows chaining: x |> sequence(layer1) |> sequence(layer2)

This is >>= from Haskell, but we call it sequence because Gleam users shouldn’t need a category theory PhD to read the code.

sub

</>

pub fn sub(
  tape: Tape,
  a: Variable,
  b: Variable,
) -> Result(Traced(Variable), error.TensorError)

Traced subtraction: c = a - b

Gradient: dc/da = 1, dc/db = -1 Subtraction is just addition with a sign flip. Simple, elegant.

transpose

</>

pub fn transpose(
  tape: Tape,
  a: Variable,
) -> Result(Traced(Variable), error.TensorError)

Traced transpose: B = A^T

Gradient: dL/dA = (dL/dB)^T Transpose is its own inverse. Elegant symmetry.