layers

Neural Network Layers - The Building Blocks of Deep Learning

“A neural network is just a differentiable program.” — Yann LeCun (paraphrased, but he’d probably agree)

References:

Rumelhart, Hinton & Williams (1986). “Learning representations by back-propagating errors.” Nature. THE paper that started it all.
Glorot & Bengio (2010). “Understanding the difficulty of training deep feedforward neural networks.” Xavier initialization lives here.
He et al. (2015). “Delving Deep into Rectifiers.” Kaiming init for ReLU.

Design philosophy:

Layers are values, not classes. No hidden state, no surprises.
Forward pass returns Traced(Variable) - computation graph included.
Everything flows through the tape. Backprop just works.

Types

Linear

</>

Linear (fully connected) layer.

Stores weights W: [out_features, in_features] and bias b: [out_features]. PyTorch convention: weights are [out, in] for efficiency in y = xW^T.

pub type Linear {
  Linear(w: autograd.Variable, b: autograd.Variable)
}

Constructors

Linear(w: autograd.Variable, b: autograd.Variable)

Values

linear

</>

pub fn linear(
  tape: autograd.Tape,
  in_features: Int,
  out_features: Int,
) -> autograd.Traced(Linear)

Creates a new Linear layer with Xavier/Glorot initialization.

Xavier init: W ~ Uniform(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out))) This keeps variance stable across layers for tanh/sigmoid. For ReLU, you’d want He init (scale by sqrt(2/fan_in)) instead.

The bias starts at zero. Some argue for small positive values to ensure ReLU neurons fire initially, but zero works fine.

linear_forward

</>

pub fn linear_forward(
  tape: autograd.Tape,
  layer: Linear,
  x: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)

Forward pass: y = xW^T + b Instrumented: records forward pass latency.

The order of operations matters for gradient computation:

Transpose W: [out, in] -> [in, out]
Matmul: [batch, in] @ [in, out] -> [batch, out]
Add bias: [batch, out] + [out] (broadcast over batch dim)

Each step is traced, so backward() will compute dL/dW, dL/db, dL/dx.

mse_loss

</>

pub fn mse_loss(
  tape: autograd.Tape,
  pred: autograd.Variable,
  target: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)

Mean Squared Error loss: L = mean((pred - target)^2)

MSE: the L2 norm’s favorite child.

Properties:

Gradient: dL/dpred = 2(pred - target) / n
Strongly convex (unique minimum)
Penalizes large errors quadratically (sensitive to outliers)
The MLE under Gaussian noise assumption

When to use:

Regression problems with Gaussian-distributed errors
When you want smooth, well-behaved gradients

When NOT to use:

Outlier-heavy data (use Huber or MAE instead)
Classification (use cross-entropy)

relu

</>

pub fn relu(
  tape: autograd.Tape,
  x: autograd.Variable,
) -> autograd.Traced(autograd.Variable)

ReLU activation: f(x) = max(0, x)

The most popular activation function. Simple, effective, sometimes dead.

Why ReLU wins:

Sparse activations (many zeros = efficient)
No vanishing gradient for positive values (gradient = 1)
Computationally trivial (just a comparison)

Why ReLU loses:

“Dying ReLU”: neurons that output 0 have 0 gradient forever
Unbounded output can cause numerical issues

Alternatives: LeakyReLU, GELU, SiLU/Swish. Each has tradeoffs.