viva_tensor/nn/layers

Neural Network Layers - The Building Blocks of Deep Learning

“A neural network is just a differentiable program.” — Yann LeCun (paraphrased, but he’d probably agree)

References:

Design philosophy:

Types

Linear (fully connected) layer.

Stores weights W: [out_features, in_features] and bias b: [out_features]. PyTorch convention: weights are [out, in] for efficiency in y = xW^T.

pub type Linear {
  Linear(w: autograd.Variable, b: autograd.Variable)
}

Constructors

Values

pub fn linear(
  tape: autograd.Tape,
  in_features: Int,
  out_features: Int,
) -> autograd.Traced(Linear)

Creates a new Linear layer with Xavier/Glorot initialization.

Xavier init: W ~ Uniform(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out))) This keeps variance stable across layers for tanh/sigmoid. For ReLU, you’d want He init (scale by sqrt(2/fan_in)) instead.

The bias starts at zero. Some argue for small positive values to ensure ReLU neurons fire initially, but zero works fine.

pub fn linear_forward(
  tape: autograd.Tape,
  layer: Linear,
  x: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)

Forward pass: y = xW^T + b Instrumented: records forward pass latency.

The order of operations matters for gradient computation:

  1. Transpose W: [out, in] -> [in, out]
  2. Matmul: [batch, in] @ [in, out] -> [batch, out]
  3. Add bias: [batch, out] + [out] (broadcast over batch dim)

Each step is traced, so backward() will compute dL/dW, dL/db, dL/dx.

pub fn mse_loss(
  tape: autograd.Tape,
  pred: autograd.Variable,
  target: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)

Mean Squared Error loss: L = mean((pred - target)^2)

MSE: the L2 norm’s favorite child.

Properties:

  • Gradient: dL/dpred = 2(pred - target) / n
  • Strongly convex (unique minimum)
  • Penalizes large errors quadratically (sensitive to outliers)
  • The MLE under Gaussian noise assumption

When to use:

  • Regression problems with Gaussian-distributed errors
  • When you want smooth, well-behaved gradients

When NOT to use:

  • Outlier-heavy data (use Huber or MAE instead)
  • Classification (use cross-entropy)
pub fn relu(
  tape: autograd.Tape,
  x: autograd.Variable,
) -> autograd.Traced(autograd.Variable)

ReLU activation: f(x) = max(0, x)

The most popular activation function. Simple, effective, sometimes dead.

Why ReLU wins:

  • Sparse activations (many zeros = efficient)
  • No vanishing gradient for positive values (gradient = 1)
  • Computationally trivial (just a comparison)

Why ReLU loses:

  • “Dying ReLU”: neurons that output 0 have 0 gradient forever
  • Unbounded output can cause numerical issues

Alternatives: LeakyReLU, GELU, SiLU/Swish. Each has tradeoffs.

Search Document