viva_tensor/nn/layers
Neural Network Layers - The Building Blocks of Deep Learning
“A neural network is just a differentiable program.” — Yann LeCun (paraphrased, but he’d probably agree)
References:
- Rumelhart, Hinton & Williams (1986). “Learning representations by back-propagating errors.” Nature. THE paper that started it all.
- Glorot & Bengio (2010). “Understanding the difficulty of training deep feedforward neural networks.” Xavier initialization lives here.
- He et al. (2015). “Delving Deep into Rectifiers.” Kaiming init for ReLU.
Design philosophy:
- Layers are values, not classes. No hidden state, no surprises.
- Forward pass returns Traced(Variable) - computation graph included.
- Everything flows through the tape. Backprop just works.
Types
Linear (fully connected) layer.
Stores weights W: [out_features, in_features] and bias b: [out_features]. PyTorch convention: weights are [out, in] for efficiency in y = xW^T.
pub type Linear {
Linear(w: autograd.Variable, b: autograd.Variable)
}
Constructors
-
Linear(w: autograd.Variable, b: autograd.Variable)
Values
pub fn linear(
tape: autograd.Tape,
in_features: Int,
out_features: Int,
) -> autograd.Traced(Linear)
Creates a new Linear layer with Xavier/Glorot initialization.
Xavier init: W ~ Uniform(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out))) This keeps variance stable across layers for tanh/sigmoid. For ReLU, you’d want He init (scale by sqrt(2/fan_in)) instead.
The bias starts at zero. Some argue for small positive values to ensure ReLU neurons fire initially, but zero works fine.
pub fn linear_forward(
tape: autograd.Tape,
layer: Linear,
x: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)
Forward pass: y = xW^T + b Instrumented: records forward pass latency.
The order of operations matters for gradient computation:
- Transpose W: [out, in] -> [in, out]
- Matmul: [batch, in] @ [in, out] -> [batch, out]
- Add bias: [batch, out] + [out] (broadcast over batch dim)
Each step is traced, so backward() will compute dL/dW, dL/db, dL/dx.
pub fn mse_loss(
tape: autograd.Tape,
pred: autograd.Variable,
target: autograd.Variable,
) -> Result(autograd.Traced(autograd.Variable), error.TensorError)
Mean Squared Error loss: L = mean((pred - target)^2)
MSE: the L2 norm’s favorite child.
Properties:
- Gradient: dL/dpred = 2(pred - target) / n
- Strongly convex (unique minimum)
- Penalizes large errors quadratically (sensitive to outliers)
- The MLE under Gaussian noise assumption
When to use:
- Regression problems with Gaussian-distributed errors
- When you want smooth, well-behaved gradients
When NOT to use:
- Outlier-heavy data (use Huber or MAE instead)
- Classification (use cross-entropy)
pub fn relu(
tape: autograd.Tape,
x: autograd.Variable,
) -> autograd.Traced(autograd.Variable)
ReLU activation: f(x) = max(0, x)
The most popular activation function. Simple, effective, sometimes dead.
Why ReLU wins:
- Sparse activations (many zeros = efficient)
- No vanishing gradient for positive values (gradient = 1)
- Computationally trivial (just a comparison)
Why ReLU loses:
- “Dying ReLU”: neurons that output 0 have 0 gradient forever
- Unbounded output can cause numerical issues
Alternatives: LeakyReLU, GELU, SiLU/Swish. Each has tradeoffs.