Backpropagation and Gradient Descent

Forward propagation computes predictions; a loss function scores error. Backpropagation sends error signals backward via the chain rule; gradient descent nudges weights downhill.

Gradient descent intuition

Imagine a hiker in fog on a mountain who can only feel local slope. Each step moves opposite the steepest gradient toward the valley (minimum loss):

Wnew = Wold − η · (∂E/∂W)

Learning rate η is step size — too small = slow training; too large = overshooting and unstable loss.

Chain rule in backprop

∂E/∂w = (∂E/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)

Gradients flow from output layer back through each hidden layer so every weight knows how much it contributed to the error.

Training loop

  1. Forward pass: model predicts output.
  2. Compute loss between prediction and true label.
  3. Backward pass: compute gradients using chain rule.
  4. Update weights with optimizer.

w = w - eta * dL/dw where eta is learning rate.

flowchart LR A[Input] --> B[Forward pass] B --> C[Loss] C --> D[Backprop gradients] D --> E[Weight update] E --> B

Popular optimizers