Forward propagation computes predictions; a loss function scores error. Backpropagation sends error signals backward via the chain rule; gradient descent nudges weights downhill.
Imagine a hiker in fog on a mountain who can only feel local slope. Each step moves opposite the steepest gradient toward the valley (minimum loss):
Wnew = Wold − η · (∂E/∂W)
Learning rate η is step size — too small = slow training; too large = overshooting and unstable loss.
∂E/∂w = (∂E/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)
Gradients flow from output layer back through each hidden layer so every weight knows how much it contributed to the error.
w = w - eta * dL/dw where eta is learning rate.