Activation functions add non-linearity, enabling neural networks to learn complex patterns. MLP (Multi-Layer Perceptron) stacks layers of neurons.
Input → Hidden layer(s) → Output. Hidden layers must use non-linear activations; without them, the stack collapses to a single linear map and cannot solve XOR-like boundaries.
Standard tangent has discontinuities and shoots to infinity at 90° — causing exploding gradients. Hyperbolic tanh is smooth, monotonic, and bounded in [−1, +1].
| Task | Output activation | Typical loss |
|---|---|---|
| Regression | Linear / identity | Mean Squared Error (MSE) |
| Binary classification | Sigmoid → [0, 1] | Binary cross-entropy (BCE) |
| Multi-class | Softmax → distribution | Categorical cross-entropy (CCE) |
Raw logits are exponentiated (eliminates negatives, amplifies the winner), then normalized so all class probabilities sum to 1.0. CCE then penalizes confident wrong classes heavily.