Foundations
Xavier initialisation
Xavier initialization is a method for setting the initial weights of a neural network that aims to reduce the vanishing or exploding gradient problems, particularly in deep networks. It initializes weights based on the number of input and output neurons in a layer, drawing values from a distribution scaled to keep the variance of activations roughly the same across layers.
Explanation
In deep neural networks, if the initial weights are too small, the signal shrinks as it passes through each layer, leading to vanishing gradients and slow learning. Conversely, if the initial weights are too large, the signal grows exponentially, leading to exploding gradients and unstable training. Xavier initialization, also known as Glorot initialization, addresses this by setting the weights such that the variance of the activations remains approximately constant across layers. Specifically, for a layer with 'n_in' input neurons and 'n_out' output neurons, the weights are typically drawn from a uniform distribution U(-sqrt(6) / sqrt(n_in + n_out), sqrt(6) / sqrt(n_in + n_out)) or a normal distribution with a mean of 0 and a standard deviation of sqrt(2 / (n_in + n_out)). This initialization scheme is most effective when the activation functions used are linear or approximately linear (e.g., tanh, but not ReLU without modifications). Variations like He initialization are preferred for ReLU activations.