The effects of weight initialization on neural nets

Author: Sayak Paul

Neural nets are stochastic processes, meaning that it might not give you the same results each time you run it. This makes it hard to make neural nets reproducible and reproducibility is desirable for all the good reasons you can probably think of ~~(deplyoment,~~ ~~reusability, etc.). We want our neural nets to show consistent peformance across several runs. It’s worth to note that reproducibility in machine learning is still an active area of research.~~

In this article, we’ll review and compare a plethora of weight initialization methods for neural nets. We will also be discussing a simple recipe for initializing the weights in a neural net.

Here’s a sneak peek of the comparison between the different methods we’ll cover.

Accuracy and validation trade-off plots and parameter histograms from my Weights and Biases run page

A neural net can be viewed as a function with learnable parameters and those parameters are often referred to as weights and biases. Now, while starting the training of neural nets these parameters (typically the weights) are initialized in a number of different ways - sometimes, using contant values like 0’s and 1’s, sometimes with values sampled from some distribution (typically a unifrom distribution or normal distribution), sometimes with other sophisticated schemes like Xavier Initialization.

The performance of a neural net depends a lot on how its parameters are initialized when it is starting to train. Moreover, if we initialize it randomly for each runs, it’s bound to be non-reproducible (almost) and even not-so-performant too. On the other hand, if we initialize it with contant values, it might take it way too long to converge. With that, we also eliminate the beauty of randomness which in turn gives a neural net the power to reach a covergence quicker using gradient-based learning. We clearly need a better way to initialize it.

Careful initialization of weights not only helps us to develop more reproducible neural nets but also it helps us in training them better as we will see in this article. Let’s dive in!

Different weight initialization schemes

We are going to study the effects of the following weight initialization schemes:

Weights initialized to all zeros

Weights initialized to all ones

Weights initialized with values sampled from a uniform distribution with a fixed bound

Weights initialized with values sampled from a uniform distribution with a careful tweak

Weights initialized with values sampled from a normal distribution with a careful tweak

Finally, we are going to see the effects of the default weight initialization scheme that comes with tf.keras.

If you want to follow along, then you spin this Colab notebook up and execute the cells one by one to try the different schemes yourself.

Data and a humble model for the experiments

To make the experiments quick and consistent let’s fixate on the dataset and a simple model architecture. For doing experiments like this, my favorite dataset to start off with is the FashionMNIST dataset. We will be using the following model architecture:

Our model architecture for the experiments

The model would take a flattened feature vector of shape (784, ) and after passing through a set of dropout and dense layers, it would produce a prediction vector of shape (10, ) which correspond to the probabilities of the 10 different classes present in the FashionMNIST dataset.

This would be the model architecture we will be using for the all experiments. We will be using the sparse_categorical_crossentropy as the loss function and the Adam optimizer.

#1 Weights initialized to all zeros

Let’s first throw a weight vector of all zeros to our model and see how it performs in 10 epochs of training. In tf.keras, layers like Dense, Conv2D, LSTM have two arguments - kernel_initializer and bias_initializer. This is where we can pass in any pre-defined initializer or even a custom one. I would recommend you to take a look at this documentation which enlists all the available initializers in tf.keras.

We can set the kernel_initializer arugment of all the Dense layers in our model to zeros to initialize our weight vectors to all zeros. Since the bias is a scalar quantity, even if we set it to zeros it won’t matter that much as it would for the weights. In code, it would look like so:

tf.keras.layers.Dense(256, activation='relu', kernel_initializer=init_scheme,

bias_initializer='zeros')

Here’s how our model would train with weights initialized to all zeros -

User-uploaded image: WB+Chart+2_27_2020+9_12_27+PM.png

Training loss vs. validation loss (with zero sweight initialization)

​​Different weight initialization schemes

​​Data and a humble model for the experiments

​​#1 Weights initialized to all zeros

Different weight initialization schemes

Data and a humble model for the experiments

#1 Weights initialized to all zeros