Tuning Hyperparameters in Deep Learning

Hello everyone, Welcome to our final blog in this Deep Learning Introduction series. In this blog, we will discuss on hyperparameter tuning, which’s a question on everyone’s mind when getting started with Deep Learning.

Hyperparameter tuning

There are several hyperparameters we should take in consideration while building deep learning models, which are mostly specific to our design choice. In this blog, we will discuss about the most common hyperparameters for most of the deep learning models.

Weight Initialization

Weights are not exactly the hyperparameters, but they form the heart of deep learning. In order to converge to a better minima, and also have non-zero initial weight vectors, might help us converge faster. One such technique is Xavier initialization. Using Xavier initialization, we would derive our weights from a distribution with zero mean and a specific variance, which is specific to your network architecture.

The variance is given by two values, fan-in and fan-out which correspond to number of incoming neurons and outgoing neurons in a specific layer. Here’s the formula for variance based on these values.

Xavier initialization variance

Learning Rate

The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. One of the most popular methods is called Adam, which is a method that adapts the learning rate during the model’s training.

We can make use of adaptive learning rates, where we would vary our learning rate during model’s training which helps us in reducing training time or better convergence. There are several ways we can make use of adaptive learning rates. One such is Adam.

Number of Hidden layers & units

The number of hidden layers and no. of neurons in hidden layers always play a huge role in the training process. Large hidden layers are known to learn more and more complex data and representations in the train set. The no. of hidden layers are comparatively more in unsupervised learning than supervised learning.

There is no particular formula or measure on no. of layers or units, but we have seen our models performing well with more no. of layers and units. The first hidden layer would have more no. of nodes than that of the input layer.

Loss Function

Loss function is also considered as a tunable hyperparameter depending on the task our model is expected to perform. Depending on classification/regression we can choose between categorical loss or Squared error, and we have many such loss functions in regression itself. We need to choose the loss function that fits to our data the best way.

Epochs

The no. of epochs or training iterations, has direct implication to our model’s performance. Normally, we do train our model for a larger no. of epochs, and use Early Stopping technique to stop training, if we feel our model’s performing better or sometimes even worse.

Grid search simply tries every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters.

It can be easily parallelized, but care should be taken to ensure that if one job fails it fails gracefully; otherwise a portion of the hyperparameter space could be left unexplored.

Grid search iterates over all the possible combinations of our specified hyperparameters, once the best parameters combination are found, we can make use of those parameters to build our final model.

That’s it for this blog, we will catch up again on some more concepts related to Deep Learning and even some real time applications too. Until then, stay safe. Cheers ✌✌

3 responses to “Tuning Hyperparameters in Deep Learning”

Leave a reply to Tag Samurai Cancel reply

Design a site like this with WordPress.com
Get started