Deep learning (DL) terminologies are frequently asked during data science and machine learning interviews. In this tutorial, we will discuss the top 10 neural network model concept interview questions and how to answer them.

**Resources for this post:**

- Video tutorial for this post on YouTube
- Click here for the Colab notebook
- More video tutorials on Data Science Interview Questions
- More blog posts on Data Science Interview Questions

Let’s get started!

### Question 1: What is weight initialization for a neural network model?

- Weight initialization is setting the weights of a neural network to a set of values as the starting point for the model training process. It affects the neural network model performance.
- We can specify the initial weights as all zeros, all ones, a constant number, or a distribution.

In the python code below, we use the TensorFlow initializer to set the initial weight in a normal distribution with 0 mean and unit standard deviation.

# Import tensorflow

import tensorflow as tf

# Set random normal initializer

initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=1.)

# Apply the initializer to the layer

layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

### Question 2: What is backpropagation?

- Backpropagation is a key step in training a neural network model.
- The goal of backpropagation is to update the weights for the neurons in order to minimize the loss function.
- Backpropagation takes the error from the previous forward propagation and feeds this error backward through the layers to update the weights. This process is iterated until the neural network model is converged.

### Question 3: What is the loss function of a neural network model?

A loss function is a function that measures the quality of a machine learning model by comparing the actual and the predicted target values. It is for a supervised model only because its calculation requires the ground truth value of the target.

When training a neural network model, we need to specify the loss function name depending on the nature of the project.

- Linear Regression: For a neural network model with a continuous target variable,

👉 `mean_squared_error`

is the default.

👉 `mean_squared_logarithmic_error`

is the mean squared error (MSE) based on log values. It is usually used for the dependent variable with a wide range of values.

👉 `mean_absolute_error`

is robust for data with outliers because the errors are not squared.

- Binary Classification: For a neural network model with two discrete labels as the target variable,

👉 `binary_crossentropy`

is the default. It’s the same as the log loss in logistic regression.

👉 `hinge`

is for the target variable of -1 and 1. It rewards the prediction of the same sign and penalizes the prediction if the signs are different.

👉 `squared_hinge`

is for the target variable of -1 and 1. As the name suggests, it is the squared value of the hinge loss function.

- Multi-Class Classification: For a neural network model with more than two discrete labels as the target variable,

👉 `categorical_crossentropy`

is the same as `binary_crossentropy`

but it is for multiple categories.

👉 `sparse_categorical_crossentropy`

is good for the dependent variable with a lot of categories.

👉 `kullback_leibler_divergence`

measures how a predicted probability distribution is similar to the target distribution.

### Question 4: What are batch size and epoch in a neural network model?

**Batch Size**is the number of training samples in each forward propagation and backpropagation before the model weights are updated.**Epoch**is the number of complete passes through the whole training dataset.

For example, if a neural network model has a batch size of 10 and the training sample size of 200, the model weights will be updated 20 times for each epoch.

### Question 5: What are the commonly used activation functions for a neural network model?

An activation function is a mathematical transformation that enables nonlinearity in a neural network model. Activation function enables a neural network model to capture complex nonlinear patterns of the training dataset.

The most commonly used activation functions are listed below:

**Linear**activation function is usually used in the output layer of a neural network model with the continuous target variable.**Sigmoid**activation function is usually used in the output layer of a neural network model with a binary classification target variable.**Softmax**activation function is usually used in the output layer of a neural network model with a multi-class classification target variable.**Tanh**activation function is usually used in the hidden layer of a neural network model. It’s a transformation of the sigmoid function (𝑡𝑎𝑛ℎ(𝑥)=2∗𝑠𝑖𝑔𝑚𝑜𝑖𝑑(2𝑥)−1), ranging from -1 to 1.**RELU**is the rectified linear unit activation function. It is usually used in the hidden layer of a neural network model. The formula for RELU is 𝑚𝑎𝑥(0,𝑥), which sets negative values to 0.

### Question 6: What are the differences between batch gradient descent, stochastic gradient descent, and mini-batch gradient descent?

Gradient descent is an optimization algorithm used to find the minimum of a function. It works by iteratively moving in the direction that reduces the value of the function the most. Gradient descent is a common algorithm used in machine learning to find the optimal parameters for a model. It can be used for both linear and classification models.

There are three commonly used gradient descent types, batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. The main difference between the three variants is the amount of data used each time the weights are updated.

- Batch gradient descent uses the entire dataset to compute the gradient for each parameter update. Batch gradient descent uses the entire dataset to compute the gradient for each parameter update, so the weight updates are stable, but it requires high memory for large datasets.
- Stochastic gradient descent (SGD) updates the model weights using one record at a time, so it requires less memory. However, Stochastic gradient descent is not stable. The frequent updates of the weights can produce noisy gradients, causing the loss to fluctuate instead of slowly decreasing.
- Mini-batch gradient descent lies between batch gradient descent and stochastic gradient descent, and it uses a subset of the training dataset to compute the gradient at each step. Mini-batch gradient descent combines the benefits of batch gradient descent and stochastic gradient descent. Batch size is an important hyperparameter to tune in mini-batch gradient descent.

### Question 7: What is batch normalization in a neural network model?

Batch normalization performs the normalization of layer inputs for each training mini-batch.

The pros of batch normalization are:

- Batch normalization stabilizes the weight change. It allows us to use much higher learning rates and be less careful about initialization.
- Batch normalization regularizes the neural network model because a training example is seen in conjunction with other examples in the mini-batches, and the training network no longer produces deterministic values for a given training example. The authors of the batch normalization paper mentioned that in some cases eliminating the need for Dropout.
- It runs faster in the sense that it achieves the same accuracy in less time and with fewer epochs.

The con of the batch normalization is:

- Because the normalization is performed at the batch dimension, it does not work well for small batch sizes because the mean and variance for the batch are not representative of the dataset. A general rule of thumb is to have at least 16 samples in one batch.

### Question 8: What are exploding and vanishing gradients?

Vanishing and exploding gradients can happen for a deep multi-layer artificial neural network model or a recurrent neural network (RNN) model. The weights of the model cannot be updated properly when vanishing or exploding gradients happen.

- Vanishing gradients refer to the scenario that the gradients get smaller and smaller when the model back propagates and the weights of the model cannot be updated properly.
- Exploding gradients refer to the scenario that the gradients get larger and larger when the model back propagates and the weights of the model cannot be updated properly.

We can identify vanishing and exploding gradients by monitoring the training process.

- If vanishing gradients happen, we can observe that larger updates are applied to the weights of later layers and smaller or even no updates on the weights of earlier layers. The model learns slowly and training stops with a model with poor performance
- If exploding gradients happen, we can observe unstable updates from iteration to iteration and larger updates applied on the weights of earlier layers. The model weights and loss can become NaN quickly.

There are several solutions for fixing vanishing gradients.

- We can use
`ReLU`

as the activation function and avoid using`sigmoid`

or`tanh`

as the activation function. This is because the derivative of a`ReLU`

activation function is either 0 or 1, which will not vanish the gradients. - We can also make the model structure simpler by including fewer hidden layers.
- Another way of reducing vanishing gradients is to initialize weights from a uniform or normal distribution of certain variances, and maintain the variance of activations the same across all layers. In TensorFlow, this is implemented as the
`glorot_normal`

and`glorot_uniform`

for`kernel_initializer`

. - Lastly, we can use an optimizer with momentum (e.g.,
`Adam`

) that factors in the accumulated previous gradients.

To fix exploding gradients, we can use the following methods:

- Use gradient clipping to cap the derivatives to a threshold and uses the capped gradients to update the weights.
- Setting weight initializer as the
`glorot_normal`

and`glorot_uniform`

in a TensorFlow model can also help reduce the exploding gradients. - We can also use L2 regularization to shrink the weights and prevent exploding gradients.

### Question 9: What is zero-shot learning?

Zero-shot learning (ZSL) refers to building a model and using it to make predictions on the tasks that the model was not trained to do.

For example, if we would like to classify millions of news articles into different topics, building a traditional multi-class classification model would be very costly because manually labeling the news topics takes a lot of time. Zero-shot text classification is able to make class predictions without explicitly building a supervised classification model using a labeled dataset. Zero-shot text classification is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).

Please check out my previous tutorial Zero-shot Topic Modeling with Deep Learning Using Python Hugging Face for the python code for a zero-shot model.

### Question 10: What is transfer learning?

Transfer learning is a machine learning technique that reuses a pretrained large deep learning model on a new task. It usually includes the following steps:

- Select a pretrained model that is suitable for the new task. For example, if the new task includes text from different languages, a multi-language pretrained model needs to be selected.
- Keep all the weights and biases from the pretrained model except for the output layer. This is because the output layer for the pretrained model is for the pretrained tasks and it needs to be replaced with the new task.
- Feed randomly initialize weights and biases into the new head of the new task. For a sentiment analysis transfer learning (aka fine-tuning) model on a pretrained BERT model, we will remove the head that classifies mask words, and replace it with the two sentiment analysis labels, positive and negative.
- Retrain the model for the new task with the new data, utilizing the pretrained weights and biases. Because the weights and biases store the knowledge learned from the pretrained model, the fine-tuned transfer learning model can build on that knowledge and does not need to learn from scratch.

To learn how to implement transfer learning in python, please check out my tutorials on transfer learning using TensorFlow, transformers trainer, and PyTorch.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Zero-shot Topic Modeling with Deep Learning Using Python
- Transfer Learning for Text Classification Using PyTorch
- Transfer Learning for Text Classification Using Hugging Face Transformers Trainer
- Customized Sentiment Analysis: Transfer Learning Using Tensorflow with Hugging Face
- Categorical Entity Embedding Using Python Tensorflow Keras
- Google Colab Tutorial for Beginners
- Sentiment Analysis: Hugging Face Zero-shot Model vs Flair Pre-trained Model
- Topic Modeling with Deep Learning Using Python BERTopic
- Top 7 Support Vector Machine (SVM) Interview Questions for Data Science and Machine Learning
- Top 5 Decision Tree Interview Questions for Data Science and Machine Learning
- Bagging vs Boosting vs Stacking in Machine Learning
- Top 10 NLP Concepts Interview Questions and Answers

### References

- Tensorflow documentation on initializers
- Tensorflow documentation on regularization
- Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Paper: Attention Is All You Need
- Vanishing and Exploding Gradients in Neural Network Models: Debugging, Monitoring, and Fixing