Machine Learning Interview Questions

1. What is Machine Learning?

A field of study that gives computers the ability to learn without being explicitly programmed.

2. What are the different types of Machine Learning?

Supervised Learning: learn a model from labeled training data, then make predictions

Unsupervised Learning: explore the structure of the data to extract meaningful information

Reinforcement Learning: develop an agent that improves its performance based on interactions with the environment

3. What is overfitting and how can you avoid it?

Overfitting occurs when the model learns the training set too well.

It takes up random fluctuations in the training data as concepts. These impacts the model’s ability to generalize and don’t apply to new data.

High loss and low accuracy are seen in the test dataset

There are 3 main methods to avoid overfitting

Regularization: This involves a cost term for the features involved, with the objective function.

Make a simple model: With lesser variables and parameters, the variance can be reduced

Cross-validation: methods, like k-folds can be used If some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parameters.

4. What are training set and test sets in a Machine Learning model? How much data will be allocated for a training, validation and test sets?

Training Set are examples given to the model to analyze and learn. Usually 70% is taken as training dataset. This is labeled data used to train the model.

The test set is used to test the accuracy of the hypothesis generated by the model. Remaining 30% is taken as Testing dataset.

5. How do you handle missing or corrupted data in a dataset?

The ways to handle missing/corrupted data is to drop those rows/columns or replace them completely with some other value.

There are two useful methods in panda:

 - Isnull ( ) and dropna ( ) will help finding he columns/rows with missing data and drop them

 - Fillna ( ) will replace the wrong values with a placeholder value(o)

6. How can you choose a classifier based on training set size?

When the training set is small, a model that has a high bias and low variance seems to work better because they are less likely to overfit. For Example: Naïve Bayes works best.

When the training set is large, models with low bias and high variance tend to perform better as they work fine with complex relationships. E.g. Decision Tree.

7. Explain confusion matrix with respect to Machine Learning algorithms.

Confusion matrix is a specific table that is used to measure the performance of an algorithm.

It is mostly used in supervised learning (in unsupervised learning it is called matching matrix).

Confusion matrix has two dimensions: Actual & Predicted

It also has identical sets of features in both theses dimensions.

8. What is false positive and false negative and how are they significant?

False Positive are those cases which wrongly get classified as True, but are actually False

False Negative similarly are those cases which wrongly get classified as False but are True.

9. What are the three stages to build a model in Machine Learning?

Model Building: Choose the suitable algorithm for the model and train it according to the requirement

Model Testing: Check the accuracy of the model through the test data.

Applying the model: Make the required changes after testing and apply the final model.

10. What is Deep Learning?

Deep Learning involves systems that think and learn like humans using artificial neural networks.

Deep Learning is a machine learning technique that teaches computers to do what comes naturally to humans.

In Deep learning, a computer model learns to perform classification tasks directly from images, text, or sound.

Models are trained by using a large set of labeled data and neural network architectures that contain many layers.

11. What is the difference between Machine Learning and Deep Learning?

Machine Learning:

It enables machines to take decisions on their own, based on past data.

It needs only a small amount of training data.

It works week on low-end systems.

The most features need to be identified in advance and manually coded.

The problem is divided into parts and solved individually and then combined.

Deep Learning:

It enables machines to take decisions with the help of artificial neural networks.

It needs a large amount of training data.

It needs high end systems to work.

The machine learns the features of the data it is provided.

The problem is solved in an end-to-end manner.

12. What is the application of supervised machine learning in modern businesses?

Email Spam Detection

Sentiment analysis

Healthcare Diagnosis

Fraud Detection

13. What are the unsupervised machine learning techniques?

Clustering problems involve data to be divided into subsets. These subsets, also called clusters contain data that are similar to each other.

Different clusters reveal different details about the objects, making it different from classification or regression.

In an Association problem, we identify patterns of associations between different variables or items.

In e-commerce websites, they are able to suggest other items for you to buy, based on the prior purchases that you have done, spending habits, items in your wish-list, other customers purchase habits and so on.

14. What is the difference between supervised and unsupervised machine learning?

Supervised Learning: Machine Learning model learns from the past input data and makes future predictions as output.

Unsupervised Learning: Machine Learning model uses unlabeled input data and allows the algorithm to act on that information without guidance.

15. Compare K-Means and KNN algorithms.


It is unsupervised in nature

It is a clustering algorithm

The points in each cluster are similar to each other and each cluster is different from its neighboring clusters.


It is a supervised in nature.

It is a classification algorithm.

It classifiers an unlabeled observation based on its K (can be any number) surrounding neighbors.

16. How will you know which machine learning algorithm to choose for your classification problem?

If accuracy is concerned, then one can test different algorithms and cross validate them.

If the training dataset is small, one should use models that have low variance and high bias.

If the training dataset is large, one should use models with high variance and low bias.

17. When will you use classification over regression?

Classification is used when your target variable is categorical in nature. While Regression is used when your target variable is continuous in nature. Both belong to the category of supervised machine learning algorithms.

Classification problems could be estimating the gender of a person, the type of color, if the result is True or False, etc.

Regression problems could be estimated sale and price of a product, predicting sports score, amount of rainfall, etc.

18. What is Random Forest?

Random Forest is a supervised machine learning algorithm that is generally used for classification.

Random Forest operates by constructing multiple Deision Trees during the training phase.

The Decision of the majority of the trees is chosen by the random forest as the final decision.

19. Define Precision and Recall?

Precision is the ratio of a number of events you can correctly recall to a number all events cam recall (mix of correct and wrong recalls).

Recall is the ratio of a number of events you can recall the number of total events

20. Explain Logistic Regression?

Logistic Regrssion is a classification algorithm, used to predict a binary outcome for a given set of independent variables.

Output of a logistic regression is either a 0 or 1

It has a threshold value which is generally 0.5

Any value above 0.5 is considered as 1 and any point below 0.5 is considered as 0.