Question 1

What is the difference between supervised, unsupervised, and reinforcement learning?

Accepted Answer

Supervised learning trains on labeled data, learning a mapping from inputs to known outputs — for example predicting house prices or classifying emails. Unsupervised learning works with unlabeled data to find structure on its own, such as clustering customers or reducing dimensionality. Reinforcement learning has an agent take actions in an environment and learn from rewards and penalties over time, as in game-playing or robotics.

Question 2

What is the difference between classification and regression?

Accepted Answer

Both are supervised learning tasks, but they differ in the type of output. Classification predicts a discrete category or label, such as spam versus not spam, or which of several classes an image belongs to. Regression predicts a continuous numeric value, such as a price, temperature, or age. The choice of model, loss function, and evaluation metric depends on which type of problem you have.

Question 3

What is overfitting, and how is it different from underfitting?

Accepted Answer

Overfitting happens when a model learns the training data too well, including its noise, so it performs strongly on training data but poorly on new, unseen data — it has high variance. Underfitting is the opposite: the model is too simple to capture the underlying pattern, so it performs poorly on both training and test data — it has high bias. The goal is a model that generalizes well, sitting between these two extremes.

Question 4

What is the bias-variance tradeoff?

Accepted Answer

Bias is error from overly simplistic assumptions that cause a model to miss real patterns (underfitting), while variance is error from being too sensitive to the training data and its noise (overfitting). Increasing model complexity typically lowers bias but raises variance, and simplifying it does the reverse. The tradeoff is about finding the balance that minimizes total error on unseen data.

Question 5

Why do we split data into training, validation, and test sets?

Accepted Answer

The training set is used to fit the model, the validation set is used to tune hyperparameters and compare models during development, and the test set is held out until the very end to give an unbiased estimate of real-world performance. Keeping the test set untouched prevents you from accidentally tuning to it, which would make your performance estimate overly optimistic. A common split is something like 60/20/20 or 70/15/15, depending on data size.

Question 6

What is cross-validation?

Accepted Answer

Cross-validation is a technique for estimating how well a model generalizes by repeatedly training and testing on different subsets of the data. In k-fold cross-validation, the data is split into k folds; the model trains on k-1 folds and is evaluated on the remaining one, rotating until each fold has served as the validation set, then the scores are averaged. It gives a more reliable performance estimate than a single split and uses the data more efficiently, which is especially valuable on small datasets.

Question 7

What are precision, recall, and the F1 score?

Accepted Answer

Precision is the fraction of predicted positives that are actually positive — true positives divided by all predicted positives — and answers how trustworthy a positive prediction is. Recall is the fraction of actual positives the model correctly identified — true positives divided by all actual positives — and answers how many real positives were caught. The F1 score is the harmonic mean of precision and recall, giving a single balanced metric that is useful when you care about both, especially on imbalanced data.

Question 8

What is a confusion matrix?

Accepted Answer

A confusion matrix is a table that summarizes a classifier's predictions against the actual labels, with cells for true positives, true negatives, false positives, and false negatives. It lets you see exactly what kinds of mistakes the model makes rather than just an overall accuracy number. From it you can derive metrics like precision, recall, accuracy, and specificity.

Question 9

What is regularization, and how do L1 and L2 differ?

Accepted Answer

Regularization adds a penalty on model complexity to the loss function to discourage overfitting by shrinking the weights. L1 regularization (Lasso) penalizes the sum of absolute values of the weights and tends to drive some weights exactly to zero, effectively performing feature selection. L2 regularization (Ridge) penalizes the sum of squared weights and shrinks them smoothly toward zero without eliminating them. Elastic Net combines both penalties.

Question 10

What is gradient descent?

Accepted Answer

Gradient descent is an optimization algorithm that minimizes a loss function by iteratively moving the model's parameters in the direction opposite the gradient, the direction of steepest descent. The learning rate controls the step size: too large and it may overshoot or diverge, too small and it converges slowly. Common variants include batch gradient descent, stochastic gradient descent which updates on one example at a time, and mini-batch gradient descent which uses small batches.

Question 11

What is feature scaling, and why does it matter?

Accepted Answer

Feature scaling brings features onto a comparable range, commonly through normalization (rescaling to a fixed range like 0 to 1) or standardization (rescaling to zero mean and unit variance). It matters because algorithms that rely on distances or gradients — such as k-nearest neighbors, SVMs, k-means, and gradient-descent-based models — can be dominated by features with larger numeric ranges. Tree-based models like decision trees and random forests generally do not require scaling.

Question 12

How do you handle imbalanced datasets?

Accepted Answer

When one class greatly outnumbers another, accuracy becomes misleading and models tend to ignore the minority class. Common remedies include resampling — oversampling the minority class (for example with SMOTE, which synthesizes new examples) or undersampling the majority class — and using class weights so errors on the minority class are penalized more heavily. It is also important to evaluate with metrics like precision, recall, F1, or area under the precision-recall curve rather than raw accuracy.

Question 13

What is the difference between a decision tree and a random forest?

Accepted Answer

A decision tree is a single model that splits the data on feature thresholds to form a tree of decisions; it is easy to interpret but prone to overfitting. A random forest is an ensemble of many decision trees, each trained on a random subset of the data and features, whose predictions are averaged or voted. This reduces variance and usually generalizes much better than a single tree, at the cost of interpretability.

Question 14

What is the difference between bagging and boosting?

Accepted Answer

Both are ensemble methods that combine multiple weak learners, but they work differently. Bagging (bootstrap aggregating) trains models independently and in parallel on different bootstrapped samples, then averages or votes their predictions to reduce variance — random forests are a bagging method. Boosting trains models sequentially, with each new model focusing on the errors of the previous ones, reducing bias — examples include AdaBoost and gradient boosting methods like XGBoost.

Question 15

How does the k-means clustering algorithm work?

Accepted Answer

K-means is an unsupervised algorithm that partitions data into k clusters. It starts by placing k centroids, assigns each point to the nearest centroid, then recomputes each centroid as the mean of its assigned points, repeating until assignments stop changing. You must choose k in advance, often using methods like the elbow plot or silhouette score, and because results depend on the initial centroid positions it is typically run several times.

Question 16

What is the curse of dimensionality?

Accepted Answer

The curse of dimensionality refers to problems that arise when data has very many features. As dimensions grow, the volume of the space increases so fast that data becomes sparse, distances between points become less meaningful, and models need exponentially more data to generalize well. This is why dimensionality reduction techniques like PCA and careful feature selection are often used to keep models effective.

Question 17

What is the difference between a parameter and a hyperparameter?

Accepted Answer

Parameters are values the model learns from the data during training, such as the weights in a linear model or the splits in a decision tree. Hyperparameters are settings you configure before training that control how learning happens, such as the learning rate, the number of trees, the value of k, or the regularization strength. Hyperparameters are typically chosen using the validation set, often through grid search, random search, or more advanced tuning methods.

Machine Learning Interview Questions

1. What is the difference between supervised, unsupervised, and reinforcement learning?

2. What is the difference between classification and regression?

3. What is overfitting, and how is it different from underfitting?

4. What is the bias-variance tradeoff?

5. Why do we split data into training, validation, and test sets?

6. What is cross-validation?

7. What are precision, recall, and the F1 score?

8. What is a confusion matrix?

9. What is regularization, and how do L1 and L2 differ?

10. What is gradient descent?

11. What is feature scaling, and why does it matter?

12. How do you handle imbalanced datasets?

13. What is the difference between a decision tree and a random forest?

14. What is the difference between bagging and boosting?

15. How does the k-means clustering algorithm work?

16. What is the curse of dimensionality?

17. What is the difference between a parameter and a hyperparameter?

More interview questions

Get these answered live in your real interview