Machine Learning Interview Questions
Machine learning fundamentals come up in data science, ML engineering, and analytics interviews. These are the questions interviewers actually ask, with concise answers you can speak confidently.
17 questions with concise, interview-ready answers.
1. What is the difference between supervised, unsupervised, and reinforcement learning?
Supervised learning trains on labeled data, learning a mapping from inputs to known outputs — for example predicting house prices or classifying emails. Unsupervised learning works with unlabeled data to find structure on its own, such as clustering customers or reducing dimensionality. Reinforcement learning has an agent take actions in an environment and learn from rewards and penalties over time, as in game-playing or robotics.
2. What is the difference between classification and regression?
Both are supervised learning tasks, but they differ in the type of output. Classification predicts a discrete category or label, such as spam versus not spam, or which of several classes an image belongs to. Regression predicts a continuous numeric value, such as a price, temperature, or age. The choice of model, loss function, and evaluation metric depends on which type of problem you have.
3. What is overfitting, and how is it different from underfitting?
Overfitting happens when a model learns the training data too well, including its noise, so it performs strongly on training data but poorly on new, unseen data — it has high variance. Underfitting is the opposite: the model is too simple to capture the underlying pattern, so it performs poorly on both training and test data — it has high bias. The goal is a model that generalizes well, sitting between these two extremes.
4. What is the bias-variance tradeoff?
Bias is error from overly simplistic assumptions that cause a model to miss real patterns (underfitting), while variance is error from being too sensitive to the training data and its noise (overfitting). Increasing model complexity typically lowers bias but raises variance, and simplifying it does the reverse. The tradeoff is about finding the balance that minimizes total error on unseen data.
5. Why do we split data into training, validation, and test sets?
The training set is used to fit the model, the validation set is used to tune hyperparameters and compare models during development, and the test set is held out until the very end to give an unbiased estimate of real-world performance. Keeping the test set untouched prevents you from accidentally tuning to it, which would make your performance estimate overly optimistic. A common split is something like 60/20/20 or 70/15/15, depending on data size.
6. What is cross-validation?
Cross-validation is a technique for estimating how well a model generalizes by repeatedly training and testing on different subsets of the data. In k-fold cross-validation, the data is split into k folds; the model trains on k-1 folds and is evaluated on the remaining one, rotating until each fold has served as the validation set, then the scores are averaged. It gives a more reliable performance estimate than a single split and uses the data more efficiently, which is especially valuable on small datasets.
7. What are precision, recall, and the F1 score?
Precision is the fraction of predicted positives that are actually positive — true positives divided by all predicted positives — and answers how trustworthy a positive prediction is. Recall is the fraction of actual positives the model correctly identified — true positives divided by all actual positives — and answers how many real positives were caught. The F1 score is the harmonic mean of precision and recall, giving a single balanced metric that is useful when you care about both, especially on imbalanced data.
8. What is a confusion matrix?
A confusion matrix is a table that summarizes a classifier's predictions against the actual labels, with cells for true positives, true negatives, false positives, and false negatives. It lets you see exactly what kinds of mistakes the model makes rather than just an overall accuracy number. From it you can derive metrics like precision, recall, accuracy, and specificity.
9. What is regularization, and how do L1 and L2 differ?
Regularization adds a penalty on model complexity to the loss function to discourage overfitting by shrinking the weights. L1 regularization (Lasso) penalizes the sum of absolute values of the weights and tends to drive some weights exactly to zero, effectively performing feature selection. L2 regularization (Ridge) penalizes the sum of squared weights and shrinks them smoothly toward zero without eliminating them. Elastic Net combines both penalties.
10. What is gradient descent?
Gradient descent is an optimization algorithm that minimizes a loss function by iteratively moving the model's parameters in the direction opposite the gradient, the direction of steepest descent. The learning rate controls the step size: too large and it may overshoot or diverge, too small and it converges slowly. Common variants include batch gradient descent, stochastic gradient descent which updates on one example at a time, and mini-batch gradient descent which uses small batches.
11. What is feature scaling, and why does it matter?
Feature scaling brings features onto a comparable range, commonly through normalization (rescaling to a fixed range like 0 to 1) or standardization (rescaling to zero mean and unit variance). It matters because algorithms that rely on distances or gradients — such as k-nearest neighbors, SVMs, k-means, and gradient-descent-based models — can be dominated by features with larger numeric ranges. Tree-based models like decision trees and random forests generally do not require scaling.
12. How do you handle imbalanced datasets?
When one class greatly outnumbers another, accuracy becomes misleading and models tend to ignore the minority class. Common remedies include resampling — oversampling the minority class (for example with SMOTE, which synthesizes new examples) or undersampling the majority class — and using class weights so errors on the minority class are penalized more heavily. It is also important to evaluate with metrics like precision, recall, F1, or area under the precision-recall curve rather than raw accuracy.
13. What is the difference between a decision tree and a random forest?
A decision tree is a single model that splits the data on feature thresholds to form a tree of decisions; it is easy to interpret but prone to overfitting. A random forest is an ensemble of many decision trees, each trained on a random subset of the data and features, whose predictions are averaged or voted. This reduces variance and usually generalizes much better than a single tree, at the cost of interpretability.
14. What is the difference between bagging and boosting?
Both are ensemble methods that combine multiple weak learners, but they work differently. Bagging (bootstrap aggregating) trains models independently and in parallel on different bootstrapped samples, then averages or votes their predictions to reduce variance — random forests are a bagging method. Boosting trains models sequentially, with each new model focusing on the errors of the previous ones, reducing bias — examples include AdaBoost and gradient boosting methods like XGBoost.
15. How does the k-means clustering algorithm work?
K-means is an unsupervised algorithm that partitions data into k clusters. It starts by placing k centroids, assigns each point to the nearest centroid, then recomputes each centroid as the mean of its assigned points, repeating until assignments stop changing. You must choose k in advance, often using methods like the elbow plot or silhouette score, and because results depend on the initial centroid positions it is typically run several times.
16. What is the curse of dimensionality?
The curse of dimensionality refers to problems that arise when data has very many features. As dimensions grow, the volume of the space increases so fast that data becomes sparse, distances between points become less meaningful, and models need exponentially more data to generalize well. This is why dimensionality reduction techniques like PCA and careful feature selection are often used to keep models effective.
17. What is the difference between a parameter and a hyperparameter?
Parameters are values the model learns from the data during training, such as the weights in a linear model or the splits in a decision tree. Hyperparameters are settings you configure before training that control how learning happens, such as the learning rate, the number of trees, the value of k, or the regularization strength. Hyperparameters are typically chosen using the validation set, often through grid search, random search, or more advanced tuning methods.
Get these answered live in your real interview
NostrobeAI is a real-time AI interview copilot — it hears the question and drafts a strong answer on your screen, invisible on Zoom, Meet, and Teams. One-time pricing, no subscription.
Try NostrobeAI free