Data Science Interview Questions
Data science interviews probe your grasp of statistics, the analytics workflow, and how you reason about data — not just modeling. These are the questions interviewers actually ask, with concise answers you can speak confidently.
17 questions with concise, interview-ready answers.
1. What are the typical stages of the data science lifecycle?
A common framing is: understand the business problem, collect and acquire the data, clean and prepare it, explore it (EDA), engineer features, build and train models, evaluate against the right metrics, then deploy and monitor in production. It is iterative rather than linear — insights from later stages often send you back to redefine the problem or gather more data. The early stages, framing the problem and preparing the data, usually consume the most time.
2. What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data, where each example has a known target, and the model learns to map inputs to that target — classification and regression are the two main types. Unsupervised learning works with unlabeled data and finds structure on its own, such as clustering similar points or reducing dimensionality. A middle ground, semi-supervised learning, uses a small amount of labeled data alongside a large amount of unlabeled data.
3. How do you handle missing data?
First understand why the data is missing, since the mechanism (missing completely at random, at random, or not at random) affects what is safe to do. Options include dropping rows or columns when missingness is small or the feature is mostly empty, and imputation — filling with the mean, median, or mode, or with a model-based estimate like KNN or regression imputation. You can also add a binary indicator flag so the model can learn that the value was missing, which matters when the absence itself is informative.
4. What is an outlier, and how do you detect one?
An outlier is an observation that lies far from the bulk of the data and may reflect a genuine extreme, a measurement error, or a data-entry mistake. Common detection methods are the z-score (flagging points more than about three standard deviations from the mean), the IQR rule (points below Q1 minus 1.5 times the IQR or above Q3 plus 1.5 times the IQR), and visual tools like box plots and scatter plots. How you treat one depends on context — you might remove it, cap it, transform the variable, or keep it if it is a valid signal.
5. What is feature engineering, and why does it matter?
Feature engineering is the process of creating, transforming, or selecting input variables to make patterns easier for a model to learn. Examples include encoding categorical variables, creating interaction or ratio terms, extracting components from dates, binning continuous values, and applying log transforms to skewed data. It matters because well-chosen features often improve performance more than swapping in a more complex algorithm, since they inject domain knowledge the model cannot discover on its own.
6. What is the difference between normalization and standardization?
Normalization, often called min-max scaling, rescales values to a fixed range such as 0 to 1, which is sensitive to outliers. Standardization rescales a feature to have a mean of 0 and a standard deviation of 1, producing z-scores, and does not bound the range. Use normalization when you need values in a set range or the data is not Gaussian, and standardization when the algorithm assumes roughly normal, centered features — many distance-based and gradient-based methods benefit from scaled inputs.
7. What is a p-value in hypothesis testing?
A p-value is the probability of observing a result at least as extreme as the one you got, assuming the null hypothesis is true. A small p-value means the data would be unlikely under the null, so you reject the null in favor of the alternative; a common threshold is 0.05. It is important to note that the p-value is not the probability that the null hypothesis is true, and statistical significance does not by itself mean the effect is large or practically important.
8. What are the null and alternative hypotheses?
The null hypothesis is the default claim of no effect or no difference — for example, that two groups have the same mean. The alternative hypothesis is what you are trying to find evidence for, such as that the means differ. A hypothesis test gathers data to decide whether there is enough evidence to reject the null in favor of the alternative; you never prove the null true, you only fail to reject it.
9. What is the difference between correlation and causation?
Correlation means two variables move together in a measurable, statistical way, while causation means one variable actually produces a change in the other. Correlation does not imply causation because a relationship can be driven by a confounding variable that influences both, or be pure coincidence. To establish causation you generally need a controlled or randomized experiment, or careful causal-inference techniques rather than observational correlation alone.
10. What is the difference between a Type I and a Type II error?
A Type I error is a false positive — rejecting the null hypothesis when it is actually true, like concluding a drug works when it does not. A Type II error is a false negative — failing to reject the null when the alternative is actually true, like missing a real effect. The significance level alpha controls the Type I error rate, while the Type II error rate is beta, and statistical power equals one minus beta.
11. What is the central limit theorem?
The central limit theorem states that as the sample size grows, the distribution of the sample mean approaches a normal distribution, regardless of the shape of the underlying population, provided the variance is finite. In practice this often holds well once the sample size is around 30 or more. It is foundational because it lets us use normal-based confidence intervals and hypothesis tests for means even when the raw data is not normally distributed.
12. How does an A/B test work?
An A/B test randomly splits users into a control group that sees the existing version and a treatment group that sees a change, then compares a chosen metric between them. Randomization balances out confounding factors so that any significant difference can be attributed to the change. You decide the metric, sample size, and significance level in advance, run the test until you reach that sample size, and use a statistical test to judge whether the difference is significant rather than due to chance.
13. What is exploratory data analysis (EDA)?
EDA is the initial investigation of a dataset to understand its structure, spot patterns, check assumptions, and find anomalies before modeling. It combines summary statistics (means, medians, distributions, correlations) with visualizations like histograms, box plots, and scatter plots. The goal is to build intuition about the data, surface quality issues such as missing values or outliers, and inform decisions about cleaning and feature engineering.
14. What is data leakage, and how do you prevent it?
Data leakage happens when information that would not be available at prediction time leaks into training, causing overly optimistic results that collapse in production. Classic causes include using future information, including the target in a feature, or fitting scalers and imputers on the full dataset before splitting. To prevent it, split into train and test sets first, fit all preprocessing only on the training data (ideally inside a pipeline and within cross-validation folds), and carefully exclude features that are proxies for the target.
15. What is the difference between long and wide data formats?
Wide format has one row per subject with separate columns for each variable or time point, so the table is shorter and wider. Long format has multiple rows per subject, typically with a key column identifying the variable and a value column holding the measurement, making the table taller and narrower. Long format is generally preferred for many plotting and modeling tools, and you convert between the two with reshaping operations like pivot (wide) and melt (long).
16. What are some common sampling methods?
Simple random sampling gives every member an equal chance of selection. Stratified sampling divides the population into subgroups and samples within each to preserve their proportions, which is useful for imbalanced classes. Systematic sampling selects every k-th element from an ordered list, and cluster sampling randomly selects whole groups and includes everyone within them. The aim is a representative sample that lets you generalize while avoiding selection bias.
17. How do you choose evaluation metrics for a classification problem?
Accuracy is the share of correct predictions, but it is misleading on imbalanced data, so you look at precision (of the predicted positives, how many were right), recall (of the actual positives, how many you caught), and the F1 score that balances the two. For ranking and threshold-independent quality, ROC-AUC measures how well the model separates classes. The right choice depends on the cost of errors — favor recall when missing positives is expensive, like fraud or disease detection, and precision when false alarms are costly.
Get these answered live in your real interview
NostrobeAI is a real-time AI interview copilot — it hears the question and drafts a strong answer on your screen, invisible on Zoom, Meet, and Teams. One-time pricing, no subscription.
Try NostrobeAI free