1. SQL Mastery
01. What is the difference between WHERE and HAVING?
WHERE is used to filter rows before grouping, while HAVING is used to filter groups after the GROUP BY clause is applied.
02. Explain the difference between Rank, Dense_Rank, and Row_Number.
Row_Number gives a unique sequential number to each row. Rank skips numbers if there are ties (e.g., 1, 2, 2, 4). Dense_Rank does not skip numbers (e.g., 1, 2, 2, 3).
03. What are CTEs and why are they used?
Common Table Expressions (CTEs) are temporary result sets defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. They improve readability and are useful for recursive queries.
04. How do you handle NULL values in SQL?
Using COALESCE(column, replacement), ISNULL(), or CASE statements. It's important to remember that NULL != NULL.
05. What is a Self Join and when would you use it?
A self join is a regular join but the table is joined with itself. It's used when a table has a foreign key that references its own primary key (e.g., employee-manager hierarchy).
06. Explain ACID properties in Databases.
Atomicity, Consistency, Isolation, and Durability. These ensure that database transactions are processed reliably.
07. What is the difference between UNION and UNION ALL?
UNION removes duplicate records from the combined result set, while UNION ALL includes all duplicates. UNION ALL is faster because it doesn't perform the distinct check.
08. How do you find the 2nd highest salary in a table?
Using a subquery: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees). Or using OFFSET: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1.
09. What are Window Functions?
Window functions perform calculations across a set of table rows that are somehow related to the current row. Examples: SUM() OVER(), AVG() OVER(), LEAD(), LAG().
10. Explain Correlated Subqueries.
A subquery that uses values from the outer query. It's executed once for every row processed by the outer query.
2. Python for Data
11. What is the difference between a list and a tuple?
Lists are mutable (can be changed) and use square brackets []. Tuples are immutable (cannot be changed) and use parentheses (). Tuples are generally faster and safer for data that shouldn't change.
12. What are decorators in Python?
Decorators are a way to modify the behavior of a function or class without changing its source code. They use the @ symbol.
13. Explain the difference between loc and iloc in Pandas.
loc is label-based (uses index/column names), while iloc is integer-based (uses numeric positions).
14. How do you handle missing data in a Pandas DataFrame?
Using df.dropna() to remove rows with missing values, or df.fillna(value) to replace them with a specific value or mean/median.
15. What is a lambda function?
An anonymous, one-line function defined with the lambda keyword. Used for short operations, often with map(), filter(), or sort().
16. Explain the difference between merge, join, and concatenate in Pandas.
Merge is for SQL-style joins on specific columns. Join is for joining on indexes. Concat is for stacking dataframes vertically or horizontally.
17. What is List Comprehension?
A concise way to create lists. Example: [x*x for x in range(10) if x % 2 == 0].
18. How do you optimize a Pandas operation on a large dataset?
Using vectorized operations instead of loops, specifying data types (e.g., int32 instead of int64), and using chunking for very large files.
19. What is the GIL in Python?
The Global Interpreter Lock (GIL) is a mechanism that ensures only one thread executes Python bytecode at a time, making it thread-safe but limiting multi-core performance for CPU-bound tasks.
20. Explain Generators in Python.
Generators are functions that return an iterator using the yield keyword. They are memory-efficient because they produce items one at a time on demand.
3. Statistics & Logic
21. What is the Central Limit Theorem?
The CLT states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the original population distribution.
22. What is a P-value?
A P-value is the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. A low p-value (< 0.05) suggests rejecting the null hypothesis.
23. Explain Type I and Type II errors.
Type I error (False Positive) is rejecting a true null hypothesis. Type II error (False Negative) is failing to reject a false null hypothesis.
24. What is the difference between Correlation and Causation?
Correlation is a relationship between two variables. Causation means one variable directly causes a change in the other. Correlation does not imply causation.
25. Explain the Law of Large Numbers.
As the number of trials increases, the actual result will get closer and closer to the expected value or average.
26. What is Normal Distribution?
A bell-shaped curve where data is symmetrical around the mean. Mean, median, and mode are all equal.
27. What is the difference between Mean, Median, and Mode?
Mean is the average. Median is the middle value. Mode is the most frequent value. Median is more robust to outliers.
28. Explain Standard Deviation.
Standard deviation measures the amount of variation or dispersion in a set of values. Low SD means data is close to the mean.
29. What is A/B Testing?
A statistical experiment where two versions (A and B) are compared to see which performs better based on a specific metric.
30. What is an Outlier and how do you handle it?
An outlier is a data point significantly different from the others. You can handle them by removing them (if erroneous), capping them, or using robust statistical methods.
4. Excel & Visualization
31. What is a Pivot Table?
An Excel tool that allows you to summarize, analyze, and explore large datasets by dragging and dropping fields into different areas.
32. Explain VLOOKUP vs XLOOKUP.
VLOOKUP searches for a value in the first column of a table. XLOOKUP is more flexible, can search in any direction, and handles missing values better.
33. What is Power Query?
An Excel/Power BI tool used for ETL (Extract, Transform, Load) operations. It allows you to clean and reshape data before analysis.
34. What is DAX in Power BI?
Data Analysis Expressions (DAX) is a formula language used in Power BI, Analysis Services, and Power Pivot in Excel to create custom calculations.
35. How do you choose the right chart for your data?
Bar charts for comparisons, Line charts for trends, Pie charts for parts of a whole (sparingly), and Scatter plots for relationships.
36. What is Data Normalization?
The process of organizing data to reduce redundancy and improve integrity, typically by dividing large tables into smaller ones.
37. Explain the difference between Fact and Dimension tables.
Fact tables contain quantitative data (metrics). Dimension tables contain descriptive attributes (context) related to the metrics.
38. What is a Star Schema?
A data modeling technique where a central fact table is surrounded by several dimension tables, resembling a star.
39. How do you handle "Dirty Data"?
By identifying duplicates, correcting errors, handling missing values, and standardizing formats.
40. What is Data Storytelling?
The ability to communicate insights from data using a narrative that connects the dots for stakeholders, often using visualizations.
5. Pro Interview Tips
41. Explain a time you found a significant insight in data.
Use the STAR method. Describe the Situation, the Task you were given, the Action (analysis) you took, and the Result (insight/money saved).
42. How do you handle a situation where your stakeholders disagree with your data?
I re-verify my analysis first. Then I sit with them to understand their domain knowledge and explain my methodology transparently.
43. What is your process for starting a new data project?
Define the business problem -> Data collection -> Data cleaning -> Exploratory Data Analysis (EDA) -> Insights -> Reporting.
44. How do you stay updated with data analysis trends?
Following blogs like Towards Data Science, Kaggle competitions, and learning new tools like GenAI for data workflows.
45. What is the most difficult data challenge you've faced?
Focus on a technical challenge (like inconsistent data sources) and how you overcame it using coding or logic.
46. Why do you want to be a Data Analyst at our company?
Align your skills with their specific industry and mention their recent achievements or data culture.
47. How do you explain technical concepts to non-technical managers?
Using analogies, avoiding jargon, and focusing on the business impact rather than the technical details.
48. What is the difference between Data Analysis and Data Science?
Data Analysis focuses on past data to provide insights. Data Science uses past data to build predictive models for the future.
49. How do you prioritize tasks when you have multiple deadlines?
Using the Eisenhower Matrix (Urgent vs Important) and communicating early with stakeholders about timelines.
50. Do you have any questions for us?
Ask about their data stack, their biggest current data challenge, or the team's workflow.