1. Data Architecture & Warehousing
01. What is the difference between a Data Lake and a Data Warehouse?
A Data Warehouse stores structured data processed for a specific purpose. A Data Lake stores raw, unstructured, or semi-structured data at scale.
02. Explain the difference between ETL and ELT.
ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. ELT (Extract, Load, Transform) loads raw data first and leverages the warehouse's power to transform it.
03. What is a Star Schema vs a Snowflake Schema?
Star Schema has a central fact table connected to denormalized dimension tables. Snowflake Schema normalizes dimension tables, saving space but increasing join complexity.
04. What is Slowly Changing Dimension (SCD) Type 2?
A method to track historical data by creating new records for changes, using start/end dates or versions to identify the current record.
05. Explain Data Modeling (Conceptual, Logical, Physical).
Conceptual: High-level business view. Logical: Detailed structure without platform specifics. Physical: Actual implementation on a specific DB platform.
06. What is a Surrogate Key?
A unique identifier produced by the system (usually an integer) that has no business meaning, used as a primary key in dimension tables.
07. Explain OLTP vs OLAP.
OLTP (Online Transactional Processing) is for fast, short transactions (like banking). OLAP (Online Analytical Processing) is for complex queries and data analysis.
08. What is Data Sharding?
Horizontal partitioning of data across multiple databases to improve scalability and performance.
09. What is a Data Mesh?
A decentralized architectural pattern where data is managed by domain-specific teams as a product.
10. What is CDC (Change Data Capture)?
A set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data.
2. Big Data & Processing (Spark)
11. What is Apache Spark and why is it used?
A unified analytics engine for large-scale data processing. It's much faster than MapReduce because it processes data in-memory.
12. Explain RDD vs DataFrame vs Dataset in Spark.
RDD is the basic building block (unstructured). DataFrame is a distributed collection of data organized into named columns (structured). Dataset is a type-safe interface to DataFrames.
13. What are Transformations and Actions in Spark?
Transformations (like map, filter) are lazy; they build a logical plan. Actions (like count, collect) trigger the actual computation.
14. Explain Spark's Lazy Evaluation.
Spark doesn't run transformations immediately. It records them and optimizes the execution plan when an action is called.
15. What is a Spark DAG?
Directed Acyclic Graph. The sequence of computations performed on data in Spark, which can be optimized before execution.
16. What is Data Skew and how do you fix it in Spark?
Data Skew is when data is unevenly distributed across partitions. Fixes: Salting the keys, repartitioning, or using broadcast joins.
17. Explain Broadcast Join in Spark.
When joining a large table with a small table, Spark sends the small table to all executors to avoid expensive data shuffles.
18. What is Spark Streaming?
An extension of the Spark API that allows scalable, high-throughput, fault-tolerant stream processing of live data streams.
19. Explain the role of the Driver and Executor in Spark.
The Driver coordinates the job and manages the DAG. Executors perform the actual data processing tasks.
20. What is Cache() vs Persist() in Spark?
Both are used to store intermediate results in memory. Cache uses the default storage level (MEMORY_ONLY), while Persist allows you to specify other levels (e.g., DISK_ONLY).
3. Storage & Formats
21. What is the difference between Row-based and Columnar storage?
Row-based (like CSV, Avro) is good for transactional writes. Columnar (like Parquet, ORC) is much faster for analytical queries that read specific columns.
22. Explain the Parquet file format.
An open-source, columnar storage file format for Hadoop. It provides efficient data compression and encoding schemes.
23. What is Apache Avro?
A row-based data serialization system that uses JSON for defining data types and protocols, and serializes data into a compact binary format.
24. Explain Delta Lake.
An open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
25. What is HDFS?
Hadoop Distributed File System. A distributed file system designed to run on commodity hardware and provide high-throughput access to data.
26. Explain the Hive MetaStore.
A central repository of Apache Hive metadata, storing information about tables, partitions, and their storage locations.
27. What is NoSQL and its types?
Non-relational databases. Types: Document (MongoDB), Key-Value (Redis), Column-family (Cassandra), Graph (Neo4j).
28. What is Snowflake and its unique architecture?
A cloud-based data platform with a unique multi-cluster shared data architecture that separates storage, compute, and services.
29. Explain Partitioning vs Indexing.
Partitioning splits the table into physical chunks based on a column. Indexing creates a separate data structure to speed up data lookup.
30. What is S3 and its role in Data Engineering?
Amazon Simple Storage Service. Often used as the storage layer for Data Lakes due to its durability and scalability.
4. Orchestration & Pipelines
31. What is Apache Airflow?
A platform to programmatically author, schedule, and monitor workflows using Directed Acyclic Graphs (DAGs).
32. Explain DAG in Airflow.
A collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
33. What are Operators in Airflow?
Templates for a single task in a workflow (e.g., PythonOperator, BashOperator, S3ToRedshiftOperator).
34. What is Apache Kafka?
A distributed event store and stream-processing platform capable of handling high-throughput, real-time data feeds.
35. Explain Kafka Topics, Producers, and Consumers.
Topic: A category name where messages are stored. Producer: Publishes messages to topics. Consumer: Reads messages from topics.
36. What is idempotency in data pipelines?
The property that running a pipeline multiple times with the same input yields the same result, ensuring reliability during retries.
37. How do you handle pipeline failures?
Using retries, alerts, dead-letter queues for bad data, and ensuring tasks are idempotent.
38. What is Data Quality Monitoring?
Automated checks to ensure data completeness, accuracy, consistency, and timeliness (e.g., using Great Expectations).
39. Explain the role of Docker/Kubernetes in Data Engineering.
Docker containers package dependencies for portability. Kubernetes orchestrates and scales these containers for large-scale data workloads.
40. What is dbt (data transformation tool)?
A tool that allows data analysts and engineers to transform data in their warehouse using SQL and software engineering best practices (like version control and testing).
5. Scenario Based
41. How would you design a pipeline for real-time sales reporting?
Source (DB/App) -> Kafka -> Spark Streaming/Flink -> Warehouse (Snowflake/BigQuery) -> Dashboard.
42. You have a Spark job failing with OutOfMemory (OOM) error. How do you debug?
Check for data skew, increase executor memory, optimize joins (e.g., use broadcast), or reduce the number of partitions.
43. How do you handle schema evolution in a data lake?
Using file formats like Parquet/Avro that support schema merging, or using a schema registry in Kafka.
44. Describe your experience with data security and compliance (GDPR/PII).
Discuss encryption at rest/transit, data masking, access control (RBAC), and ensuring PII is handled according to policy.
45. How do you optimize a long-running ETL job in Snowflake?
Check the query profile, use clustering keys, optimize the use of result caching, and ensure the warehouse size is appropriate.
46. What is the most complex data architecture you've built?
Focus on the scale, the tools used (Airflow, Spark, Kafka), and the business value it provided.
47. How do you balance speed vs quality in data delivery?
By using MVP approaches for speed while building automated data quality checks to ensure long-term quality.
48. Why is Python preferred over Java for many data engineering tasks today?
Due to its ease of use, massive data library ecosystem (Pandas, PySpark, Airflow), and faster development cycles.
49. How do you document your data pipelines?
Using tools like dbt docs, data dictionaries, and maintaining clear DAG code comments.
50. What is the future of Data Engineering?
Moving towards Data Contracts, Data Observability, and automated AI-driven pipeline optimization.