Data Pipeline Automation Interview Questions (2026)

📌 Foundation Questions

Q1. What is the difference between ETL and ELT, and when do you choose one over the other?

ETL (Extract, Transform, Load) transforms data before loading it into the target system -- this was the dominant pattern when compute was expensive and targets were rigid relational databases. ELT (Extract, Load, Transform) loads raw data directly into a cloud data warehouse or lake first, then transforms it using the warehouse's own compute engine. Today I prefer ELT for cloud-native stacks because modern warehouses like Snowflake, BigQuery, and Databricks are infinitely scalable. ETL still makes sense when the source data contains PII that must be masked before it ever touches the target system.

Q2. How do you handle Schema Drift in a data pipeline?

Schema drift is one of the most common production failures I handle. My standard approach has three layers: First, I enable schema evolution on the target table (Delta Lake MERGE schema support or BigQuery permissive schema update mode), so new columns from the source are automatically added. Second, I implement a schema comparison step at the start of each pipeline run that diffs the incoming schema against the registered expected schema and alerts on breaking changes like column type changes or column deletions. Third, for critical pipelines, I version source schemas using a schema registry so I know exactly what contract each producer is sending.

Practice Data Pipeline Interviews For Free!

🔁 Data Pipeline Automation Interview Guide