ETL Interview Questions

Preparing for an ETL Interview? This curated list of ETL Interview questions measures technical expertise, analytical thinking.

1. What is ETL?

ETL (Extract, Transform, Load) is a data integration process used to combine data from multiple sources into a centralized repository, such as a data warehouse. It involves three stages:

Extract: Collect raw data from sources (databases, APIs, files, etc.).
Transform: Clean, validate, and restructure the data.
Load: Insert processed data into a target system for analysis.

2. Topics Covered in ETL Pipeline Interviews

Core ETL stages (extract, transform, load), data sources (APIs, databases), incremental vs. full loads.
Transformation: cleaning, validation, business logic. Loading strategies, data warehousing (star schema).
Tools: AWS Glue, Spark, Airflow. Data modeling (facts/dimensions). Performance: partitioning, indexing.
Error handling: idempotency, logging. Governance: GDPR, encryption.
Testing: data quality checks.
Advanced: streaming (Kafka), data lakes.
Scenarios: design pipelines, troubleshoot issues.
Soft skills: SQL/Python, trade-offs.

1. How do you analyze tables in ETL?

You can validate the structures of system objects using the ANALYZE statement.
This statement generates statistics, which are then used by a cost-based optimizer to determine the most effective strategy for data retrieval.
ESTIMATE, DELETE, and COMPUTER are some additional operations.

2. What SQL commands allow you to validate data completeness?

You can validate the completeness of the data using the intersect and minus statements.
When you run source minus target and target minus source, the minus query returns a value indicating rows that don’t match.
A duplicate row is present when the count intersects is less than the source count, and the minus query returns the value.

3. What role does impact analysis play in an ETL system?

Impact analysis analyzes the metadata relating to an object to determine what is impacted by a change in its structure or content.
A data warehouse’s proper loading can be affected by changing data-staging objects in processes. You must conduct an impact analysis before modifying a table once it has been created in the staging area.

4. What are the various Phases of data mining?

A phase in data mining is a logical procedure for sifting through vast amounts of data to identify crucial data.

Exploration: The exploration phase aims to identify significant variables and determine their characteristics.
Pattern Identification: In this phase, the main task is to look for patterns and select the best prediction.
Deployment stage: This stage can only be attained once a reliable, highly predictive pattern is identified in stage 2.

5. Explain the Use of ETL in data migration projects.

The usage of ETL tools is popular in data migration projects.
For instance, if a company previously managed its data in Oracle 10g and now wants to switch to a SQL Server cloud database, it will need to be moved from the Source to the Target.
ETL tools are quite beneficial when performing this kind of data migration. ETL code writing will take a lot of the user’s time.
As a result, the ETL tools are helpful because they make coding easier than SQL or T-SQL.
Therefore, ETL is a beneficial process for projects involving data migration.

6. Which performs better, joining data first, then filtering it, or filtering data first, then joining it with other resources?

Filtering data and then joining it with other data sources is better.
Filtering unnecessary data as early as possible in the process is an excellent technique to enhance the efficiency of the ETL process. It cuts down on time necessary for data transfer, I/O, and/or memory processing.
The general idea is to minimize the number of processed rows and avoid altering data that is never utilized.

7. Explain how ETL and OLAP tools differ.

ETL tools: ETL tools allow you to extract, transform, and load the data in the data warehouse or data mart.
OLAP (Online Analytical Processing) tools: OLAP tools help generate reports from data marts and warehouses for business analysis.

8. Briefly explain ETL mapping sheets.

ETL mapping sheets usually offer complete details about a source and a destination table, including each column and how to look them up in reference tables.
ETL testers may have to generate big queries with several joins at any stage of the testing process to check the data.
When using ETL mapping sheets, writing data verification queries is much easier.

9. Find every user who was active for three days in a row.

//Schema
(DataFrame: sf_events)
sf_events
date: datetime64[ns]
account_id: object
user_id: object

//Code
import pandas as pd
df = sf_events.drop_duplicates()
df = df[['user_id', 'date']].sort_values(['user_id', 'date'])
df['3_days'] = df['date'] + pd.DateOffset(days=2)
df['shift_3'] = df.groupby('user_id')['date'].shift(-2)
df[df['shift_3'] == df['3_days']]['user_id']

10. How would you modify a big table with more than 10 million rows?

Using batches is the most typical approach: Divide a large query into smaller ones.
For example, ten thousand rows in one batch.

DECLARE @id_control INT = 0 --current batch
,@batchSize INT = 10000 --size of the batch
,@results INT = 1 --row count after batch -- if 0 rows are returned, exit the loop
WHILE (@results > 0)

BEGIN
UPDATE [table]
SET [column] = [value]
WHERE [PrimaryKey column] > @id_control
AND [PrimaryKey column] <= @id_control + @batchSize -- the latest row count
SET @results = @@ROWCOUNT -- start the next batch
SET @id_control = @id_control + @batchSize
END

If the table is too big, it might be preferable to make a new one, add the updated data, and then switch tables.

Table of Contents

1. What is ETL?

2. Topics Covered in ETL Pipeline Interviews