Here, we will discuss Azure Data Factory Interview Questions, which interviewers ask in most company interviews for Data Engineer job positions.
Table of Contents
1. What is Azure Data Factory?
Azure Data Factory (ADF) is a cloud-based ETL/ELT service for orchestrating and automating data movement and transformation across on-premises and cloud sources.
Key components include pipelines (workflows), activities (tasks like copy/data flow), datasets (data structures), linked services (connection strings), and Integration Runtime (execution environment)
2. Azure Data Factory Interview Topics
- Use Cases: Data migration, hybrid ETL, analytics.
- Data Flows: Serverless Spark-based transformations.
- Integration: Azure services (Synapse, Databricks), on-prem via SHIR.
- Monitoring: Azure Monitor, pipeline logs.
- Triggers/Params: Scheduling, dynamic workflows.
- Error Handling: Retries, alerts.
- Cost Management: Serverless pricing, optimizing execution.
- Governance: Version control (Git), schema drift, security (Key Vault).
Azure Data Factory Interview Questions
1. Is Azure Data Factory ETL or ELT a tool?
Azure Data Factory is a cloud-based Microsoft tool that provides a cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms.
2. Why is ADF needed?
With an increasing amount of big data, ADF can orchestrate and operationalize processes to refine the enormous stores of raw business data into actionable business insights.
3. What sets Azure Data Factory apart from conventional ETL tools?
Azure Data Factory stands out from other ETL tools as it provides:
- Enterprise Data Readiness: There are 90+ connectors supported to get your data from any disparate sources to the Azure cloud!
- Code-Free Transformation: UI-driven mapping dataflows.
- Ability to run Code on Any Azure Compute: Hands-on data transformations
- Ability to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages run on Azure Cloud.
- Making DataOps seamless: with Source control, automated deploy & simple templates.
- Secure Data Integration: Managed virtual networks protect against data exfiltration, which, in turn, simplifies your networking.
4. What are the Major components of Azure Data Factory?
- Pipelines: A data Factory can contain one or more pipelines, which is a logical grouping of tasks/activities to perform a task. e.g., An activity can read data from Azure blob storage and load it into Cosmos DB or Synapse DB for analytics while transforming the data according to business logic.
- Activities: Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data between data stores. Data Factory supports data movement, transformations, and control activities.
- Datasets: Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.
- Linked service: This is more like a connection string, which will hold the information that the Data Factory can connect to various sources. In the case of reading from Azure Blob storage, the storage-linked service will specify the connection string to connect to the blob, and the Azure Blob dataset will select the container and folder containing the data.
- Integration Runtime: Integration runtime instances provide the bridge between the activity and the linked Service. It is referenced by the linked service or activity and provides the computing environment where the activity either runs on or gets dispatched. This way, the activity can be performed in the region closest to the target data stores or compute service in the most performant way while meeting security (no exposing of data publicly) and compliance needs.
- Data Flows: These are objects you build visually in Data Factory, which transform data at scale on backend Spark services. You do not need to understand programming or Spark internals. Just design your data transformation intent using graphs (Mapping) or spreadsheets (Power query activity)
5. What are the different ways to execute pipelines in Azure Data Factory?
- Debug mode can be helpful when trying out pipeline code and acts as a tool to test and troubleshoot our code.
- Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline. This is useful if you want to run your pipelines on an ad-hoc basis.
- We can schedule our pipelines at predefined times and intervals via a Trigger.
6. What is the purpose of Linked services in Azure Data Factory?
Linked services are used majorly for two purposes in Data Factory:
- For a Data Store representation, i.e., any storage system like Azure Blob storage account, a file share, or an Oracle DB/ SQL Server instance.
- For Compute representation, i.e., the underlying VM will execute the activity defined in the pipeline.
7. Can you elaborate more on data factory integration runtime?
The Integration Runtime or IR is the compute infrastructure for Azure Data Factory pipelines. It is the bridge between activities and linked services.
It’s referenced by the linked service or activity and provides the computing environment where the activity is run directly or dispatched.
This allows the activity to be performed in the closest region to the target data stores or computing service.
There are three types of integration runtime supported by Azure Data Factory:
- Azure integration runtime
- Self-hosted integration runtime
- Azure SSIS integration runtime
8. What is the limit on the number of integration runtimes, if any?
Within a Data Factory, the default limit on any entities is set to 5000, including pipelines, data sets, triggers, linked services, Private Endpoints, and integration runtimes.
One can create an online support ticket to raise the limit to a higher number if required.
9. What are ARM templates in Azure Data Factory? What are they used for?
An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.
ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are convinced that the code is working correctly.
10. How can we deploy code to higher environments in Azure Data Factory?
- Create a feature branch that will store our code base.
- Create a pull request to merge the code after we’re sure to the Dev branch.
- Publish the code from the dev to generate ARM templates.
- This can trigger an automated CI/CD DevOps pipeline to promote code to higher environments like Staging or Production.
11. Which three activities can you run in Azure Data Factory?
There are three types of activities can you run in Azure Data Factory:
- Data movement activities
- Data transformation activities
- Control flow activities
12. What are two types of computing environments supported by Azure Data Factory to execute the transform activities?
- On-demand computing environment
- Bring your environment
13. What are the steps involved in the ETL process?
The ETL process follows four main steps:
- Connect and collect
- Data transformation using computing services such as HDInsight, Hadoop, Spark, etc.
- Publish
- Monitor
14. If you want to use the output by executing a query, which activity shall use?
Look-up activity can return the result of executing a query or stored procedure.
The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow activity like ForEach activity.
15. Have you used execute notebook activity in the Azure Data Factory? How to pass parameters to a notebook activity?
We can use execute notebook activity to pass code to our databricks cluster. We can pass parameters to a notebook activity using baseParameters property.
If the parameters are not defined/ specified in the activity, default values from the notebook are executed.
16. What are some useful constructs available in the data factory?
- Parameter
- Coalesce
- Activity
17. Is it possible to push code and have CI/CD in ADF?
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes incrementally before publishing the finished product.
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business intelligence tools.
18. What do you mean by variables in Azure Data Factory?
Variables in the Azure Data Factory pipeline provide the functionality to hold the values. They are used for a similar reason as we use variables in any programming language and are available inside the pipeline.
- System variable
- User variables
19. What are mapping data flows?
Mapping data flows are visually designed data transformations in Azure Data Factory.
Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters.
Data flow activities can be operationalized using existing Azure Data Factory scheduling, control flow, and monitoring capabilities.
Mapping data flows provides an entirely visual experience with no coding required.
Data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory manages all the code translation, path optimization, and execution of the data flow jobs.
20. What is copy activity in Azure Data Factory?
Copy Activity is one of the most popular and universally used activities in the Azure data factory.
It is used for ETL or Lift and Shift, where you want to move the data from one data source to another.
While you copy the data, you can also do the transformation.
21. Can you elaborate more on copy activity?
The Copy Activity performs the following steps at a level:
- Read data from the source data store. (e.g., blob storage)
- Perform the following tasks on the data:
- Serialization/deserialization
- Compression/decompression
- Column mapping
- Write data to the destination data store or sink. (e.g., azure data lake)
22. What are the different activities you performed in Azure Data Factory?
Here, are some of the activities performed in Azure Data Factory for real-world projects:
- Copy Data Activity to copy the data between datasets.
- ForEach Activity for looping.
- Get Metadata Activity which can provide metadata about any data source.
- Set Variable Activity to define and initiate variables within pipelines.
- Lookup Activity to do a lookup to get some values from a table/file.
- Wait Activity to wait for a specified amount of time before/in between the pipeline run.
- Validation Activity will validate the presence of files within the dataset.
- Web Activity to call a custom REST endpoint from an ADF pipeline.
23. How can I schedule the pipeline in Azure Data Factory?
You can use the time window trigger or scheduler trigger to schedule a pipeline.
ADF supports three types of triggers:
- Tumbling window trigger: A trigger that operates on a periodic interval while retaining a state.
- Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
- Event-Based Trigger: A trigger that responds to an event. e.g., a file getting placed inside a blob.
24. When should you choose Azure Data Factory?
One should consider using Data Factory:
- When working with big data, there is a need for a data warehouse to be implemented; you might require a cloud-based integration solution like ADF for the same.
- Not all the team members are experienced in coding and may prefer graphical tools to work with data.
- When raw business data is stored at diverse data sources, which can be on-prem and on the cloud, we would like to have one analytics solution like ADF to integrate them all in one place.
- We would like to use readily available data movement and processing solutions and like to be light in terms of infrastructure management. So, a managed solution like ADF makes more sense in this case.
25. How can you access data using the other 90 dataset types in the data factory?
Use copy activity. The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics, delimited text files from Azure storage account or Azure Data Lake Storage Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink data source.
26. What is the difference between mapping and wrangling data flow?
- Mapping data flows transform data at scale without requiring coding. You can design a data transformation job in the data flow canvas by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Complete your data flow with a sink to land your results in a destination. It is excellent at mapping and transforming data with known and unknown schemas in the sinks and sources.
- Power Query Data Wrangling allows you to do agile data preparation and exploration using the Power Query Online mashup editor at scale via spark execution. With the rise of data lakes, sometimes you just need to explore a data set or create a dataset in the lake.
27. Is it possible to calculate a value for a new column from an existing column from mapping in ADF?
We can derive transformations in the mapping data flow to generate a new column based on our desired logic.
- We can create a new derived column or update an existing one when generating a derived one.
- Enter the name of the column you’re creating in the Column textbox.
- You can use the column dropdown to override an existing column in your schema.
- Click the Enter expression textbox to start creating the derived column’s expression.
- You can input or use the expression builder to build your logic.
28. How is lookup activity useful in the data factory?
In the ADF pipeline, the Lookup activity is commonly used for configuration lookup purposes, and the source dataset is available.
Moreover, it is used to retrieve the data from the source dataset and then send it as the output of the activity.
Generally, the output of the lookup activity is further used in the pipeline for taking some decisions or presenting any configuration as a result.
29. What is get metadata activity in the data factory?
The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data Factory or a Synapse pipeline.
We can use the output from the Get Metadata activity in conditional expressions to perform validation or consume the metadata in subsequent activities.
30. How to debug the ADF pipeline?
Debugging is one of the crucial aspects of any coding-related activity needed to test the code for any issues it might have.
It also provides an option to debug the pipeline without executing it.
31. What does it mean by a breakpoint in the ADF pipeline?
A breakpoint in the ADF pipeline is used for debugging any error or failure of the pipeline or any activities.
For example, you are using three activities in the pipeline, and now you want to debug up to the second activity only. You can do this by placing the breakpoint at the second activity.
To add a breakpoint, you can click the circle present at the top of the activity.
32. What is the use of ADF service?
ADF(Azure Data Factory) is primarily used to organize the data copying between various relational and non-relational data sources hosted locally in data centers or the cloud. Moreover, you can use ADF Service to transform the ingested data to fulfill business requirements.
In most Big Data solutions, ADF Service is used as an ETL or ELT tool for data ingestion.
33. Explain the data source in Azure Data Factory.
The data source in Azure Data Factory is the source or destination system that comprises the data intended to be utilized or executed.
34. Can you share any difficulties you faced while getting data from on-premises to Azure Data Factory?
One of the significant challenges we face while migrating from on-prem to the cloud is throughput and speed.
When we try to copy the data using Copy activity from on-prem, the speed of the process is relatively slow, and hence we don’t get the desired throughput.
35. How to copy multiple sheet data from an Excel file?
When we use an Excel connector within a data factory, we must provide a sheet name from which we have to load data.
This approach is nuanced when we have to deal with a single or a handful of sheets of data, but when we have lots of sheets (say 10+), this may become a tedious task as we have to change the hard-coded sheet name every time!
36. Is it possible to have nested looping in Azure Data Factory?
There is no direct support for nested looping in the data factory for any looping activity (for each/until).
However, we can use one for each/until loop activity which will contain an execute pipeline activity that can have a loop activity.
37. How to copy multiple tables from one data store to another data store?
An efficient approach to copy multiple tables from one data store to another data store would be:
- Maintain a lookup table/ file which will contain the list of tables and their source, which needs to be copied.
- Then, we can use the lookup activity and each loop activity to scan through the list.
- Inside the for each loop activity, we can use a copy activity or a mapping dataflow to accomplish the task of copying multiple tables to the destination datastore.
38. What are some limitations of ADF?
Azure Data Factory provides great functionalities for data movement and transformations.
However, there are some limitations of ADF:
- We can’t have nested looping activities in the data factory, and we must use some workaround if we have that sort of structure in our pipeline. All the looping activities come under this: If, Foreach, switch, and until activities
- The lookup activity can retrieve only 5000 rows at a time and not more than that. Again, we need to use some other loop activity along with SQL with the limit to achieve this sort of structure in the pipeline.
- We can have a maximum of 40 activities in a single pipeline, including everything: inner activity, containers, etc. To overcome this, we should try to modularize the pipelines regarding the number of datasets, activities, etc.
39. How do you send email notifications on pipeline failure?
There are multiple ways to send email notifications on pipeline failure:
- Using Logic Apps with Web/Web hook activity.
- Using Alerts and Metrics from pipeline options.
40. What is Azure SQL database? How to integrate with the data factory?
Azure SQL Database is an always up-to-date, fully managed relational database service built for the cloud for storing data.
We can easily design data pipelines to read and write to SQL DB using the Azure data factory.
41. Can you host an SQL server instance on Azure?
Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL Server instance.
42. What is Azure Data Lake Analytics?
Azure Data Lake Analytics is an on-demand analytics job service that simplifies storing data and processing big data.