Connect Azure Data Factory to On-Prem Data Sources
Introduction
Azure Data Factory (ADF) is a powerful cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows. In this blog, I’ll demonstrate how I extended an existing Azure Data Engineering project (originally created by Summit Mittal on YouTube) by adding functionality to connect ADF to an on-premises SQL Server database. This step-by-step guide covers the process of configuring the connection, ensuring a seamless data flow between the cloud and on-prem systems.
Why On-Premises SQL Server?
While the original project used Azure SQL Database, I chose to use an on-premises SQL Server for a key reason: I was using the free tier of an Azure Microsoft account, which lasts for one month. When I checked my subscription costs, I realized that the two Azure SQL databases required by the original project were consuming a significant portion of my budget, even though they were configured with the lowest resources. Given that my free tier was soon expiring and I would need to upgrade to a pay-as-you-go subscription, I wanted to avoid additional costs. By creating the required databases locally on my machine, I could continue experimenting with the project while keeping costs low.
Link to the original Project → https://www.youtube.com/watch?v=d3Vw3VtKDnc&t=17676s
Prerequisites
Before you begin, ensure you have the following:
A working Azure subscription.
Basic knowledge of Azure Data Factory and its components.
SQL Server installed locally on your machine.
CSV files for importing data into SQL Server (Can be downloaded from the project repository).
Microsoft Integration Runtime and Java Runtime Environment (64-bit) downloaded.
Steps to Connect Azure Data Factory to On-Premises Data Sources
Step 1: Import Data from CSV Files to SQL Server
- Create a New Database: Set up a local SQL Server database on your machine. As you can see that there are two databases. (This was the requirement of the original project)
- Import CSV Files: Use SQL Server Management Studio (SSMS) to import CSV files into tables. During the import process, ensure the schema and data types are correctly mapped. We can use SQL Server Import and Export Wizard. This can be accessed by right clicking on the database and then click on Task.
- Verify the Data: Run a few SELECT queries to confirm the data is imported correctly.
Step 2: Set Up Self-Hosted Integration Runtime
Download and Install Integration Runtime:
Download the Microsoft Integration Runtime installer and run it.
You must chose the manual setup at the time of configuring the linked service.
Note down the Key (you can use any of the two) and keep it safe.
This Key will be use to register the integration runtime on local machine.
Install Java Runtime Environment (JRE):
Download and install the 64-bit version of JRE.
Configure the JAVA_HOME environment variable to point to the JRE installation directory.
Add the bin directory of JRE to the system PATH variable.
Also verify the installation by using JAVA -version command in command prompt.
It is recommended that you should install Java 8 or Java 11.
Register the Key and verify the connectivity:
Open the Integration Runtime Configuration Manager on your local machine and paste the key to register it.
In the Configuration Manager, use the Diagnostics tab to test the connection to your local SQL Server. Enter the server name and database credentials. A green tick indicates a successful setup.
Step 3: Continue with the Linked Service and Dataset in Azure Data Factory
Complete the linked service setup in ADF:
Specify the server name, database name and authentication. NOTE: We should always use Azure Key Vault for all of our project secrets. This is
the recommended approach. However, due to shortage of time and since this is for demo purpose, I have applied plain username and password.
Use the parameters within the linked service. This is required in order to follow the meta data driven approach.
Complete the dataset setup in ADF:
- Parameterize the database name, table and schema.
Step 4: Test and Debug
Run the Pipeline:
- Debug the pipeline to verify that data is successfully copied from the on-prem SQL Server to the destination. In our case, this is the Azure Data Lake Gen2 storage account container. The files are copied in parquet format.
Verify the results in Azure Monitor:
- Check the status of the pipeline. I had to re-run pipeline multiple times because of Integration runtime issues. The issues were mostly related to incompatible installation of integration service and JRE. Initially I used the express setup and somehow that ended up installing the 32 bit version of the runtime.
Verify the ADLS Gen 2 container:
- Check the destination directory and the last modified date of each file. If all went ok, the pipeline will place the existing file in the archive and also put new parquet files in the destination folder.
Conclusion
Extending this Azure Data Engineering project has been an incredible learning journey. By integrating Azure Data Factory with an on-premises SQL Server database, I was able to tackle real-world challenges like cost optimization and secure data connectivity. This hands-on experience not only deepened my understanding of Azure services but also reinforced the importance of adopting a metadata-driven approach for scalable and efficient data workflows.
I’d like to take this opportunity to extend my heartfelt thanks to Summit Mittal for creating such an excellent and comprehensive end-to-end data engineering project. His tutorial provided a solid foundation and inspired me to build upon it. Projects like these are invaluable for aspiring and experienced data engineers alike, and they truly showcase the potential of modern data engineering platforms.
Thank you for following along! I hope this blog inspires you to explore new ways of working with Azure Data Factory and data integration tools. If you have any questions or suggestions, feel free to share them in the comments below. Happy learning!