Streamline data operations: Integrate Database Projects with Data Factory for enhanced efficiency

In a previous blog, we learned how to streamline your SQL development with Database Projects and Git.

In this blog post, we embark on a journey to optimize data operations by seamlessly integrating Database Projects with Azure Data Factory. Through a practical exploration, we'll demonstrate how this integration streamlines processes, from data ingestion to deployment, fostering enhanced efficiency and reliability.

First things first

In this blog post we are going to show a practical use case of how you should use database projects for SQL together with Azure Data Factory. The use case for this will be the following:

We already have the data for the drivers, circuits, races and results. We would also like to have the qualifying times for every race and driver. We need to transform this data so it can be linked to the other tables. We can add this data with the following steps:

  • Data is fetched from an Excel file located on the Plainsight Pro GitHub

  • The data is copied to a SQL database on Azure using the copy data activity

  • A new table containing the qualifying times has been created

  • A new view is created that shows the cleaned table

  • The repository is updated with the changes

Make sure the following prerequisites are met before following along:

Setup connections

First, we will create a connection to GitHub and the database. Go to the Manage tab and under connections choose linked services. Click on new and choose the HTTP option.

Create a database project

Go to the database projects tab and click create new. Choose Azure SQL database and give it a name. Now the database project has been created and can be viewed in the database projects tab.

Continue, choose a name, fill in the base GitHub base URL (https://raw.githubusercontent.com/ in this case) and choose anonymous authentication. Create another linked service and choose Azure SQL Database.

Give the linked service a name and choose the connection string option. Next choose the right subscription, server name and database name. Use system assigned managed identity as the authentication type and click on create. Also, write down the Managed identity name as this will be needed later.

Head over to Azure Data Factory

Now our linked services can be used in Data Factory.

To use these linked services, we will create a dataset for both. First, we will create the dataset for GitHub. Go to the Author tab, click on the + icon and select New dataset.

Choose the HTTP option, name it and click continue. Select the DelimitedText format, select the newly created HTTP linked service and click OK. Now that our dataset has been created, we create a parameter called RelativeUrl and leave the value blank. In the Relative URL field in connection, choose Add dynamic content and choose the RelativeUrl parameter.

The first dataset is now ready. Let's now create the dataset for the database. Create a new dataset and select Azure SQL Database. Select the right Linked Service.

Now you might see that data factory gives an error when trying to fetch the table names. Don't worry. This is because we have not given the Data Factory Managed Identity access to our database. We need to create a user for data factory in the database. To do this, go back to the connections tab in Data Studio, right click on the database and click create a new query. Now run the following query and replace DF_instance_name with the managed instance name you wrote down:

CREATE USER [your DF_instance_name] FROM EXTERNAL PROVIDER; 
ALTER ROLE db_owner ADD MEMBER [DF_instance_name];

This will create a new user for our external Data Factory instance and give it the role of database owner (this is needed to be able to create new tables). Go back to Data Factory. The names of the tables should pop up now. Don’t choose a table and just click OK. Now create the parameters schema and table and leave these blank. Go back to the connection tab and click on add dynamic content for the schema field. Choose the schema parameter we created and click OK. Do the same for the table parameter.

Now that the linked services and datasets are created, we can create a pipeline to copy the data from the API to our database. First, we will create a new table in our database project to map our csv to. Go to Data Studio to the Database projects tab. Right click on tables and choose Add Table. This will initiate a create table sql file.

Paste the following code in the file:

CREATE TABLE [dbo].[qualifying] (
    [qualifyId]     INT NULL,
    [raceId]        INT NULL,
    [driverId]      INT NULL,
    [constructorId] INT NULL,
    [number]        INT NULL,
    [position]      INT NULL,
    [q1]            NVARCHAR (10) NULL,
    [q2]            NVARCHAR (10) NULL,
    [q3]            NVARCHAR (10) NULL
)

Save the changes and run these on the database to apply the changes. As the database project is only stored locally and not executed in the database unless we run it.

Let's now create our pipeline. Go back to Data Factory to the author tab. Click on the plus icon again, select pipeline and give it a name.

Next choose the copy activity (under Move and transform) and swipe it to the canvas. This activity will copy data to the SQL database. Give the activity a name and go to the source tab. In source dataset, choose the GitHub dataset. In the relative url property, enter:

PlainsightPro/blogs-sql-server-dev-db-projects-and-cicd/main/Streamlining-data-operations-integrating-Database-Projects-with-Data-Factory-for-enhanced-efficiency/qualifying.csv

Go to sink and choose the SQL linked service. Enter dbo in the schema field and qualifying in the table field. For write behavior choose Upsert. This is a combination of update and insert. If the key is not in the table, it will insert and otherwise update. In key columns add the qualifyId column.

Next go to mapping and click import schema. This will map the Excel file to the right column in the database. Verify that the mapping is correct.

Now our pipeline is ready. Click the debug button to test this pipeline. After it finishes successfully, we will add a trigger. Since formula one qualifying takes place on either Friday or Saturday, we will trigger this every Sunday. Click add trigger and choose New/Edit.

Create a new trigger that is scheduled every Sunday starting today.

The new qualifying times will now be added every Sunday and inserted into the database.

Now to make sure the data is in the correct format, we will clean this data and put it into a view. Go back to the database project and add a folder under the dbo folder called Views. Right click on the folder, click create a view and name it. Enter the following code:

CREATE VIEW [dbo].[stg_qualifying]
  AS SELECT [qualifyId]
      ,[raceId]
      ,[driverId]
      ,[constructorId]
      ,[number]
      ,[position]
      ,TRY_CONVERT(time,'00:'+q1) AS q1
      ,TRY_CONVERT(time,'00:'+q2) AS q2
      ,TRY_CONVERT(time,'00:'+q3) AS q3
  FROM [dbo].[qualifying]

This staging view will clean the qualifying times and put these in the time format. This can then be used to create a fact table. Save the view and run it on the database. Now in the source control tab, we can commit this and click sync changes.

Stay tuned!

And there we have it: Our database project has now been updated. Now, the true power of using database projects together with data factory comes in the next blog, where we automatically deploy this to production after our code has been tested on our development database. This makes sure that when we deploy a new Data Factory version to production, our SQL database is also updated to match the changes in data factory without having to make changes to the production database.

Anxious to know what Plainsight could mean for you?

Previous
Previous

Fabric Brussels 20/6: Reserve your seat!

Next
Next

Streamline your SQL development with database projects and Git