Build with an Azure free account. Get USD200 credit for 30 days and 12 months of free services.

Start free today

Event-based data integration with Azure Data Factory

Play Event-based data integration with Azure Data Factory

The Discussion

  • User profile image
    JohnN

    Thank you for this demo, and to the ADF team for adding this much-needed functionality.

    But even in your demo it shows how slow Data Factory can be. At 7:32 in the video the stats show that 355 bytes took 44s to transfer!!??

    Is there any way (as azure Admins) we can see a more granular breakdown as to what that 44s was spent doing? How can we speed this up for real-world data loads?

  • User profile image
    gauravmalhot

    Yes, you can see the granular breakdown using our visual tools by doing the following steps:

    a. Click 'Monitor' tab on left bar

    b. Identify the pipeline run for the copy operation

    c. Click on the 'View Activity Runs' icon under the pipeline run

    d. Click on 'Details' icon under the 'Actions' column for your activity run

    e. You can see how much time was spent in queue, in actual transfer, throughput, duration etc.

    The time taken is not linear with the amount of data being transferred. You can transfer data from blob to SQL DW which was the case that I was showcasing at a throughput of 1 GB/s meaning you can transfer 1 TB worth of data in close to 20-25 mins. Please give it a try and let us know if you see any perf issues.

  • User profile image
    aljj

    Another cool feature added.

    Questions:

    1) is there a way to trigger an activity based on more than 1 event? Say I want to start a Databricks Notebook only after 2 files arrived

    2) I'd like to trigger when a parquet table/folder is completed this is when _committed_<guid> file is created in a folder. Is there a way to specify wildcards? Or how can I do that?

    Thanks and keep adding features,

    a.

     

     

  • User profile image
    Cesar Hernandez

    1. Today we do not have a batching mechanism for events that would enable that scenario.

    2. It is not possible to specify wild cards directly in the filters.

    However for 2, if the container and folder names are well known and only this type of file will be created there than just omit the file name in the starts with field like so: '/containername/blobs/folderpath/'.

    If various types of files will be uploaded to the same folder but you only want the trigger to fire for a particular kind of file you have two options.
    First is to simply include the part of the file name that is known like so: '/cotainername/blobs/folderpath/_committed_'.
    Second option is to use a combination of starts with and ends with, so for example starts with could be '/containername/blobs/folderpath/' and ends with could be '.csv'. This last approach would only fire the trigger for .csv files in the location specified.

  • User profile image
    MADHU

    Azure data factory trigger EVENT type will support blob storage or it will support for data lake store Gen1.

    If it will support for data lakes store files also please provide steps.

    please Can any one provide information.

    Advance thanks.

  • User profile image
    robcaron
    ADLS Gen1 does not emit events today. Once they begin to emit events, Data Factory plans on supporting them as a source.
  • User profile image
    Linus

    Integration based on file presence is an anti-pattern from the early 2000s. I really don't get why the first user story developed for event-based ADF-triggers was to base it on the presence of files in a blob. Surely that's the very last use-case in the vast list of different use-cases with regards to events and ADF?

    Events aren't generally described on the form "this data is now available in a blob". That's not an event, that's a result. Events are expressed in terms of "user X logged in" or "new user was created". If I wanted to run an ADF pipeline upon the event "new user was created", I would expect to just subscribe the ADF package to that particular event, and then try to fetch source data (with retries) until whatever data the pipline needs was available. With this "event-based" data integration, I would have to create an Azure Function that is subscribed to the topic and creates some dummy file in the blob so that ADF can react to that event. It's overly complicated, slow, and probably also very unreliable.

Add Your 2 Cents