Reduce ETL Processing Time

jmmc Member
edited August 2022 in Magic ETL

Hi Everyone.

I'm looking for strategies to reduce the run time of Magic ETL 2 data flows. I'm working with several left joins in sequence on about 17 datasets of 1 million rows each. Beyond this, the ETL performs no other operations.

I'm curious if anyone has experience with these options:

  • Data fusion
  • Adrenaline
  • Redshift SQL
  • Views as inputs, or views as the joining logic

I don't think adrenaline fits the use case, since it's used for large inputs and aggregations. For Redshift, I don't see this as an option in the SQL engine drop down. Even then, I think Magic ETL 2 is probably the fastest.

I also don't think views will perform any better.



  • Hi @jmmc how many datasets are you using in your ETLs and how much data do they contain?

    I recommend starting with this best practices article on the Knowledge Base:

  • jmmc
    jmmc Member

    Hi @MichelleH , thank you for the information. I'm combining about 17 data sets that have 1 million rows each. My thought is that Magic ETL 2 fits the use case most.

  • @jmmc I agree that ETL v2 is probably your best bet. Joins are generally very processing-intensive, so it's not surprising that a dataflow that joins 17 datasets would take a while to run. I'm not sure if this applies to your situation, but you could also see if there are ways to filter the datasets as much as possible before joining them together so that there are fewer rows to scan in each join.

  • jmmc
    jmmc Member

    @MichelleH that is a good idea. I'll have to explore that possibility.

    After reading the documentation, I'm wondering if data fusion is an option.

  • @jmmc How may rows of input data does the flow have? I have an ETL flow with 1.4 million input rows that has multiple joins, some groupings, and a rank & window tile that runs in about 1m 30s. Do your inputs have a lot of columns? If you don't use all the columns, you might try selecting only the columns you need from each input. Same goes for filtering rows, if your datasets go by date and you don't think you'll need dates before a certain point you can filter out those rows. Also, are your joins just on one column or multiple columns?

    **Was this post helpful? Click Agree or Like below**

    **Did this solve your problem? Accept it as a solution!**

  • Hi @jmmc

    When I'm doing ETL optimization I typically like to think about being a teapot - Short and Stout.

    Short - Filter / Drop unused columns / reduce the amount of data as early as possible.

    Another option may be to create a DataSet view of your datasets if you're filtering the data / dropping columns first, this will reduce the amount of data that the ETL engine needs to transfer over and will reduce the amount of time.

    Stout - Try to do as much stuff in parallel. Rather than chaining single joins with your dataset can you do multiple join tiles at the same time and then join those joins together?

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • jmmc
    jmmc Member

    Hi @RobSomers and @GrantSmith , thank you for the feedback. I'll investigate these options. Curious, have either of you used data fusion? Would it fit this use case?

  • Data Fusion (or Blend as it's sometimes referenced) is the preceding version of DataSet Views. My understanding is they have the same functionality underneath but views offers more functionality.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • jmmc
    jmmc Member

    @GrantSmith thank you!

  • I use DataSet View to join a 12M row dataset with 6 or 7 other smaller datasets that contain details like categories, item descriptions, etc. All are left joins, with no other changes or filters.

    Comparing the view to an equivalent dataset that already has the additional columns, I didn't see any appreciable difference in performance when using the view or dataset as a source with Analyzer.

  • jmmc
    jmmc Member

    @MattBell, thanks for the input.

    I'm not familiar with using views, so I was curious if they would be a performance boost compared to ETLs. My understanding is that the view would be processed as needed, meaning the joins would occur on the fly when the dashboard is viewed.

    So I was wondering if that processing time, when loading the dashboard, would be less than the ETL processing time. Maybe something to do with not using indexes with views.

  • @MattBeal

    If you're not pre-filtering or dropping columns from your dataset with your views then you're likely not to see any improvement as the same amount of data needs to be transferred over to the ETL processing platform.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • @GrantSmith: It wasn't about expecting improvement, my concern was making sure that the view wasn't slowing things down. As far as I know, the view is regenerated when the underlying datasets change. The 12M row dataset gets rows appended hourly, so if there's any slowness because of the view, I'm unlikely to see it.

  • @jmmc

    @GrantSmith is spot on with his advice with one small addendum.

    If you can perform your JOIN in a Dataset View (which is basically Fusions 2.0 so ... ignore fusions) then, assuming your VIEW can return a result within 1 minute (processing requirement imposed on VIEWS / Adrenaline) then you should see significant improvement to your Magic ETL dataflow performance.

    In a nutshell. Views do operate on the database layer (Adrenaline) so therefore your JOINs should be more performant because the benefit from Indexing, columnar compression, and all the things we love about databases.

    Magic ETL is not a database engine (just wrote a longish explanation here -- ) so try to just do the tasks in ETL that a streaming text pipeline should be good at.

    If you're using many JOINs to convert a Snowflake schema into a more traditional Dimensional table (i.e. a ton of lookups on lookups) it may make sense -- as Grant described to create your dimensional table separately from the JOIN on to the transactional stuff (i.e. "chaining joins" as Grant describes).

    Hope that helps.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"