Magic ETL Efficiency

Hello,

I have a simple Magic ETL that I use to filter a dataset from our company's source system to only show the latest version of each record. The dataflow finds the latest timestamp for each record, then uses an inner join on the record ID and timestamp to only include the most recent row. I've attached a screenshot of the dataflow configuration for reference.

Magic ETL.PNG

Currently it takes about 15-20 minutes for this dataflow to run, processing about 16M rows to produce an output of 1.7M rows. I'm trying to come up with ways to improve the efficiency of this dataflow as the input dataset continues to grow to ensure that it doesn't impact the timing of downstream dataflows. Does anyone have any suggestions or alternatives? Any help would be greatly appreciated.

Thanks!

Find more posts tagged with

Accepted answers

Godiepi

Hi @MichelleH

you could try the following. I think it might help improved the efficiency of your flow

Follow the instructions in the screenshot attached to

Remove the Group By and add a "Rank & Window"
instructions for the Rank window set up are in the screenshot
Click on Addf Function
Name the new column "Rank_Desc"
Select the function to apply a "Dense Rank" from the drop dow menu selection and click Apply
continue to step 2 selecting the column that identifies the lastest version
continue to step 3 and choose "Descending"
continue to step 4 and select the column that makes the partition, in your case "RecordID"
after setting up the rank window function , use the new column "Rank_Desc" to filter your dataset where the column = 1

All comments

Godiepi

Hi @MichelleH

you could try the following. I think it might help improved the efficiency of your flow

Follow the instructions in the screenshot attached to

Remove the Group By and add a "Rank & Window"
instructions for the Rank window set up are in the screenshot
Click on Addf Function
Name the new column "Rank_Desc"
Select the function to apply a "Dense Rank" from the drop dow menu selection and click Apply
continue to step 2 selecting the column that identifies the lastest version
continue to step 3 and choose "Descending"
continue to step 4 and select the column that makes the partition, in your case "RecordID"
after setting up the rank window function , use the new column "Rank_Desc" to filter your dataset where the column = 1

MichelleH

Hi @Godiepi ,

Thanks for the suggestion! I created a copy of the original dataflow and updated it using your instructions, then let the two dataflows run concurrently for a couple of days to observe the results.

While your method was faster overall, it only reduced the average run time by 1-2 minutes. I agree that using Dense Rank with a filter is a better option than a Group By and Join, however I'm still open to additional suggestions that could possibly help get the run time to under 10 minutes if possible.

Thanks again!

bdavis

There's not much else going on in the dataflow, especially nothing that adds complexity. I don't know if this is an option for you, but have you thought about decreasing the size of the input set? Perhaps increase the frequency of the import and only bring in rows added since the last import? I don't know how wide your data is, but reducing the number of columns might add some speed (assuming there are columns you could eliminate).

DataMaven

Great Suggestion @Godiepi! I often forget about using Rank!

I agree with @bdavis, and was about to suggest adding column selection in the front so it doesn't have to chunk through so much.

You also may be able to remove the 'Remove Duplicates' function if the selection of Rank=1 does the trick. It seems small, but it's still looking through every row across many attributes, especially if it's before your column selection.

MichelleH

@DataMaven I hadn't thought about the Remove Duplicates, but I'll get rid of that and see if that helps!

DataMaven

Did it help?