Domo ETL parallel processing to reduce ETL runtime

Options

Hello,

I have a python script which is part of an ETL to transform a combination of 3 input datasets.

The python script contains multiple if else statements and for loops which is taking aroung 4 hours to complete the ETL pipeline.

I want to know if there is an option to parallelize the python script in Magic ETL to reduce the runtime.

Thanks

Tagged:

Answers

  • GrantSmith
    Options

    It depends on the structure of your data and if you're able to process the data in parallel or not. Can you do it in subsections or have the different steps happen at the same time so you can utilize multiple python code tiles? Is the logic you're using in Python able to be translated back into Magic ETL with the tiles? When you're using a Python tile it must spin up a new virtual environment, copy over the dataset, process it and then pass it back which is where most of the delay typically comes from.

    Do you have to process all of the records or can you filter it down and process fewer records?

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • vguddanti
    Options

    This is how my ETL looks like. My python script does 3 for loops one after the other. Do you think splitting the code into 3 tiles will help fasten the processing?

  • GrantSmith
    Options

    It depends on what type of processing your python tile is doing. Does it require all of the datasets to be input into the tile to process the data together or can the data be processed independently of the different datasets?

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • vguddanti
    Options

    It requires a combination of 2 datasets for each function. Then I am pulling matching data based on some conditions.

  • timehat
    timehat Contributor
    Options

    Multiple scripting actions today run sequentially (first tile ready to start once all upstream data has buffered will run, any other scripting tiles on parallel paths will wait for a running tile (or one further up in the queue) to finish before they get their turn. This is to ensure a more consistent set of execution resources (mainly memory) is available for each script (rather than each script competing against each other for resources).

    Subject to change, but this is the current setup and isn't configurable today. If the logic in the script can be performed with the normal ETL tiles, that has a good chance of running more quickly (despite tile authoring perhaps being more tedious than authoring a Python script).

  • vguddanti
    Options

    Thanks for your answer. Does using python package "Dask" help parallelize the script on all the available cores?