Domo ETL parallel processing to reduce ETL runtime
Hello,
I have a python script which is part of an ETL to transform a combination of 3 input datasets.
The python script contains multiple if else statements and for loops which is taking aroung 4 hours to complete the ETL pipeline.
I want to know if there is an option to parallelize the python script in Magic ETL to reduce the runtime.
Thanks
Answers
-
It depends on the structure of your data and if you're able to process the data in parallel or not. Can you do it in subsections or have the different steps happen at the same time so you can utilize multiple python code tiles? Is the logic you're using in Python able to be translated back into Magic ETL with the tiles? When you're using a Python tile it must spin up a new virtual environment, copy over the dataset, process it and then pass it back which is where most of the delay typically comes from.
Do you have to process all of the records or can you filter it down and process fewer records?
**Was this post helpful? Click Agree or Like below**
**Did this solve your problem? Accept it as a solution!**0 -
This is how my ETL looks like. My python script does 3 for loops one after the other. Do you think splitting the code into 3 tiles will help fasten the processing?
0 -
It depends on what type of processing your python tile is doing. Does it require all of the datasets to be input into the tile to process the data together or can the data be processed independently of the different datasets?
**Was this post helpful? Click Agree or Like below**
**Did this solve your problem? Accept it as a solution!**0 -
It requires a combination of 2 datasets for each function. Then I am pulling matching data based on some conditions.
0 -
Multiple scripting actions today run sequentially (first tile ready to start once all upstream data has buffered will run, any other scripting tiles on parallel paths will wait for a running tile (or one further up in the queue) to finish before they get their turn. This is to ensure a more consistent set of execution resources (mainly memory) is available for each script (rather than each script competing against each other for resources).
Subject to change, but this is the current setup and isn't configurable today. If the logic in the script can be performed with the normal ETL tiles, that has a good chance of running more quickly (despite tile authoring perhaps being more tedious than authoring a Python script).
0 -
Thanks for your answer. Does using python package "Dask" help parallelize the script on all the available cores?
0
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.5K Connect
- 1.2K Connectors
- 300 Workbench
- 6 Cloud Amplifier
- 8 Federated
- 2.9K Transform
- 100 SQL DataFlows
- 616 Datasets
- 2.2K Magic ETL
- 3.8K Visualize
- 2.5K Charting
- 738 Beast Mode
- 56 App Studio
- 40 Variables
- 684 Automate
- 176 Apps
- 452 APIs & Domo Developer
- 46 Workflows
- 10 DomoAI
- 35 Predict
- 14 Jupyter Workspaces
- 21 R & Python Tiles
- 394 Distribute
- 113 Domo Everywhere
- 275 Scheduled Reports
- 6 Software Integrations
- 123 Manage
- 120 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 10 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 108 Community Announcements
- 4.8K Archive