ETL Slowed down by large Data Input
I'm currently using a dataset in a few dataflows that contains around 10M rows of data (this includes historical extracts from the last day of each month as well as the previous day). Any ETL process that is using this dataset as an input takes a long time to run due to all of the rows having to load in before anything can be done (the actual ETL steps run quickly). What are some best practices to isolate on the data I need in the data input process to reduce run time?
Best Answer
-
use a Dataset view (beta feature, ask your csm) to isolate the subset of data that you actually need in ETL..
are you running magic or SQL? Magic can start ETL before all the data is loaded into the ETL environment, with SQL you have to wait until all the data is loaded into a table (and indexed if you're using redshift) before the dataflow can begin.
also, SQL enginges cannot leverage a DSV. if you're using SQL your pipeline has to be:
1) DSV
2) Dataset Copy (connector) of your DSV.
then SQL transform.
If you have the Adrenaline dataflows feature, you may be able to leverage that for faster performance b/c it all happens in Adrenaline, but this is a premium feature.
If you work with your support team, you may be able to
1) materialize your DSV (so you can avoid dataset copy)
2) pull the materialized DSV into SQL, but make it clear that that's your intention b/c i think there's additional backend work that has to happen to us a materialized view in a SQL dataflow.
SHORT VERSION OF THE STORY, you'll have a simpler data pipeline in Magic 2.0
Jae Wilson
Check out my 🎥 Domo Training YouTube Channel 👨💻
**Say "Thanks" by clicking the ❤️ in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"2
Answers
-
use a Dataset view (beta feature, ask your csm) to isolate the subset of data that you actually need in ETL..
are you running magic or SQL? Magic can start ETL before all the data is loaded into the ETL environment, with SQL you have to wait until all the data is loaded into a table (and indexed if you're using redshift) before the dataflow can begin.
also, SQL enginges cannot leverage a DSV. if you're using SQL your pipeline has to be:
1) DSV
2) Dataset Copy (connector) of your DSV.
then SQL transform.
If you have the Adrenaline dataflows feature, you may be able to leverage that for faster performance b/c it all happens in Adrenaline, but this is a premium feature.
If you work with your support team, you may be able to
1) materialize your DSV (so you can avoid dataset copy)
2) pull the materialized DSV into SQL, but make it clear that that's your intention b/c i think there's additional backend work that has to happen to us a materialized view in a SQL dataflow.
SHORT VERSION OF THE STORY, you'll have a simpler data pipeline in Magic 2.0
Jae Wilson
Check out my 🎥 Domo Training YouTube Channel 👨💻
**Say "Thanks" by clicking the ❤️ in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"2 -
@jaeW_at_Onyx Thank you so much for the suggestions! I will be switching these ETLs over to Magic 2.0 and take a look into Adrenaline.
0
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.5K Connect
- 1.2K Connectors
- 296 Workbench
- 6 Cloud Amplifier
- 8 Federated
- 2.9K Transform
- 100 SQL DataFlows
- 614 Datasets
- 2.2K Magic ETL
- 3.8K Visualize
- 2.5K Charting
- 729 Beast Mode
- 53 App Studio
- 40 Variables
- 677 Automate
- 173 Apps
- 451 APIs & Domo Developer
- 45 Workflows
- 8 DomoAI
- 34 Predict
- 14 Jupyter Workspaces
- 20 R & Python Tiles
- 394 Distribute
- 113 Domo Everywhere
- 275 Scheduled Reports
- 6 Software Integrations
- 121 Manage
- 118 Governance & Security
- Domo Community Gallery
- 32 Product Releases
- 10 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 108 Community Announcements
- 4.8K Archive