Combine Data Issues
I have 3 datasets called FYC Historical, FYC Details-Current, and FYC Summary. I have combined these with our VUE Production Clean Dataset and called the dataflow VUE FYC Combine (takes almost an hour to run).
I have then taken this output dataset, called VUE Historical and Current Test, and placed it into a copy of an overall dataflow called FYC TEST of Bridge | Course and Program Enrollment Status, which produces my 2 main datasets (FYC TEST Bridge | Course Enrollment Status and FYC TEST Bridge | Program Enrollment Status).
I have tried to run this dataflow, and it has gone longer than a day and ended up canceling with over 400 million rows and almost a terabyte of data. What am I doing wrong? What can I do to correct this to get it to run faster?
Comments
-
Hi DrGWright,
I'm sorry to learn that dataflow is failing to successfully run. The first thing is to make sure you've optimized the dataflow using the tactics described in this Help Center article: https://knowledge.domo.com/Prepare/Magic_Transforms/SQL_DataFlows/Optimizing_an_SQL_DataFlow. The suggestions include Indexing (if in MySQL), filtering the data down to only what will actually be used, grouping data when aggregations are used in the SELECT clause, and breaking up multiple table joins into their own transforms.
Another thing to look at is to ensure your joins are being made on unique keys (either a column with only unique values or on multiple columns which create a unique combination) and trying to join mostly (or only) on numeric columns; string columns take much longer to index when being joined.
If I have answered your question, please click "Yes" on my comment's option.
3 -
I'm not using SQL for any of this. I'm still not sure how to simplify. Is there anything within the ETL or Dataflow to help correct this issue?
0 -
Hi DrGWright,
Looks like you're joining 4 datasets to start. Are you using Dataflow?
- Databases:
FYC Historical
FYC Details-Current
FYC Summary
VUE Production Clean Dataset - Creates: VUE Historical and Current Test dataset
This is then joined into another data flow that's producing another 2 datasets. Lot's going on to start.
Do you have any further information on these databases? Types of database, locally or cloud/SaaS and also what field you are using to join? Also, their size can impact run times.
If you can give us a bit more context it will help. For example, if I wanted to join a CRM database and a Transaction database I might use Customer_ID as it's common in both Databases.0 - Databases:
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.5K Connect
- 1.2K Connectors
- 300 Workbench
- 6 Cloud Amplifier
- 8 Federated
- 2.9K Transform
- 100 SQL DataFlows
- 616 Datasets
- 2.2K Magic ETL
- 3.8K Visualize
- 2.5K Charting
- 738 Beast Mode
- 56 App Studio
- 40 Variables
- 684 Automate
- 176 Apps
- 452 APIs & Domo Developer
- 46 Workflows
- 10 DomoAI
- 35 Predict
- 14 Jupyter Workspaces
- 21 R & Python Tiles
- 394 Distribute
- 113 Domo Everywhere
- 275 Scheduled Reports
- 6 Software Integrations
- 123 Manage
- 120 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 10 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 108 Community Announcements
- 4.8K Archive