Average run time of a redshift data flow on large dataset

Hello, I wanted to get a general idea of your redshift run time. Any input is appreciated.

I have a data set with about million rows and 123 columns. However, it took 3.5 hours to simply have a subset of 1 column and almost 10 hours to run my data flow with some left joints. This dataset need to be refreshed everyday, so you can image how difficult it is for me to get anything done.

I am sure this data set is not a "Big Data" at all, curious how long does it take you to run a similar size of data.

Also, please give me any advise on how to speed up.

Thank you.

Olivia

Comments

dthierjung

Not having seen your joins, I would suspect that adding some indeces to your input datasets on the columns that are joined on will vastly improve the run time.

What sort of ETL are you using? If it's the MySQL ETL, you can use the following syntax as a transform to add an index:

"ALTER TABLE `table_name` ADD INDEX (`column_name`)"

You should add an index on each input dataset that you join on which will improve the execution time.

Try adding those and let me know if that doesn't reduce the run time.

WizardOz

thank you - dthierjung,

I will try what you suggested. May I ask what is your average run time for what size of data?

But here is an example of a data flow I did yesterday and it is still running after 17 hours - I missed my deadline for an analysis .

this is what in this data flow:

3 input tables:

table A: 54million row, 9 coulmns,

table B: 24 rows, 3 columns

Table C: 26.4K rows, 32 columns,

SELECT a.*, b.`Competitor`, e.`ActivityDate`, e.`Company`, e.`Title`, e.`Full Name`

from `Table A` a
left join `Table B` b
on a.`company_name`=b.`DB compnay Name`
left join `Table C` e
on b.`Company Name` = e.`Company`

dthierjung

We don't have any inputs that exceed 3 million, so I don't think I'd be able to make a useful comparison for you, although I'm sure others can comment on their run-times. As an aside, I had a dataflow that executed in 3 hours but only had ~600k rows total. After adding indeces, this flow completed in less than 10 minutes, so those can really make a huge difference.

But sql best-practices are applicable no matter the data sizes, and given the size of your inputs, using all the tools at your disposal would be best. There are a few more items you can consider:

1) Joining on integers are typically faster than strings, so if you can join on `company_id` for instance, that would be more ideal.

2) If you can filter your data in any way prior to joining, that will reduce your run-time as well.

3) Also consider if you need ALL the columns from `Table_a` in the example you provided.

You reference a dataset of 123 columns and a million rows, but a table like this isn't mentioned in the query. Those three tables are all different sizes. Where does this first dataset come in to play?

In my experience with Domo, 54 million rows is rather large for most companies and I'm not surprised a dataflow took a few hours to run on that. I wouldn't expect one transform combining that data with a couple smaller tables to take 10 or 17 hours.

@dthierjung gives really solid advice, though. Strings, especially long ones, can destroy your run time. Since you're running RedShift, I don't think indexing the tables is going to be necessary. It's really only MySQL you need to do that with, and indexing absolutely helps there.

Do try your dataflow with a fairly limiited subset, like a particular customer, and see how that improves run time.

WizardOz

Hi, dthierjung, your suggestion works - it takes about 3.5 hours to get it finish. But I think this is still too long. Wondering how do you guys run ad hoc queries?

Thnak you

dthierjung

Domo is fantastic for prototyping however with the sheer amount of data that you're working with, ad hoc reports will be tough to do quickly.

My company has moved our productionized data flows into a MS SQL environment which does all of the heavy lifting on our server so processing times are typically much shorter.

Have you tried the Fusion ETL? I believe that handles datasets in the millions of rows more efficiently.

WizardOz

Hi, Aron,

Thank you for your comments. The data with 123 columns and 54 million rows is the original source data. The example table A I gave above is only a subset of this original data - I have another query to select those columns.

I have a data flow based on the originally dataset and ran 10 hours in redshift. I already asked Domo tech person to take a look and he said it looks good, nothing can be optimized in redshift. Like yous aid, index only works in MySQL. This data needs to be refreshed everyday and the size is growing everyday.

I am not sure I understand when you say run like a particular customer - for us, this is our particular day to day situation, if I only run a subset, how can I do my analysis if I need to compare with last years number or summarize overall analysis? I really need to get the data flow run fast.

Thank you.

Olivia

I suggested limiting to just one customer just to see how fast the dataflow would run, as a sort of check on the logic.

Running 54 million rows with that many columns, I'm not surprised by a three or four hour runtime.

What is the nature of the data? Does it go back many years? Is that why there are so many rows? Or are there so many dimensions that it takes a lot of rows to use them all?

One thing we've done in order to get descriptive data in faster times is to run two datasets simultaneously. One with just a year or two of data that refreshes often, like every hour, and another with more historical data that refreshes about every four hours. That adds complexity to the process but gets the right data in the hands of the right people at the right time.

You might try a similar strategy where you become more precise with the requirements, given the time constraints.

WizardOz

Thanks for let me know about FUSION ETL. I am tryng to get that one.

Olivia

WizardOz

Hi, Aaron,

Thank you for your great suggestion!. I am going to see how to optimize my data flow like that

Olivia

WHM

Redshift datasets can have a sort key added that will help them join faster... below is some of our code to speed up a dataflow with big datasets: (the BSEG customer table has 85million rows.) The biggest impoct on run times will be IO moving the data into the redshift environment. If you can limit the columns you use it will go faster. you can fusion attributes on later if they are not needed for whatever processing you are doing in hte dataflow..

/* ===== Prepare Sorted Table =====
IMPORTANT! Using BEGIN & END makes these steps run as a single transaction.
Not using them will lead to occasions of the next transform running before the sorted table is ready.
*/

BEGIN;

DROP TABLE IF EXISTS raw_sap_bseg_customer_sorted;

CREATE TABLE raw_sap_bseg_customer_sorted
(
{
BUKRS VARCHAR(24),
BELNR VARCHAR(24),
GJAHR VARCHAR(24),
BUZEI VARCHAR(24),
LIFNR VARCHAR(24),
XREF1 VARCHAR(24),
XREF2 VARCHAR(24),
KUNNR VARCHAR(24)
}
)
SORTKEY(BUKRS,BELNR,GJAHR);

INSERT INTO raw_sap_bseg_customer_sorted
(SELECT * FROM raw_sap_bseg_customer);

END;

Quick Links

Find more posts tagged with

Other Categories

Product Ideas
Have a Domo product enhancement idea? Submit or upvote on ideas in the Ideas Exchange.
Ideas Exchange
Suggest & vote on new features you would like to see implemented in the Domo Product.
Data Connections
Ask questions about Connectors, Workbench, Cloud Amplifier and get best practices from Domo peers
Connectors
A space to troubleshoot connector errors (like authentication and sync issues), best practices for building or customizing connectors, and API and writeback options.
Workbench
Workbench discussions including configuring and running jobs, managing data types and schema, troubleshooting upload errors, and working with large datasets. Ask questions about scheduling and automation, version updates, system requirements, and SQL query behavior.
Cloud Integrations
Discussions around federated and cloud integration topics, such as Cloud Amplifier, Snowflake, Databricks, BigQuery, Oracle NetSuite, and other data warehouse or lake connections. Ask questions about authentication, auto-preview settings, cost implications, pass-through SQL, and integration configuration.
Data & ETL
Ask questions about Magic ETL, SQL DataFlows, DataFusion, Dataset Views and get best practices from Domo peers
Magic ETL
Magic ETL discussions including data transformation flows, formula editor use, tile functions (e.g., Pivot, Join, Group By, Rank & Window), and handling schema and datatype conversions. Ask questions about workflow logic, preview behavior, visual editing features, freeform SQL, and performance/error tuning.
SQL DataFlows
SQL DataFlows discussions including creating and managing SQL dataflows, API automation (e.g., via Python), error resolution (such as row-count mismatches or timeout limits), and SQL transform logic. Ask questions about performance optimization, execution time limits, workflow error troubleshooting, API integration, and SQL view or query visibility.
Datasets
Datasets discussions including DataFusion and Dataset Views, dataset sharing and permissions, importing and formatting data (e.g., CSV/XLSX), dataset granularity and filtering behavior. Ask questions about data merging and snapshots, API metadata access, header changes in imported files, and export/view limits.
Visualize & Apps
Ask questions about Beast Mode, Cards, Charting, Dashboards, Stories, Variables and get best practices from Domo peers
Dashboards
Dashboards discussions including Cards, Dashboards, and Stories—covering topics like card formatting, dashboard navigation, filtering logic, and data visualization behavior. Ask questions about layout consistency, dynamic labeling, drill-downs, access permissions, inter-dashboard navigation, and export options.
App Studio
App Studio discussions including building multi-page apps, custom navigation, themes, forms, filters, queues, and component behaviors. Ask questions about popup forms, filter persistence, control visibility, mobile access, theming and branding, embedded workflows, and publish workflows.
Pro-code Components
Pro-code Components discussions including building and debugging Domo Bricks or pro-code apps, app lifecycle management (e.g., manifest.json), and dataset or workflow integration. Ask questions about permission configurations, app-to-dataset writebacks, form security, PDF export, workflow initiation code, and use of the web-based Pro-code Editor.
Charting & Analyzer
Charting & Analyzer discussions including chart types (e.g., period-over-period charts, bullet charts, pivot tables, heat maps), tooltip and data label configuration, filter behavior, and time-based visualization logic. Ask questions about date selector binding, custom calculation displays, sorting order, annotations, chart alerts, and multi-metric formatting.
Calculations & Variables (Beast Mode)
Calculations & Variables (Beast Mode) discussions including creating and troubleshooting calculated fields, using variables in Analyzer, nesting Beast Modes, and leveraging FIXED and window functions like RANK or aggregation logic. Ask questions about variable scoping, date and running total calculations, error handling (e.g., divide-by-zero, row filters), ETL vs Beast Mode placement, and performance optimization.
AI & Data science
Ask questions about DomoAI and get answers from Domo peers.
Domo AI & AI Chat
Domo AI & AI Chat discussions including AI readiness tools, AI Chat interface behavior, AI agent creation and workflows, and AI dictionary or metadata configuration. Ask questions about AI Chat sessions reports, chat history visibility, publication syncing, AI agent errors, and dataset readiness governance.
Managing AI
Managing AI discussions including AI Playground usage, AI project setup, and AI model management within Domo. Ask questions about AI Academy episodes, AI agent errors, AI readiness guidance, and image/upload workflows.
Jupyter Workspaces
Jupyter Workspaces discussions including Notebook execution, scheduling DataFlows, error troubleshooting (e.g., “no output” or workspace down), and package or library support within the workspace. Ask questions about AI features, file share connectors, domojupyter APIs, Jupyter via Workflows, and data science resources.
Automate
Ask questions about App Framework, Workflows, Domo Bricks, Domo Developer, API and get best practices from Domo peers
Workflows
Workflows discussions including Task Center automation, form-based workflows, conditional logic, alerts, and code-driven tasks using Code Engine (JavaScript/Python). Ask questions about email triggers, append/writebacks, dataset logging, API integration, error handling, and workflow-task interactions like Projects & Tasks or dashboards.
Alerts
Alerts discussions including setting up card-based and dataset-based alerts, conditional notifications, and monitoring alert execution behavior. Ask questions about summary number triggers, email content values, multi-dimensional logic, non-firing alerts, and configuration differences across dataset types.
Distribute
Ask questions about Domo Everywhere, Scheduled Reports, Mobile and get best practices from Domo peers
Domo Everywhere
Domo Everywhere discussions including embedding dashboards and cards (public vs private), filtering and access control, performance and layout behavior, and API/client ID management. Ask questions about license tracking, text selection in embedded content, export limitations, embed errors, and configuration of .env and datasetRedirects.
Reporting
Reporting discussions including Scheduled Reports, Report Builder, and Slideshow Publications. Ask questions about bulk managing scheduled reports, CSV/PDF export formatting, report layout customization, interface changes, and admin visibility of reports.
Manage
Ask questions about Governance Administration, Approvals, Teams, Alerts, and Buzz and get best practices from Domo peers
Governance & Security
Governance & Security discussions including managing People, Groups, Roles, Teams, Approvals, and PDP, plus sandbox environment access and activity log investigation. Ask questions about role delegation, dynamic group attributes, SSO/SCIM onboarding, governance toolkit usage, and governance dataset visibility and reporting.
Navigation & Productivity
Navigation & Productivity discussions including navigation layout and customization, Projects & Tasks usage, Goals tracking, and Buzz chat functionality. Ask questions about custom icons in navigation, level-specific dashboard creation, workspace navigation behavior, and project/task visibility in Buzz.
APIs
APIs discussions including Domo REST APIs, Python SDK, Java SDK, data import/export, and App API use cases. Ask questions about authentication (client ID/secret), rate limits, error handling (401/403), dataset append/update, and embedding or snapshot automation.
Add-ins & Plugins
Add-Ins & Plugins discussions including Microsoft add-ins (Excel, PowerPoint), Google Slides, and other third-party integrations. Ask questions about installation errors, legacy vs new plugin behavior, refresh failures, template formatting, iframe embedding, and version differences.
Domo Community Gallery
Watch how our Customers are using Domo to solve their complex problems. Featuring real-world use cases, customer success stories, and community-shared workflows or integrations. Learn how our customers are using Domo to solve their complex problems.
Product Releases
Domo support and product teams are here to live-answer questions about the most recent product releases. Please post questions in this Forum board for all users to benefit (rather than submitting a support ticket).
Domo University
Domo University discussions include self-paced training, instructor-led courses, virtual/in-person learning, and certification paths. Ask questions about course content updates, certification exam tips, platform onboarding improvements, and training resource formatting or errors.
Community Forums
Getting Started
Welcome to Domo's Community Forums! You'll find everything you need to get started in this category.
Community Announcements
Get the latest from Domo's Community Team.
Social Groups
Archive
Old or outdated content that could still be found helpful.