Reusable dataflows / ETLs

We're trying to come up with a good strategy for dealing with large datasets that contain mostly historical data but need to undergo complex transformations in a dataflow or ETL.

For example, our user activity log (currently about 7M rows) needs to be joined to several other tables and then we need to segment the activity in multiple ways, group some of it together, etc.

We've been using a MySQL dataflow because of the complexity of the transformations but this takes a little while to run -- 20-30 minutes -- even if we only process 6 months' worth of activity (< 2M rows from the activity log). If we process the entire 7M-row activity log the dataflow can take over an hour to run.

So one general thought we've had is to only transform very recent data (say, from the current month) and then UNION it together with transformed historical data in a DataFusion. But we're struggling to come up with a good process for managing this in practice. Coming from a software-development background, my mind goes to reusable dataflows -- if it were possible to configure a DF to use a "variable" input this process would be pretty easy to manage.

So for example on the first day of the month the DF would use the historical activity log as its input, and then on subsequent days it would use the "month-to-date" activity log. I don't think this is literally possible in Domo, without manually reconfiguring the DF twice a month, but maybe someone has come up with a technique that approximates it?

Comments

MarkSnodgrass

Have you tried to rebuild your MySQL dataflow in Magic ETL? The latest Magic ETL runs extremely fast and I would expect it to handle 7m in under a couple minutes. You can also do formula rules in the filter tile so that you could filter to month to date at the beginning and then do the rest of your work if you found you needed to try and make the ETL run even faster. If you haven't watched the Domopalooza breakout session called Demystifying Data Pipelines, I would suggest watching that is it gives a nice overview of the different types of data transformation tools and when to use them.

Hope this helps.

Ben_Dunlap

Thanks, I'll check out that breakout session. We're kind of committed to using SQL dataflows for various reasons, at least for transforms of any complexity. These datasets are just going to keep getting bigger in any case so I'm really more interested in specific techniques for breaking large, mostly-static datasets down before transformation and then reassembling.

The general approach I have in mind would refactor this naive approach:

1) LARGE_DATASET -> DF -> LARGE_OUTPUT

To something like this:

1) MOST_OF_LARGE_DATASET -> DF -> LARGE_PARTIAL

2) VERY_RECENT_DATA -> DF -> SMALL_PARTIAL

3) LARGE_PARTIAL + SMALL_PARTIAL -> Union via Fusion or ETL -> FINAL_OUTPUT

Where step 1 would only run, say, monthly. But I don't know how to make that happen without having two copies of the key dataflow, which then have to be maintained separately.

MarkSnodgrass

@jaeW_at_Onyx is well-versed in this area. Jae, do you have some thoughts you can share to @Ben_Dunlap ?

jaeW_at_Onyx

@Ben_Dunlap

Flat out, with only 7M rows of data you (99% guaranteed) will have the best and fastest performance with Magic 2.0. The underlying transformation engine is more performant than the way data is piped in a MySQL dataflow (out of Domo Vault / Amazon S3 and into a MySQL VM and then back the S3 vs. streaming in an Apache-esque ETL environment).

Especially with the pending release of Magic 2.0 + Partitioning (which you're essentially describing through your incremental load process) your Dataflows are going to smoke MySQL performance.

Domo has not invested in making MySQL Dataflows faster in over 4 years and that's reflected in the version of MySQL that's the dataflows are run in (5.6 vs. 8+).

I work with 1B+ and 500M+ row datasets and have consulted 100+ clients prior to joining Sony... I know how hard a pill it is to swally Magic dataflows, there are VERY few uses cases where SQL will outperform Magic. That said Domo does have a premium feature (Adrenaline Dataflows) that allows you to write SQL directly against the database engine that Domo uses to present cards which will give you even faster performance than Magic 2.0 ... but if you're just transforming 7M I'd still point you at Magic.

What you can do is use Dataset Views (which unfortunately is still somewhat buggy) to JOIN your datasets and perform light transforms, THEN pull the DSV into Magic 2.0 for any last processing steps and materialization. If you can do your ETL phase in a DSV then you can just use Magic 2.0 for UNIONing and materializing (UNLOAD the dataset to disk in S3) the data. This would give you best of all worlds (facilitate transforming incrementally loaded data, and take advantage of the fastest ETL tools in Domo without relying too much on Magic.)

... but seriously... for 7M rows just use Magic.

Ben_Dunlap

I'm sure you're right about performance but we have other reasons apart from performance for wanting to stick with SQL dataflows. We're actually just wrapping up a major project of migrating all of our ETLs to SQL dataflows. For the most part the run times aren't a problem, and even in this case we can live with the long run time since the underlying raw data is controlled by a vendor that only gives us access to daily snapshots. So the DF really only needs to run once a day. Just wondering if anyone had solved the partitioning problem using vanilla SQL DFs without the need for periodic manual reconfiguration of DFs or duplicate DFs.

ST_Superman

@Ben_Dunlap - I'll defer to @MarkSnodgrass and @jaeW_at_Onyx for understanding the technical workings under the hood of magic 2.0 over MySQL. It sounds to me like your hands may be tied when it comes to which type of transform you are using and the decision has already been made to use MySQL.

In that case, I would highly encourage you to filter your input datasets as much as possible and make sure that you are adding an index before any joins. You could also reach out to Domo to see if a technical consultant could review your dataflow to help you optimize it.

Out of curiosity, what are the "various reasons" that are preventing you from switching to the magic 2.0 engine? For me, we ran into some roadblocks when it came to documentation. Some of our data flow processes needed to be documented in specific ways and pasting a screenshot of the ETL or the JSON text was not sufficient.

If it is just a matter of needing to have precise control over "complicated" joins, there is probably a way to manage that with magic 2.

jaeW_at_Onyx

@Ben_Dunlap

Ooof... migrated away from Magic. ooook.

are you familiar with recursive dataflows? I cannot ... de-recommend them strongly enough but it is an anti-pattern that works.

You can add indexes to your MySQL dataflow which may boost performance during the JOIN action

https://domohelp.domo.com/hc/en-us/articles/360042923014-Optimizing-a-SQL-DataFlow

https://datacrew.circle.so/c/articles-and-knowledge-base/domo-tc-data-pipeline-best-practices-sag-guidelines

https://domohelp.domo.com/hc/en-us/articles/360043428133-Creating-a-Recursive-Snapshot-SQL-DataFlow

Ben_Dunlap

Thanks all, we're using indexes in our DFs and we've pushed some of our transforms down into the Postgres layer (our key raw datasets are coming out of a read-only Postgres db that our vendor provides to us). Both of these optimizations help a lot with performance.

@ST_-Superman-_ we had a few reasons for migrating to SQL:

1) We were finding that it often took several non-obvious steps in Magic to do something that would be very simple in SQL, which led to bloated and difficult-to-understand ETLs. I'm sure a lot of this has been solved in Magic 2.0 but we made this decision when all of our ETLs were still pre-2.0

2) At the time of the decision, our business relationship with Domo was at an all-time low. This has improved quite a bit since then, but back then the goal was to become less dependent on Domo's proprietary tech, not more. SQL queries can be copy/pasted as a simple export/backup.

3) Along the same lines, we had just hired our first full-time data analyst and I was more interested in giving this person an opportunity to develop portable skills than in having them become a Domo expert

So to summarize: We prioritized portability and "readability of code" (so to speak) over performance. Given our relatively small datasets and the way that the data is made available to us, this still looks to me like a good trade-off for at least the next few years.

@jaeW_at_Onyx Recursive dataflows are intriguing, but in our use case one of the main performance bottlenecks is the step of loading all of the historical data as an input to the DF, which I think would still need to happen in a recursive dataflow. It's giving me some ideas though. I'll post back here if any of them come to anything.

DataMaven

@Ben_Dunlap - I strongly concur with the rest of the folks here that you should give Magic another shot. 7 Million rows isn't a very large dataset, so even if you just use Magic when you need performance, it would help.

If your analysts aren't interacting a lot with the end-users, SQL can make sense as a way to leverage existing skillsets. However, if they are working a lot with end-users, it sure does help build trust and understanding!

There are some minor logic shifts which make the transformations much easier to navigate. If you have any questions about how to best achieve results, this group can help. A big advantage to Magic is that processes can run concurrently, so a well-designed dataflow will totally smoke SQL.

If you want to know how to grab the code from the Magic ETL dataflow, the folks responding here can help with that, too. Wanting to be able to mine your code and not be beholden totally makes sense.

Ben_Dunlap

Thanks again all. I'm definitely seeing the appeal of Magic 2. I think we will probably end up with some sort of compromise where we still use a fair amount of SQL but do heavier-duty joins in Magic or in the underlying Postgres db. Just discovered the Java CLI thanks to a post elsewhere by @jaeW_at_Onyx, so now we have access to version-controllable JSONs of all of our stuff, which is a game-changer. I'm always amazed at how much more there is to dig into with Domo!

Quick Links

Find more posts tagged with

Other Categories

Product Ideas
Have a Domo product enhancement idea? Submit or upvote on ideas in the Ideas Exchange.
Ideas Exchange
Suggest & vote on new features you would like to see implemented in the Domo Product.
Data Connections
Ask questions about Connectors, Workbench, Cloud Amplifier and get best practices from Domo peers
Connectors
A space to troubleshoot connector errors (like authentication and sync issues), best practices for building or customizing connectors, and API and writeback options.
Workbench
Workbench discussions including configuring and running jobs, managing data types and schema, troubleshooting upload errors, and working with large datasets. Ask questions about scheduling and automation, version updates, system requirements, and SQL query behavior.
Cloud Integrations
Discussions around federated and cloud integration topics, such as Cloud Amplifier, Snowflake, Databricks, BigQuery, Oracle NetSuite, and other data warehouse or lake connections. Ask questions about authentication, auto-preview settings, cost implications, pass-through SQL, and integration configuration.
Data & ETL
Ask questions about Magic ETL, SQL DataFlows, DataFusion, Dataset Views and get best practices from Domo peers
Magic ETL
Magic ETL discussions including data transformation flows, formula editor use, tile functions (e.g., Pivot, Join, Group By, Rank & Window), and handling schema and datatype conversions. Ask questions about workflow logic, preview behavior, visual editing features, freeform SQL, and performance/error tuning.
SQL DataFlows
SQL DataFlows discussions including creating and managing SQL dataflows, API automation (e.g., via Python), error resolution (such as row-count mismatches or timeout limits), and SQL transform logic. Ask questions about performance optimization, execution time limits, workflow error troubleshooting, API integration, and SQL view or query visibility.
Datasets
Datasets discussions including DataFusion and Dataset Views, dataset sharing and permissions, importing and formatting data (e.g., CSV/XLSX), dataset granularity and filtering behavior. Ask questions about data merging and snapshots, API metadata access, header changes in imported files, and export/view limits.
Visualize & Apps
Ask questions about Beast Mode, Cards, Charting, Dashboards, Stories, Variables and get best practices from Domo peers
Dashboards
Dashboards discussions including Cards, Dashboards, and Stories—covering topics like card formatting, dashboard navigation, filtering logic, and data visualization behavior. Ask questions about layout consistency, dynamic labeling, drill-downs, access permissions, inter-dashboard navigation, and export options.
App Studio
App Studio discussions including building multi-page apps, custom navigation, themes, forms, filters, queues, and component behaviors. Ask questions about popup forms, filter persistence, control visibility, mobile access, theming and branding, embedded workflows, and publish workflows.
Pro-code Components
Pro-code Components discussions including building and debugging Domo Bricks or pro-code apps, app lifecycle management (e.g., manifest.json), and dataset or workflow integration. Ask questions about permission configurations, app-to-dataset writebacks, form security, PDF export, workflow initiation code, and use of the web-based Pro-code Editor.
Charting & Analyzer
Charting & Analyzer discussions including chart types (e.g., period-over-period charts, bullet charts, pivot tables, heat maps), tooltip and data label configuration, filter behavior, and time-based visualization logic. Ask questions about date selector binding, custom calculation displays, sorting order, annotations, chart alerts, and multi-metric formatting.
Calculations & Variables (Beast Mode)
Calculations & Variables (Beast Mode) discussions including creating and troubleshooting calculated fields, using variables in Analyzer, nesting Beast Modes, and leveraging FIXED and window functions like RANK or aggregation logic. Ask questions about variable scoping, date and running total calculations, error handling (e.g., divide-by-zero, row filters), ETL vs Beast Mode placement, and performance optimization.
AI & Data science
Ask questions about DomoAI and get answers from Domo peers.
Domo AI & AI Chat
Domo AI & AI Chat discussions including AI readiness tools, AI Chat interface behavior, AI agent creation and workflows, and AI dictionary or metadata configuration. Ask questions about AI Chat sessions reports, chat history visibility, publication syncing, AI agent errors, and dataset readiness governance.
Managing AI
Managing AI discussions including AI Playground usage, AI project setup, and AI model management within Domo. Ask questions about AI Academy episodes, AI agent errors, AI readiness guidance, and image/upload workflows.
Jupyter Workspaces
Jupyter Workspaces discussions including Notebook execution, scheduling DataFlows, error troubleshooting (e.g., “no output” or workspace down), and package or library support within the workspace. Ask questions about AI features, file share connectors, domojupyter APIs, Jupyter via Workflows, and data science resources.
Automate
Ask questions about App Framework, Workflows, Domo Bricks, Domo Developer, API and get best practices from Domo peers
Workflows
Workflows discussions including Task Center automation, form-based workflows, conditional logic, alerts, and code-driven tasks using Code Engine (JavaScript/Python). Ask questions about email triggers, append/writebacks, dataset logging, API integration, error handling, and workflow-task interactions like Projects & Tasks or dashboards.
Alerts
Alerts discussions including setting up card-based and dataset-based alerts, conditional notifications, and monitoring alert execution behavior. Ask questions about summary number triggers, email content values, multi-dimensional logic, non-firing alerts, and configuration differences across dataset types.
Distribute
Ask questions about Domo Everywhere, Scheduled Reports, Mobile and get best practices from Domo peers
Domo Everywhere
Domo Everywhere discussions including embedding dashboards and cards (public vs private), filtering and access control, performance and layout behavior, and API/client ID management. Ask questions about license tracking, text selection in embedded content, export limitations, embed errors, and configuration of .env and datasetRedirects.
Reporting
Reporting discussions including Scheduled Reports, Report Builder, and Slideshow Publications. Ask questions about bulk managing scheduled reports, CSV/PDF export formatting, report layout customization, interface changes, and admin visibility of reports.
Manage
Ask questions about Governance Administration, Approvals, Teams, Alerts, and Buzz and get best practices from Domo peers
Governance & Security
Governance & Security discussions including managing People, Groups, Roles, Teams, Approvals, and PDP, plus sandbox environment access and activity log investigation. Ask questions about role delegation, dynamic group attributes, SSO/SCIM onboarding, governance toolkit usage, and governance dataset visibility and reporting.
Navigation & Productivity
Navigation & Productivity discussions including navigation layout and customization, Projects & Tasks usage, Goals tracking, and Buzz chat functionality. Ask questions about custom icons in navigation, level-specific dashboard creation, workspace navigation behavior, and project/task visibility in Buzz.
APIs
APIs discussions including Domo REST APIs, Python SDK, Java SDK, data import/export, and App API use cases. Ask questions about authentication (client ID/secret), rate limits, error handling (401/403), dataset append/update, and embedding or snapshot automation.
Add-ins & Plugins
Add-Ins & Plugins discussions including Microsoft add-ins (Excel, PowerPoint), Google Slides, and other third-party integrations. Ask questions about installation errors, legacy vs new plugin behavior, refresh failures, template formatting, iframe embedding, and version differences.
Domo Community Gallery
Watch how our Customers are using Domo to solve their complex problems. Featuring real-world use cases, customer success stories, and community-shared workflows or integrations. Learn how our customers are using Domo to solve their complex problems.
Product Releases
Domo support and product teams are here to live-answer questions about the most recent product releases. Please post questions in this Forum board for all users to benefit (rather than submitting a support ticket).
Domo University
Domo University discussions include self-paced training, instructor-led courses, virtual/in-person learning, and certification paths. Ask questions about course content updates, certification exam tips, platform onboarding improvements, and training resource formatting or errors.
Community Forums
Getting Started
Welcome to Domo's Community Forums! You'll find everything you need to get started in this category.
Community Announcements
Get the latest from Domo's Community Team.
Social Groups
Archive
Old or outdated content that could still be found helpful.