Import huge amount of data from mySQL - is it possible to partition a datasource?

gllrusso · September 2020

Good morning,

I'm writing this because it seems I cannot solve properly my problem on my own and I need your help.

So, i have a huge table in mysql (50 million rows, 100 columns) and we used to import it entirely every day to Domo to have all the new data. Of course, this is getting slower and slower and we cannot manage this with a massive approach anymore.

For simplicity, consider that we have one date column that we can use for partitioning the table.

Considering how we work on this table, I am actually interested only in the last 7 days data and I would like to update only them.

I've considered using the merge option in the connector, but this does not work in my case because I may also delete some rows in the last 7 days from one day to the other, and merge would not let me delete those rows.

So, my question is: it is possible to partition my datasource so that every day I delete the last 7 days from Domo (based on the partitioning column) and insert again them from MySQL (so to have updated information). Of course I want to do this without touching all the other rows.

I have already tried to use a dataflow having as input the old datasource, the new 7 days rows and performing manually the delete + insert, but this takes so much time because we need to load the whole dataset (very slow), perform the delete+insert (very fast), and then writing the updated dataset (very slow, again).

If we just could manage somehow to delete rows from a data source based on a partitioning column this would make things extremely faster.

Thank you in advance.

jaeW_at_Onyx · October 2020

Couple points here.

it is possible to run a job in the Domo Utility CLI (Java CL)I that will remove rows from an UPSERT enabled dataset if you have a CSV with the UPSERT KEY of the removed rows. So ... this will solve your hard delete problem.

Alternatively, you can look at data processing models that handle incremental loads through the week and then do a massive transform once a month or once a week. Integrating this into a Data Assembler (not Data Stacker) workflow might be a good fit depending on the details of the use case.

Your point about not being able to use a Fusion. Magic ETL 2.0 (beta talk to your CSM) does support Dataset Views in Magic pipelines, so this may be a decent solution depending on data volumes.

GrantSmith · September 2020

Hi @usergru

How I've typically done this in the past is to have two separate datasets, live and historical. For the historical dataset I'd have it APPEND the data from 8 days ago every day. For the live dataset it would REPLACE the dataset with the data from the past 7 days. Then you can utilize a data fusion to stack / union the dataset into it's final form. This way you're only ever updating 8 days worth of data instead of the entire dataset.

Another option would be to possibly utilize UPSERT depending on how you're ingesting the data. Here's an article outlining UPSERT in Workbench 5: https://knowledge.domo.com/Connect/Connecting_to_Data_Using_Workbench_5/Using_Upsert_in_Workbench_5

gllrusso · October 2020

Hello and thanks for the answer.

The UPSERT solution is not good for me because in the last days of data I may need to delete some rows, thus if I delete them between an UPSERT and the other I will have the deleted rows in the output dataset, and this is a problem for me. Since no DELETE option is given (without an external dataflow, which would make things too much slow), I gave it up with upsert/merges.

The first solution would, instead, be good. The problem, however, is that we cannot use data fusion output as a source, that is even if I get the result I cannot use it as an input for other ETLs.

For this reason I'm really looking for another solution. Maybe Domo Streams may help?

bdavis · October 2020

There is a solution for this - contact your CSM and inquire about their data stacker.

jaeW_at_Onyx · October 2020

Couple points here.

it is possible to run a job in the Domo Utility CLI (Java CL)I that will remove rows from an UPSERT enabled dataset if you have a CSV with the UPSERT KEY of the removed rows. So ... this will solve your hard delete problem.

Alternatively, you can look at data processing models that handle incremental loads through the week and then do a massive transform once a month or once a week. Integrating this into a Data Assembler (not Data Stacker) workflow might be a good fit depending on the details of the use case.

Your point about not being able to use a Fusion. Magic ETL 2.0 (beta talk to your CSM) does support Dataset Views in Magic pipelines, so this may be a decent solution depending on data volumes.

Import huge amount of data from mySQL - is it possible to partition a datasource?

Best Answer

Answers

Categories