Loading huge amount of data to Domo using workbench 5 - Issue

Hi All,

 

I'm new to Domo Administration, I have configured data set connecting to Oracle DB from workbench 5, I have following questions in regards to huge data loading issue.
 

1. I have two tables in Oracle DB (Table A & B) both these tables have huge amount of data. Table 'A' has 500 millions rows and 20 columns and Table 'B' has 400 millions rows of data and 10 columns. I would like to know what is best practice to import the data to Domo.


2. I tried to import these two tables(A & B) to Domo by increasing the 'Query Execution timeout' limit in workbench but the job keeps failing may be due huge amount of data.


Please let me know what is the best practice to load the data to Domo using workbench, also need to do daily upsert's for last 30 days to the dataset.

 

Also, I tried to apply the logic (select * from table where cluster between '{lastvalue:cluster}!' and '!{lastvalue:cluster}!'+2000000 order by cluster)

to do initial data load to the dataset and when i executed the job i see "Replacement parameter does not contain a value for column" error on workbench,  could someone please guide me with steps to do inital full load and then upserts for last 30 days on daily basis.


Please let me know if you have any questions,

 

Thanks in advance

 

Thanks,

Hari

Best Answer

  • NewsomSolutions
    NewsomSolutions Contributor
    Answer ✓

    Good Morning @user06848 ,

     

    I see there haven't been any responses publicly to this so I'm going to see if I can help, maybe even if it is just getting you to think of it differently, maybe I can help here.

     

    1.  So for the 'best practice for a large amt of data' for workbench, in the past, what I've tried to do is use the scheduling feature to import smaller segments of data over and over again.  I'd do this by using the replacement variables and also by limiting how many days of data it would pull in, say 10 days at a time.  And I would just be appending all this data together.  You may not have that as an option, so you could try to limit the columns or creating indexes on your Oracle side and see if that helps if you're doing something like "where data > x date" or something in your query and it is causing a problem.

     

    2. For the timeout, how long is it running?  I've run jobs for 16 hours before that held over 3B rows and it ran, so maybe just go in and blow the timeout limit to 99999 and see if that fixes that.  

     

    For all of this, you may want to think about how much data you can get away with importing in one shot...if you're worried about duplicates or holes there are ways around that too.  You could do one workbench job that pulls in smaller segments (more current data) and then another one that is a historical pull (less current data) and then join the two with ETL and using a recursive ETL (https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_ETL_DataFlow)  . 

     

    For the issue with the replacment variable, try naming your columns in your query vs using "*".  I know technically that "should" work but workbench to me is a bit funny.  I've written almost the exact same queries using replacment variables and one works and the other doesn't.  So it may be that you just need to try it again and see if it works any better.

     

    I hope that helps.
    Matt

Answers

  • NewsomSolutions
    NewsomSolutions Contributor
    Answer ✓

    Good Morning @user06848 ,

     

    I see there haven't been any responses publicly to this so I'm going to see if I can help, maybe even if it is just getting you to think of it differently, maybe I can help here.

     

    1.  So for the 'best practice for a large amt of data' for workbench, in the past, what I've tried to do is use the scheduling feature to import smaller segments of data over and over again.  I'd do this by using the replacement variables and also by limiting how many days of data it would pull in, say 10 days at a time.  And I would just be appending all this data together.  You may not have that as an option, so you could try to limit the columns or creating indexes on your Oracle side and see if that helps if you're doing something like "where data > x date" or something in your query and it is causing a problem.

     

    2. For the timeout, how long is it running?  I've run jobs for 16 hours before that held over 3B rows and it ran, so maybe just go in and blow the timeout limit to 99999 and see if that fixes that.  

     

    For all of this, you may want to think about how much data you can get away with importing in one shot...if you're worried about duplicates or holes there are ways around that too.  You could do one workbench job that pulls in smaller segments (more current data) and then another one that is a historical pull (less current data) and then join the two with ETL and using a recursive ETL (https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_ETL_DataFlow)  . 

     

    For the issue with the replacment variable, try naming your columns in your query vs using "*".  I know technically that "should" work but workbench to me is a bit funny.  I've written almost the exact same queries using replacment variables and one works and the other doesn't.  So it may be that you just need to try it again and see if it works any better.

     

    I hope that helps.
    Matt

  •  

    Thanks Matt for providing the steps to solve data load issue.

     

    1. I was able to load the data in small chunks (row count <= 5 million rows), I have applied replacement variable for one of unique columns and was able to load the data by scheduling the job to run every 15 mins.

     

    2. Per your suggestion, I have increase the job  timeout limit to 99999 and i dont see any connectivity issues.

     

    3. I have another dataset where i have to load the data using date range and wanted to check to see if we can use date (format only in MM-DD-YYYY) in replacement variable to run the job in auto process mode instead of manually updating the where clause in the edit query section each time to load the data.

     

    Please let me know your suggestion on point 3.

     

    Thanks,

    Hari

  • @user06848 yes you should be able to do that.  In my limited experience using the date for rep variable though what I found was that no matter what format you have your date set as, it will come back from the replacement variable as something like yyyy-mm-dd hh:mm:ss and then you would have to strip the time out and convert the date in the format you'd like.  

     

    I hope that helps.

    Matt

This discussion has been closed.