Magic ETL trigger timing - input datasets of varied size

I think I just discovered that if a magic ETL is configured to run "only when datasets are updated", and has multiple input datasets, it will start running as soon as the smallest/quicket input dataset has been updated. If a larger/slower input dataset finishes updating before the ETL finishes running, the ETL will not run again.

 

Timeline of a concrete example that happened this morning for an ETL that has six input datasets. I've numbered them Input 1 through Input 6 in order of size. Input 1 has 39 rows and Input 6 has over a milion rows:

 

05:00:00 - Inputs 1-6 start updating
05:00:17 - Input 1 finishes updating, ETL starts running
[various times] - Inputs 2-5 finish updating
05:03:04 - Input 6 finishes updating
05:03:58 - ETL finishes running

 

Unfortunately the new data in Input 6 was not included in the ETL run, and the ETL was not triggered to run again. This makes sense in a way but also seems like a bug.

 

I'm going to report it as a bug to Domo Support but meanwhile was wondering if anyone else has experienced this or has any tips for simple/reliable workarounds?

Tagged:

Comments

  • Ritwik
    Ritwik Contributor

    I've also struggled with this. 

     

    There was another post some time ago on this topic. One 'workaround' suggested was using a dummy input dataset that runs every half an hour (or any amount of time, let's say every 20 minutes from a specific time range, you have flexibility). So, even if your actual input #2 did not finish running as the master dataflow updates, it will likely be picked up the next run. 

     

    Far from perfect, especially if your inputs are very large datasets that you'd only want to process once a day each in the morning.

     

    A more sophisticated way to communicate this to the Domo datalow UI would be incredibly useful. "Run when ALL inputs have been successfully run & updated once on the current day". 


  • A more sophisticated way to communicate this to the Domo datalow UI would be incredibly useful. "Run when ALL inputs have been successfully run & updated once on the current day". 


    Yes, a config option like that would help. On the other hand it does seem like this is actually a bug. I think it's pretty reasonable to interpret "Run when datasets are updated" as "run every time any dataset is updated", not "run when datasets are updated unless they finished updating while a run was already in progress".

     

    Each ETL could have a queue of input-dataset updates and could run once for each entry in the queue... or, when triggered to run, an ETL could wait as long any input-dataset was still in the process of updating, and throw an error/alert if it had to wait too long. I'll check with Domo Support.

  • This is why I will only select one data source when telling Domo to run when datasets update.  I do think it would be a product improvement if Domo could check if the dataset had a run time that started after each of the input datasets last updated.  

     

    You should be able to set an alert with the "Domo Stats - Data Sets" data.  You would need to identify the input data sets and then create a beastmode that would look at the max value for `DataSet Last Updated Date/Time` and would compare that to your data set's last updated date/time.  This wouldn't be perfect though, because you could still have a case where the data set in question (the final dataset) still completed after all of the inputs, but had started before one of the input datasets were finished.  

     

    To really get this done, it would be great if the start time was included in the domo stats dataset:2020-01-22_14-08-23.png

     


    “There is a superhero in all of us, we just need the courage to put on the cape.” -Superman
  • That said, something like the dummy-dataset idea should work for me for now. I can configure my smallest input dataset (less than 100 rows, and unlikely to get much larger) to update hourly and that should take care of it. Thanks!

  • timehat
    timehat Domo Employee

    Just want to add a comment to confirm that the team that owns this feature is aware of this pain point. I can't speak to timeline or anything, but I know it's helpful to know that the issue is well understood and on the radar.

  • Have you got the case number please for this issue?  This bug is causing me problems and I'd like to find out when it is going to get fixed.