how to remove duplications in data sets

Hulachic5
Hulachic5 Member
edited March 2023 in Datasets

We have data that gets appended to the data flow for historical purposes.  How do we removed the duplication of the data due to appending.  The column that would initiate the data remaining is "Date Uploaded".

 

HELP.

 

SPM

Best Answer

  • n8isjack-ret
    n8isjack-ret Domo Employee
    Answer ✓

    It is best to 'replace' the data from workbench. This way the recursive dataflow will only be fed the records that workbench sent today.

     

    In Addition:

    One idea I would point out is "Intentional Overlap." I just made that term up, but I like it. When you are setting up Workbench to pull only the 'new' or 'recently updated' data you may want to ensure you pull the same data more than once to prevent loss.

     

    For example:

    I am pulling data every day. It runs every night and pulls all the data updated yesterday. That will include duplicates created on previous days that were updated yesterday, so I use the recursive dataflow. 

     

    Concern: One day something happens, and it doesn't load. Not sure why this would happen, but we're talking about 365 days/year and chances are likely that one day it'll happen. The next night it runs and pulls one day in and we never go back and pull that previous day in. Now we are missing a day.

     

    How to prevent it? If I am pulling data every day, then I'll pull seven days in. Every day I pull seven days. Every day I have six days of duplicate data... but that is ok because my recursive dataflow handles de-duplicating it. When that one day happens and a data load is missed, the next night will catch it. In fact I have a six-day buffer that protects me from such a problem. If some system is down and Domo cannot connect to it for six days, it will 'self-heal' on the seventh day and noone has to do anything. 

    Former Domo employee you can find me in the Dojo Community here @n8isjack

Answers

  • HI @Hulachic5, you will want to use a recursive dataflow to depulicate the data. This is different than using an append. We have some guidance for SQL and Magic ETL to help you with the framework. 

     

    Basically you will load the new data into a dataset that is set to "Replace" itself. That data feeds into a dataflow that will identify incoming duplicates, remove the duplicate row from the history, and append the full new load onto the history.

     

    This can be tricky your first time but is very powerful once you have it down.

     

    In your case, where duplicates already exist, it will require that you handle the duplications that already exist in a dataflow before starting the recursive dataflow. There are many ways to do that. One that is useful is the 'Rank and Window' functions in Magic ETL. I would 'rank' the values so that the most recent record is rank 1 and the oldest is the biggest rank. Then you can filter to only rank=1 so that older records are dropped.

    Former Domo employee you can find me in the Dojo Community here @n8isjack
  • Hi n8isjack --

     

    Is it wise to append the new data from workbench into DOMO or replace the data since there is a recursive dataset already in place?

  • n8isjack-ret
    n8isjack-ret Domo Employee
    Answer ✓

    It is best to 'replace' the data from workbench. This way the recursive dataflow will only be fed the records that workbench sent today.

     

    In Addition:

    One idea I would point out is "Intentional Overlap." I just made that term up, but I like it. When you are setting up Workbench to pull only the 'new' or 'recently updated' data you may want to ensure you pull the same data more than once to prevent loss.

     

    For example:

    I am pulling data every day. It runs every night and pulls all the data updated yesterday. That will include duplicates created on previous days that were updated yesterday, so I use the recursive dataflow. 

     

    Concern: One day something happens, and it doesn't load. Not sure why this would happen, but we're talking about 365 days/year and chances are likely that one day it'll happen. The next night it runs and pulls one day in and we never go back and pull that previous day in. Now we are missing a day.

     

    How to prevent it? If I am pulling data every day, then I'll pull seven days in. Every day I pull seven days. Every day I have six days of duplicate data... but that is ok because my recursive dataflow handles de-duplicating it. When that one day happens and a data load is missed, the next night will catch it. In fact I have a six-day buffer that protects me from such a problem. If some system is down and Domo cannot connect to it for six days, it will 'self-heal' on the seventh day and noone has to do anything. 

    Former Domo employee you can find me in the Dojo Community here @n8isjack
  • jstan
    jstan Contributor

    Do you have any keys on the dataset and do you use workbench for updates?  If so, you can do a full replace and then use an upsert dataset by flagging the keys in the dataset.  This way, you will always have the latest version of the data.  

     

    Go to the schema in the workbench job after the full replace.  Then right click on the column header and select column chooser.  From there, you can double click upsert.  Check all the boxes that make up your unique table key.  You should then be able to use a date field to pull the last day or two or use a last value, but then you will not get any duplicates.  

     

    This will only work if you can form a key with one or more columns in the dataset.

     

    Thanks,

    Josh

  • I know this is 5 years old, but this no longer works. I'm using WorkBench to load datasets and we are on a consumption model, which means we bought "credits" and data loads count against those.

    When I try to update a dataset with only recent transactions, doing a replace without a "where" statement works, but seems we are getting charged execution credits although sometimes no updates/additions occurred.

    There is no way to assign unique keys to a dataset.