Remove Duplicates - Keep Most Current Instance?

Echelon
Echelon Contributor

I have a table of Opportunities from Salesforce that I am pulling into Domo, and Appending on a regular basis, for any Opportunities that were modified by a User or Automated Process in the last day. The purpose would be to have the most current records of all fields, for every Opportunity.

 

What I'm finding is that I most definitely need to de-duplicate in order to have accurate data. However, I'm concerned using the Duplicate Remover in MagicETL, on how to ensure that the Opportunity that is left in the Dataset, is the most current version.

 

What is the best way to approach this?

Best Answer

  • zcameron
    zcameron Domo Employee
    Answer ✓

    The method I describe in the document I uploaded on that thread doesn't actually use the remove duplicates method for that reason. I'll post the doc again here. Take a look and see if it makes any sense for you.

     

    One note is that you can skip the SELECT COLUMNS transform if in your append transform you choose to include columns that come from the branch that contains your NEW data. It will then automatically remove all the column names with "_1" attached.

Answers

  • zcameron
    zcameron Domo Employee

    Echelon,

     

    I use a technique called a Recursive Dataflow to handle this type of situation. There's documentation in the Help Center about how to do this. Also, check out this thread in Dojo for a conversation about this topic and a document I posted with instructions on how to do it in Magic ETL.

     

    I hope that helps!

  • Echelon
    Echelon Contributor

    I did find that, and it has definitely got me in the right direction.

     

    One piece that I can't figure out though, is how does the de-duplicating determing which row to remove and which to leave?

     

    What I need to do, is to update the Table that I am creating, and replace it if there is an updated version of a row, with new information. I'm not sure how to remove the old row out of the Table.

     

    In my case I know I would use a unique identifier (ID in my case), but also a Modified Date. Whatever is older, I would filter/remove.

  • zcameron
    zcameron Domo Employee
    Answer ✓

    The method I describe in the document I uploaded on that thread doesn't actually use the remove duplicates method for that reason. I'll post the doc again here. Take a look and see if it makes any sense for you.

     

    One note is that you can skip the SELECT COLUMNS transform if in your append transform you choose to include columns that come from the branch that contains your NEW data. It will then automatically remove all the column names with "_1" attached.

  • Echelon
    Echelon Contributor

    This did it - I'm 99% sure I'm pulling in the correct data now...only 1% not sure, because the rows don't EXACTLY match between what's directly in Salesforce, and what is being spit out through the ETL. But, I think it must be just from lag time between the systems.

     

    Thanks so much for the help!

  • zcameron
    zcameron Domo Employee

    Glad to hear it!

     

    Nice work!

  • sdr
    sdr Member

    Does this technique require starting with a dataset that's already free of duplicates? It will append rows that are not dupes, but it won't dedupe this original data. Is that right?