MySQL UNION not removing duplicates

bdavis
bdavis Contributor

I have a historical dataset (HIST) and one that's updated daily (DAILY).  The DAILY contains the past 3 days worth of data.

 

So the statement:

SELECT * FROM DAILY
UNION
SELECT * FROM HIST

...should produce a combined dataset, with no duplicate rows. But, I'm getting duplicate rows in my output as if I had used a UNION ALL instead. I know how to work around it, but I'm not understanding why it's happening in the first place. The HIST set was created from the DAILY with no transforms. 

 

Here's what's interesting, after it's ran and creates 3 days of duplicate data - if I run it again it doesn't create any more records. So, I interjected a transform to delete the past 2 days from HIST (leaving 1) and ran the dataflow again. It didn't duplicate anything at all. Ok, so I run it one last time (importing the past 3 days into a set that already has all 3 days in it) and expect no change. Nope, it duplicated all 3 days again.

 

Again, I'm just going to use a `SELECT DISTINCT` after the union to work around this, but I don't understand why this is occuring in Domo, but not on my local MySQL.

Comments

  • AS
    AS Coach

    Are you seeing this behavior in the full output, or just in the transform preview? I've noticed transform preview problems before that end up being find when I run it, but if it's part of a fully run dataflow, then it sounds like a bug that should be reported.

    Your workaround with a DISTINCT sounds appropriate here. You might also try using redshift SQL instead.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • So both sets of data have the exact same records? There's no like load date column or anything that might be different between the two?

     

    If not, I agree with @AS and I would reach out to support to have them check into it.

     

    Best of luck,

    Valiant

  • bdavis
    bdavis Contributor

    It's in the full output. I've also seen the previews displaying very incorrect data but generating the proper output. 

     

    I'm convinced something else had to occur (even though these particular queries are as simple as you get). So I'm going to nuke it all, push the historical data, and recreate the DF again to see if it resolves.

     

    UPDATE: So I tried it again from scratch - same results. Strangely, even SELECT DISTINCT * didn't work. I downloaded the data and verified it's 100% duplicated data via SQL and Excel. So I nuked it all again, uploaded the historical, and created the HIST dataset. Then I uploaded the past 3 days, created a transform that did the union between them, and the output query is just SELECT DISTINCT * FROM combo. It's still not removing duplicates. I've contacted support. This is quite strange as I've done this many times before with success.

  • AS
    AS Coach

    Interesting.  The only thing that comes to mind as @Valiant suggests is that there's a value, like a timestamp, that's a little bit different on each side.

    Maybe try doing the distinct or union logic on the datasets but removing suspicious columns beforehand, in order to determine if one column (or more) has distinction across datasets.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • bdavis
    bdavis Contributor

    I downloaded the dataset that has the duplicate values, loaded it into R and found duplicate rows (the exact same ones in the daily export). There's certainly no calculated field happening. I truly can't explain it. Hopefully support has an idea.

  • AS
    AS Coach

    That's really odd.
    If you use that output dataset in a card and do a distinct aggregation in a beast mode, do you get the same result as without the distinct?

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • bdavis
    bdavis Contributor

    No, different results. This is a bizzarre scenario, I'm really stumped. I've done the same steps on a local MySQL installation and I get expected results. I've done it in R and gotten expected results. I'm not sure what's unique about the dataset/dataflow on Domo.