Duplicates after joining datasets

Hello, I have two data sets, that I ma trying to combine,

And

After I combine the sets, I get dupplicates

Any reccomendations?

Best Answers

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    It appears your join is only on the month. If you have multiple values in the supplier code codes it will cause a cross product join creating duplicates. Try and include the supplier code in addition to the month in your join.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • RobSomers
    RobSomers Coach
    Answer ✓

    @Newbie23 Sine you're just joining on month it will cause duplication. For each line in your Test QTY Rejected dataset, it will take the month and look for that month in Test Received QTY. Since that one has multiples of the same month, it will add a row for each time that month appears. You would need to add supplier code as another primary key in your join. That would work as long as your Test QTY Rejected has no duplicate months per supplier code. If it still doesn't work, then you need to do further groupings in your datasets so that each one is at a supplier-month grain (each month only appears once per supplier.)

    **Was this post helpful? Click Agree or Like below**

    **Did this solve your problem? Accept it as a solution!**

Answers

  • How are you combining your data? Are you using a join in Magic ETL? What kind of join are you using? What are the values you're joining? It appears you may have an issue with how you're joining your two datasets together. Make sure you have all of the key fields in the join to avoid duplication of data.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • Hi,

    I am using Magic ETL, and using the outer left, since I only need one column from the QTY received. I did noticed one data set has move rows than the other, but I was thinking those will be blank and the data will aligned based on the month.

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    It appears your join is only on the month. If you have multiple values in the supplier code codes it will cause a cross product join creating duplicates. Try and include the supplier code in addition to the month in your join.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • RobSomers
    RobSomers Coach
    Answer ✓

    @Newbie23 Sine you're just joining on month it will cause duplication. For each line in your Test QTY Rejected dataset, it will take the month and look for that month in Test Received QTY. Since that one has multiples of the same month, it will add a row for each time that month appears. You would need to add supplier code as another primary key in your join. That would work as long as your Test QTY Rejected has no duplicate months per supplier code. If it still doesn't work, then you need to do further groupings in your datasets so that each one is at a supplier-month grain (each month only appears once per supplier.)

    **Was this post helpful? Click Agree or Like below**

    **Did this solve your problem? Accept it as a solution!**

  • Thank you both for your help, I did the join by month and supplier and the data looks good, however, the new datasets only shows two suppliers and I have a lot more that that. When I filter by supplier trying to find the same supplier A58, it says not in preview or not rows matching the criteria.

  • Swap the two datasets you're using in your join. You're basing all of your dataset from the suppliers which have rejected records and not all vendors from the received records.

    TEST Received Qty should be on the left of your join, not the right.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • Are you suggesting this

    It works, see below, however my supplier A58 is not there, I wondering if I need to do some special grouping for my suppliers, most of my suppliers start with a letter and I don't see any of them, only see supplier codes that are numeric only

  • The main dataset for QTY received has 32 clolumns and I am extracting 3 columns: Month, supplier code and Qty received

  • Where does your A58 supplier exist? In both input datasets or just one?

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • Exists in both.

    The QTY received has ALL the suppliers (several bussines units), I want to extract the suppliers for my bussines unit only ( about 15 suppupliers). the QTY rejected is for my business unit only.

    I am going to try to filter by supplier first then join my datasets.

  • I added more columns to my QTY received, like supplier name and company, and drop the supplier code. I will be sorting by supplier name instead of supplier code. It seams to be working fine, I will try it again tomorrow.

    Thanks for your help,