Magic ETL

Magic ETL

Compare rows and delete if one of the fields do not match

Hello,

 

I have data like this.

1, red

1, blue

2, purple

2, purple

3, yellow

3,red

 

I'd like to edit the data to show the data like this:

1, red

2, purple

2, purple

3, yellow

 

So I'd only like to only keep duplicates if the first and second field each have a match in the data. Any ideas? Assuming it is ETL somehow but can't wrap my head around it.

Best Answer

  • Contributor
    Answer ✓

    First, let me make sure I have my head wrapped around what you are trying to do.

     

    Playing with your data, with a name for the columns:

     

    1. Number Color
      1 red
      1 blue
      2 purple
      2 purple
      3 yellow
      3 red
      3 blue
      3 yellow

    If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

     

    Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

     

    Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

    1. Functions, just add one:
      1. "Row"
      2. Row Number
    2. Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
    3. Ascending
    4. `Number`

     

    The data will look like this with the new column:

     

    1. Number Color Row
      1 red 1
      1 blue 2
      2 purple 1
      2 purple 2
      3 yellow 1
      3 red 2
      3 blue 3
      3 yellow 4

    Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

     

    1. [Filter 1]: Row = 1
      Number Color Row
      1 red 1
      2 purple 1
      3 yellow 1

      [Filter 2]: Row Not = 1
      Number Color Row
      1 blue 2
      2 purple 2
      3 red 2
      3 blue 3
      3 yellow 4

    We will reference them as [Filter 1] and [Filter 2]

     

    This gets a little harder to conceptualize, I'll try my best. 

     

    Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

     

    Find valid matches:

    To identify them we will do a [Join Data] action with these settings:

    • [Filter 2] is on the left ([Filter 1] goes on the right automatically)
    • Inner Join
    • Key Fields
      • Number
      • Color

    The preview should look like this at that point:

    1. Number Color Row Number_1 Color_1 Row_1
      2 purple 2
      3 yellow 4

    Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

     

    That should do it. Your final preview should look like:

    1. Number Color Row
      1 red 1
      2 purple 1
      3 yellow 1
      2 purple 2
      3 yellow 4

    You can use a [Select Columns] action to remove the `Row` column if it bugs you.

     


    **Say "Thanks" by clicking the "heart" in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"

Answers

  • Contributor
    Answer ✓

    First, let me make sure I have my head wrapped around what you are trying to do.

     

    Playing with your data, with a name for the columns:

     

    1. Number Color
      1 red
      1 blue
      2 purple
      2 purple
      3 yellow
      3 red
      3 blue
      3 yellow

    If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

     

    Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

     

    Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

    1. Functions, just add one:
      1. "Row"
      2. Row Number
    2. Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
    3. Ascending
    4. `Number`

     

    The data will look like this with the new column:

     

    1. Number Color Row
      1 red 1
      1 blue 2
      2 purple 1
      2 purple 2
      3 yellow 1
      3 red 2
      3 blue 3
      3 yellow 4

    Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

     

    1. [Filter 1]: Row = 1
      Number Color Row
      1 red 1
      2 purple 1
      3 yellow 1

      [Filter 2]: Row Not = 1
      Number Color Row
      1 blue 2
      2 purple 2
      3 red 2
      3 blue 3
      3 yellow 4

    We will reference them as [Filter 1] and [Filter 2]

     

    This gets a little harder to conceptualize, I'll try my best. 

     

    Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

     

    Find valid matches:

    To identify them we will do a [Join Data] action with these settings:

    • [Filter 2] is on the left ([Filter 1] goes on the right automatically)
    • Inner Join
    • Key Fields
      • Number
      • Color

    The preview should look like this at that point:

    1. Number Color Row Number_1 Color_1 Row_1
      2 purple 2
      3 yellow 4

    Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

     

    That should do it. Your final preview should look like:

    1. Number Color Row
      1 red 1
      2 purple 1
      3 yellow 1
      2 purple 2
      3 yellow 4

    You can use a [Select Columns] action to remove the `Row` column if it bugs you.

     


    **Say "Thanks" by clicking the "heart" in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Fantastic Reply!  thank you for the walkthrough and that should do it!  Will let you knowif I have any questions but looks somewhat straightforward!

Welcome!

It looks like you're new here. Members get access to exclusive content, events, rewards, and more. Sign in or register to get started.
Sign In