Compare rows and delete if one of the fields do not match

colinr

Hello,

I have data like this.

1, red

1, blue

2, purple

3, yellow

3,red

I'd like to edit the data to show the data like this:

1, red

2, purple

3, yellow

So I'd only like to only keep duplicates if the first and second field each have a match in the data. Any ideas? Assuming it is ETL somehow but can't wrap my head around it.

Quick Links

Accepted answers

n8isjack

First, let me make sure I have my head wrapped around what you are trying to do.

Playing with your data, with a name for the columns:

Number   Color
1        red
1        blue
2        purple
2        purple
3        yellow
3        red
3        blue
3        yellow

If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

Functions, just add one:
1. "Row"
2. Row Number
Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
Ascending
`Number`

The data will look like this with the new column:

Number   Color      Row
1        red        1
1        blue       2
2        purple     1
2        purple     2
3        yellow     1
3        red        2
3        blue       3
3        yellow     4

Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

[Filter 1]: Row = 1
Number   Color      Row
1        red        1
2        purple     1
3        yellow     1

[Filter 2]: Row Not = 1
Number   Color      Row
1        blue       2
2        purple     2
3        red        2
3        blue       3
3        yellow     4

We will reference them as [Filter 1] and [Filter 2]

This gets a little harder to conceptualize, I'll try my best.

Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

Find valid matches:

To identify them we will do a [Join Data] action with these settings:

[Filter 2] is on the left ([Filter 1] goes on the right automatically)
Inner Join
Key Fields
- Number
- Color

The preview should look like this at that point:

Number   Color      Row  Number_1  Color_1  Row_1
2        purple     2 
3        yellow     4

Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

That should do it. Your final preview should look like:

Number   Color      Row
1        red        1
2        purple     1
3        yellow     1
2        purple     2 
3        yellow     4

You can use a [Select Columns] action to remove the `Row` column if it bugs you.

All comments