It's easy to remove Duplicates.....How do I KEEP only the Duplicates in a Dataset?
I have a large dataset that I am constantly adding to, and the unique identifier for each item is a 17 or 22 character Text String. It is critical for me to quickly identify if I have duplicated a previous item when I add to the dataset.
ETL makes it simple to remove duplicates from a dataset.....but is there a way to eliminate everything BUT the duplicates??? Ideally, I'd like to either:
1. Create an alert anytime a new duplicate value is added to the dataset, or
2. Create an Output Dataset that consists ONLY of the rows that have a duplicate value in a specific column.
Thanks in advance for any help.
Best Answer
-
Try this:
Items you will need:
Input table for your dataset.
Group by
2 Filters
Join
Output Dataset
Connect Group by to your input dataset.
In step 1) put the field you want to check for Duplicates.
In 2) - also put the field you want to check for Duplicates, but do a Count on it and name it "Count"
Connected this Group By to a filter - and filter for "Count Greater Than 1"
Drag another connector from your Input Dataset to the Join, and drag a connector from your first Filter to the Join (basically joining the dataset to itself, after grouping and filtering). Join these two based on the field you were checking for duplicates.
Connected your Join to 2nd filter - and filter for "Count is not null"
Connect your Filter to an output dataset - which should produce a table showing you ONLY rows that have >1 of the field you were checking for.
Optionally - you can also do a Select Columns between the 2nd filter and the output dataset, selecting all rows except the 2nd copy of the field you were looking for duplicates on. Because both input datasets will have that field, it will appear twice in the final set unless you remove it using this method.
Good luck!
5
Answers
-
Try this:
Items you will need:
Input table for your dataset.
Group by
2 Filters
Join
Output Dataset
Connect Group by to your input dataset.
In step 1) put the field you want to check for Duplicates.
In 2) - also put the field you want to check for Duplicates, but do a Count on it and name it "Count"
Connected this Group By to a filter - and filter for "Count Greater Than 1"
Drag another connector from your Input Dataset to the Join, and drag a connector from your first Filter to the Join (basically joining the dataset to itself, after grouping and filtering). Join these two based on the field you were checking for duplicates.
Connected your Join to 2nd filter - and filter for "Count is not null"
Connect your Filter to an output dataset - which should produce a table showing you ONLY rows that have >1 of the field you were checking for.
Optionally - you can also do a Select Columns between the 2nd filter and the output dataset, selecting all rows except the 2nd copy of the field you were looking for duplicates on. Because both input datasets will have that field, it will appear twice in the final set unless you remove it using this method.
Good luck!
5 -
Thank you......that worked perfectly, and your instructions were perfectly clear and easy to implement!
0 -
Worked a treat. thanks!
0 -
I tried this but I keep getting duplicates.
What Join did you use and what which columns from what dataset did you drop?
My filter found no nulls.
0
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.5K Connect
- 1.2K Connectors
- 300 Workbench
- 6 Cloud Amplifier
- 8 Federated
- 2.9K Transform
- 100 SQL DataFlows
- 616 Datasets
- 2.2K Magic ETL
- 3.9K Visualize
- 2.5K Charting
- 738 Beast Mode
- 57 App Studio
- 40 Variables
- 685 Automate
- 176 Apps
- 452 APIs & Domo Developer
- 47 Workflows
- 10 DomoAI
- 36 Predict
- 15 Jupyter Workspaces
- 21 R & Python Tiles
- 394 Distribute
- 113 Domo Everywhere
- 275 Scheduled Reports
- 6 Software Integrations
- 124 Manage
- 121 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 10 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 108 Community Announcements
- 4.8K Archive