It's easy to remove Duplicates.....How do I KEEP only the Duplicates in a Dataset?
I have a large dataset that I am constantly adding to, and the unique identifier for each item is a 17 or 22 character Text String. It is critical for me to quickly identify if I have duplicated a previous item when I add to the dataset.
ETL makes it simple to remove duplicates from a dataset.....but is there a way to eliminate everything BUT the duplicates??? Ideally, I'd like to either:
1. Create an alert anytime a new duplicate value is added to the dataset, or
2. Create an Output Dataset that consists ONLY of the rows that have a duplicate value in a specific column.
Thanks in advance for any help.
Best Answer
-
Try this:
Items you will need:
Input table for your dataset.
Group by
2 Filters
Join
Output Dataset
Connect Group by to your input dataset.
In step 1) put the field you want to check for Duplicates.
In 2) - also put the field you want to check for Duplicates, but do a Count on it and name it "Count"
Connected this Group By to a filter - and filter for "Count Greater Than 1"
Drag another connector from your Input Dataset to the Join, and drag a connector from your first Filter to the Join (basically joining the dataset to itself, after grouping and filtering). Join these two based on the field you were checking for duplicates.
Connected your Join to 2nd filter - and filter for "Count is not null"
Connect your Filter to an output dataset - which should produce a table showing you ONLY rows that have >1 of the field you were checking for.
Optionally - you can also do a Select Columns between the 2nd filter and the output dataset, selecting all rows except the 2nd copy of the field you were looking for duplicates on. Because both input datasets will have that field, it will appear twice in the final set unless you remove it using this method.
Good luck!
5
Answers
-
Try this:
Items you will need:
Input table for your dataset.
Group by
2 Filters
Join
Output Dataset
Connect Group by to your input dataset.
In step 1) put the field you want to check for Duplicates.
In 2) - also put the field you want to check for Duplicates, but do a Count on it and name it "Count"
Connected this Group By to a filter - and filter for "Count Greater Than 1"
Drag another connector from your Input Dataset to the Join, and drag a connector from your first Filter to the Join (basically joining the dataset to itself, after grouping and filtering). Join these two based on the field you were checking for duplicates.
Connected your Join to 2nd filter - and filter for "Count is not null"
Connect your Filter to an output dataset - which should produce a table showing you ONLY rows that have >1 of the field you were checking for.
Optionally - you can also do a Select Columns between the 2nd filter and the output dataset, selecting all rows except the 2nd copy of the field you were looking for duplicates on. Because both input datasets will have that field, it will appear twice in the final set unless you remove it using this method.
Good luck!
5 -
Thank you......that worked perfectly, and your instructions were perfectly clear and easy to implement!
0 -
Worked a treat. thanks!
0 -
I tried this but I keep getting duplicates.
What Join did you use and what which columns from what dataset did you drop?
My filter found no nulls.
0
Categories
- All Categories
- 1.9K Product Ideas
- 1.9K Ideas Exchange
- 1.6K Connect
- 1.3K Connectors
- 302 Workbench
- 6 Cloud Amplifier
- 9 Federated
- 2.9K Transform
- 104 SQL DataFlows
- 637 Datasets
- 2.2K Magic ETL
- 3.9K Visualize
- 2.5K Charting
- 761 Beast Mode
- 65 App Studio
- 42 Variables
- 704 Automate
- 182 Apps
- 458 APIs & Domo Developer
- 53 Workflows
- 11 DomoAI
- 39 Predict
- 16 Jupyter Workspaces
- 23 R & Python Tiles
- 401 Distribute
- 116 Domo Everywhere
- 277 Scheduled Reports
- 8 Software Integrations
- 132 Manage
- 129 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 12 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 111 Community Announcements
- 4.8K Archive