Any tips on how to clean Typo's in a dataset?
Hi all,
I have a large and growing dataset with a number of human error typo's within the source.
I am looking for a simple formula I can add to my ETL (or other method) to clean and fix common errors.
An example to illustrate the issue, in my dataset I may have a field called 'Company Name' and within that field, for the same company I see the following records:
Company A
Company|A
Company A.
Company B
Company|B
etc.
Is there a method to clean all references and remove the spurrious | and .? At the moment I am having to it for individual 'companies' rather than a single line for all companies. All help is very much appreciated.
Thanks
Joe
Best Answer
-
No problem! The REPLACE() function would work inside of either a Magic ETL or in a Beast Mode inside of a card. The function will strip out the pipes "|" and replace with SPACE and strip out the periods "." and not replace those with anything.
If you want to go the ETL route you'd use an Add Formula tile, name it the same name as your Company Name column and paste the two replaces inside where the formula goes.
The process for a beast mode in a card is pretty similar, inside of Analyzer you would click "Create a Calculated Field" and create the Beast Mode. One thing to note however is that inside the card you would have to give it a different name because in this case you're creating a new field rather than overwriting an existing one as you would have in the ETL
2
Answers
-
If you just remove the pipe symbols then you'd end up with a mix of "Company A", "CompanyA","Company B", "CompanyB" etc. so you'd need to replace pipe with a space. But if you replace the periods with a space then you'll have trailing whitespace. So I'd suggest you have two nested REPLACE() functions so that you can handle the different scenarios separately:
REPLACE(REPLACE(`Company Name`,'|',' '),'.','')
1 -
Thank you @ggenovese. Apologies for the basic questions, I am a fairly basic user!
Would I need to write each query for each company name or would 1 line work for Company A, B and C etc.?
Thanks
0 -
No problem! The REPLACE() function would work inside of either a Magic ETL or in a Beast Mode inside of a card. The function will strip out the pipes "|" and replace with SPACE and strip out the periods "." and not replace those with anything.
If you want to go the ETL route you'd use an Add Formula tile, name it the same name as your Company Name column and paste the two replaces inside where the formula goes.
The process for a beast mode in a card is pretty similar, inside of Analyzer you would click "Create a Calculated Field" and create the Beast Mode. One thing to note however is that inside the card you would have to give it a different name because in this case you're creating a new field rather than overwriting an existing one as you would have in the ETL
2 -
Thank you @ggenovese that has worked really well.
1 -
@Joe_Pitcher I second @ggenovese's advice. I would also recommend adding validation rules to manual data sources whenever possible to prevent these data quality issues upstream. For example, if your data is coming from a spreadsheet then consider adding drop-downs and datatype validation in the spreadsheet so that the users are required to enter consistent values.
1
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.5K Connect
- 1.2K Connectors
- 300 Workbench
- 6 Cloud Amplifier
- 9 Federated
- 2.9K Transform
- 100 SQL DataFlows
- 622 Datasets
- 2.2K Magic ETL
- 3.9K Visualize
- 2.5K Charting
- 744 Beast Mode
- 58 App Studio
- 41 Variables
- 686 Automate
- 176 Apps
- 453 APIs & Domo Developer
- 47 Workflows
- 10 DomoAI
- 36 Predict
- 15 Jupyter Workspaces
- 21 R & Python Tiles
- 395 Distribute
- 113 Domo Everywhere
- 276 Scheduled Reports
- 6 Software Integrations
- 124 Manage
- 121 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 10 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 108 Community Announcements
- 4.8K Archive