Any tips on how to clean Typo's in a dataset?

Hi all,

I have a large and growing dataset with a number of human error typo's within the source.

I am looking for a simple formula I can add to my ETL (or other method) to clean and fix common errors.

An example to illustrate the issue, in my dataset I may have a field called 'Company Name' and within that field, for the same company I see the following records:

Company A

Company|A

Company A.

Company B

Company|B

etc.

Is there a method to clean all references and remove the spurrious | and .? At the moment I am having to it for individual 'companies' rather than a single line for all companies. All help is very much appreciated.

Thanks

Joe

Best Answer

  • ggenovese
    ggenovese Contributor
    Answer ✓

    No problem! The REPLACE() function would work inside of either a Magic ETL or in a Beast Mode inside of a card. The function will strip out the pipes "|" and replace with SPACE and strip out the periods "." and not replace those with anything.

    If you want to go the ETL route you'd use an Add Formula tile, name it the same name as your Company Name column and paste the two replaces inside where the formula goes.

    The process for a beast mode in a card is pretty similar, inside of Analyzer you would click "Create a Calculated Field" and create the Beast Mode. One thing to note however is that inside the card you would have to give it a different name because in this case you're creating a new field rather than overwriting an existing one as you would have in the ETL

Answers

  • ggenovese
    ggenovese Contributor

    If you just remove the pipe symbols then you'd end up with a mix of "Company A", "CompanyA","Company B", "CompanyB" etc. so you'd need to replace pipe with a space. But if you replace the periods with a space then you'll have trailing whitespace. So I'd suggest you have two nested REPLACE() functions so that you can handle the different scenarios separately:

    REPLACE(REPLACE(`Company Name`,'|',' '),'.','')

  • Thank you @ggenovese. Apologies for the basic questions, I am a fairly basic user!

    Would I need to write each query for each company name or would 1 line work for Company A, B and C etc.?

    Thanks

  • ggenovese
    ggenovese Contributor
    Answer ✓

    No problem! The REPLACE() function would work inside of either a Magic ETL or in a Beast Mode inside of a card. The function will strip out the pipes "|" and replace with SPACE and strip out the periods "." and not replace those with anything.

    If you want to go the ETL route you'd use an Add Formula tile, name it the same name as your Company Name column and paste the two replaces inside where the formula goes.

    The process for a beast mode in a card is pretty similar, inside of Analyzer you would click "Create a Calculated Field" and create the Beast Mode. One thing to note however is that inside the card you would have to give it a different name because in this case you're creating a new field rather than overwriting an existing one as you would have in the ETL

  • Thank you @ggenovese that has worked really well.

  • @Joe_Pitcher I second @ggenovese's advice. I would also recommend adding validation rules to manual data sources whenever possible to prevent these data quality issues upstream. For example, if your data is coming from a spreadsheet then consider adding drop-downs and datatype validation in the spreadsheet so that the users are required to enter consistent values.