Inconsistent State Arbeviations in our Data

In SFDC_TechSupport dataflow, the state column holds inconsistent state acronyms. For example, California will be ca, CA, & cal. Is there a simple way to clean up the state data and get it to work with a United States map?

 

Link to SFDC_TechSupport dataset:https://rytecdoors.domo.com/datasources/dee055d6-c463-4a1e-8f70-2da011076bad/details/overview

Link to card using this data: https://rytecdoors.domo.com/page/157177905/kpis/details/339647928

 

Thanks!

Sarah

Best Answer

  • jstan
    jstan Contributor
    Answer ✓

    Hi Sarah, you will likely have to run it through a dataflow or use a calculated field (beast mode).

     

    Dataflow/ETL:

    An easy way to do it would be to get all the abbreviations from salesforce by creating a data table card from the existing salesforce dataset with the list of all the abbreviations and add an aggregated field (count or sum a field which will then show unique values for the abbreviations).  Then export the table to excel (or copy paste into Google Sheets).  Add the two letter state next to the abbreviation (format how you want - all uppercase for instance).  Then you can either blend/fusion the excel file/google sheet (case sensitive join) to your salesforce dataset or put it through a dataflow.  If you put it through a dataflow, you can use UPPER or LOWER to join so if someone puts another value in there that has different capitalization, the dataflow will catch it.  Just make sure to get distinct values first (use distinct or group by in a transform if you convert state values to UPPER/LOWER, so that your data will not be duplicated... UPPER(Cal) is the same as CAL).

     

    Calculated Field:

    The other option is a calculated field.  You could do a long case statement.  Capitalization will matter with the specific values you are trying to match unless you use upper or lower on the field for state name.  With this approach, you would really only need to identify the non-conforming state values like Cal or something spelled out.

    CASE WHEN UPPER(state_name) IN ('CAL','CA','CALIFORNIA') THEN 'CA'

    WHEN UPPER(state_name) IN ('OH','OHIO') THEN 'OH'

    WHEN UPPER(state_name) IN ('NEW YORK','NYC') THEN 'NY'

    ELSE state_name END

     

    The other option is to have your salesforce team clean up their data, but if your company is anything like mine, it would take a long time for it to happen.

     

    Thanks,
    Josh

     

Answers

  • Chips
    Chips Domo Employee

    You could certainly do a clean up via Magic ETL, thinking you would join to a static "Clean Up" Excel doc that would have all the different incorrect states

     

    State       S/B

    CA           CA

    ca            CA 

    cal           CA

     

    But the first thing I would do is look for other fields in your query that could help, that might be more reliable. Do you have zip code in there?

    Domo Consultant

    **Say "Thanks" by clicking the "heart" in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Jaketh13
    Jaketh13 Domo Employee

    Hey @sdarmody

     

    Unfoturnately, it looks like this data is being entered manually in SFDC and therefore there is some variation in how different users are entering the states. The only thing for it would be to clean up the data now that it's in Domo (or possibly have them all to go back and fix it, but I don't know if that's an option).

     

    I would personally do this with a Beast Mode. Something like:

    CASE WHEN LOWER(`State`) LIKE 'ca%' THEN 'CA'

              WHEN LOWER(`State`) LIKE 'n%v%' THEN 'NV'

              WHEN ...

     

    It might be a bit on the heavy side, but it should at least help you locate the outliers and then you could move it into a DataFlow if you prefered (for consistency's sake)


    Thanks,
    Jake
    **Say "Thanks" by clicking the "heart" in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • jstan
    jstan Contributor
    Answer ✓

    Hi Sarah, you will likely have to run it through a dataflow or use a calculated field (beast mode).

     

    Dataflow/ETL:

    An easy way to do it would be to get all the abbreviations from salesforce by creating a data table card from the existing salesforce dataset with the list of all the abbreviations and add an aggregated field (count or sum a field which will then show unique values for the abbreviations).  Then export the table to excel (or copy paste into Google Sheets).  Add the two letter state next to the abbreviation (format how you want - all uppercase for instance).  Then you can either blend/fusion the excel file/google sheet (case sensitive join) to your salesforce dataset or put it through a dataflow.  If you put it through a dataflow, you can use UPPER or LOWER to join so if someone puts another value in there that has different capitalization, the dataflow will catch it.  Just make sure to get distinct values first (use distinct or group by in a transform if you convert state values to UPPER/LOWER, so that your data will not be duplicated... UPPER(Cal) is the same as CAL).

     

    Calculated Field:

    The other option is a calculated field.  You could do a long case statement.  Capitalization will matter with the specific values you are trying to match unless you use upper or lower on the field for state name.  With this approach, you would really only need to identify the non-conforming state values like Cal or something spelled out.

    CASE WHEN UPPER(state_name) IN ('CAL','CA','CALIFORNIA') THEN 'CA'

    WHEN UPPER(state_name) IN ('OH','OHIO') THEN 'OH'

    WHEN UPPER(state_name) IN ('NEW YORK','NYC') THEN 'NY'

    ELSE state_name END

     

    The other option is to have your salesforce team clean up their data, but if your company is anything like mine, it would take a long time for it to happen.

     

    Thanks,
    Josh

     

  • Good call, I'm currently looking for a good zip code field from SFDC. However, it looks like it's also inconsistent data. Still trying. If this doesn't work, I'll clean up the data by mapping current state abreviations to the desired abreviations. Thanks!