REGEX Tutorial to Validate US Phone Numbers

Hi everyone,

I wanted to create this discussion post to share a methodology I created using Domo's Regular Expressions in Magic ETL to parse and validate US phone numbers. This validation process follows the basic standards set by the North American Numbering Plan (NANP) and helps identify valid US phone numbers from a dataset using just a small reference table and just a few ETL tiles and formulas.

Valid US Phone Number Parsing Tutorial:

  1. Create a US Area Codes reference table: Simply upload the attached spreadsheet (or copy and paste the data into a Domo webform) that contains a current list of valid US area codes.
  2. Add a Formula tile and create an Area Code field using the following formula: LEFT(TRIM(REGEXP_REPLACE(`Phone Number`, '[^0-9]', '')),3)
  3. Join the US Area Codes reference table to your main dataset that using the Area Code as the join key. To resolve the duplicate field names, simply rename the reference table field to Valid Area Code.
  4. In another Formula tile, create a Valid Area Code? flag using the following formula: IFNULL(`Valid Area Code`,'Invalid')
  5. Create another formula called Valid Phone Number? underneath the "Valid Area Code?" flag:

CASE
WHEN `Valid Area Code?` = 'Invalid' THEN 'FALSE'
WHEN REGEXP_LIKE(TRIM(REGEXP_REPLACE(`Phone Number`, '[^0-9]', '')), '^(?:[2-9][0-9]{2}){2}[0-9]{4}$')
AND NOT REGEXP_LIKE(TRIM(REGEXP_REPLACE(`Phone Number`, '[^0-9]', '')), '^(.)\1{9}$')
AND NOT REGEXP_LIKE(TRIM(REGEXP_REPLACE(`Phone Number`, '[^0-9]', '')), '^[0-9]{3}911[0-9]{4}$')
AND NOT REGEXP_LIKE(TRIM(REGEXP_REPLACE(`Phone Number`, '[^0-9]', '')), '^[2-9][0-9]{2}555[0-9]{4}$')
THEN 'TRUE'
ELSE 'FALSE'
END

The formula above checks the following information for each phone number:

  • The number contains exactly 10 numerical digits with a valid structure defined by the North American Numbering Plan (NANP): (NPA) NXX-XXXX.
  • The number's area code (NPA) and prefix (NXX) both start with digits 2-9.
  • The number is not composed entirely of the same digit (e.g., 1111111111).
  • The number does not contain the "911" sequence in the prefix (NXX) position.
  • The number does not contain the "555" sequence in the prefix (NXX) position.

I hope this is helpful for anyone working with large datasets of phone numbers who need to filter out invalid US phone numbers! Please feel free to provide any feedback to further optimize or expand on this logic, especially if anyone has patterns for international formats or any additional edge cases I may have missed.

Thank you!

Comments