Separating words in a column where certain words have more spaces than others

I have a column in a dataset of production data. The column is formatted as follows:

Production

Cherry Tomato 12x1pt - Packed

Beefsteak 15 LB - Restack

Mini Sweet Peppers 14x4 FM - Packed

I'm wondering if there is a way to pull the "12x1pt" "15 LB" and "14x4 FM" into a column for "Size". Given the varying amount of white space, it isn't an easy task through the 'Split Column' plate in the ETL. I feel like this may require RegEx of which I have no experience with.

Thank you in advance.

Best Answer

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    Sorry, trying to write regexes on mobile is not a good idea :D

    REGEXP_REPLACE(`PRODUCT`, '^.*?([0-9]+[^ -]*\s?.*?)(?= - ).*$', '$1') 
    

    Adding a ? after your \s will make the space optional so it then also works with your examples of Cocktail 16X14OZ - Packed or Cherry Tomato 12x1pt - Packed

    @ChrisGaffan

    Also I utilize a website called regex101.com to build and test my regular expressions which may be helpful.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**

Answers

  • Will it always be one more more number followed by x and then one or more numbers followed by some extra characters or will that format possibly change?

    REGEXP_REPLACE(`Production`, '^.*([0-9]+x[0-9]+[^ ]*).*$', '$1')
    

    This regex follows that pattern where it's looking for one or more digits preceeding the x character and then followed by one or more digits and optionally any number of non space characters.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • ChrisGaffan
    ChrisGaffan Member
    edited July 31

    @GrantSmith Would i use a Formula plate for that or would I place it into a "Custom" delimiter within the "Split Column" plate? Also no, they won't always have an x. Some will be whitespace between a # and "LB" "PK" "PT"

  • You'd use a formula tile to calculate a new field in Magic ETL.

    If your end delimiter is ' - ' then we can tweak it to be something like:

    1. REGEXP_REPLACE(`Production`, '^.*([0-9]+.*) - .*$', '$1')

    This simplified version looks for numbers followed by any number of characters until it finds a space-space delimiter.

    Will there be other numbers in your product names or just the amounts like you have shown in your examples?

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • @GrantSmith This seems to be removing anything before the x.

    "Mini Sweet Peppers 14x4 FM - Packed" comes out as just "4 FM"

  • ChrisGaffan
    ChrisGaffan Member
    edited July 31
    REGEXP_REPLACE(`PRODUCT`, '^.*?([0-9]+[^ -]*\s.*?)(?= - ).*$', '$1') 
    

    This works, however it then doesn't work for something like "Cocktail 16X14OZ - Packed". Looks like it's not going to be doable due to the variability of how the original column is formatted.

    @GrantSmith

  • GrantSmith
    GrantSmith Coach
    Answer ✓

    Sorry, trying to write regexes on mobile is not a good idea :D

    REGEXP_REPLACE(`PRODUCT`, '^.*?([0-9]+[^ -]*\s?.*?)(?= - ).*$', '$1') 
    

    Adding a ? after your \s will make the space optional so it then also works with your examples of Cocktail 16X14OZ - Packed or Cherry Tomato 12x1pt - Packed

    @ChrisGaffan

    Also I utilize a website called regex101.com to build and test my regular expressions which may be helpful.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • @GrantSmith Thank You again!