Dealing with column types & bad data

Here's the problem I've come across a few times now:

 

A CSV upload file has a column which looks like integers for the first hundred or so rows, but lurking further down are alpha characters. When creating a transform in magic, it loads the first 100 rows, sees they're all integers and sets the column type as such. Then, run the ETL and it fails with "Failed to parse data 'blah' as type Integer for column..."

 

Adding a 'set column type' tile to set it to text (and then strip non-integers) doesn't help as the error is triggered when the whole file is loaded into the ETL before it can process the column type change.

 

The only solution I've found is to strip out the bad data manually before uploading the file. Resolving at the datasource would be ideal, but not always possible. Has anyone found a way to deal with this without manual intevention so it just works? 

 

Thanks.

 

Best Answer

  • Valiant
    Valiant Coach
    Answer ✓

    Damien,

     

    One question on your situation, are you looking to remove the fields with non-numeric data completely (ie, make them NULL) or are you wanting to just strip out the specific offending characters or perhaps omit the row entirely if it has bad data?

     

    If you'll let me know how you're wanting to handle the bad data, I'll try to be more clear in my next response, but maybe the following will help.

     

    Regardless I think you're going to need to use the SQL transform to clean your data.

    When I'm looking for strictly numeric fields (0-9 only), you can do something like this:

    SELECT * FROM Table WHERE `column` REGEXP '^[0-9]+$'

    That would return only records where the column has numbers. If you have commas or periods in your data as well use:

    SELECT * FROM Table WHERE `column` REGEXP '^[0-9]+\\.?[0-9]*$'

     

    Again, hopefully thats somewhat helpful. If you need more assistance, just let me know a bit more about how you're wanting to handle the fields causing you problems.

     

    Sincerely,

    -ValiantSpur

     

     

     

Answers

  • Valiant
    Valiant Coach
    Answer ✓

    Damien,

     

    One question on your situation, are you looking to remove the fields with non-numeric data completely (ie, make them NULL) or are you wanting to just strip out the specific offending characters or perhaps omit the row entirely if it has bad data?

     

    If you'll let me know how you're wanting to handle the bad data, I'll try to be more clear in my next response, but maybe the following will help.

     

    Regardless I think you're going to need to use the SQL transform to clean your data.

    When I'm looking for strictly numeric fields (0-9 only), you can do something like this:

    SELECT * FROM Table WHERE `column` REGEXP '^[0-9]+$'

    That would return only records where the column has numbers. If you have commas or periods in your data as well use:

    SELECT * FROM Table WHERE `column` REGEXP '^[0-9]+\\.?[0-9]*$'

     

    Again, hopefully thats somewhat helpful. If you need more assistance, just let me know a bit more about how you're wanting to handle the fields causing you problems.

     

    Sincerely,

    -ValiantSpur

     

     

     

  • Thanks for the response. I was hoping it would be possible in Magic but happy with SQL.

     

    In this case I just want to make the non-numeric data null.