Levenshtein distance in MySQL dataflow

Hi,

 

I am looking to implement the levenshtein distance function (https://en.wikipedia.org/wiki/Levenshtein_distance) in a MySQL transform in order to find similar strings. I previously used Soundex to compare strings but it does not perform well on longer strings. How would I go about implementing this function in the dataflow?

 

Thanks,

Jason

Comments

  • NewsomSolutions
    NewsomSolutions Domo Employee

    Good Question - I didn't know that even had a name...thank you @jlederer  I have learned something new.

     

    Since I only have a hammer (soundex) I'm going to try to give you a nail.  Since you are familiar with Soundex and have used it but are having problems with larger strings, have you tried to break up your strings to see if that works any better?  May not be helpful, just thinking out loud really...

  • Thanks @NewsomSolutions for the response! I am glad to have shared new information with you. Your suggestion has potential, I believe. I guess I could break up my string and then compare the soundex of each. The issue is that the strings I am comparing are company names. For example, two records that are the same company but have different company name strings could be: "Joe's Auto Center" and "Joe's Auto Center and Detail In New York City". Both of these are the same company and I would like to merge them into one record, but the longer string produces a different soundex key. If I broke the longer string up and just compared the "Joe's Auto Center" part with each string then the soundex would work. I just fear this may lead to some similar company names being merged with different companies. I will give it a shot, thanks again.

  • NewsomSolutions
    NewsomSolutions Domo Employee

    @jlederer  Yea, I can see how that would not be the best solution for that scenario.  Could you do something where you say where name1 is in name2...but that again has it's own host of problems.  Too bad you can't find a Legal Company Name and DBA Company Name list with maybe Tax IDs on it somewhere and nail this down a bit tighter.  If you solve this outside of dojo solutions, please update this w/ what you did b/c I'm very curious.  Thanks

  • AS
    AS Coach

    I was loosely involved with a similar project while at Domo. My team created an app to group multiple variants of television program names together using soundex and Levenshtein.

    See if you can't get your account executive to put you in contact with Tanner Bindrup at Domo.  Guy's a genious.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"