Similarity between two text columns

I have a dataset with two columns indicating company names. I was wondering what is the best way to determine the similarity between the two?

 

Perhaps, I can pass 3 columns in R/Python and return 4 columns with cosine similarity. Can I do that? A starter code/example would be great?

Comments

  • Hi @Yogesh 

     

    How are you defining similarity? Spelled the same? Sound the same?

     

     

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql

     

    Levenshtein distance is a common way of calculating the similarity between two text values (i.e. how many characters would you have to change before they are the same.  "cat > rat" = 1, "John > Jon" = 1.

     

    You'll have to rewrite it into MySQL but it can be done.  For a workflow like this though, you'll want some sort of process where a user accepts or discards a recommendation which you'll want to accumulate (recursively?) in a lookup table of 'approved matches'.  

     

    in my above example, you may not want to automatically accept that John and Jon are the same entry... hence the need for a feedback loop.  Domo can of course handle that as a simple webform or more dynamically with a custom app with a polished user interface.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"