[ https://issues.apache.org/jira/browse/HIVE-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317524#comment-14317524 ]
Alexander Pivovarov commented on HIVE-9556: ------------------------------------------- String similarity functions can be used to find fraud activity. e.g. person registers with slightly different names - "Alexander" vs "Alexandre" Also it can be used to find the same addresses. "110 Rock Harbor ln" vs "110 Rock harbour Lane" Oracle has function SOUNDEX to find strings which sound similar Postgres has - soundex - difference - levenshtein // returns int instead of double - -metaphone - dmetaphone http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html Strings similarity function might be useful if people migrate from Oracle or from Postgres to Hive. If people work with accounts, names, addresses, medical records, etc they can find strings similarity functions extremely useful. Strings similarity functions can be used by Data Scientists as well. Levenshtein distance is included to Apache Commons Lang StringUtils.getLevenshteinDistance() which is standard library found in most of java projects It would be nice to have Levenshtein Distance in Hive as well > create UDF to measure strings similarity using Levenshtein Distance algo > ------------------------------------------------------------------------ > > Key: HIVE-9556 > URL: https://issues.apache.org/jira/browse/HIVE-9556 > Project: Hive > Issue Type: Improvement > Components: UDF > Reporter: Alexander Pivovarov > Assignee: Alexander Pivovarov > Attachments: HIVE-9556.1.patch, HIVE-9556.2.patch > > > algorithm description http://en.wikipedia.org/wiki/Levenshtein_distance > {code} > --one edit operation, greatest str len = 12 > str_sim_levenshtein('Test String1', 'Test String2') = 1 - 1 / 12 = 0.91666667 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)