[jira] [Commented] (HIVE-9556) create UDF to measure strings similarity using Levenshtein Distance algo

Alexander Pivovarov (JIRA) Wed, 11 Feb 2015 19:43:42 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317524#comment-14317524
 ]


Alexander Pivovarov commented on HIVE-9556:
-------------------------------------------

String similarity functions can be used to find fraud activity. e.g. person 
registers with slightly different names - "Alexander" vs "Alexandre"
Also it can be used to find the same addresses. "110 Rock Harbor ln" vs "110 
Rock harbour Lane"

Oracle has function SOUNDEX to find strings which sound similar

Postgres has
- soundex
- difference
- levenshtein   // returns int instead of double
- -metaphone
- dmetaphone

http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

Strings similarity function might be useful if people migrate from Oracle or 
from Postgres to Hive.
If people work with accounts, names, addresses, medical records, etc they can 
find strings similarity functions extremely useful.
Strings similarity functions can be used by Data Scientists as well.

Levenshtein distance is included to Apache Commons Lang 
StringUtils.getLevenshteinDistance()
which is standard library found in most of java projects

It would be nice to have Levenshtein Distance in Hive as well

> create UDF to measure strings similarity using Levenshtein Distance algo
> ------------------------------------------------------------------------
>
>                 Key: HIVE-9556
>                 URL: https://issues.apache.org/jira/browse/HIVE-9556
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>         Attachments: HIVE-9556.1.patch, HIVE-9556.2.patch
>
>
> algorithm description http://en.wikipedia.org/wiki/Levenshtein_distance
> {code}
> --one edit operation, greatest str len = 12
> str_sim_levenshtein('Test String1', 'Test String2') = 1 - 1 / 12 = 0.91666667
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9556) create UDF to measure strings similarity using Levenshtein Distance algo

Reply via email to