Re: string similarity measures

Cam Bazz Thu, 04 Sep 2008 06:54:46 -0700

yes, I already have a system for users reporting words. they fall on an
operator screen and if operator approves, or if 3 other people marked it as
curse, then it is filtered.
in the other thread you wrote:


>I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. ?>You might
want to add more weight the greater the size of the shingle.
>
>There are shingle filters in lucene/java/contrib/analyzers and there is a
Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?

Best,


On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:

>
> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>
>
>  Hello,
>> This came up before but - if we were to make a swear word filter, string
>> edit distances are no good. for example words like `shot` is confused with
>> `shit`. there is also problem with words like hitchcock. appearently i
>> need
>> something like soundex or double metaphone. the thing is - these are
>> language specific, and i am not operating in english.
>>
>> I need a fuzzy like curse word filter for turkish, simply.
>>
>
> You probably need to make a large list of words. I would try to learn from
> the users that do swear, perhaps even trust my users to report each other. I
> would probably also look at storing in what context the word is used,
> perhaps by adding the surrounding words (ngrams, shingles, markov chains).
> Compare "go to hell" and "when hell frezes over". The first is rather
> derogatory while the second doen't have to be bad at all.
>
> I'm thinking Hidden Markov Models and Neural Networks.
>
>
>          karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: string similarity measures

Reply via email to