yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote:
>I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. ?>You might want to add more weight the greater the size of the shingle. > >There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. would that apply to my case? tanimoto coefficient over shingles? Best, On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > > 4 sep 2008 kl. 14.38 skrev Cam Bazz: > > > Hello, >> This came up before but - if we were to make a swear word filter, string >> edit distances are no good. for example words like `shot` is confused with >> `shit`. there is also problem with words like hitchcock. appearently i >> need >> something like soundex or double metaphone. the thing is - these are >> language specific, and i am not operating in english. >> >> I need a fuzzy like curse word filter for turkish, simply. >> > > You probably need to make a large list of words. I would try to learn from > the users that do swear, perhaps even trust my users to report each other. I > would probably also look at storing in what context the word is used, > perhaps by adding the surrounding words (ngrams, shingles, markov chains). > Compare "go to hell" and "when hell frezes over". The first is rather > derogatory while the second doen't have to be bad at all. > > I'm thinking Hidden Markov Models and Neural Networks. > > > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >