Need help "teaching" Japanese tokenizer to pick up slangs

Rahul Ratnakar Mon, 10 Mar 2014 10:58:32 -0700

I am trying to analyze some japanese web pages for presence of slang/adult
phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
tokenizer breaks up the word into proper words, I am more interested in
catching the slangs which seems to result from combining various "safe"
words.


Few example of words that, as per our in-house japanese language expert,(I
have no knowledge of japanese whatsoever)  are slangs and should be caught
"unbroken" are-

無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks
it up into 無臭 and 正 which are both apparently safe.

ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad
when combined.

中出し  broken into 中 and 出し, but should have been left as is as it represents
a bad phrase.

Any help on how I can use kuromozi tokenizer or any alternatives would be
greatly appreciated.

Thanks.

Need help "teaching" Japanese tokenizer to pick up slangs

Reply via email to