I am trying to analyze some japanese web pages for presence of slang/adult
phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
tokenizer breaks up the word into proper words, I am more interested in
catching the slangs which seems to result from combining various "safe"
words.

Few example of words that, as per our in-house japanese language expert,(I
have no knowledge of japanese whatsoever)  are slangs and should be caught
"unbroken" are-

無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks
it up into 無臭 and 正 which are both apparently safe.

ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad
when combined.

中出し  broken into 中 and 出し, but should have been left as is as it represents
a bad phrase.

Any help on how I can use kuromozi tokenizer or any alternatives would be
greatly appreciated.

Thanks.

Reply via email to