I am trying to analyze some japanese web pages for presence of slang/adult phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the tokenizer breaks up the word into proper words, I am more interested in catching the slangs which seems to result from combining various "safe" words.
Few example of words that, as per our in-house japanese language expert,(I have no knowledge of japanese whatsoever) are slangs and should be caught "unbroken" are- 無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks it up into 無臭 and 正 which are both apparently safe. ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad when combined. 中出し broken into 中 and 出し, but should have been left as is as it represents a bad phrase. Any help on how I can use kuromozi tokenizer or any alternatives would be greatly appreciated. Thanks.