Hey all, I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we wrote to tokenize a document into word grams.
the approach I took was simple as follows: 1. extended the lucene Analyzer 2. In the tokenStream method use ShingleMatrixFilter. Passed in the standard tokenizer, and shingle min/max/splitter. This worked pretty well for us. Now we would like to tokenize hangul/korean into word grams. I'm curious others have done something similar and would share their experience. Any pointers to get started with this would be great. Thanks.