Hey all,

I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.

the approach I took was simple as follows:

1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
standard tokenizer, and shingle min/max/splitter.

This worked pretty well for us.  Now we would like to tokenize hangul/korean
into word grams.

I'm curious others have done something similar and would share their
experience.  Any pointers to get started with this would be great.

Thanks.

Reply via email to