Splitting word tokens - other languages

CassUser CassUser Thu, 17 Feb 2011 11:06:19 -0800

Hey all,

I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.


the approach I took was simple as follows:

1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
standard tokenizer, and shingle min/max/splitter.

This worked pretty well for us.  Now we would like to tokenize hangul/korean
into word grams.

I'm curious others have done something similar and would share their
experience.  Any pointers to get started with this would be great.

Thanks.

Splitting word tokens - other languages

Reply via email to