I don't think there's anything you can use out of the box, but if you search for the mail thread (see serchable archives) for a thread titled "Hebrew and Hindi analyzers" you might find something useful.
Not much help I know, but perhaps a place to start. And yes, you should use the same analyzer for indexing and searching if at all possible. The reason is that the job of an analyzer is to break the incoming stream up into meaningful units (usually words). You wouldn't want your analyzer used in indexing to, say, remove stopwords then use a different analyzer to search that did NOT remove stopwords (or lowercase, or stem, of...). And certainly many people have indexed and searched non-English documents, and many have been contributed the resultant Analyzers back to the Lucene community. If you find that you have to write your own, please consider contributing. HTH Erick On Sat, May 23, 2009 at 2:23 AM, KK <dioxide.softw...@gmail.com> wrote: > Hi All, > I've been trying to index some non-english [Indian languages] in unicode > utf-8. For all these languages we don't have any stemmer or tokenizers etc. > To keep the searching simple I'ld like to be able to do exact word > searches/matches as a first step. I'ld like to know which will be the > simplest yet working analyzer to use for both indexing as well as > searhing[lucene wiki says both should be same, else you might not get > search > results, right?] > > Many a people must have done indexing for non-english text for which there > is no standard analyzers. I request them to give me ideas on this. Along > with this I would also like to do hit highlighting irrespective of > language. > Ideas on this will be equally helpful. > > Is simpleAnalyzer() good enough for indexing and searching? > > Thanks, > KK >