Hi, I am just beginning to implement text indexation for an application I am building and am not quite sure of a few things. The documents indexed will be in various languages, ranging mostly from short notes to ~20 page articles (with the occaisional book length). And so my plan is to have separate indexes for each language, each of which would contain a number of fields created from the same text analyzed in a number of ways. So for an English document I might have fields
stem, suffix, token generated from the same text with respectively an EnglishAnalyzer(), A custom analyzer with a ReverseStringFilter(), and StandardAnalyzer(). As doing things this way seems to mean having the text go through Standard and Stopword filters 3 times, once for each field, I am wondering if the there is a way to do something like this (with custom analyzers/implementation of PerFieldAnalyzer (or even out of the box--I'm very new to lucene)) that could avoid that duplicate processing*? Maybe a way to store the result of the analysis for the "token" field**, to be reused as the start point for the analysis for the "stem" and "suffix" fields (which would then just need the application of a Stemming filter and the ReverseStringFilter respectively). Note I am keen to avoid any pre-analysis processing of text as I would like to keep the offsets etc in line with the sources (stored externally) for hit highlighting when I eventually get that far! Any help/advice greatly appreciated. Thanks and kind regards, graham * in languages with requiring removal of diacritics for some fields and not others, etc, there will I guess be more duplication. **would this be achievable with reusableTokenStream()--(with my google skills) I haven't been able to get any clear idea of how to go about using this.