I would like to be able to get multi-language support within a single index. I would appreciate input on what I am suggesting:
Assuming that you want something like the following in your document: Title_english Title_french Title_german Keyword_english Keyword_french Keyword_german Let's pretend for now that each of these was created with a different appropriate analyzer and the mechanisms for doing this exist (see end of post for more on this). How to handle a query? Could we associate an Analyzer with a set of fields, like this: // pseudo java Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"}); Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"}); Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"}); Analyzer ml = new MultiLanguageAnalyzer(); (MultiLanguageAnalyzer)ml.add(ea); (MultiLanguageAnalyzer)ml.add(fa); (MultiLanguageAnalyzer)ml.add(ga); QueryParser parser = MultiLanguageParser("TitleEnglish", ml); // end Now when parser.parse("TitleEnglish: foo TitleFrench:bar smith") is called, MultiLanguageParser uses the appropriate analyzer for each field in the query to parse the sub-query & rolls up all of the queries created by these analyzers into the real query. I am thinking that this would require having separate term dictionaries for each language, thus demanding a significant change in the index format? [Note I am not an expert on Lucene internals] Of course, something similar to the above could be used adding documents to the index. Looking at: http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files It seems that it would need - instead of the present single set - a set of segment files for each analyzer: .fnm (Fields), tis & tii (term dictionary), .frq (term frequencies), .prx (positions), .nrm (normalizations), .tvx, .tvd, .tvf (term vectors). How stable is the code for this part of the index & would it easily support this kind of extension? Or would some re-factoring be needed to make these sorts of manipulations to the nature of the segments files easier for mere mortal developers? :-) Is this something that is already being talked about/looked in to/being implemented? :-) thanks, Glen Newton http://zzzoot.blogspot.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]