Dear list, I have a quite specific issue on which I would appreciate very much having some thoughts before I start the actual implementation. Here's my task description: I would like to index corpora that have already been tokenized by an external tokenizer. This tokenization is stored in an external file and is the one I want to use for the Lucene index too. For each document, there is a file that describes each token in the document by character offsets, e.g. "<token start="0" end="3" />". Leave aside the XML format, I'll write an appropriate XML parser so that we just have that tokenization information. I do not want do to any additional analysis on the input text, i.e. no stopword filtering etc.; each token that is specified in the external tokenization is supposed to result in an indexed token.
My approach to achieve this goal would be to implement an Analyzer that reads the external tokenization information and generates a TokenStream containing all the Token objects with offsets set according to the external tokenization, i.e. without an own Tokenizer implementation. I'm working with Lucene 3.5, which is why one very concrete question at this point is: how would you implement this using the Attribute interface; still use Token objects or can/should I work around them at all? The documentation is quite vague about that point and so is the "Lucene in Action (2nd ed.)" textbook. The background is that I need to allow different tokenizations, so there will potentially be multiple indexes for a text. Queries will have to be tokenized by a user-defined tokenizer and the suitable index will then be searched. So what are your thoughts about that approach? Is it the right strategy for the task? Please recall that a given fact is that the tokenization has to be read from an external file. In general, I am afraid that the Lucene almost hardwires the analysis process. Even though it does allow for custom tokenizers to be implemented, it does not seem to intended that one does come up with a completely self-made text analysis process, is it? Thank you very much! Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://www.ids-mannheim.de/kl/projekte/korap/ Tel.: +49-(0)621-1581-238 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org