Hi, You should inform yourself about the difference between "stored" and "indexed" fields: The tokens in the ".tis" file are in fact the analyzed tokens retrieved from the TokenStream. This is controlled by the Field parameter Field.Index. The Field.Store parameter has nothing to do with indexing: if a field is marked as "stored", the full and unchanged string / binary is stored in the stored fields file (".fdt"). Stored fields are used to e.g. display search results after a search has executed (because the tokens alone do not help for the search result display). In general for every field you should think about what you want to do with it: Index it, if you want to search on it; store it if you want the value be displayed in the search results (available via IndexReader/IndexSearcher.document()). In most cases only one option of both is really needed (I prefer to have the stored and indexed fields completely separated with different fiel names; e.g. stored fields can also be used to store a XML file for search result display in the index that has nothing to do with the field used for retrieval, but tokenizing and indexing this plain XML will not be useful).
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Carsten Schnober [mailto:schno...@ids-mannheim.de] > Sent: Wednesday, April 18, 2012 5:00 PM > To: java-user@lucene.apache.org > Subject: Field value vs TokenStream > > Dear list, > I'm studying the Lucene index file formats and I wonder: after having initialized > a field with Field(String name, String value, Field.Store store, Field.Index index), > where is the value String stored? > > I understand that the chosen analyzer does its processing on that value, > including tokenization, and returns a TokenStream from which the Indexer > retrieves the attributes that it stores in the index. > When I use a binary editor to inspect the term infos (tis) file in the index > directory, I can see every single token (term). > For experimenting purposes, I implemented an analyzer that converts the value > input to the field and noticed the following: the TokenStream still correctly > generates the terms that end up to be stored in the tis file, but the initial input > value is still displayed as the field value when I retrieve a document from the > index and output it with Document.toString(). I tried to analyse the Field's > tokenStream, but > tokenStreamValue() returns null; is that normal when retrieving a document > from an existing index? > > Can someone let me know what happens to a Field's value string and at which > point in the pipeline it is replaced by the (term) attributes generated by the > TokenStream? > > Thank you very much! > Best, > Carsten > > > -- > Carsten Schnober > Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- > Korpusanalyseplattform der nächsten Generation http://korap.ids- > mannheim.de/ | Tel.: +49-(0)621-1581-238 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org