RE: Field value vs TokenStream

Uwe Schindler Wed, 18 Apr 2012 11:07:11 -0700

Hi,

You should inform yourself about the difference between "stored" and
"indexed" fields: The tokens in the ".tis" file are in fact the analyzed
tokens retrieved from the TokenStream. This is controlled by the Field
parameter Field.Index. The Field.Store parameter has nothing to do with
indexing: if a field is marked as "stored", the full and unchanged string /
binary is stored in the stored fields file (".fdt"). Stored fields are used
to e.g. display search results after a search has executed (because the
tokens alone do not help for the search result display). In general for
every field you should think about what you want to do with it: Index it, if
you want to search on it; store it if you want the value be displayed in the
search results (available via IndexReader/IndexSearcher.document()). In most
cases only one option of both is really needed (I prefer to have the stored
and indexed fields completely separated with different fiel names; e.g.
stored fields can also be used to store a XML file for search result display
in the index that has nothing to do with the field used for retrieval, but
tokenizing and indexing this plain XML will not be useful).


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
> Sent: Wednesday, April 18, 2012 5:00 PM
> To: java-user@lucene.apache.org
> Subject: Field value vs TokenStream
> 
> Dear list,
> I'm studying the Lucene index file formats and I wonder: after having
initialized
> a field with Field(String name, String value, Field.Store store,
Field.Index index),
> where is the value String stored?
> 
> I understand that the chosen analyzer does its processing on that value,
> including tokenization, and returns a TokenStream from which the Indexer
> retrieves the attributes that it stores in the index.
> When I use a binary editor to inspect the term infos (tis) file in the
index
> directory, I can see every single token (term).
> For experimenting purposes, I implemented an analyzer that converts the
value
> input to the field and noticed the following: the TokenStream still
correctly
> generates the terms that end up to be stored in the tis file, but the
initial input
> value is still displayed as the field value when I retrieve a document
from the
> index and output it with Document.toString(). I tried to analyse the
Field's
> tokenStream, but
> tokenStreamValue() returns null; is that normal when retrieving a document
> from an existing index?
> 
> Can someone let me know what happens to a Field's value string and at
which
> point in the pipeline it is replaced by the (term) attributes generated by
the
> TokenStream?
> 
> Thank you very much!
> Best,
> Carsten
> 
> 
> --
> Carsten Schnober
> Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP
--
> Korpusanalyseplattform der nächsten Generation http://korap.ids-
> mannheim.de/ | Tel.: +49-(0)621-1581-238
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Field value vs TokenStream

Reply via email to