Re: Split single string into several fields?

Andrzej Bialecki Wed, 28 Oct 2009 02:29:41 -0700

Robert Muir wrote:

Will, I think this parsing of documents into different fields, is separate
and unrelated from lucene's analysis (tokenization)...
the analysis comes to play once you have a field, and you want to break the
text into indexable units (words, or entire field as token like your urls).


i wouldn't suggest make a big complicated analyzer that tries to parse html
in addition to breaking text into words, I would keep parsing and analysis
separate.
then i would handle different fields with different analyzers, i think Erick
already mentioned PerFieldAnalyzerWrapper, its useful for this.

It's also possible to do the tokenization ahead of time, i.e. before youpass the document to IndexWriter. You can construct the TokenStreamusing your own analysis chain, and use Field.setTokenStreamValue() -this way you will index exactly the token stream you want, and you caneven create other fields in the document (or split this token streaminto several fields).



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Split single string into several fields?

Reply via email to