Robert Muir wrote:
Will, I think this parsing of documents into different fields, is separate
and unrelated from lucene's analysis (tokenization)...
the analysis comes to play once you have a field, and you want to break the
text into indexable units (words, or entire field as token like your urls).

i wouldn't suggest make a big complicated analyzer that tries to parse html
in addition to breaking text into words, I would keep parsing and analysis
separate.
then i would handle different fields with different analyzers, i think Erick
already mentioned PerFieldAnalyzerWrapper, its useful for this.

It's also possible to do the tokenization ahead of time, i.e. before you pass the document to IndexWriter. You can construct the TokenStream using your own analysis chain, and use Field.setTokenStreamValue() - this way you will index exactly the token stream you want, and you can even create other fields in the document (or split this token stream into several fields).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to