Robert Muir wrote:
Will, I think this parsing of documents into different fields, is separate
and unrelated from lucene's analysis (tokenization)...
the analysis comes to play once you have a field, and you want to break the
text into indexable units (words, or entire field as token like your urls).
i wouldn't suggest make a big complicated analyzer that tries to parse html
in addition to breaking text into words, I would keep parsing and analysis
separate.
then i would handle different fields with different analyzers, i think Erick
already mentioned PerFieldAnalyzerWrapper, its useful for this.
It's also possible to do the tokenization ahead of time, i.e. before you
pass the document to IndexWriter. You can construct the TokenStream
using your own analysis chain, and use Field.setTokenStreamValue() -
this way you will index exactly the token stream you want, and you can
even create other fields in the document (or split this token stream
into several fields).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org