You can use text extractors for the document formats you mentioned.
Lucene as such does not deal with this text extraction process.
Following are the extractors we generally use:
PDF -> PDFBox: Java API to read PDF documents
http://www.pdfbox.org.
WORD-> Antiword: http://www
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer. Can anyone help?
You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lu