from:"Krovi, DVSR_Sarma"

RE: Lucene Help

2006-04-13 Thread Krovi, DVSR_Sarma

You can use text extractors for the document formats you mentioned. Lucene as such does not deal with this text extraction process. Following are the extractors we generally use: PDF -> PDFBox: Java API to read PDF documents http://www.pdfbox.org. WORD-> Antiword: http://www

RE: Installing a custom tokenizer

2006-08-29 Thread Krovi, DVSR_Sarma

> I suspect that my issue is getting the Field constructor to use a > different tokenizer. Can anyone help? You need to basically come up with your own Tokenizer (You can always write a corresponding JavaCC grammar and compiling it would give the Tokenizer) Then you need to extend org.apache.lu