Re: extracting non-english text from word, pdf, etc....??

Michael J. Prichard Thu, 02 Aug 2007 06:06:13 -0700

Yea, I have seen those. I guess the question is what do you all use toextract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox andso on? This is what I use now to extract english.


Thanks,
Michael


testn wrote:

If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:

I know how to do english text with POI and PDFBox and so on.  Now, I want
to start indexing non-english language such as french and spanish.  Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??

Reply via email to