I've used POI, as well as commercial providers. As always, it depends :-) I wasn't particularly impressed with the commercial providers given the amount of money they wanted for it. PDF was particularly tricky, but you weren't asking about that. At least w/ POI, you have the opportunity to fix things that don't work based on your priorities. I don't know what the failure rate is for the commercial providers, but my experience is they will all fail at least once, so you better plan on it. I'd look to use a framework like Tika or Aperture, where you can easily upgrade or plug in new or different libraries (including commercial providers) as needed w/o rewriting your code. Additionally, with something like Tika or Aperture, you could easily mix and match your solutions, such that you use one for Word and a different one for PPT or PDF.

One issue with any of them is how you plan to use them. If you need more than bag of words, they all get less reliable, especially when it comes to PDFs and Office docs. Dealing with things like tables, columns, captions, labels, etc. has always been problematic in my experience when one wants to do higher level processing (beyond keyword search).

HTH,
Grant

On May 12, 2008, at 10:03 AM, Lukas Vlcek wrote:

Hi,

I need to find a reliable way how to extract content out of Word, Excel and PowerPoint formats prior to indexing and I am not sure if POI is the best way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats?

My experience with POI is such that sometimes it can be a pain to get the content out of the MS files properly. I also know that Nutch plugin uses POI for MS formats but as far as I remember it is not 100% reliable (my more then one year old experience is that about 1-2% of files were not parsed).

My requirements are that the text extraction software must run on Linux and should be written in Java, it can be open source or commercial library.

Regards,
Lukas

--
http://blog.lukas-vlcek.com/

--------------------------
Grant Ingersoll
http://lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to