We are using Aspose: www.aspose.com. We are still in pre-release, it works fine for all of the MS products. It's commercial, but is a good deal as long as you don't have too many developers working on it, since the licensing is per seat. We had a little trouble with thier PDF product. The other thing is that their main product line is .NET but the Java line has kept up pretty well. For text extraction the APIs are straight forward.
mark harwood <[EMAIL PROTECTED]> 05/13/2008 07:44 AM Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats? On the commercial front, Oracle's "Outside In" (previously Stellent) is the one that gets used in a lot of search engines. Being a C-based product though, integration isn't quite as nice/easy as pure Java solutions. ----- Original Message ---- From: Bowesman Antony <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 13 May, 2008 8:49:00 AM Subject: Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats? We are using POI 3.0.2 FINAL. Like you, it is not very reliable for many Word files. It does not support Word 2, Fast saved files, files which are not padded to 256 bytes. PPT and Excel are quite bad, a large % of our PPT files throw Exceptions. Not tried 3.1 as it's just gone BETA 1, but I expect that the Word parsing is unchanged and the changelog doesn't show any Word changes. TestMining.org http://www.textmining.org/ is quite good, but the 0.4 version did not do Word 2 or Fast Saved files. 1.0 version should fix that, but I've not yet tried it. Licene for 1.0 is LGPL, whereas 0.4 was Apache 2. AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI so is quite slow if you want to use it for a lot of parsing. It can do text extraction via the command line. The Linux versions suports pipes. It's based on WvWare http://wvware.sourceforge.net/ Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite effective, fast. It also has catppt. I'm not sure if the text order is 100% according to the original though. The last two are not licence friendly for distribution. I've extracted the Nutch parsing framework and am using it in our product and have tested all of the above and the priority for Word parsing is TextMining v0.4, before POI and then the other two which I plugged in via the parse-ext parser. HTH Antony Lukas Vlcek wrote: > Hi, > > I need to find a reliable way how to extract content out of Word, Excel and > PowerPoint formats prior to indexing and I am not sure if POI is the best > way to go. Can anybody share experience with POI and/or other [commercial] > Java library for text extraction from MS formats? > > My experience with POI is such that sometimes it can be a pain to get the > content out of the MS files properly. I also know that Nutch plugin uses POI > for MS formats but as far as I remember it is not 100% reliable (my more > then one year old experience is that about 1-2% of files were not parsed). > > My requirements are that the text extraction software must run on Linux and > should be written in Java, it can be open source or commercial library. > > Regards, > Lukas > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]