If it's windows only, you can roll your own with IFilters ( http://www.ifilter.org/).
On Tue, May 13, 2008 at 10:23 AM, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Does it make sense to consider using OpenOffice to convert from MS formats > to PDF or HTML before indexing. Would this yield me a lower fail rate as > opposed to pure POI approach? I don't care about formating now I care > about > content in the first place. Formating would be important only in the case > that Nutch or other piece of software would be able to accommodate this > information into Lucene index (such that words in headline would yield > higher boost for example). > > Couple of words about my motivation: > > We released SharePoint 2007 in our company. We are not very satisfied with > its search capabilities so I started to looking for some alternatives. The > first thing I was looking at is Google Search Appliance as they claim it > can > crawl, index and search SharePoint portals. > > I realized that their integration with outer world is done via connector > manager which is itself an open source project written in Java and > Sharepoint connector implementation is as well released as a open source > in > Java. This makes me think that I should be able to test Sharepoint > connector > replacing GSA black box with Lucene,Nutch,Solr or whatever and test how > well > this connector thing works. This would be a perfect test before we invest > more in GSA. On the other hand if I would be able to run the Sharepoint > connector without GSA (replaced by Lucene based product) then text > extraction from MS family formats can be the main impenetrable barrier. > > Lukas > > On Tue, May 13, 2008 at 4:13 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > > Grant Ingersoll wrote: > > > > > I've used POI, as well as commercial providers. As always, it depends > > > :-) I wasn't particularly impressed with the commercial providers > given the > > > amount of money they wanted for it. PDF was particularly tricky, but > you > > > weren't asking about that. At least w/ POI, you have the opportunity > to > > > fix things that don't work based on your priorities. I don't know > what the > > > failure rate is for the commercial providers, but my experience is > they will > > > all fail at least once, so you better plan on it. I'd look to use a > > > framework like Tika or Aperture, where you can easily upgrade or plug > in new > > > or different libraries (including commercial providers) as needed w/o > > > rewriting your code. Additionally, with something like Tika or > Aperture, > > > you could easily mix and match your solutions, such that you use one > for > > > Word and a different one for PPT or PDF. > > > > > > One issue with any of them is how you plan to use them. If you need > > > more than bag of words, they all get less reliable, especially when it > comes > > > to PDFs and Office docs. Dealing with things like tables, columns, > > > captions, labels, etc. has always been problematic in my experience > when one > > > wants to do higher level processing (beyond keyword search). > > > > > > > Yet another option ... In the past I used a licensed copy of MS Office > to > > extract things that I wanted, using a bit of OLE automation and > VBscript. > > Worked reasonably well, in the sense that I had no issues whatsoever > with > > extracting the content _and_ formatting from any documents that could be > > normally opened with MS Office - however, performance was an issue, ie. > it > > was slow, CPU/memory hog, and occasionally it would get stuck in a weird > > state when only complete reboot would help. > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > http://blog.lukas-vlcek.com/ >