I'm trying to upgrade our search functionality (currently, RTF/text only, and exact phrase match only) at my company, and have run into some concerns. Our 4 main formats are:
RTF - javax.swing looks fine, we use those classes already. MS Word - I know that POI exists, but development on the Word portion seems to have stopped, and there are a lot of nasty looking bugs in their DB. Since we're involved in dealing with contracts, many of our Word files are large and complicated. How has everyone's experience with POI's Word parsing been? PDF - Looks like PDFBox has memory issues. Frankly, this is not a problem in anything other than indexing. Minor, but still a concern. Word Perfect - There doesn't seem to be any converters for this format? I would hate to have to recommend to my boss to shell out $10k to $25k (or more!) in licensing fees for a commercial search engine just because I can't parse the files and the commercial ones can, but that is still cheaper than dedicating two engineers for 6 months if we have to write parsers for Word, PDF and Word Perfect if we go with Lucene (frankly, there's less risk too, considering how complicated parsing would be.) I know that Lucene doesn't deal with file formats, but the basic fact is, to use Lucene, you have to present it text strings, and there's no way to get that without dealing with file formats. What is the experience of people on the list with implementing parsers for anything more than text, html and xml? Thanks for any insights, Jeff Wang diCarta, Inc.