I too faced similar problem.
May I suggest trying pdftotext? This I observed being used by Google Desktop. http://www.foolabs.com/xpdf/download.html AFAIK it is under GNU GENERAL PUBLIC LICENSE. Best Regards, Aditya From: Grant Ingersoll [mailto:[email protected]] On Behalf Of Grant Ingersoll Sent: Tuesday, April 27, 2010 3:38 AM To: [email protected] Subject: Re: Problem with PDF extraction Hi Marc, Can you ask on [email protected] and give more information about any errors that occur in your Solr log plus the setup of the ExtractingRequestHandler and related schema. -Grant On Apr 26, 2010, at 5:04 PM, Marc Ghorayeb wrote: Hello, I have been having problems with PDF randomly crashing the 1.4 Solr server so i tried out the SVN version which contains a newer Tika library. On its own, the tika app extracts correctly the content of my PDF. However, inside Solr, when i upload a pdf file to my update/extract handler, it does not seem to parse it (a blank file is outputted...). The literal values do get indexed though. I have had no luck in getting the tika parsing to work. For some reason, i get the same result whether or not the tika-parsers-0.7.jar is present in the lib folder. Whereas if the tika-core-0.7 jar is absent, it just crashes (which seems normal to me...). I don't seem to be the only one having this problem (on the user mailing list that is). Can anyone help me out? It would be greatly appreciated. I use a fairly classic schema and default requesthandlers. Marc Ghorayeb. _____ Hotmail débarque sur votre téléphone ! Paramétrez <http://www.messengersurvotremobile.com/?d=Hotmail> Hotmail sur votre téléphone! Gratuit ! -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
