RE: Problem with PDF extraction

Aditya Tue, 27 Apr 2010 00:35:05 -0700

I too faced similar problem.


May I suggest trying pdftotext? This I observed being used by Google
Desktop.

 

http://www.foolabs.com/xpdf/download.html

 

AFAIK it is under GNU GENERAL PUBLIC LICENSE.

 

Best Regards,

Aditya

 

From: Grant Ingersoll [mailto:[email protected]] On Behalf Of Grant Ingersoll
Sent: Tuesday, April 27, 2010 3:38 AM
To: [email protected]
Subject: Re: Problem with PDF extraction

 

Hi Marc,

 

Can you ask on [email protected] and give more information about
any errors that occur in your Solr log plus the setup of the
ExtractingRequestHandler and related schema.

 

-Grant

 

On Apr 26, 2010, at 5:04 PM, Marc Ghorayeb wrote:





Hello,

 

I have been having problems with PDF randomly crashing the 1.4 Solr server
so i tried out the SVN version which contains a newer Tika library. On its
own, the tika app extracts correctly the content of my PDF. However, inside
Solr, when i upload a pdf file to my update/extract handler, it does not
seem to parse it (a blank file is outputted...). The literal values do get
indexed though. I have had no luck in getting the tika parsing to work. For
some reason, i get the same result whether or not the tika-parsers-0.7.jar
is present in the lib folder. Whereas if the tika-core-0.7 jar is absent, it
just crashes (which seems normal to me...).

 

I don't seem to be the only one having this problem (on the user mailing
list that is). Can anyone help me out? It would be greatly appreciated.

 

I use a fairly classic schema and default requesthandlers.

 

Marc Ghorayeb.

 

  _____  

Hotmail débarque sur votre téléphone ! Paramétrez
<http://www.messengersurvotremobile.com/?d=Hotmail>  Hotmail sur votre
téléphone! Gratuit !

 

--------------------------

Grant Ingersoll

http://www.lucidimagination.com/

 

Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search

RE: Problem with PDF extraction

Reply via email to