pdftotext is much better and faster from my experience.
On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fabi...@nunes.me> wrote: > Have you ever tried other extractor tool than PDFBox? I used to extract > contents with pdfbox: its capability of extract contents wasn't a problem, > but its lack of structure information was. > You can try poppler-utils (pdftotext) to extract contents with > layout structure. > > Fabiano Nunes > > > > > > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian....@gmail.com> wrote: > > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant. > > Have you tried asking on the tika mailing list? > > http://tika.apache.org/mail-lists.html. > > > > > > -- > > Ian. > > > > > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailg...@yahoo.co.in> wrote: > > > I first extract the contents from documents using tika and latter index > > it with Lucene. The problem is the extracted text from PDF using tika has > no > > whitespaces. > > > > > > Regards > > > Ganesh > > > > > > > > > ----- Original Message ----- > > > From: "McGibbney, Lewis John" <lewis.mcgibb...@gcu.ac.uk> > > > To: <java-user@lucene.apache.org> > > > Sent: Friday, December 03, 2010 4:40 PM > > > Subject: RE: PDF text extracted without spaces > > > > > > > > >> Hi Ganesh > > >> > > >> I encountered this same problem last week. I was thinking if it was > > possible to include at minimum a WhitespaceAnalyzer somewhere within Tika > > which would solve the problem. I am not sure of how this would be done as > I > > am not familiar with Tika codebase. > > >> > > >> Unfortunately I don't think that the solution to the first part of > this > > problem lies within the java-user mailing list. > > >> > > >> When were you sending extracted contents to Lucene... at what later > > stage? > > >> > > >> Thank you > > >> > > >> Lewis > > >> > > >> -----Original Message----- > > >> From: Ganesh [mailto:emailg...@yahoo.co.in] > > >> Sent: 03 December 2010 10:44 > > >> To: java-user@lucene.apache.org > > >> Subject: Re: PDF text extracted without spaces > > >> > > >> The main problem is i am not getting whitespace and newline char. This > > is happening only for PDF documents. > > >> > > >> Sample outoput: Someofthedifferencesare but it should be Some of the > > differences are > > >> > > >> Regards > > >> Ganesh > > >> > > >> ----- Original Message ----- > > >> From: "Alexander Aristov" <alexander.aris...@gmail.com> > > >> To: <java-user@lucene.apache.org> > > >> Sent: Friday, December 03, 2010 2:39 PM > > >> Subject: Re: PDF text extracted without spaces > > >> > > >> > > >>> anyway even if you get correct whitespaces and new lines this won't > > affect > > >>> indexing. > > >>> > > >>> Best Regards > > >>> Alexander Aristov > > >>> > > >>> > > >>> On 3 December 2010 10:00, Lance Norskog <goks...@gmail.com> wrote: > > >>> > > >>>> The text should come out as a stream of words with space, but > without > > >>>> any of the formatting in the PDF. Extraction is only good enough to > > >>>> tell you that a word is somewhere inside a PDF file. Can you post a > > >>>> short bit of the text that it extracted? > > >>>> > > >>>> Also, you should try this test on different PDF files that were made > > >>>> with different software. > > >>>> > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> > wrote: > > >>>> > Hello all, > > >>>> > > > >>>> > I know, this is not the right group to ask this question, thought > > some of > > >>>> you guys might have experienced. > > >>>> > > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I > > extracted > > >>>> text from PDF document but found spaces and new line missing. > Indexing > > the > > >>>> data gives wrong result. Could any one in this group could help me? > I > > am > > >>>> using tika directly to extract the contents, which later gets > indexed. > > >>>> > > > >>>> > Regards > > >>>> > Ganesh > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo! > Messenger. > > >>>> Download Now! http://messenger.yahoo.com/download.php > > >>>> > > > >>>> > > > --------------------------------------------------------------------- > > >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>>> > > > >>>> > > > >>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Lance Norskog > > >>>> goks...@gmail.com > > >>>> > > >>>> > --------------------------------------------------------------------- > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>>> > > >>>> > > >>> > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > > Download Now! http://messenger.yahoo.com/download.php > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> Email has been scanned for viruses by Altman Technologies' email > > management service - www.altman.co.uk/emailsystems > > >> > > >> Glasgow Caledonian University is a registered Scottish charity, number > > SC021474 > > >> > > >> Winner: Times Higher Education’s Widening Participation Initiative of > > the Year 2009 and Herald Society’s Education Initiative of the Year 2009 > > >> > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > >> > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > > Download Now! http://messenger.yahoo.com/download.php > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > -- Hans Merkl Right On Point, LLC 215 Victor Parkway, Suite E Annapolis, MD 21403 Phone: (443) 951-4324 E-mail: hme...@rightonpoint.us