Have you ever tried other extractor tool than PDFBox? I used to extract contents with pdfbox: its capability of extract contents wasn't a problem, but its lack of structure information was. You can try poppler-utils (pdftotext) to extract contents with layout structure.
Fabiano Nunes On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ian....@gmail.com> wrote: > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant. > Have you tried asking on the tika mailing list? > http://tika.apache.org/mail-lists.html. > > > -- > Ian. > > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailg...@yahoo.co.in> wrote: > > I first extract the contents from documents using tika and latter index > it with Lucene. The problem is the extracted text from PDF using tika has no > whitespaces. > > > > Regards > > Ganesh > > > > > > ----- Original Message ----- > > From: "McGibbney, Lewis John" <lewis.mcgibb...@gcu.ac.uk> > > To: <java-user@lucene.apache.org> > > Sent: Friday, December 03, 2010 4:40 PM > > Subject: RE: PDF text extracted without spaces > > > > > >> Hi Ganesh > >> > >> I encountered this same problem last week. I was thinking if it was > possible to include at minimum a WhitespaceAnalyzer somewhere within Tika > which would solve the problem. I am not sure of how this would be done as I > am not familiar with Tika codebase. > >> > >> Unfortunately I don't think that the solution to the first part of this > problem lies within the java-user mailing list. > >> > >> When were you sending extracted contents to Lucene... at what later > stage? > >> > >> Thank you > >> > >> Lewis > >> > >> -----Original Message----- > >> From: Ganesh [mailto:emailg...@yahoo.co.in] > >> Sent: 03 December 2010 10:44 > >> To: java-user@lucene.apache.org > >> Subject: Re: PDF text extracted without spaces > >> > >> The main problem is i am not getting whitespace and newline char. This > is happening only for PDF documents. > >> > >> Sample outoput: Someofthedifferencesare but it should be Some of the > differences are > >> > >> Regards > >> Ganesh > >> > >> ----- Original Message ----- > >> From: "Alexander Aristov" <alexander.aris...@gmail.com> > >> To: <java-user@lucene.apache.org> > >> Sent: Friday, December 03, 2010 2:39 PM > >> Subject: Re: PDF text extracted without spaces > >> > >> > >>> anyway even if you get correct whitespaces and new lines this won't > affect > >>> indexing. > >>> > >>> Best Regards > >>> Alexander Aristov > >>> > >>> > >>> On 3 December 2010 10:00, Lance Norskog <goks...@gmail.com> wrote: > >>> > >>>> The text should come out as a stream of words with space, but without > >>>> any of the formatting in the PDF. Extraction is only good enough to > >>>> tell you that a word is somewhere inside a PDF file. Can you post a > >>>> short bit of the text that it extracted? > >>>> > >>>> Also, you should try this test on different PDF files that were made > >>>> with different software. > >>>> > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> wrote: > >>>> > Hello all, > >>>> > > >>>> > I know, this is not the right group to ask this question, thought > some of > >>>> you guys might have experienced. > >>>> > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I > extracted > >>>> text from PDF document but found spaces and new line missing. Indexing > the > >>>> data gives wrong result. Could any one in this group could help me? I > am > >>>> using tika directly to extract the contents, which later gets indexed. > >>>> > > >>>> > Regards > >>>> > Ganesh > >>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > >>>> Download Now! http://messenger.yahoo.com/download.php > >>>> > > >>>> > > --------------------------------------------------------------------- > >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > > >>>> > > >>>> > >>>> > >>>> > >>>> -- > >>>> Lance Norskog > >>>> goks...@gmail.com > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>>> > >>> > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > Download Now! http://messenger.yahoo.com/download.php > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> Email has been scanned for viruses by Altman Technologies' email > management service - www.altman.co.uk/emailsystems > >> > >> Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > >> > >> Winner: Times Higher Education’s Widening Participation Initiative of > the Year 2009 and Herald Society’s Education Initiative of the Year 2009 > >> > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > >> > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > Download Now! http://messenger.yahoo.com/download.php > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >