Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant. Have you tried asking on the tika mailing list? http://tika.apache.org/mail-lists.html.
-- Ian. On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailg...@yahoo.co.in> wrote: > I first extract the contents from documents using tika and latter index it > with Lucene. The problem is the extracted text from PDF using tika has no > whitespaces. > > Regards > Ganesh > > > ----- Original Message ----- > From: "McGibbney, Lewis John" <lewis.mcgibb...@gcu.ac.uk> > To: <java-user@lucene.apache.org> > Sent: Friday, December 03, 2010 4:40 PM > Subject: RE: PDF text extracted without spaces > > >> Hi Ganesh >> >> I encountered this same problem last week. I was thinking if it was possible >> to include at minimum a WhitespaceAnalyzer somewhere within Tika which would >> solve the problem. I am not sure of how this would be done as I am not >> familiar with Tika codebase. >> >> Unfortunately I don't think that the solution to the first part of this >> problem lies within the java-user mailing list. >> >> When were you sending extracted contents to Lucene... at what later stage? >> >> Thank you >> >> Lewis >> >> -----Original Message----- >> From: Ganesh [mailto:emailg...@yahoo.co.in] >> Sent: 03 December 2010 10:44 >> To: java-user@lucene.apache.org >> Subject: Re: PDF text extracted without spaces >> >> The main problem is i am not getting whitespace and newline char. This is >> happening only for PDF documents. >> >> Sample outoput: Someofthedifferencesare but it should be Some of the >> differences are >> >> Regards >> Ganesh >> >> ----- Original Message ----- >> From: "Alexander Aristov" <alexander.aris...@gmail.com> >> To: <java-user@lucene.apache.org> >> Sent: Friday, December 03, 2010 2:39 PM >> Subject: Re: PDF text extracted without spaces >> >> >>> anyway even if you get correct whitespaces and new lines this won't affect >>> indexing. >>> >>> Best Regards >>> Alexander Aristov >>> >>> >>> On 3 December 2010 10:00, Lance Norskog <goks...@gmail.com> wrote: >>> >>>> The text should come out as a stream of words with space, but without >>>> any of the formatting in the PDF. Extraction is only good enough to >>>> tell you that a word is somewhere inside a PDF file. Can you post a >>>> short bit of the text that it extracted? >>>> >>>> Also, you should try this test on different PDF files that were made >>>> with different software. >>>> >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> wrote: >>>> > Hello all, >>>> > >>>> > I know, this is not the right group to ask this question, thought some of >>>> you guys might have experienced. >>>> > >>>> > I newbie with Tika. I am using latest version 0.8 version. I extracted >>>> text from PDF document but found spaces and new line missing. Indexing the >>>> data gives wrong result. Could any one in this group could help me? I am >>>> using tika directly to extract the contents, which later gets indexed. >>>> > >>>> > Regards >>>> > Ganesh >>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. >>>> Download Now! http://messenger.yahoo.com/download.php >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> > >>>> > >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> goks...@gmail.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download >> Now! http://messenger.yahoo.com/download.php >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> Email has been scanned for viruses by Altman Technologies' email management >> service - www.altman.co.uk/emailsystems >> >> Glasgow Caledonian University is a registered Scottish charity, number >> SC021474 >> >> Winner: Times Higher Education’s Widening Participation Initiative of the >> Year 2009 and Herald Society’s Education Initiative of the Year 2009 >> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download > Now! http://messenger.yahoo.com/download.php > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org