Hi Ganesh I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.
Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list. When were you sending extracted contents to Lucene... at what later stage? Thank you Lewis -----Original Message----- From: Ganesh [mailto:emailg...@yahoo.co.in] Sent: 03 December 2010 10:44 To: java-user@lucene.apache.org Subject: Re: PDF text extracted without spaces The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents. Sample outoput: Someofthedifferencesare but it should be Some of the differences are Regards Ganesh ----- Original Message ----- From: "Alexander Aristov" <alexander.aris...@gmail.com> To: <java-user@lucene.apache.org> Sent: Friday, December 03, 2010 2:39 PM Subject: Re: PDF text extracted without spaces > anyway even if you get correct whitespaces and new lines this won't affect > indexing. > > Best Regards > Alexander Aristov > > > On 3 December 2010 10:00, Lance Norskog <goks...@gmail.com> wrote: > >> The text should come out as a stream of words with space, but without >> any of the formatting in the PDF. Extraction is only good enough to >> tell you that a word is somewhere inside a PDF file. Can you post a >> short bit of the text that it extracted? >> >> Also, you should try this test on different PDF files that were made >> with different software. >> >> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> wrote: >> > Hello all, >> > >> > I know, this is not the right group to ask this question, thought some of >> you guys might have experienced. >> > >> > I newbie with Tika. I am using latest version 0.8 version. I extracted >> text from PDF document but found spaces and new line missing. Indexing the >> data gives wrong result. Could any one in this group could help me? I am >> using tika directly to extract the contents, which later gets indexed. >> > >> > Regards >> > Ganesh >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. >> Download Now! http://messenger.yahoo.com/download.php >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html