The text should come out as a stream of words with space, but without any of the formatting in the PDF. Extraction is only good enough to tell you that a word is somewhere inside a PDF file. Can you post a short bit of the text that it extracted?
Also, you should try this test on different PDF files that were made with different software. On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> wrote: > Hello all, > > I know, this is not the right group to ask this question, thought some of you > guys might have experienced. > > I newbie with Tika. I am using latest version 0.8 version. I extracted text > from PDF document but found spaces and new line missing. Indexing the data > gives wrong result. Could any one in this group could help me? I am using > tika directly to extract the contents, which later gets indexed. > > Regards > Ganesh > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download > Now! http://messenger.yahoo.com/download.php > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Lance Norskog goks...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org