Re: PDF text extracted without spaces

2010-12-06 Thread Ganesh
- From: "Ralph Seward" To: Sent: Friday, December 03, 2010 8:21 PM Subject: Re: PDF text extracted without spaces pdftotext has usually worked quite well for my purposes. More info at http://www.foolabs.com/xpdf/about.html . "Xpdf runs under the X Window System on UNIX, VMS, and

Re: PDF text extracted without spaces

2010-12-03 Thread Ralph Seward
is the extracted text from PDF using tika has > > no > > > whitespaces. > > > > > > > > Regards > > > > Ganesh > > > > > > > > > > > > - Original Message - > > > > From: "McGibbne

Re: PDF text extracted without spaces

2010-12-03 Thread Hans Merkl
ments using tika and latter index > > it with Lucene. The problem is the extracted text from PDF using tika has > no > > whitespaces. > > > > > > Regards > > > Ganesh > > > > > > > > > - Original Message - > > > From: "M

Re: PDF text extracted without spaces

2010-12-03 Thread Fabiano Nunes
> From: "McGibbney, Lewis John" > > To: > > Sent: Friday, December 03, 2010 4:40 PM > > Subject: RE: PDF text extracted without spaces > > > > > >> Hi Ganesh > >> > >> I encountered this same problem last week. I was thinking if

Re: PDF text extracted without spaces

2010-12-03 Thread Ian Lea
; with Lucene. The problem is the extracted text from PDF using tika has no > whitespaces. > > Regards > Ganesh > > > - Original Message - > From: "McGibbney, Lewis John" > To: > Sent: Friday, December 03, 2010 4:40 PM > Subject: RE: PDF text

Re: PDF text extracted without spaces

2010-12-03 Thread Ganesh
PM Subject: RE: PDF text extracted without spaces > Hi Ganesh > > I encountered this same problem last week. I was thinking if it was possible > to include at minimum a WhitespaceAnalyzer somewhere within Tika which would > solve the problem. I am not sure of how this would be don

RE: PDF text extracted without spaces

2010-12-03 Thread McGibbney, Lewis John
-user@lucene.apache.org Subject: Re: PDF text extracted without spaces The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents. Sample outoput: Someofthedifferencesare but it should be Some of the differences are Regards Ganesh - Ori

Re: PDF text extracted without spaces

2010-12-03 Thread Ganesh
, December 03, 2010 2:39 PM Subject: Re: PDF text extracted without spaces > anyway even if you get correct whitespaces and new lines this won't affect > indexing. > > Best Regards > Alexander Aristov > > > On 3 December 2010 10:00, Lance Norskog wrote: >

Re: PDF text extracted without spaces

2010-12-03 Thread Alexander Aristov
anyway even if you get correct whitespaces and new lines this won't affect indexing. Best Regards Alexander Aristov On 3 December 2010 10:00, Lance Norskog wrote: > The text should come out as a stream of words with space, but without > any of the formatting in the PDF. Extraction is only good

Re: PDF text extracted without spaces

2010-12-02 Thread Lance Norskog
The text should come out as a stream of words with space, but without any of the formatting in the PDF. Extraction is only good enough to tell you that a word is somewhere inside a PDF file. Can you post a short bit of the text that it extracted? Also, you should try this test on different PDF fi

PDF text extracted without spaces

2010-12-02 Thread Ganesh
Hello all, I know, this is not the right group to ask this question, thought some of you guys might have experienced. I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Cou