-
From: "Ralph Seward"
To:
Sent: Friday, December 03, 2010 8:21 PM
Subject: Re: PDF text extracted without spaces
pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .
"Xpdf runs under the X Window System on UNIX, VMS, and
is the extracted text from PDF using tika
has
> > no
> > > whitespaces.
> > > >
> > > > Regards
> > > > Ganesh
> > > >
> > > >
> > > > - Original Message -
> > > > From: "McGibbne
ments using tika and latter index
> > it with Lucene. The problem is the extracted text from PDF using tika has
> no
> > whitespaces.
> > >
> > > Regards
> > > Ganesh
> > >
> > >
> > > - Original Message -
> > > From: "M
> From: "McGibbney, Lewis John"
> > To:
> > Sent: Friday, December 03, 2010 4:40 PM
> > Subject: RE: PDF text extracted without spaces
> >
> >
> >> Hi Ganesh
> >>
> >> I encountered this same problem last week. I was thinking if
; with Lucene. The problem is the extracted text from PDF using tika has no
> whitespaces.
>
> Regards
> Ganesh
>
>
> - Original Message -
> From: "McGibbney, Lewis John"
> To:
> Sent: Friday, December 03, 2010 4:40 PM
> Subject: RE: PDF text
PM
Subject: RE: PDF text extracted without spaces
> Hi Ganesh
>
> I encountered this same problem last week. I was thinking if it was possible
> to include at minimum a WhitespaceAnalyzer somewhere within Tika which would
> solve the problem. I am not sure of how this would be don
-user@lucene.apache.org
Subject: Re: PDF text extracted without spaces
The main problem is i am not getting whitespace and newline char. This is
happening only for PDF documents.
Sample outoput: Someofthedifferencesare but it should be Some of the
differences are
Regards
Ganesh
- Ori
, December 03, 2010 2:39 PM
Subject: Re: PDF text extracted without spaces
> anyway even if you get correct whitespaces and new lines this won't affect
> indexing.
>
> Best Regards
> Alexander Aristov
>
>
> On 3 December 2010 10:00, Lance Norskog wrote:
>
anyway even if you get correct whitespaces and new lines this won't affect
indexing.
Best Regards
Alexander Aristov
On 3 December 2010 10:00, Lance Norskog wrote:
> The text should come out as a stream of words with space, but without
> any of the formatting in the PDF. Extraction is only good
The text should come out as a stream of words with space, but without
any of the formatting in the PDF. Extraction is only good enough to
tell you that a word is somewhere inside a PDF file. Can you post a
short bit of the text that it extracted?
Also, you should try this test on different PDF fi
Hello all,
I know, this is not the right group to ask this question, thought some of you
guys might have experienced.
I newbie with Tika. I am using latest version 0.8 version. I extracted text
from PDF document but found spaces and new line missing. Indexing the data
gives wrong result. Cou
11 matches
Mail list logo