On Fri, Dec 3, 2010 at 10:15 PM, Teruhiko Kurosaka wrote:
> Hello,
> I have a Tokenizer that generates a Payload, and a TokenFilter that uses it.
> These work well with Solr 1.4.0 (therefore Lucene 2.9.1?), but when
> I switched to the trunk version (I rebuilt the Tokenizer and TokenFilter
> using
Hello,
I have a Tokenizer that generates a Payload, and a TokenFilter that uses it.
These work well with Solr 1.4.0 (therefore Lucene 2.9.1?), but when
I switched to the trunk version (I rebuilt the Tokenizer and TokenFilter
using the Lucene jar from the trunk and ran it), I encountered with
this
pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .
"Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and should
run on pretty much any system with a dece
pdftotext is much better and faster from my experience.
On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote:
> Have you ever tried other extractor tool than PDFBox? I used to extract
> contents with pdfbox: its capability of extract contents wasn't a problem,
> but its lack of structure informati
Have you ever tried other extractor tool than PDFBox? I used to extract
contents with pdfbox: its capability of extract contents wasn't a problem,
but its lack of structure information was.
You can try poppler-utils (pdftotext) to extract contents with
layout structure.
Fabiano Nunes
On Fri,
Hello Lucene users,
On behalf of the Lucene development community I would like to announce the
release of Lucene Java versions 3.0.3 and 2.9.4:
Both releases fix bugs in the previous versions:
- 2.9.4 is a bugfix release for the Lucene Java 2.x series, based on Java
1.4.
- 3.0.3 has the same bug
Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
Have you tried asking on the tika mailing list?
http://tika.apache.org/mail-lists.html.
--
Ian.
On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
> I first extract the contents from documents using tika and latter index it
> with
I first extract the contents from documents using tika and latter index it with
Lucene. The problem is the extracted text from PDF using tika has no
whitespaces.
Regards
Ganesh
- Original Message -
From: "McGibbney, Lewis John"
To:
Sent: Friday, December 03, 2010 4:40 PM
Subject: R
Hi Ganesh
I encountered this same problem last week. I was thinking if it was possible to
include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve
the problem. I am not sure of how this would be done as I am not familiar with
Tika codebase.
Unfortunately I don't think th
The main problem is i am not getting whitespace and newline char. This is
happening only for PDF documents.
Sample outoput: Someofthedifferencesare but it should be Some of the
differences are
Regards
Ganesh
- Original Message -
From: "Alexander Aristov"
To:
Sent: Friday, December
anyway even if you get correct whitespaces and new lines this won't affect
indexing.
Best Regards
Alexander Aristov
On 3 December 2010 10:00, Lance Norskog wrote:
> The text should come out as a stream of words with space, but without
> any of the formatting in the PDF. Extraction is only good
11 matches
Mail list logo