Re: PayloadAttribute behavior change between Lucene 2.9/3.0 and the trunk

2010-12-03 Thread Robert Muir
On Fri, Dec 3, 2010 at 10:15 PM, Teruhiko Kurosaka wrote: > Hello, > I have a Tokenizer that generates a Payload, and a TokenFilter that uses it. > These work well with Solr 1.4.0 (therefore Lucene 2.9.1?), but when > I switched to the trunk version (I rebuilt the Tokenizer and TokenFilter > using

PayloadAttribute behavior change between Lucene 2.9/3.0 and the trunk

2010-12-03 Thread Teruhiko Kurosaka
Hello, I have a Tokenizer that generates a Payload, and a TokenFilter that uses it. These work well with Solr 1.4.0 (therefore Lucene 2.9.1?), but when I switched to the trunk version (I rebuilt the Tokenizer and TokenFilter using the Lucene jar from the trunk and ran it), I encountered with this

Re: PDF text extracted without spaces

2010-12-03 Thread Ralph Seward
pdftotext has usually worked quite well for my purposes. More info at http://www.foolabs.com/xpdf/about.html . "Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a dece

Re: PDF text extracted without spaces

2010-12-03 Thread Hans Merkl
pdftotext is much better and faster from my experience. On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote: > Have you ever tried other extractor tool than PDFBox? I used to extract > contents with pdfbox: its capability of extract contents wasn't a problem, > but its lack of structure informati

Re: PDF text extracted without spaces

2010-12-03 Thread Fabiano Nunes
Have you ever tried other extractor tool than PDFBox? I used to extract contents with pdfbox: its capability of extract contents wasn't a problem, but its lack of structure information was. You can try poppler-utils (pdftotext) to extract contents with layout structure. Fabiano Nunes On Fri,

[ANNOUNCE] Release of Lucene Java versions 3.0.3 and 2.9.4

2010-12-03 Thread Uwe Schindler
Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.3 and 2.9.4: Both releases fix bugs in the previous versions: - 2.9.4 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4. - 3.0.3 has the same bug

Re: PDF text extracted without spaces

2010-12-03 Thread Ian Lea
Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant. Have you tried asking on the tika mailing list? http://tika.apache.org/mail-lists.html. -- Ian. On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote: > I first extract the contents from documents using tika and latter index it > with

Re: PDF text extracted without spaces

2010-12-03 Thread Ganesh
I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces. Regards Ganesh - Original Message - From: "McGibbney, Lewis John" To: Sent: Friday, December 03, 2010 4:40 PM Subject: R

RE: PDF text extracted without spaces

2010-12-03 Thread McGibbney, Lewis John
Hi Ganesh I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase. Unfortunately I don't think th

Re: PDF text extracted without spaces

2010-12-03 Thread Ganesh
The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents. Sample outoput: Someofthedifferencesare but it should be Some of the differences are Regards Ganesh - Original Message - From: "Alexander Aristov" To: Sent: Friday, December

Re: PDF text extracted without spaces

2010-12-03 Thread Alexander Aristov
anyway even if you get correct whitespaces and new lines this won't affect indexing. Best Regards Alexander Aristov On 3 December 2010 10:00, Lance Norskog wrote: > The text should come out as a stream of words with space, but without > any of the formatting in the PDF. Extraction is only good