RE: Old Lucene src archive corrupt?

2010-03-02 Thread Toke Eskildsen
From: An Hong [an.h...@i365.com] > I'm trying to download some old Lucene source, e.g., > http://archive.apache.org/dist/lucene/java/lucene-2.9.0-src.zip I get "unexpected of Archive" report from WinRar, which never has had problem

RE: Old Lucene src archive corrupt?

2010-03-02 Thread Uwe Schindler
The archive opens perfectly with 7zip for windows. You can check, if your download is not corrupt by verifing the "*.md5" checksum or even validate the "*.asc" signature against the file with "gpg --verify" - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u.

Old Lucene src archive corrupt?

2010-03-02 Thread An Hong
Hi, I'm trying to download some old Lucene source, e.g., http://archive.apache.org/dist/lucene/java/lucene-2.9.0-src.zip I get "unexpected of Archive" report from WinRar, which never has had problem with zip files. It's also the sa

Re: Help wanted with Indexing PDF Documents

2010-03-02 Thread Ian Lea
Sounds like a question for the PDFBox mailing list. Once you've got the relevant info out of the PDF you can index it however you like. -- Ian. On Tue, Mar 2, 2010 at 4:11 PM, Ching Zheng wrote: > Hi, > I have about 50 PDF douments with size of each is around 10MB. I am using > PDFbox for pars

Help wanted with Indexing PDF Documents

2010-03-02 Thread Ching Zheng
Hi, I have about 50 PDF douments with size of each is around 10MB. I am using PDFbox for parsing, just wondering how I can index bookmarsk with its corresponded page information? I use PDDocumentOutline to get bookmark's title, but I only have PDNamedDestination which offers no page number info. C

Re: Lucene Indexing out of memory

2010-03-02 Thread Erick Erickson
It's not searching that I'm wondering about. The memory size, as far as I understand, really only has document resolution. That is, you can't index a part of a document, flush to disk, then index the rest of the document. The entire document is parsed into memory, and only then flushed to disk if R

Re: Lucene Indexing out of memory

2010-03-02 Thread Ian Lea
Where exactly are you hitting the OOM exception? Have you got a stack trace? How much memory are you allocating to the JVM? Have you run a profiler to find out what is using the memory? If it runs OK for 70K docs then fails, 2 possibilities come to mind: either the 70K + 1 doc is particularly la

Re: Lucene Indexing out of memory

2010-03-02 Thread ajay_gupta
Hi Erick, I tried setting setRAMBufferSizeMB as 200-500MB as well but still it goes OOM error. I thought its filebased indexing so memory shouldn't be an issue but you might be right that when searching it might be using lot of memory ? Is there way to load documents in chunks or someothere way

RE: Lucene Indexing out of memory

2010-03-02 Thread Murdoch, Paul
Ajay, Here is another thread I started on the same issue. http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe n-indexing-large-files Paul -Original Message- From: java-user-return-45254-paul.b.murdoch=saic@lucene.apache.org [mailto:java-user-return-45254-paul.

RE: Lucene Indexing out of memory

2010-03-02 Thread Murdoch, Paul
Ajay, I've posted a few times on OOM issues. Here is one thread. http://mail-archives.apache.org/mod_mbox//lucene-java-user/200909.mbox/% 3c5b20def02611534db08854076ce825d803626...@sc1exc2.corp.emainc.com%3e I'll try and get some more links to you from some other threads I started for OOM issue

Re: Lucene Indexing out of memory

2010-03-02 Thread Erick Erickson
I'm not following this entirely, but these docs may be huge by the time you add context for every word in them. You say that you "search the existing indices then I get the content and append". So is it possible that after 70K documents your additions become so huge that you're blowing up? Have

Lucene Indexing out of memory

2010-03-02 Thread ajay_gupta
Hi, It might be general question though but I couldn't find the answer yet. I have around 90k documents sizing around 350 MB. Each document contains a record which has some text content. For each word in this text I want to store context for that word and index it so I am reading each document and