RE: IndexWriter and memory usage

2010-04-13 Thread Woolf, Ross
Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump. In the first image you can see that at the time our memory has become very tight most of it is held up in bytes. In the second image I examine one of those instances and n

Problem with search

2010-04-13 Thread Sirish Vadala
Hello All, I am kind of new to Lucene, and having problem filtering search results. Background: My Indexed documents have multiple bills and each bill has multiple versions. Each version of the same bill has a different bill Version Id, but the same bill Id. In most likely case, the text in d

Re: Removing terms in the Index

2010-04-13 Thread Shai Erera
I ran your code. Since I don't have the queries file (Docs/documento.txt), I set this line instead: String termos = "\"Lucene in Action\""; When I set it to \"Lucene\", both documents are found. When I set it to \"Lucene in Action\" only the first document is found. Seems correct to me. Can you

RE: IndexWriter and memory usage

2010-04-13 Thread Woolf, Ross
Are these fixes in 2.9x branch? We are using 2.9x and can't move to 3x just yet. If so, where do I specifically pick this up from? -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Monday, April 12, 2010 10:20 PM To: java-user@lucene.apache.org Subject: Re: IndexW

Re: Understanding lucene indexes and disk I/O

2010-04-13 Thread Michael McCandless
On Tue, Apr 13, 2010 at 11:55 AM, Burton-West, Tom wrote: > At some point maybe the File Formats Document could be updated to make it > clear that the tii has an entry similar to the IntexInterval'th tis entry but > instead of holding frq/prx deltas it holds absolute pointers. Is it worth > e

RE: Understanding lucene indexes and disk I/O

2010-04-13 Thread Burton-West, Tom
Thanks Mike, At some point maybe the File Formats Document could be updated to make it clear that the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas it holds absolute pointers. Is it worth entering a JIRA issue? I would be happy to update the

Re: WhitespaceAnalyzer and version

2010-04-13 Thread Siraj Haider
Hi Shai, On 4/13/2010 1:41 AM, Shai Erera wrote: Hi WhitespaceAnalyzer definitely has a Version dependent ctor. What Lucene version do you use? You van use LUCENE_CURRENT but be aware that of a certain Analyzer's behavior has changed in a way that affects your app, you'll need to reindex your

Re: WhitespaceAnalyzer and version

2010-04-13 Thread Siraj Haider
Hi Uwe, On 4/13/2010 2:23 AM, Uwe Schindler wrote: As of Lucene 3.0, WhitespaceAnalyzer has not yet a Version ctor. It will come in 3.1, when Lucene is changed to be Unicode 4.0 conform (3.0 and before is Unicode 3.0, which is Java 1.4). QueryParser need the Version ctor for the handling of s

Re: Exception, field is not stored

2010-04-13 Thread Grant Ingersoll
On Apr 12, 2010, at 1:31 PM, Ramon De Paula Marques wrote: > Hi guys, > > I'm trying to use highlighter to a better search on my website, but when the > search get documents html and pdf that were indexed with a reader causes an > exception that tells the field is not stored. > > I don't know w

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-13 Thread Michael McCandless
Can you whittle down your example even more? EG don't read the term vectors for the first hit. Just open a single reader and do the TermQuery search over and over? BTW what does this line in PyLucene do?: tfvP = lucene.TermFreqVector.cast_(tfv) You never hit exceptions in this code right?

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-13 Thread Michael McCandless
On Mon, Apr 12, 2010 at 9:50 AM, Herbert L Roitblat wrote: > Thank you Michael. Your suggestions are helpful. I inherited all of the > code that uses pyLucene and don't consider myself an expert on it, so I very > much appreciate your suggestions. > > It does not seem to be the case that these e

Re: IndexWriter and memory usage

2010-04-13 Thread Michael McCandless
This would be a very good thing to try, given that you have some huge documents that, indexed alone, use far more than your RAM buffer. Mike On Tue, Apr 13, 2010 at 12:19 AM, Lance Norskog wrote: > There is some bugs where the writer data structures retain data after > it is flushed. They are co

Re: IndexWriter and memory usage

2010-04-13 Thread Michael McCandless
The infoStream generally looks healthy. You seem to have a contained set of unique field names. The one thing that's interesting is... your docs are quite large. If you grep for "flush: segment=" in your infoStream you see how many docs "fit" in 16 MB before flushing, and it's lowish (as high as

Re: Understanding lucene indexes and disk I/O

2010-04-13 Thread Michael McCandless
Hi Tom, Fear not: we only scan up to 128 terms, to find the specific term. First, the terms dict index (tii) is fully loaded into RAM, and then a binary search is done on this (in-RAM) to find the nearest index term just before the term you want. Then, we seek to that spot in the main terms dict