Re: IndexWriter and memory usage

Michael McCandless Mon, 10 May 2010 01:27:26 -0700

Hmmmm...

Your usage (searching for old doc & updating it, to add new fields) is fine.

But: what memory usage do you see if you open a searcher, and search
for all docs, but don't open an IndexWriter?  We need to tease apart
the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
you post the output of CheckIndex (java
org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
index?  That may give some hints about expected memory usage of IR (eg
if # unique terms is large).

More comments below:

On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <ross_wo...@bmc.com> wrote:
> Sorry to be so long in getting back on this. The patch you provided has 
> improved the situation but we are still seeing some memory loss.  The 
> following are some images from the heap dump.  I'll share with you what we 
> are seeing now.
>
> This first image shows the memory pattern.  Our fist commit takes place at 
> about 3:54 when the steady trend up takes a drop and the new cycle begins.   
> What we have found is the 2422 fix has made the memory in the first phase 
> before the commit much better (and I'm sure throughout the entire run).  But 
> as you can see after the commit then we then again begin to lose memory.  One 
> of the pieces of info to know about this is what you are seeing, we have 5 
> threads that are pushing data to our Lucene plugin.  If we drop it down to 1 
> thread then we are much more successful and can actually index all of our 
> data without running out of memory but at 5 threads it gets into trouble.  We 
> still see a trend up in memory usage, but not as severe as when we use the 
> multiple threads.
> http://tinypic.com/view.php?pic=2w6bf68&s=5

Can you post the output of "svn diff" on the 2.9 code base you're
using?  I just want to look & verify all issues we've discussed are
included in your changes.  The fact that 1 thread is fine and 5
threads are not still sounds like a symptom of LUCENE-2283.

Also, does that heap usage graph exclude garbage?  Or, alternatively,
can you provoke an OOME w/ 512 MB heap and then capture the heap dump
at that point?

> There is another piece of the picture that I think might be coming into play. 
>  We have plugged Lucene into a legacy app and are subject to how we can get 
> it to deliver the data that we are indexing.  In some scenarios (like the one 
> we are having this problem with) we are building our documents progressively 
> (adding fields to the document through the process).  What you see before the 
> first commit is the legacy system handing us the first field for many 
> documents. Once we have gotten all of "field 1" for all documents then we 
> commit that data into the index.  Then the system starts feeding us "field 
> 2."  So we perform a search to see if the document already exists (for the 
> scenario you are seeing it does) and so it retrieves the original document 
> (we store a document ID) and it then adds the new field of data to the 
> existing document and we "update" the document in the index.  After the first 
> commit, the rest of the process is one where a document already exist and so 
> the new field is added and and the document is updated.  It is in this 
> process that we start rapidly losing memory.  The following images show some 
> examples of common areas where memory is being held.
>
> http://tinypic.com/view.php?pic=11vkwnb&s=5

This looks like "normal" memory usage of IndexWriter -- these are the
recycled buffers used for holding stored fields.  However: the net RAM
used by this allocation should not exceed your 16 MB IW ram buffer
size -- does it?

> http://tinypic.com/view.php?pic=abq9fp&s=5

This one is the byte[] buffer used by CompoundFileReader, opened by
IndexReader.  It's odd that you have so many of these (if I'm reading
this correctly) -- are you certain all opened readers are being
closed?  How many segments do you have in your index?  Or... are there
many unique threads doing the searching?  EG do you create a new
thread for every search or update?

> http://tinypic.com/view.php?pic=25pskyp&s=5

This one is also normal memory used by IndexWriter, but as above, the
net RAM used by this allocation (summed w/ the above one) should not
exceed your 16 MB IW ram buffer size.

> As mentioned, we are subject to how we can have the legacy app feed us the 
> data and so this is why we do it this way.  We treat this system as a real 
> time system and at anytime the legacy system may send us a field that needs 
> to be added or updated to a document.  So we search for the document and if 
> found we either add or update a field if the field is already existing in the 
> document.  So I started to wonder if a clue in this memory loss comes from 
> the fact that we are retrieving an existing document and then adding to it 
> and updating.
>
> Now, if we eliminate the updating and simply add each item as a new document 
> (which we did just to test but won't be adequate for our running system), 
> then we still see a slight trend upward in memory usage and the following 
> images show that now most of the  memory is consumed in char[] rather than 
> the byte[] we saw before.  We don't know if this is normal and expected, or 
> if it is something to be concerned about as well.
>
> http://tinypic.com/view.php?pic=vfgkyt&s=5

That memory usage is normal -- it's used by the in-RAM terms index of
your opened IndexReader.  But I'd like to see the memory usage of
simply opening your IndexReader and searching for documents to update,
but not opening an IndexWriter at all.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter and memory usage

Reply via email to