Phew! Thank for bringing closure Ross. Happy indexing, Mike
On Wed, May 19, 2010 at 12:50 PM, Woolf, Ross <ross_wo...@bmc.com> wrote: > Just wanted to report that Michael was able to find the issue that was > plaguing us, He has checked fixes into the 2.9.x, 3.0.x, 3.1.x, 4.0.x > branches. Most of the issues were related to indexing documents larger than > the indexing buffer size (16mb by default). Now we no longer run out of > memory during our large document indexing runs. > > Thanks for your help Michael in resolving this. > > -----Original Message----- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Friday, May 14, 2010 11:23 AM > To: java-user@lucene.apache.org > Subject: Re: IndexWriter and memory usage > > The patch looks correct. > > The 16 MB RAM buffer means the sum of the shared char[], byte[] and > PostingList/RawPostingList memory will be kept under 16 MB. There are > definitely other things that require memory beyond this -- eg during a > segment merge, SegmentReaders are opened for each segment being > merged. Also, if there are pending deletions, 4 bytes per doc is > allocated. > > Applying deletions also opens SegmentReaders. > > Also: a single very large document will cause IW to blow way past the > 16 MB limit, using up as much as is required to index that one doc. > When that doc is finished, it will then flush and free objects until > it's back under the 16 MB limit. If several threads happen to index > large docs at the same time, the problem is that much worse (they all > must finish before IW can flush). > > Can you print the size of the documents you're indexing and see if > that correlates to when you see the memory growth? > > Mike > > On Tue, May 11, 2010 at 2:57 PM, Woolf, Ross <ross_wo...@bmc.com> wrote: >> Still working on some of the things you ask, namely searching without >> indexing. I need to modify our code and the general indexing process takes >> 2 hours, so I won't have a quick turn around on that. We also have a hard >> time answering the question about items that are normal but do they use more >> than the 16 MB. The heap dump does not allow us to quickly identify >> specifics on objects like we show in images below, so we really don't know >> what the amount of memory is used up in objects of this sort. We only know >> that byte[] total for all is at 197891887. >> >> However, I have provided another image that breaks down the memory usage >> from the heap. A big question we have is that we talk about the 16 mb >> buffer, but is there other memory used by Lucene beyond that that we should >> expect to see? >> >> http://i39.tinypic.com/o0o560.jpg >> >> we have 197891887 used in byte[] (anyone we look at is related in some way >> to the index writer) >> we have 169263904 used in char[] (these are related to the index writer too) >> we have 72658944 used in FreqProxTermsWriter$PostingList >> we have 37722668 used in RawPostingList[] >> >> All of these are well over the 16mb. So we are a little lost as to what we >> should expect to see when we look at the memory usage >> >> I've attached the patch and the CheckIndex files. Unfortunately on the >> patch I guess my editor made some space line changes, so you get a lot of >> extra items in the patch that really are not any changes other than >> tab/space. >> >> If you are open to a live share again, then maybe you could look at this >> data quicker than the screen shots I send. >> >> -----Original Message----- >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >> Sent: Monday, May 10, 2010 2:27 AM >> To: java-user@lucene.apache.org >> Subject: Re: IndexWriter and memory usage >> >> Hmmmm... >> >> Your usage (searching for old doc & updating it, to add new fields) is fine. >> >> But: what memory usage do you see if you open a searcher, and search >> for all docs, but don't open an IndexWriter? We need to tease apart >> the IndexReader vs IndexWriter memory usage you are seeing. Also, can >> you post the output of CheckIndex (java >> org.apache.lucene.index.CheckIndex /path/to/index) of your fully built >> index? That may give some hints about expected memory usage of IR (eg >> if # unique terms is large). >> >> More comments below: >> >> On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <ross_wo...@bmc.com> wrote: >>> Sorry to be so long in getting back on this. The patch you provided has >>> improved the situation but we are still seeing some memory loss. The >>> following are some images from the heap dump. I'll share with you what we >>> are seeing now. >>> >>> This first image shows the memory pattern. Our fist commit takes place at >>> about 3:54 when the steady trend up takes a drop and the new cycle begins. >>> What we have found is the 2422 fix has made the memory in the first phase >>> before the commit much better (and I'm sure throughout the entire run). >>> But as you can see after the commit then we then again begin to lose >>> memory. One of the pieces of info to know about this is what you are >>> seeing, we have 5 threads that are pushing data to our Lucene plugin. If >>> we drop it down to 1 thread then we are much more successful and can >>> actually index all of our data without running out of memory but at 5 >>> threads it gets into trouble. We still see a trend up in memory usage, but >>> not as severe as when we use the multiple threads. >>> http://tinypic.com/view.php?pic=2w6bf68&s=5 >> >> Can you post the output of "svn diff" on the 2.9 code base you're >> using? I just want to look & verify all issues we've discussed are >> included in your changes. The fact that 1 thread is fine and 5 >> threads are not still sounds like a symptom of LUCENE-2283. >> >> Also, does that heap usage graph exclude garbage? Or, alternatively, >> can you provoke an OOME w/ 512 MB heap and then capture the heap dump >> at that point? >> >>> There is another piece of the picture that I think might be coming into >>> play. We have plugged Lucene into a legacy app and are subject to how we >>> can get it to deliver the data that we are indexing. In some scenarios >>> (like the one we are having this problem with) we are building our >>> documents progressively (adding fields to the document through the >>> process). What you see before the first commit is the legacy system >>> handing us the first field for many documents. Once we have gotten all of >>> "field 1" for all documents then we commit that data into the index. Then >>> the system starts feeding us "field 2." So we perform a search to see if >>> the document already exists (for the scenario you are seeing it does) and >>> so it retrieves the original document (we store a document ID) and it then >>> adds the new field of data to the existing document and we "update" the >>> document in the index. After the first commit, the rest of the process is >>> one where a document already exist and so the new field is added and and >>> the document is updated. It is in this process that we start rapidly >>> losing memory. The following images show some examples of common areas >>> where memory is being held. >>> >>> http://tinypic.com/view.php?pic=11vkwnb&s=5 >> >> This looks like "normal" memory usage of IndexWriter -- these are the >> recycled buffers used for holding stored fields. However: the net RAM >> used by this allocation should not exceed your 16 MB IW ram buffer >> size -- does it? >> >>> http://tinypic.com/view.php?pic=abq9fp&s=5 >> >> This one is the byte[] buffer used by CompoundFileReader, opened by >> IndexReader. It's odd that you have so many of these (if I'm reading >> this correctly) -- are you certain all opened readers are being >> closed? How many segments do you have in your index? Or... are there >> many unique threads doing the searching? EG do you create a new >> thread for every search or update? >> >>> http://tinypic.com/view.php?pic=25pskyp&s=5 >> >> This one is also normal memory used by IndexWriter, but as above, the >> net RAM used by this allocation (summed w/ the above one) should not >> exceed your 16 MB IW ram buffer size. >> >>> As mentioned, we are subject to how we can have the legacy app feed us the >>> data and so this is why we do it this way. We treat this system as a real >>> time system and at anytime the legacy system may send us a field that needs >>> to be added or updated to a document. So we search for the document and if >>> found we either add or update a field if the field is already existing in >>> the document. So I started to wonder if a clue in this memory loss comes >>> from the fact that we are retrieving an existing document and then adding >>> to it and updating. >>> >>> Now, if we eliminate the updating and simply add each item as a new >>> document (which we did just to test but won't be adequate for our running >>> system), then we still see a slight trend upward in memory usage and the >>> following images show that now most of the memory is consumed in char[] >>> rather than the byte[] we saw before. We don't know if this is normal and >>> expected, or if it is something to be concerned about as well. >>> >>> http://tinypic.com/view.php?pic=vfgkyt&s=5 >> >> That memory usage is normal -- it's used by the in-RAM terms index of >> your opened IndexReader. But I'd like to see the memory usage of >> simply opening your IndexReader and searching for documents to update, >> but not opening an IndexWriter at all. >> >> Mike >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org