Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Herbert Roitblat Mon, 12 Apr 2010 07:55:24 -0700

Update:

reusing the reader and searcher made almost no difference. It still eats upthe heap.----- Original Message -----From: "Herbert L Roitblat" <h...@orcatec.com>

To: <java-user@lucene.apache.org>
Sent: Monday, April 12, 2010 6:50 AM
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Thank you Michael.  Your suggestions are helpful.  I inherited all of
the code that uses pyLucene and don't consider myself an expert on it,
so I very much appreciate your suggestions.

It does not seem to be the case that these elements represent the index
of the collection. TermInfo and Term grow as I retrieve more documents.
There was no trouble building the index.

The contents of these fields are the tokens (some fields are tokenized,
others not) of the document fields.  In the tokenized fields, there is
one object for each word. They seem to be in order of the documents for
which the term vectors are being sought.  So these objects seem to
represent a "concatenation" of all of the documents being considered in
order, and if they are never removed, would always overwhelm the heap
with a large document set.  They are not the index in the usual sense, I
think.  Before I start retrieving documents, there is barely anything in
these objects.

What is holding the document contents in the heap after the fields
information is returned?

Can you say more about incRef/decRef?  I deleted all variables that
interacted with Lucene and it seems to have made no difference

There are not a lot of different fields, I would say on the order of 50
with about 20 of them in virtually every document.

It uses:
lucene.IndexReader.open(self._store)


One suggestion I got is to put the reader code in the class init
function and then reuse it.  I have not tried that one yet (next on the
agenda).  You suggested something similar and I will try that.

Thanks,
Herb


Michael McCandless wrote:

The large count of TermInfo & Term is completely normal -- this is
Lucene's term index, which is entirely RAM resident.

In 3.1, with flexible indexing, the RAM efficiency of the terms index
should be much improved.

While opening a new reader/searcher for every query is horribly
inefficient, it should not leak memory.  (Are you using
IndexReader.reopen?  I see calls to getReader, but this Lucene API
(near-real-time search) wasn't added until 2.9, and you're on 2.4, so
I think that's your own method?).

What do your get/getReader/getSearcher calls do?  Are you using
incRef/decRef at all to manage the lifetime of your readers?  How many
unique field names do you have, across all docs that you index?

If you change your test to open a single reader, but run that
TermQuery over and over and over again, do you still hit OOME?

Mike
On Sun, Apr 11, 2010 at 1:28 PM, Herbert L Roitblat <h...@orcatec.com>wrote:
Hi, Folks.  Thanks, Ruben, for your help.  It let me get a ways down the
road.

The problem is the the heap is filling up when I am doing a
lucene.TermQuery. What I am trying to accomplish is to get the terms inonefield of each document and their frequency in the document. A codesnippet
is attached below. It yields the results I want.
I managed to get a small enough heap dump into jhat. Now I could usesomehelp understanding what I have found and some help figuring out what todoabout it. I am a noobi at understanding the details of Lucene,pyLucene,
 and Java debugging.

If I understand correctly, the heap is filling up because it is keeping
instances of objects around after there is no longer any need for them.I
thought that it might be the case that Python was somehow keeping them
around, but that does not seem to be the case (true?).

From jhat, I got a class instance histogram:

290163 instances <http://192.168.1.155:7000/instances/0x7fbf693bb990> of
class org.apache.lucene.index.TermInfo
<http://192.168.1.155:7000/class/0x7fbf693bb990>
289988 instances <http://192.168.1.155:7000/instances/0x7fbf69412d80> of
class org.apache.lucene.index.Term
<http://192.168.1.155:7000/class/0x7fbf69412d80>
1976 instances <http://192.168.1.155:7000/instances/0x7fbf693f1300> ofclass
org.apache.lucene.index.FieldInfo
<http://192.168.1.155:7000/class/0x7fbf693f1300>
1976 instances <http://192.168.1.155:7000/instances/0x7fbf6940a1a8> ofclass
org.apache.lucene.index.SegmentReader$Norm
<http://192.168.1.155:7000/class/0x7fbf6940a1a8>
1081 instances <http://192.168.1.155:7000/instances/0x7fbf6928d460> ofclass
org.apache.lucene.store.FSDirectory$FSIndexInput
<http://192.168.1.155:7000/class/0x7fbf6928d460>
1048 instances <http://192.168.1.155:7000/instances/0x7fbf693ef958> ofclass
org.apache.lucene.index.CompoundFileReader$CSIndexInput
<http://192.168.1.155:7000/class/0x7fbf693ef958>
540 instances <http://192.168.1.155:7000/instances/0x7fbf69400510> ofclass
org.apache.lucene.index.TermBuffer
<http://192.168.1.155:7000/class/0x7fbf69400510>
540 instances <http://192.168.1.155:7000/instances/0x7fbf694011c8> ofclass
org.apache.lucene.util.UnicodeUtil$UTF16Result
<http://192.168.1.155:7000/class/0x7fbf694011c8>
540 instances <http://192.168.1.155:7000/instances/0x7fbf693bc168> ofclass
org.apache.lucene.util.UnicodeUtil$UTF8Result
<http://192.168.1.155:7000/class/0x7fbf693bc168>
...
There are way too many instance of index.TermInfo and index.indexTerm.So,I tracked down some instances and looked for rootset references. Therewerenone. If I understand correctly, this instance should be garbagecollected
if there are no rootset references.  True?

   Here's an example from jhat:

  Rootset references to org.apache.lucene.index.termi...@0x7fbf6e3f8218
(includes weak refs)
References to org.apache.lucene.index.termi...@0x7fbf6e3f8218 (40bytes)
  Other queries
  Exclude weak refs
---
There is at least one reference to the object, it is an element in anarray,
but the array does not have rootset references either.

Am I misinterpreting these results?  In any case, what can I do about
getting rid of these? Is it a bug in this version of Lucene? Is therea
known fix?  I think that I should be able to do an unlimited number of
queries without filling up the heap.
I am using pyLucene version 2.4.

Thanks for your help.

Herb

-------------------------------
Code snippet:
      reader = self.index.getReader()
      lReader = reader.get()
      searcher = self.index.getSearcher()
      lSearcher = searcher.get()
query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID,uid))
      hits = list(lSearcher.search(query))
      if hits:
          hit = lucene.Hit.cast_(hits[0])
          tfvs = lReader.getTermFreqVectors(hit.id)
if tfvs is not None: # this happens if the vector is notstored
              for tfv in tfvs: # There's one for each field that has a
TermFreqVector
                  tfvP = lucene.TermFreqVector.cast_(tfv)
if returnAllFields or tfvP.field in termFields: # addonly
asked fields
                      tFields[tfvP.field] = dict([(t,f) for (t,f) in
zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
      else:
# This shouldn't happen, but we just log the error and marchon
          self.log.error("Unable to fetch doc %s from index"%(uid))
##        if self.opCount % 1000 == 0:
##            print lucene.JCCEnv._dumpRefs(classes=True).items()
#http://lists.osafoundation.org/pipermail/pylucene-dev/2008-January/002171.html
##        self.opCount += 1

      lReader.close()
      lSearcher.close()
retFields = copy.deepcopy(tFields) #return a copy of tFields tofree
up references to it and its contents



Herbert Roitblat wrote:
Hi, folks.
I am using PyLucene and doing a lot of get tokens.  lucene.py reports
version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.
I'm not sure what the maxheap is, I think that it is maxheap='2048m'.I
think that it's running in a 64 bit environment.
It indexes a set of 116,000 documents just fine.
Then I need to get the tokens from these documents and near the end, Irun
into:

java.lang.OutOfMemoryError: GC overhead limit exceeded

If I wait a bit and ask again for the same document's tokens, I can get
them, but it then is somewhat likely to post the same error on acertain
number of other documents.  I can handle these errors and ask again.
I have read that this error message means that the heap is gettingfilledup and garbage collection removes only a small amount of it. Since allI amdoing is retrieving, why should the heap be filling up? I restartedthe
system before starting the retrieval.
My guess is that there is some small memory leak because memoryassignedto my python program grows slowly as I request more document tokens.Since
I'm not intending to change anything in either my python program or in
Lucene, any growth is unintentional.  I'm just getting tokens.

 we use lucene.TermQuery as the query object to get the terms.
I cannot share the documents nor the application code, but I might beable
to provide snippets.
One last piece of information, the time needed to retrieve documentsslowsthroughout the process. In the beginning I was getting about 10documents
per second.  Towards the end, it is down to about 5 with about 5 second
pauses from time to time, perhaps due to garbage collection?

Any idea of why the heap is filling up and what I can do about it?

Thanks,
Herb
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Reply via email to