Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Herbert L Roitblat Sun, 11 Apr 2010 10:29:18 -0700

Hi, Folks. Thanks, Ruben, for your help. It let me get a ways down theroad.

The problem is the the heap is filling up when I am doing alucene.TermQuery. What I am trying to accomplish is to get the terms inone field of each document and their frequency in the document. A codesnippet is attached below. It yields the results I want.

I managed to get a small enough heap dump into jhat. Now I could usesome help understanding what I have found and some help figuring outwhat to do about it. I am a noobi at understanding the details ofLucene, pyLucene, and Java debugging.

If I understand correctly, the heap is filling up because it is keepinginstances of objects around after there is no longer any need for them.I thought that it might be the case that Python was somehow keeping themaround, but that does not seem to be the case (true?).


From jhat, I got a class instance histogram:

290163 instances <http://192.168.1.155:7000/instances/0x7fbf693bb990> ofclass org.apache.lucene.index.TermInfo<http://192.168.1.155:7000/class/0x7fbf693bb990>289988 instances <http://192.168.1.155:7000/instances/0x7fbf69412d80> ofclass org.apache.lucene.index.Term<http://192.168.1.155:7000/class/0x7fbf69412d80>1976 instances <http://192.168.1.155:7000/instances/0x7fbf693f1300> ofclass org.apache.lucene.index.FieldInfo<http://192.168.1.155:7000/class/0x7fbf693f1300>1976 instances <http://192.168.1.155:7000/instances/0x7fbf6940a1a8> ofclass org.apache.lucene.index.SegmentReader$Norm<http://192.168.1.155:7000/class/0x7fbf6940a1a8>1081 instances <http://192.168.1.155:7000/instances/0x7fbf6928d460> ofclass org.apache.lucene.store.FSDirectory$FSIndexInput<http://192.168.1.155:7000/class/0x7fbf6928d460>1048 instances <http://192.168.1.155:7000/instances/0x7fbf693ef958> ofclass org.apache.lucene.index.CompoundFileReader$CSIndexInput<http://192.168.1.155:7000/class/0x7fbf693ef958>540 instances <http://192.168.1.155:7000/instances/0x7fbf69400510> ofclass org.apache.lucene.index.TermBuffer<http://192.168.1.155:7000/class/0x7fbf69400510>540 instances <http://192.168.1.155:7000/instances/0x7fbf694011c8> ofclass org.apache.lucene.util.UnicodeUtil$UTF16Result<http://192.168.1.155:7000/class/0x7fbf694011c8>540 instances <http://192.168.1.155:7000/instances/0x7fbf693bc168> ofclass org.apache.lucene.util.UnicodeUtil$UTF8Result<http://192.168.1.155:7000/class/0x7fbf693bc168>

...

There are way too many instance of index.TermInfo and index.indexTerm.So, I tracked down some instances and looked for rootset references.There were none. If I understand correctly, this instance should begarbage collected if there are no rootset references. True?


    Here's an example from jhat:

Rootset references toorg.apache.lucene.index.termi...@0x7fbf6e3f8218 (includes weak refs)


   References to org.apache.lucene.index.termi...@0x7fbf6e3f8218 (40 bytes)
   Other queries
   Exclude weak refs
---

There is at least one reference to the object, it is an element in anarray, but the array does not have rootset references either.

Am I misinterpreting these results? In any case, what can I do aboutgetting rid of these? Is it a bug in this version of Lucene? Is therea known fix? I think that I should be able to do an unlimited number ofqueries without filling up the heap.

I am using pyLucene version 2.4.

Thanks for your help.

Herb

-------------------------------
Code snippet:
       reader = self.index.getReader()
       lReader = reader.get()
       searcher = self.index.getSearcher()
       lSearcher = searcher.get()
       query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
       hits = list(lSearcher.search(query))
       if hits:
           hit = lucene.Hit.cast_(hits[0])
           tfvs = lReader.getTermFreqVectors(hit.id)

           if tfvs is not None: # this happens if the vector is not stored

for tfv in tfvs: # There's one for each field that has aTermFreqVector

                   tfvP = lucene.TermFreqVector.cast_(tfv)

if returnAllFields or tfvP.field in termFields: #add only asked fieldstFields[tfvP.field] = dict([(t,f) for (t,f) inzip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])

       else:
           # This shouldn't happen, but we just log the error and march on
           self.log.error("Unable to fetch doc %s from index"%(uid))
##        if self.opCount % 1000 == 0:

## print lucene.JCCEnv._dumpRefs(classes=True).items()#http://lists.osafoundation.org/pipermail/pylucene-dev/2008-January/002171.html

##        self.opCount += 1

       lReader.close()
       lSearcher.close()

retFields = copy.deepcopy(tFields) #return a copy of tFields tofree up references to it and its contents




Herbert Roitblat wrote:

Hi, folks.
I am using PyLucene and doing a lot of get tokens.  lucene.py reports version 
2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.
I'm not sure what the maxheap is, I think that it is maxheap='2048m'.  I think 
that it's running in a 64 bit environment.
It indexes a set of 116,000 documents just fine.
Then I need to get the tokens from these documents and near the end, I run into:

java.lang.OutOfMemoryError: GC overhead limit exceeded

If I wait a bit and ask again for the same document's tokens, I can get them, 
but it then is somewhat likely to post the same error on a certain number of 
other documents.  I can handle these errors and ask again.

I have read that this error message means that the heap is getting filled up 
and garbage collection removes only a small amount of it.  Since all I am doing 
is retrieving, why should the heap be filling up?  I restarted the system 
before starting the retrieval.

My guess is that there is some small memory leak because memory assigned to my 
python program grows slowly as I request more document tokens.  Since I'm not 
intending to change anything in either my python program or in Lucene, any 
growth is unintentional.  I'm just getting tokens.

 we use lucene.TermQuery as the query object to get the terms.

I cannot share the documents nor the application code, but I might be able to 
provide snippets.

One last piece of information, the time needed to retrieve documents slows 
throughout the process.  In the beginning I was getting about 10 documents per 
second.  Towards the end, it is down to about 5 with about 5 second pauses from 
time to time, perhaps due to garbage collection?

Any idea of why the heap is filling up and what I can do about it?

Thanks,
Herb

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Reply via email to