Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Michael McCandless Tue, 13 Apr 2010 02:46:35 -0700

Can you whittle down your example even more?

EG don't read the term vectors for the first hit.  Just open a single
reader and do the TermQuery search over and over?


BTW what does this line in PyLucene do?:

   tfvP = lucene.TermFreqVector.cast_(tfv)

You never hit exceptions in this code right?  (Because this'd cause
your close to not be called -- really you should move the .close()
calls into a finally clause).

Mike

On Mon, Apr 12, 2010 at 10:54 AM, Herbert Roitblat <h...@orcatec.com> wrote:
> Update:
> reusing the reader and searcher made almost no difference.  It still eats up
> the heap.
> ----- Original Message ----- From: "Herbert L Roitblat" <h...@orcatec.com>
> To: <java-user@lucene.apache.org>
> Sent: Monday, April 12, 2010 6:50 AM
> Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
>> Thank you Michael.  Your suggestions are helpful.  I inherited all of
>> the code that uses pyLucene and don't consider myself an expert on it,
>> so I very much appreciate your suggestions.
>>
>> It does not seem to be the case that these elements represent the index
>> of the collection. TermInfo and Term grow as I retrieve more documents.
>> There was no trouble building the index.
>>
>> The contents of these fields are the tokens (some fields are tokenized,
>> others not) of the document fields.  In the tokenized fields, there is
>> one object for each word. They seem to be in order of the documents for
>> which the term vectors are being sought.  So these objects seem to
>> represent a "concatenation" of all of the documents being considered in
>> order, and if they are never removed, would always overwhelm the heap
>> with a large document set.  They are not the index in the usual sense, I
>> think.  Before I start retrieving documents, there is barely anything in
>> these objects.
>>
>> What is holding the document contents in the heap after the fields
>> information is returned?
>>
>> Can you say more about incRef/decRef?  I deleted all variables that
>> interacted with Lucene and it seems to have made no difference
>>
>> There are not a lot of different fields, I would say on the order of 50
>> with about 20 of them in virtually every document.
>>
>> It uses:
>> lucene.IndexReader.open(self._store)
>>
>>
>> One suggestion I got is to put the reader code in the class init
>> function and then reuse it.  I have not tried that one yet (next on the
>> agenda).  You suggested something similar and I will try that.
>>
>> Thanks,
>> Herb
>>
>>
>> Michael McCandless wrote:
>>>
>>> The large count of TermInfo & Term is completely normal -- this is
>>> Lucene's term index, which is entirely RAM resident.
>>>
>>> In 3.1, with flexible indexing, the RAM efficiency of the terms index
>>> should be much improved.
>>>
>>> While opening a new reader/searcher for every query is horribly
>>> inefficient, it should not leak memory.  (Are you using
>>> IndexReader.reopen?  I see calls to getReader, but this Lucene API
>>> (near-real-time search) wasn't added until 2.9, and you're on 2.4, so
>>> I think that's your own method?).
>>>
>>> What do your get/getReader/getSearcher calls do?  Are you using
>>> incRef/decRef at all to manage the lifetime of your readers?  How many
>>> unique field names do you have, across all docs that you index?
>>>
>>> If you change your test to open a single reader, but run that
>>> TermQuery over and over and over again, do you still hit OOME?
>>>
>>> Mike
>>>
>>> On Sun, Apr 11, 2010 at 1:28 PM, Herbert L Roitblat <h...@orcatec.com>
>>> wrote:
>>>
>>>> Hi, Folks.  Thanks, Ruben, for your help.  It let me get a ways down the
>>>> road.
>>>>
>>>> The problem is the the heap is filling up when I am doing a
>>>> lucene.TermQuery.  What I am trying to accomplish is to get the terms in
>>>> one
>>>> field of each document and their frequency in the document.  A code
>>>> snippet
>>>> is attached below. It yields the results I want.
>>>>
>>>> I managed to get a small enough heap dump into jhat.  Now I could use
>>>> some
>>>> help understanding what I have found and some help figuring out what to
>>>> do
>>>> about it.  I am a noobi at understanding the details of Lucene,
>>>> pyLucene,
>>>>  and Java debugging.
>>>>
>>>> If I understand correctly, the heap is filling up because it is keeping
>>>> instances of objects around after there is no longer any need for them.
>>>> I
>>>> thought that it might be the case that Python was somehow keeping them
>>>> around, but that does not seem to be the case (true?).
>>>>
>>>> From jhat, I got a class instance histogram:
>>>>
>>>> 290163 instances <http://192.168.1.155:7000/instances/0x7fbf693bb990> of
>>>> class org.apache.lucene.index.TermInfo
>>>> <http://192.168.1.155:7000/class/0x7fbf693bb990>
>>>> 289988 instances <http://192.168.1.155:7000/instances/0x7fbf69412d80> of
>>>> class org.apache.lucene.index.Term
>>>> <http://192.168.1.155:7000/class/0x7fbf69412d80>
>>>> 1976 instances <http://192.168.1.155:7000/instances/0x7fbf693f1300> of
>>>> class
>>>> org.apache.lucene.index.FieldInfo
>>>> <http://192.168.1.155:7000/class/0x7fbf693f1300>
>>>> 1976 instances <http://192.168.1.155:7000/instances/0x7fbf6940a1a8> of
>>>> class
>>>> org.apache.lucene.index.SegmentReader$Norm
>>>> <http://192.168.1.155:7000/class/0x7fbf6940a1a8>
>>>> 1081 instances <http://192.168.1.155:7000/instances/0x7fbf6928d460> of
>>>> class
>>>> org.apache.lucene.store.FSDirectory$FSIndexInput
>>>> <http://192.168.1.155:7000/class/0x7fbf6928d460>
>>>> 1048 instances <http://192.168.1.155:7000/instances/0x7fbf693ef958> of
>>>> class
>>>> org.apache.lucene.index.CompoundFileReader$CSIndexInput
>>>> <http://192.168.1.155:7000/class/0x7fbf693ef958>
>>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf69400510> of
>>>> class
>>>> org.apache.lucene.index.TermBuffer
>>>> <http://192.168.1.155:7000/class/0x7fbf69400510>
>>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf694011c8> of
>>>> class
>>>> org.apache.lucene.util.UnicodeUtil$UTF16Result
>>>> <http://192.168.1.155:7000/class/0x7fbf694011c8>
>>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf693bc168> of
>>>> class
>>>> org.apache.lucene.util.UnicodeUtil$UTF8Result
>>>> <http://192.168.1.155:7000/class/0x7fbf693bc168>
>>>> ...
>>>>
>>>> There are way too many instance of index.TermInfo and index.indexTerm.
>>>> So,
>>>> I tracked down some instances and looked for rootset references.  There
>>>> were
>>>> none.  If I understand correctly, this instance should be garbage
>>>> collected
>>>> if there are no rootset references.  True?
>>>>
>>>>   Here's an example from jhat:
>>>>
>>>>  Rootset references to org.apache.lucene.index.termi...@0x7fbf6e3f8218
>>>> (includes weak refs)
>>>>
>>>>  References to org.apache.lucene.index.termi...@0x7fbf6e3f8218 (40
>>>> bytes)
>>>>  Other queries
>>>>  Exclude weak refs
>>>> ---
>>>> There is at least one reference to the object, it is an element in an
>>>> array,
>>>> but the array does not have rootset references either.
>>>>
>>>> Am I misinterpreting these results?  In any case, what can I do about
>>>> getting rid of these?  Is it a bug in this version of Lucene?  Is there
>>>> a
>>>> known fix?  I think that I should be able to do an unlimited number of
>>>> queries without filling up the heap.
>>>> I am using pyLucene version 2.4.
>>>>
>>>> Thanks for your help.
>>>>
>>>> Herb
>>>>
>>>> -------------------------------
>>>> Code snippet:
>>>>      reader = self.index.getReader()
>>>>      lReader = reader.get()
>>>>      searcher = self.index.getSearcher()
>>>>      lSearcher = searcher.get()
>>>>      query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
>>>>      hits = list(lSearcher.search(query))
>>>>      if hits:
>>>>          hit = lucene.Hit.cast_(hits[0])
>>>>          tfvs = lReader.getTermFreqVectors(hit.id)
>>>>
>>>>          if tfvs is not None: # this happens if the vector is not stored
>>>>              for tfv in tfvs: # There's one for each field that has a
>>>> TermFreqVector
>>>>                  tfvP = lucene.TermFreqVector.cast_(tfv)
>>>>                  if returnAllFields or tfvP.field in termFields: # add
>>>> only
>>>> asked fields
>>>>                      tFields[tfvP.field] = dict([(t,f) for (t,f) in
>>>> zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
>>>>      else:
>>>>          # This shouldn't happen, but we just log the error and march on
>>>>          self.log.error("Unable to fetch doc %s from index"%(uid))
>>>> ##        if self.opCount % 1000 == 0:
>>>> ##            print lucene.JCCEnv._dumpRefs(classes=True).items()
>>>>
>>>> #http://lists.osafoundation.org/pipermail/pylucene-dev/2008-January/002171.html
>>>> ##        self.opCount += 1
>>>>
>>>>      lReader.close()
>>>>      lSearcher.close()
>>>>      retFields = copy.deepcopy(tFields) #return a copy of tFields to
>>>> free
>>>> up references to it and its contents
>>>>
>>>>
>>>>
>>>> Herbert Roitblat wrote:
>>>>
>>>>> Hi, folks.
>>>>> I am using PyLucene and doing a lot of get tokens.  lucene.py reports
>>>>> version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.
>>>>> I'm not sure what the maxheap is, I think that it is maxheap='2048m'. I
>>>>> think that it's running in a 64 bit environment.
>>>>> It indexes a set of 116,000 documents just fine.
>>>>> Then I need to get the tokens from these documents and near the end, I
>>>>> run
>>>>> into:
>>>>>
>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>> If I wait a bit and ask again for the same document's tokens, I can get
>>>>> them, but it then is somewhat likely to post the same error on a
>>>>> certain
>>>>> number of other documents.  I can handle these errors and ask again.
>>>>>
>>>>> I have read that this error message means that the heap is getting
>>>>> filled
>>>>> up and garbage collection removes only a small amount of it.  Since all
>>>>> I am
>>>>> doing is retrieving, why should the heap be filling up?  I restarted
>>>>> the
>>>>> system before starting the retrieval.
>>>>>
>>>>> My guess is that there is some small memory leak because memory
>>>>> assigned
>>>>> to my python program grows slowly as I request more document tokens.
>>>>> Since
>>>>> I'm not intending to change anything in either my python program or in
>>>>> Lucene, any growth is unintentional.  I'm just getting tokens.
>>>>>
>>>>>  we use lucene.TermQuery as the query object to get the terms.
>>>>>
>>>>> I cannot share the documents nor the application code, but I might be
>>>>> able
>>>>> to provide snippets.
>>>>>
>>>>> One last piece of information, the time needed to retrieve documents
>>>>> slows
>>>>> throughout the process.  In the beginning I was getting about 10
>>>>> documents
>>>>> per second.  Towards the end, it is down to about 5 with about 5 second
>>>>> pauses from time to time, perhaps due to garbage collection?
>>>>>
>>>>> Any idea of why the heap is filling up and what I can do about it?
>>>>>
>>>>> Thanks,
>>>>> Herb
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Reply via email to