Re: Problem getting tokens for document

Andi Vajda Wed, 14 Apr 2010 11:06:54 -0700


On Apr 14, 2010, at 10:21, "Herbert Roitblat" <h...@orcatec.com> wrote:

Hi, folks.
I am using PyLucene and doing a lot of get tokens.  lucene.py reports
version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.

The system indexes 116,000 documents just fine.

Maxheap is '2048m', 64 bit environment.
Then I need to get the tokens from these documents and near the end,I run
into:

java.lang.OutOfMemoryError: GC overhead limit exceeded
The heap is apparently filling up with each document retrieved andnever getting cleared. I was expecting that it would give me theinformation for one document, then clear that and give me the infofor another, etc. I've looked at it with jhat.
I have tried deleting the Python objects that receive anyinformation from Lucene--no effect.I have tried reusing the Python objects that receive any informationfrom Lucene--no effect.I have tried running the Python garbage collector (it slowed theprogram slightly, but generally no effect).
Is there anything else I can do to get the tokens for a document andmake sure that this does not fill up the heap? I need to be able torun a million or more documents through this and get their tokens.


Could you try a newer version of PyLucene, such as 2.9.2 or 3.0.1 ?

I remember that some string leaks got fixed since 2.4.0 which is quiteold by now.


Andi..

Here is a code snippet.

       reader = self.index.getReader()
       lReader = reader.get()
       searcher = self.index.getSearcher()
       lSearcher = searcher.get()
query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID,uid))
       hits = list(lSearcher.search(query))
       if hits:
           hit = lucene.Hit.cast_(hits[0])
           tfvs = lReader.getTermFreqVectors(hit.id)
if tfvs is not None: # this happens if the vector is notstoredfor tfv in tfvs: # There's one for each field thathas a TermFreqVector
                   tfvP = lucene.TermFreqVector.cast_(tfv)
if returnAllFields or tfvP.field in termFields: #add only asked fieldstFields[tfvP.field] = dict([(t,f) for (t,f)in zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
       else:
# This shouldn't happen, but we just log the error andmarch on
           self.log.error("Unable to fetch doc %s from index"%(uid))

       lReader.close()
       lSearcher.close()

lReader is really:
lucene.IndexReader.open(self._store)
I've tried the Lucene list, but no one there has yet come up with asolution. If filling the heap is a Lucene problem (is it a bug), Ineed to look for a way to circumvent that bug.
Thanks,

Herb

Re: Problem getting tokens for document

Reply via email to