Problem getting tokens for document

Herbert Roitblat Wed, 14 Apr 2010 10:21:35 -0700

Hi, folks.
I am using PyLucene and doing a lot of get tokens.  lucene.py reports
version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.


The system indexes 116,000 documents just fine.  

Maxheap is '2048m', 64 bit environment.

Then I need to get the tokens from these documents and near the end, I run
into:

java.lang.OutOfMemoryError: GC overhead limit exceeded

The heap is apparently filling up with each document retrieved and never 
getting cleared.  I was expecting that it would give me the information for one 
document, then clear that and give me the info for another, etc.  I've looked 
at it with jhat.

I have tried deleting the Python objects that receive any information from 
Lucene--no effect.
I have tried reusing the Python objects that receive any information from 
Lucene--no effect.
I have tried running the Python garbage collector (it slowed the program 
slightly, but generally no effect).

Is there anything else I can do to get the tokens for a document and make sure 
that this does not fill up the heap?  I need to be able to run a million or 
more documents through this and get their tokens.


Here is a code snippet.

        reader = self.index.getReader()
        lReader = reader.get()
        searcher = self.index.getSearcher()
        lSearcher = searcher.get()
        query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
        hits = list(lSearcher.search(query))
        if hits:
            hit = lucene.Hit.cast_(hits[0])
            tfvs = lReader.getTermFreqVectors(hit.id)

            if tfvs is not None: # this happens if the vector is not stored
                for tfv in tfvs: # There's one for each field that has a 
TermFreqVector
                    tfvP = lucene.TermFreqVector.cast_(tfv)
                    if returnAllFields or tfvP.field in termFields: # add only 
asked fields
                        tFields[tfvP.field] = dict([(t,f) for (t,f) in 
zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
        else:
            # This shouldn't happen, but we just log the error and march on
            self.log.error("Unable to fetch doc %s from index"%(uid))

        lReader.close()
        lSearcher.close()

lReader is really:
lucene.IndexReader.open(self._store)

I've tried the Lucene list, but no one there has yet come up with a solution.  
If filling the heap is a Lucene problem (is it a bug), I need to look for a way 
to circumvent that bug.  

Thanks, 

Herb

Problem getting tokens for document

Reply via email to