On Apr 14, 2010, at 10:32, Aric Coady <aric.co...@gmail.com> wrote:
Hey, Herb.
There is a memory leak in the string array in pylucene 2.4. In this
case it would be the iteration of tfvP.getTerms(). The fix made it
into 2.9, more history here:
http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/200907.mbox/%3calpine.osx.2.01.0907301553230.5...@yuzu%3e
Yes, exactly. Thanks Aric, I should have read your reply before
sending mine.
Andi..
On Apr 14, 2010, at 10:21 AM, Herbert Roitblat wrote:
Hi, folks.
I am using PyLucene and doing a lot of get tokens. lucene.py reports
version 2.4.0. It is rpath linux with 8GB of memory. Python is 2.4.
The system indexes 116,000 documents just fine.
Maxheap is '2048m', 64 bit environment.
Then I need to get the tokens from these documents and near the
end, I run
into:
java.lang.OutOfMemoryError: GC overhead limit exceeded
The heap is apparently filling up with each document retrieved and
never getting cleared. I was expecting that it would give me the
information for one document, then clear that and give me the info
for another, etc. I've looked at it with jhat.
I have tried deleting the Python objects that receive any
information from Lucene--no effect.
I have tried reusing the Python objects that receive any
information from Lucene--no effect.
I have tried running the Python garbage collector (it slowed the
program slightly, but generally no effect).
Is there anything else I can do to get the tokens for a document
and make sure that this does not fill up the heap? I need to be
able to run a million or more documents through this and get their
tokens.
Here is a code snippet.
reader = self.index.getReader()
lReader = reader.get()
searcher = self.index.getSearcher()
lSearcher = searcher.get()
query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID,
uid))
hits = list(lSearcher.search(query))
if hits:
hit = lucene.Hit.cast_(hits[0])
tfvs = lReader.getTermFreqVectors(hit.id)
if tfvs is not None: # this happens if the vector is not
stored
for tfv in tfvs: # There's one for each field that
has a TermFreqVector
tfvP = lucene.TermFreqVector.cast_(tfv)
if returnAllFields or tfvP.field in termFields: #
add only asked fields
tFields[tfvP.field] = dict([(t,f) for (t,f)
in zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
else:
# This shouldn't happen, but we just log the error and
march on
self.log.error("Unable to fetch doc %s from index"%(uid))
lReader.close()
lSearcher.close()
lReader is really:
lucene.IndexReader.open(self._store)
I've tried the Lucene list, but no one there has yet come up with a
solution. If filling the heap is a Lucene problem (is it a bug), I
need to look for a way to circumvent that bug.
Thanks,
Herb