Re: Lucene Indexing out of memory

ajay_gupta Sun, 14 Mar 2010 23:49:29 -0700

Hi Michale and others,
I did get some hint for my problem. There was a bug in the code which was
eating up the memory which I figured out after lot of effort. 
Thanks All of you for your suggestions.


Regards
Ajay



Michael McCandless-2 wrote:
> 
> I agree, memory profiler or heap dump or small test case is the next
> step... the code looks fine.
> 
> This is always a single thread adding docs?
> 
> Are you really certain that the iterator only iterates over 2500 docs?
> 
> What analyzer are you using?
> 
> Mike
> 
> On Thu, Mar 4, 2010 at 4:50 AM, Ian Lea <ian....@gmail.com> wrote:
>> Have you run it through a memory profiler yet?  Seems the obvious next
>> step.
>>
>> If that doesn't help, cut it down to the simplest possible
>> self-contained program that demonstrates the problem and post it here.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta <ajay...@gmail.com> wrote:
>>>
>>> Erick,
>>> w_context and context_str are local to this method and are used only for
>>> 2500 K documents not entire 70 k. I am clearing the hashmap after each
>>> 2500k
>>> doc processing and also I printed memory consumed by  hashmap which is
>>> kind
>>> of constant for each chunk processing.  For each invocation of
>>> update_context memory should be kind of constant but after each
>>> invocation
>>> it increase few MB's and after 70k it goes OOM so something wrong is
>>> happening inside update_context some operation like search/update/add
>>> document is creating some memory and which is not release after
>>> returning
>>> from this method.
>>>
>>> -Ajay
>>>
>>>
>>> Erick Erickson wrote:
>>>>
>>>> The first place I'd look is how big my your strings
>>>> got. w_context and context_str come to mind. My
>>>> first suspicion is that you're building ever-longer
>>>> strings and around 70K documents your strings
>>>> are large enough to produce OOMs.
>>>>
>>>> FWIW
>>>> Erick
>>>>
>>>> On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <ajay...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Mike,
>>>>> Actually my documents are very small in size. We have csv files where
>>>>> each
>>>>> record represents a document which is not very large so I don't think
>>>>> document size is an issue.
>>>>> For each record I am tokenizing it and for each token I am keeping 3
>>>>> neighbouring tokens in a Hashtable. After X number of documents where
>>>>> X
>>>>> is
>>>>> currently 2500 I am creating
>>>>> index by following code:
>>>>>
>>>>>                                //Initialization step done only at
>>>>> starting
>>>>>
>>>>>                                cram = FSDirectory.open(new
>>>>> File("lucenetemp2"));
>>>>>                                context_writer = new IndexWriter(cram,
>>>>> analyzer, true,
>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>
>>>>>                    // After each 2500 docs
>>>>>
>>>>>                    update_context()
>>>>>                    {
>>>>>                        context_writer.commit();
>>>>>                        context_writer.optimize();
>>>>>
>>>>>                        IndexSearcher is = new IndexSearcher(cram);
>>>>>                        IndexReader ir = is.getIndexReader();
>>>>>                        Iterator<String> it =
>>>>> context.keySet().iterator();
>>>>>
>>>>>                        while(it.hasNext())
>>>>>                        {
>>>>>                                String word = it.next();
>>>>>                                // This is all the context of "word"
>>>>> for
>>>>> all
>>>>> the 2500 docs
>>>>>                                StringBuffer w_context =
>>>>> context.get(word);
>>>>>                                Term t = new Term("Word", word);
>>>>>                                TermQuery tq = new TermQuery(t);
>>>>>                                TopScoreDocCollector collector =
>>>>> TopScoreDocCollector.create(1, false);
>>>>>                                is.search(tq,collector);
>>>>>                                ScoreDoc[] hits =
>>>>> collector.topDocs().scoreDocs;
>>>>>
>>>>>                                if(hits.length!=0)
>>>>>                                {
>>>>>                                        int id = hits[0].doc;
>>>>>                                        TermFreqVector tfv =
>>>>> ir.getTermFreqVector(id, "Context");
>>>>>
>>>>>                                        // This creates context string
>>>>> from
>>>>> TermFreqVector. For e.g if
>>>>> TermFreqVector is word1(2), word2(1),word3(2) then its output is
>>>>>                                        // context_str="word1 word1
>>>>> word2
>>>>> word3 word3"
>>>>>                                        String context_str =
>>>>> getContextString(tfv);
>>>>>
>>>>>
>>>>>                                        w_context.append(context_str);
>>>>>                                        Document new_doc = new
>>>>> Document();
>>>>>                                        new_doc.add(new Field("Word",
>>>>> word,Field.Store.YES,
>>>>> Field.Index.NOT_ANALYZED));
>>>>>                                        new_doc.add(new
>>>>> Field("Context",
>>>>> w_context.toString(),Field.Store.YES,
>>>>> Field.Index.ANALYZED, Field.TermVector.YES));
>>>>>                                      
>>>>>  context_writer.updateDocument(t,
>>>>> new_doc);
>>>>>
>>>>>                                }else{
>>>>>
>>>>>                                        Document new_doc = new
>>>>> Document();
>>>>>                                        new_doc.add(new Field("Word",
>>>>> word,Field.Store.YES,
>>>>> Field.Index.NOT_ANALYZED));
>>>>>                                        new_doc.add(new
>>>>> Field("Context",
>>>>> w_context.toString(),Field.Store.YES,
>>>>> Field.Index.ANALYZED, Field.TermVector.YES));
>>>>>
>>>>> context_writer.addDocument(new_doc);
>>>>>
>>>>>                                }
>>>>>                        }
>>>>>                        ir.close();
>>>>>                        is.close();
>>>>>
>>>>>                    }
>>>>>
>>>>>
>>>>> I am printing memory also after each invocation of this method and I
>>>>> observed that after each call of update_context memory increases and
>>>>> when
>>>>> it
>>>>> reaches around 65-70k it goes outofmemory so somewhere memory is
>>>>> increasing
>>>>> in each invocation. I thought each invocation should take constant
>>>>> amount
>>>>> of
>>>>> memory and it should not be increased cumulatively. Also after each
>>>>> invocation of Update_context I am also calling System.gc() to release
>>>>> memory
>>>>> and I also tried various other parameters like
>>>>> context_writer.setMaxBufferedDocs()
>>>>> context_writer.setMaxMergeDocs()
>>>>> context_writer.setRAMBufferSizeMB()
>>>>> I set these parameters smaller values as well but nothing worked.
>>>>>
>>>>> Any hint will be very helpful.
>>>>>
>>>>> Thanks
>>>>> Ajay
>>>>>
>>>>>
>>>>> Michael McCandless-2 wrote:
>>>>> >
>>>>> > The worst case RAM usage for Lucene is a single doc with many unique
>>>>> > terms.  Lucene allocates ~60 bytes per unique term (plus space to
>>>>> hold
>>>>> > that term's characters = 2 bytes per char).  And, Lucene cannot
>>>>> flush
>>>>> > within one document -- it must flush after the doc has been fully
>>>>> > indexed.
>>>>> >
>>>>> > This past thread (also from Paul) delves into some of the details:
>>>>> >
>>>>> >   http://lucene.markmail.org/thread/pbeidtepentm6mdn
>>>>> >
>>>>> > But it's not clear whether that is the issue affecting Ajay -- I
>>>>> think
>>>>> > more details about the docs, or, some code fragments, could help
>>>>> shed
>>>>> > light.
>>>>> >
>>>>> > Mike
>>>>> >
>>>>> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul
>>>>> <paul.b.murd...@saic.com>
>>>>> > wrote:
>>>>> >> Ajay,
>>>>> >>
>>>>> >> Here is another thread I started on the same issue.
>>>>> >>
>>>>> >>
>>>>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe
>>>>> >> n-indexing-large-files
>>>>> >>
>>>>> >> Paul
>>>>> >>
>>>>> >>
>>>>> >> -----Original Message-----
>>>>> >> From:
>>>>> java-user-return-45254-paul.b.murdoch=saic....@lucene.apache.org
>>>>> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@
>>>>> lucene.apache.org
>>>>> >> ] On Behalf Of ajay_gupta
>>>>> >> Sent: Tuesday, March 02, 2010 8:28 AM
>>>>> >> To: java-user@lucene.apache.org
>>>>> >> Subject: Lucene Indexing out of memory
>>>>> >>
>>>>> >>
>>>>> >> Hi,
>>>>> >> It might be general question though but I couldn't find the answer
>>>>> yet.
>>>>> >> I
>>>>> >> have around 90k documents sizing around 350 MB. Each document
>>>>> contains
>>>>> a
>>>>> >> record which has some text content. For each word in this text I
>>>>> want
>>>>> to
>>>>> >> store context for that word and index it so I am reading each
>>>>> document
>>>>> >> and
>>>>> >> for each word in that document I am appending fixed number of
>>>>> >> surrounding
>>>>> >> words. To do that first I search in existing indices if this word
>>>>> >> already
>>>>> >> exist and if it is then I get the content and append the new
>>>>> context
>>>>> and
>>>>> >> update the document. In case no context exist I create a document
>>>>> with
>>>>> >> fields "word" and "context" and add these two fields with values as
>>>>> word
>>>>> >> value and context value.
>>>>> >>
>>>>> >> I tried this in RAM but after certain no of docs it gave out of
>>>>> memory
>>>>> >> error
>>>>> >> so I thought to use FSDirectory method but surprisingly after 70k
>>>>> >> documents
>>>>> >> it also gave OOM error. I have enough disk space but still I am
>>>>> getting
>>>>> >> this
>>>>> >> error.I am not sure even for disk based indexing why its giving
>>>>> this
>>>>> >> error.
>>>>> >> I thought disk based indexing will be slow but atleast it will be
>>>>> >> scalable.
>>>>> >> Could someone suggest what could be the issue ?
>>>>> >>
>>>>> >> Thanks
>>>>> >> Ajay
>>>>> >> --
>>>>> >> View this message in context:
>>>>> >>
>>>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872
>>>>> .
>>>>> >> html
>>>>> >> Sent from the Lucene - Java Users mailing list archive at
>>>>> Nabble.com.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27900854.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Indexing out of memory

Reply via email to