Hi Michale and others, I did get some hint for my problem. There was a bug in the code which was eating up the memory which I figured out after lot of effort. Thanks All of you for your suggestions.
Regards Ajay Michael McCandless-2 wrote: > > I agree, memory profiler or heap dump or small test case is the next > step... the code looks fine. > > This is always a single thread adding docs? > > Are you really certain that the iterator only iterates over 2500 docs? > > What analyzer are you using? > > Mike > > On Thu, Mar 4, 2010 at 4:50 AM, Ian Lea <ian....@gmail.com> wrote: >> Have you run it through a memory profiler yet? Seems the obvious next >> step. >> >> If that doesn't help, cut it down to the simplest possible >> self-contained program that demonstrates the problem and post it here. >> >> >> -- >> Ian. >> >> >> On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta <ajay...@gmail.com> wrote: >>> >>> Erick, >>> w_context and context_str are local to this method and are used only for >>> 2500 K documents not entire 70 k. I am clearing the hashmap after each >>> 2500k >>> doc processing and also I printed memory consumed by hashmap which is >>> kind >>> of constant for each chunk processing. For each invocation of >>> update_context memory should be kind of constant but after each >>> invocation >>> it increase few MB's and after 70k it goes OOM so something wrong is >>> happening inside update_context some operation like search/update/add >>> document is creating some memory and which is not release after >>> returning >>> from this method. >>> >>> -Ajay >>> >>> >>> Erick Erickson wrote: >>>> >>>> The first place I'd look is how big my your strings >>>> got. w_context and context_str come to mind. My >>>> first suspicion is that you're building ever-longer >>>> strings and around 70K documents your strings >>>> are large enough to produce OOMs. >>>> >>>> FWIW >>>> Erick >>>> >>>> On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <ajay...@gmail.com> wrote: >>>> >>>>> >>>>> Mike, >>>>> Actually my documents are very small in size. We have csv files where >>>>> each >>>>> record represents a document which is not very large so I don't think >>>>> document size is an issue. >>>>> For each record I am tokenizing it and for each token I am keeping 3 >>>>> neighbouring tokens in a Hashtable. After X number of documents where >>>>> X >>>>> is >>>>> currently 2500 I am creating >>>>> index by following code: >>>>> >>>>> //Initialization step done only at >>>>> starting >>>>> >>>>> cram = FSDirectory.open(new >>>>> File("lucenetemp2")); >>>>> context_writer = new IndexWriter(cram, >>>>> analyzer, true, >>>>> IndexWriter.MaxFieldLength.LIMITED); >>>>> >>>>> // After each 2500 docs >>>>> >>>>> update_context() >>>>> { >>>>> context_writer.commit(); >>>>> context_writer.optimize(); >>>>> >>>>> IndexSearcher is = new IndexSearcher(cram); >>>>> IndexReader ir = is.getIndexReader(); >>>>> Iterator<String> it = >>>>> context.keySet().iterator(); >>>>> >>>>> while(it.hasNext()) >>>>> { >>>>> String word = it.next(); >>>>> // This is all the context of "word" >>>>> for >>>>> all >>>>> the 2500 docs >>>>> StringBuffer w_context = >>>>> context.get(word); >>>>> Term t = new Term("Word", word); >>>>> TermQuery tq = new TermQuery(t); >>>>> TopScoreDocCollector collector = >>>>> TopScoreDocCollector.create(1, false); >>>>> is.search(tq,collector); >>>>> ScoreDoc[] hits = >>>>> collector.topDocs().scoreDocs; >>>>> >>>>> if(hits.length!=0) >>>>> { >>>>> int id = hits[0].doc; >>>>> TermFreqVector tfv = >>>>> ir.getTermFreqVector(id, "Context"); >>>>> >>>>> // This creates context string >>>>> from >>>>> TermFreqVector. For e.g if >>>>> TermFreqVector is word1(2), word2(1),word3(2) then its output is >>>>> // context_str="word1 word1 >>>>> word2 >>>>> word3 word3" >>>>> String context_str = >>>>> getContextString(tfv); >>>>> >>>>> >>>>> w_context.append(context_str); >>>>> Document new_doc = new >>>>> Document(); >>>>> new_doc.add(new Field("Word", >>>>> word,Field.Store.YES, >>>>> Field.Index.NOT_ANALYZED)); >>>>> new_doc.add(new >>>>> Field("Context", >>>>> w_context.toString(),Field.Store.YES, >>>>> Field.Index.ANALYZED, Field.TermVector.YES)); >>>>> >>>>> context_writer.updateDocument(t, >>>>> new_doc); >>>>> >>>>> }else{ >>>>> >>>>> Document new_doc = new >>>>> Document(); >>>>> new_doc.add(new Field("Word", >>>>> word,Field.Store.YES, >>>>> Field.Index.NOT_ANALYZED)); >>>>> new_doc.add(new >>>>> Field("Context", >>>>> w_context.toString(),Field.Store.YES, >>>>> Field.Index.ANALYZED, Field.TermVector.YES)); >>>>> >>>>> context_writer.addDocument(new_doc); >>>>> >>>>> } >>>>> } >>>>> ir.close(); >>>>> is.close(); >>>>> >>>>> } >>>>> >>>>> >>>>> I am printing memory also after each invocation of this method and I >>>>> observed that after each call of update_context memory increases and >>>>> when >>>>> it >>>>> reaches around 65-70k it goes outofmemory so somewhere memory is >>>>> increasing >>>>> in each invocation. I thought each invocation should take constant >>>>> amount >>>>> of >>>>> memory and it should not be increased cumulatively. Also after each >>>>> invocation of Update_context I am also calling System.gc() to release >>>>> memory >>>>> and I also tried various other parameters like >>>>> context_writer.setMaxBufferedDocs() >>>>> context_writer.setMaxMergeDocs() >>>>> context_writer.setRAMBufferSizeMB() >>>>> I set these parameters smaller values as well but nothing worked. >>>>> >>>>> Any hint will be very helpful. >>>>> >>>>> Thanks >>>>> Ajay >>>>> >>>>> >>>>> Michael McCandless-2 wrote: >>>>> > >>>>> > The worst case RAM usage for Lucene is a single doc with many unique >>>>> > terms. Lucene allocates ~60 bytes per unique term (plus space to >>>>> hold >>>>> > that term's characters = 2 bytes per char). And, Lucene cannot >>>>> flush >>>>> > within one document -- it must flush after the doc has been fully >>>>> > indexed. >>>>> > >>>>> > This past thread (also from Paul) delves into some of the details: >>>>> > >>>>> > http://lucene.markmail.org/thread/pbeidtepentm6mdn >>>>> > >>>>> > But it's not clear whether that is the issue affecting Ajay -- I >>>>> think >>>>> > more details about the docs, or, some code fragments, could help >>>>> shed >>>>> > light. >>>>> > >>>>> > Mike >>>>> > >>>>> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul >>>>> <paul.b.murd...@saic.com> >>>>> > wrote: >>>>> >> Ajay, >>>>> >> >>>>> >> Here is another thread I started on the same issue. >>>>> >> >>>>> >> >>>>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe >>>>> >> n-indexing-large-files >>>>> >> >>>>> >> Paul >>>>> >> >>>>> >> >>>>> >> -----Original Message----- >>>>> >> From: >>>>> java-user-return-45254-paul.b.murdoch=saic....@lucene.apache.org >>>>> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@ >>>>> lucene.apache.org >>>>> >> ] On Behalf Of ajay_gupta >>>>> >> Sent: Tuesday, March 02, 2010 8:28 AM >>>>> >> To: java-user@lucene.apache.org >>>>> >> Subject: Lucene Indexing out of memory >>>>> >> >>>>> >> >>>>> >> Hi, >>>>> >> It might be general question though but I couldn't find the answer >>>>> yet. >>>>> >> I >>>>> >> have around 90k documents sizing around 350 MB. Each document >>>>> contains >>>>> a >>>>> >> record which has some text content. For each word in this text I >>>>> want >>>>> to >>>>> >> store context for that word and index it so I am reading each >>>>> document >>>>> >> and >>>>> >> for each word in that document I am appending fixed number of >>>>> >> surrounding >>>>> >> words. To do that first I search in existing indices if this word >>>>> >> already >>>>> >> exist and if it is then I get the content and append the new >>>>> context >>>>> and >>>>> >> update the document. In case no context exist I create a document >>>>> with >>>>> >> fields "word" and "context" and add these two fields with values as >>>>> word >>>>> >> value and context value. >>>>> >> >>>>> >> I tried this in RAM but after certain no of docs it gave out of >>>>> memory >>>>> >> error >>>>> >> so I thought to use FSDirectory method but surprisingly after 70k >>>>> >> documents >>>>> >> it also gave OOM error. I have enough disk space but still I am >>>>> getting >>>>> >> this >>>>> >> error.I am not sure even for disk based indexing why its giving >>>>> this >>>>> >> error. >>>>> >> I thought disk based indexing will be slow but atleast it will be >>>>> >> scalable. >>>>> >> Could someone suggest what could be the issue ? >>>>> >> >>>>> >> Thanks >>>>> >> Ajay >>>>> >> -- >>>>> >> View this message in context: >>>>> >> >>>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872 >>>>> . >>>>> >> html >>>>> >> Sent from the Lucene - Java Users mailing list archive at >>>>> Nabble.com. >>>>> >> >>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- >>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >> >>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- >>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> --------------------------------------------------------------------- >>>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> > >>>>> > >>>>> > >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html >>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27900854.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org