Erick, w_context and context_str are local to this method and are used only for 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k doc processing and also I printed memory consumed by hashmap which is kind of constant for each chunk processing. For each invocation of update_context memory should be kind of constant but after each invocation it increase few MB's and after 70k it goes OOM so something wrong is happening inside update_context some operation like search/update/add document is creating some memory and which is not release after returning from this method.
-Ajay Erick Erickson wrote: > > The first place I'd look is how big my your strings > got. w_context and context_str come to mind. My > first suspicion is that you're building ever-longer > strings and around 70K documents your strings > are large enough to produce OOMs. > > FWIW > Erick > > On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <ajay...@gmail.com> wrote: > >> >> Mike, >> Actually my documents are very small in size. We have csv files where >> each >> record represents a document which is not very large so I don't think >> document size is an issue. >> For each record I am tokenizing it and for each token I am keeping 3 >> neighbouring tokens in a Hashtable. After X number of documents where X >> is >> currently 2500 I am creating >> index by following code: >> >> //Initialization step done only at >> starting >> >> cram = FSDirectory.open(new >> File("lucenetemp2")); >> context_writer = new IndexWriter(cram, >> analyzer, true, >> IndexWriter.MaxFieldLength.LIMITED); >> >> // After each 2500 docs >> >> update_context() >> { >> context_writer.commit(); >> context_writer.optimize(); >> >> IndexSearcher is = new IndexSearcher(cram); >> IndexReader ir = is.getIndexReader(); >> Iterator<String> it = context.keySet().iterator(); >> >> while(it.hasNext()) >> { >> String word = it.next(); >> // This is all the context of "word" for >> all >> the 2500 docs >> StringBuffer w_context = >> context.get(word); >> Term t = new Term("Word", word); >> TermQuery tq = new TermQuery(t); >> TopScoreDocCollector collector = >> TopScoreDocCollector.create(1, false); >> is.search(tq,collector); >> ScoreDoc[] hits = >> collector.topDocs().scoreDocs; >> >> if(hits.length!=0) >> { >> int id = hits[0].doc; >> TermFreqVector tfv = >> ir.getTermFreqVector(id, "Context"); >> >> // This creates context string >> from >> TermFreqVector. For e.g if >> TermFreqVector is word1(2), word2(1),word3(2) then its output is >> // context_str="word1 word1 word2 >> word3 word3" >> String context_str = >> getContextString(tfv); >> >> >> w_context.append(context_str); >> Document new_doc = new Document(); >> new_doc.add(new Field("Word", >> word,Field.Store.YES, >> Field.Index.NOT_ANALYZED)); >> new_doc.add(new Field("Context", >> w_context.toString(),Field.Store.YES, >> Field.Index.ANALYZED, Field.TermVector.YES)); >> context_writer.updateDocument(t, >> new_doc); >> >> }else{ >> >> Document new_doc = new Document(); >> new_doc.add(new Field("Word", >> word,Field.Store.YES, >> Field.Index.NOT_ANALYZED)); >> new_doc.add(new Field("Context", >> w_context.toString(),Field.Store.YES, >> Field.Index.ANALYZED, Field.TermVector.YES)); >> >> context_writer.addDocument(new_doc); >> >> } >> } >> ir.close(); >> is.close(); >> >> } >> >> >> I am printing memory also after each invocation of this method and I >> observed that after each call of update_context memory increases and when >> it >> reaches around 65-70k it goes outofmemory so somewhere memory is >> increasing >> in each invocation. I thought each invocation should take constant amount >> of >> memory and it should not be increased cumulatively. Also after each >> invocation of Update_context I am also calling System.gc() to release >> memory >> and I also tried various other parameters like >> context_writer.setMaxBufferedDocs() >> context_writer.setMaxMergeDocs() >> context_writer.setRAMBufferSizeMB() >> I set these parameters smaller values as well but nothing worked. >> >> Any hint will be very helpful. >> >> Thanks >> Ajay >> >> >> Michael McCandless-2 wrote: >> > >> > The worst case RAM usage for Lucene is a single doc with many unique >> > terms. Lucene allocates ~60 bytes per unique term (plus space to hold >> > that term's characters = 2 bytes per char). And, Lucene cannot flush >> > within one document -- it must flush after the doc has been fully >> > indexed. >> > >> > This past thread (also from Paul) delves into some of the details: >> > >> > http://lucene.markmail.org/thread/pbeidtepentm6mdn >> > >> > But it's not clear whether that is the issue affecting Ajay -- I think >> > more details about the docs, or, some code fragments, could help shed >> > light. >> > >> > Mike >> > >> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul <paul.b.murd...@saic.com> >> > wrote: >> >> Ajay, >> >> >> >> Here is another thread I started on the same issue. >> >> >> >> >> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe >> >> n-indexing-large-files >> >> >> >> Paul >> >> >> >> >> >> -----Original Message----- >> >> From: java-user-return-45254-paul.b.murdoch=saic....@lucene.apache.org >> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@ >> lucene.apache.org >> >> ] On Behalf Of ajay_gupta >> >> Sent: Tuesday, March 02, 2010 8:28 AM >> >> To: java-user@lucene.apache.org >> >> Subject: Lucene Indexing out of memory >> >> >> >> >> >> Hi, >> >> It might be general question though but I couldn't find the answer >> yet. >> >> I >> >> have around 90k documents sizing around 350 MB. Each document contains >> a >> >> record which has some text content. For each word in this text I want >> to >> >> store context for that word and index it so I am reading each document >> >> and >> >> for each word in that document I am appending fixed number of >> >> surrounding >> >> words. To do that first I search in existing indices if this word >> >> already >> >> exist and if it is then I get the content and append the new context >> and >> >> update the document. In case no context exist I create a document with >> >> fields "word" and "context" and add these two fields with values as >> word >> >> value and context value. >> >> >> >> I tried this in RAM but after certain no of docs it gave out of memory >> >> error >> >> so I thought to use FSDirectory method but surprisingly after 70k >> >> documents >> >> it also gave OOM error. I have enough disk space but still I am >> getting >> >> this >> >> error.I am not sure even for disk based indexing why its giving this >> >> error. >> >> I thought disk based indexing will be slow but atleast it will be >> >> scalable. >> >> Could someone suggest what could be the issue ? >> >> >> >> Thanks >> >> Ajay >> >> -- >> >> View this message in context: >> >> >> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872 >> . >> >> html >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > >> >> -- >> View this message in context: >> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- View this message in context: http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org