"Read" means "re-add", the spell checker in my mail program :-)
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Thursday, April 15, 2010 2:13 PM > To: java-user@lucene.apache.org > Subject: RE: NumericField indexing performance > > Hi Tomislav, > > when reading your mail its not 100% clear what you did wrong, but I > think the following occurred (so its no GC problem): > > You reused the Document and NumericField instance in your original > approach. But on each document you called again doc.add(nf). By that > for each document you added the field one more time to the document and > after say thousand docs you have 1000 times the numeric field there and > indexer indexes it therefore 1000 times. After 2000 docs it's there > 2000 times so the indexing time raises exponentially. > > So when you reuse doc instances you have to do do either: > - Don’t modify the fields at all (and also add no more fields) and just > set field values and add doc to writer > - Clear the document and read fields > > But don’t read fields without clearing! :-) That was your fault. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > > Sent: Thursday, April 15, 2010 2:00 PM > > To: java-user@lucene.apache.org > > Subject: Re: NumericField indexing performance > > > > Hi, > > > > I actually don't follow your change, because after "but changing it > to" > > line the only different thing I see is the doc.add(dateField) call, > > which you didn't list before "but changing it to". > > > > Also, if I understood Uwe correctly, he was suggesting reusing > > NumericField instances, which means "new NumericField("date")" should > > exist and be called for only *once* in your code. The same for > > Document instances. GC threads will thank you and Uwe for this > change. > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > > > > > ----- Original Message ---- > > > From: Tomislav Poljak <tpol...@gmail.com> > > > To: java-user@lucene.apache.org > > > Sent: Thu, April 15, 2010 7:41:02 AM > > > Subject: RE: NumericField indexing performance > > > > > > Hi Uwe, > > thank you very much for your answers. I've done Document > > > and > > NumericField reuse like this: > > > > Document doc = > > > getDocument(); > > NumericField dateField = new NumericField("date"); > > > > for > > > each > > > doc: > > > > > doc.add(dateField.setLongValue(Long.parseLong(DateTools.dateToString(da > > te), > > > DateTools.Resolution.MINUTE)))); > > > > ,but changing it to: > > > > Document doc > > > = getDocument(); > > NumericField dateField = new > > > NumericField("date"); > > doc.add(dateField); > > > > for each > > > doc: > > > > dateField.setLongValue(Long.parseLong(DateTools.dateToString(date), > > DateTools.Resolution.MINUTE))); > > > > did > > > the trick. Now indexing with NumericField takes minutes, not > > > hours. > > > > Thanks again, > > > > Tomislav > > > > > > > > > > > > On Wed, > > > 2010-04-14 at 23:38 +0200, Uwe Schindler wrote: > > > One addition: > > > If > > > you are indexing millions of numeric fields, you should also try to > > reuse > > > NumericField and Document instances (as described in JavaDocs). > > NumericField > > > creates internally a NumericTokenStream and lots of small objects > > (attributes), > > > so GC cost may be high. This is just another idea. > > > > > > Uwe > > > > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 > > > Bremen > > > > > > >http://www.thetaphi.de > > > eMail: > > > href="mailto:u...@thetaphi.de">u...@thetaphi.de > > > > > > > > > > > > > -----Original Message----- > > > > From: Uwe Schindler [mailto: > > > ymailto="mailto:u...@thetaphi.de" > > > href="mailto:u...@thetaphi.de">u...@thetaphi.de] > > > > Sent: Wednesday, > > > April 14, 2010 11:28 PM > > > > To: > > > ymailto="mailto:java-user@lucene.apache.org" > > > href="mailto:java-user@lucene.apache.org">java- > u...@lucene.apache.org > > > > > > > Subject: RE: NumericField indexing performance > > > > > > > > > > > Hi Tomislav, > > > > > > > > indexing with NumericField takes longer > > > (at least for the default > > > > precision step of 4, which means out of > > > 32 bit integers make 8 subterms > > > > with each 4 bits of the value). So > > > you produce 8 times more terms > > > > during indexing that must be handled > > > by the indexer. If you have lots > > > > of documents, with distinct values > > > the term index gets larger and > > > > larger, but search performance > > > increases dramatically (for > > > > NumericRangeQueries). So if you index > > > *only* numeric fields and nothing > > > > else, a 8 times slower indexing > > > can be true. > > > > > > > > If you are not using NumericRangeQuery > > > or you want tune indexing > > > > performance, try larger precision Steps > > > like 6 or 8. If you don’t use > > > > NumericRangeQuery and only want to > > > index the numeric terms as *one* > > > > term, use > > > precStep=Integer.MAX_VALUE. Also check your memory > > > > requirements, as > > > the indexer may need more memory and GC costs too > > > > much. Also the > > > index size will increase, so lots of more I/O is done. > > > > Without more > > > details I cannot say anything about your configuration. So > > > > please > > > tell us, how many documents, how many fields and how many > > > > numeric > > > fields in which configuration do you use? > > > > > > > > Uwe > > > > > > > > > > > ----- > > > > Uwe Schindler > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > > > > href="http://www.thetaphi.de" target=_blank >http://www.thetaphi.de > > > > > > > eMail: > > > href="mailto:u...@thetaphi.de">u...@thetaphi.de > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Tomislav > > > Poljak [mailto: > > > href="mailto:tpol...@gmail.com">tpol...@gmail.com] > > > > > Sent: > > > Wednesday, April 14, 2010 8:13 PM > > > > > To: > > > ymailto="mailto:java-user@lucene.apache.org" > > > href="mailto:java-user@lucene.apache.org">java- > u...@lucene.apache.org > > > > > > > > Subject: NumericField indexing performance > > > > > > > > > > > > > Hi, > > > > > is it normal for indexing time to increase up to > > > 10 times after > > > > > introducing NumericField instead of Field (for > > > two fields)? > > > > > > > > > > I've changed two date fields > > > from String representation (Field) to > > > > > NumericField, now it > > > is: > > > > > > > > > > doc.add(new > > > NumericField("time").setIntValue(date.getTime()/24/3600)) > > > > > > > > > > > > > and after this change indexing took 10x more time (before > > > it was few > > > > > minutes and after more than an hour and half). I've > > > tested with a > > > > > simple > > > > > counter like > > > this: > > > > > > > > > > doc.add(new > > > NumericField("endTime").setIntValue(count++)) > > > > > > > > > > > > > but nothing changed, it still takes around 10x longer. If I > comment > > > > > > > > adding one numeric field to index time drops significantly and > if > > > I > > > > > comment both fields indexing takes only few minutes > > > again. > > > > > > > > > > Tomislav > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------- > -- > > > > > > > > To unsubscribe, e-mail: > > > ymailto="mailto:java-user-unsubscr...@lucene.apache.org" > > > href="mailto:java-user-unsubscr...@lucene.apache.org">java-user- > > unsubscr...@lucene.apache.org > > > > > > > > For additional commands, e-mail: > > > ymailto="mailto:java-user-h...@lucene.apache.org" > > > href="mailto:java-user-h...@lucene.apache.org">java-user- > > h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------- > -- > > > > > > > To unsubscribe, e-mail: > > > ymailto="mailto:java-user-unsubscr...@lucene.apache.org" > > > href="mailto:java-user-unsubscr...@lucene.apache.org">java-user- > > unsubscr...@lucene.apache.org > > > > > > > For additional commands, e-mail: > > > ymailto="mailto:java-user-h...@lucene.apache.org" > > > href="mailto:java-user-h...@lucene.apache.org">java-user- > > h...@lucene.apache.org > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------- > -- > > > To > > > unsubscribe, e-mail: > > > href="mailto:java-user-unsubscr...@lucene.apache.org">java-user- > > unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > > > ymailto="mailto:java-user-h...@lucene.apache.org" > > > href="mailto:java-user-h...@lucene.apache.org">java-user- > > h...@lucene.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To > > > unsubscribe, e-mail: > > > href="mailto:java-user-unsubscr...@lucene.apache.org">java-user- > > unsubscr...@lucene.apache.org > > For > > > additional commands, e-mail: > > > ymailto="mailto:java-user-h...@lucene.apache.org" > > > href="mailto:java-user-h...@lucene.apache.org">java-user- > > h...@lucene.apache.org > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org