Re: MergeFactor and MaxBufferedDocs value should ...?

Grant Ingersoll Sat, 24 Mar 2007 03:43:37 -0800

I would also suggest that contrib/benchmark in the source has a niceframework for experimenting with different factors for mergeFactorand maxBufferedDocs. It is quite easy to set it up for a newcollection (i.e. yours) and run experiments that alter these two values.

Below is a sample "algorithm" file that I have been trying out. Tomake it work on yours, you need only implement a DocMaker that worksfor your collection (you probably already have the stuff for makingDocuments, you just need to implement it in the DocMaker interfaceand plug it in)



merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000:100:100

max.buffered=max.buffered:10:10:10:10:100:1000:10000:21580:21580:21580:1000:10000

compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true

#-------------------------------------------------------------------------------------


{ "Rounds"

    ResetSystemErase

    { "Populate-Opt"
        CreateIndex
        { "MAddDocs" AddDoc > : 22000
        Optimize
        CloseIndex
    }

    NewRound

} : 13

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 2:51 AM, SK R wrote:

Hi,
   I've looked the uses of MergeFactor and MaxBufferedDocs.

   If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
segments will be merged in RAMDir when 100 docs arrived. At the endof 350th
doc added to writer , RAMDir have 2 merged segment files + 50 seperate
segment files not merged together and these are flushed to FSDir.

   If wrong, please correct me.

   My doubt is whether we should set MergeFactor & MaxBufferedDocs in
proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n =1,2 ...to reduce indexing time and get greater performance or no need toworry
about it's relation?


Thanks & Regards
RSK


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MergeFactor and MaxBufferedDocs value should ...?

Reply via email to