I would also suggest that contrib/benchmark in the source has a nice framework for experimenting with different factors for mergeFactor and maxBufferedDocs. It is quite easy to set it up for a new collection (i.e. yours) and run experiments that alter these two values.

Below is a sample "algorithm" file that I have been trying out. To make it work on yours, you need only implement a DocMaker that works for your collection (you probably already have the stuff for making Documents, you just need to implement it in the DocMaker interface and plug it in)


merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000:100:100
max.buffered=max.buffered: 10:10:10:10:100:1000:10000:21580:21580:21580:1000:10000
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
# ------------------------------------------------------------------------ -------------

{ "Rounds"

    ResetSystemErase

    { "Populate-Opt"
        CreateIndex
        { "MAddDocs" AddDoc > : 22000
        Optimize
        CloseIndex
    }

    NewRound

} : 13

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 2:51 AM, SK R wrote:

Hi,
   I've looked the uses of MergeFactor and MaxBufferedDocs.

   If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
segments will be merged in RAMDir when 100 docs arrived. At the end of 350th
doc added to writer , RAMDir have 2 merged segment files + 50 seperate
segment files not merged together and these are flushed to FSDir.

   If wrong, please correct me.

   My doubt is whether we should set MergeFactor & MaxBufferedDocs in
proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 ... to reduce indexing time and get greater performance or no need to worry
about it's relation?


Thanks & Regards
RSK

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to