I would also suggest that contrib/benchmark in the source has a nice
framework for experimenting with different factors for mergeFactor
and maxBufferedDocs. It is quite easy to set it up for a new
collection (i.e. yours) and run experiments that alter these two values.
Below is a sample "algorithm" file that I have been trying out. To
make it work on yours, you need only implement a DocMaker that works
for your collection (you probably already have the stuff for making
Documents, you just need to implement it in the DocMaker interface
and plug it in)
merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000:100:100
max.buffered=max.buffered:
10:10:10:10:100:1000:10000:21580:21580:21580:1000:10000
compound=true
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory
doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000
docs.dir=reuters-out
#docs.dir=reuters-111
#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
# task at this depth or less would print when they start
task.max.depth.log=2
log.queries=true
#
------------------------------------------------------------------------
-------------
{ "Rounds"
ResetSystemErase
{ "Populate-Opt"
CreateIndex
{ "MAddDocs" AddDoc > : 22000
Optimize
CloseIndex
}
NewRound
} : 13
RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt
On Mar 23, 2007, at 2:51 AM, SK R wrote:
Hi,
I've looked the uses of MergeFactor and MaxBufferedDocs.
If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
segments will be merged in RAMDir when 100 docs arrived. At the end
of 350th
doc added to writer , RAMDir have 2 merged segment files + 50 seperate
segment files not merged together and these are flushed to FSDir.
If wrong, please correct me.
My doubt is whether we should set MergeFactor & MaxBufferedDocs in
proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n =
1,2 ...
to reduce indexing time and get greater performance or no need to
worry
about it's relation?
Thanks & Regards
RSK
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]