Re: Merge performance

Erick Erickson Wed, 18 Apr 2007 09:17:36 -0700

This *may* be relevant, I haven't needed to investigate
it yet...


http://issues.apache.org/jira/browse/LUCENE-845

Also, see the thread titled
"MergeFactor and MaxBufferedDocs value should ...?" for an
interesting discussion of how to optimize indexing, although
I'm not sure the notion of using IndexWriter.ramSizeInBytes
is of too much use when merging indexes....

Erick

On 4/18/07, d m <[EMAIL PROTECTED]> wrote:


I'd like to share index merge performance data and have a couple
of questions about it...

We (AXS-One, www.axsone.com) build one "master" index per day.
For backup and recovery purposes, we also build many individual
"mini" indexes from the docs added to the master index.

Should one of our master indexes become unusable (for whatever
reason - and I'm glad to say this has not yet happened), we plan to
reconstruct it by merging its mini indexes.

I've done some merge testing so we have an idea of how long it will
take to reconstruct a master index.

For testing purposes, I have created 1,000 mini indexes. Each:
  - contains 1,000 documents
  - is optimized
  - uses the compound file format

The avg doc size across the 1 million docs is: 10.8 KB

My testing has been to merge the 1,000 mini indexes to an empty
destination index. Destination index settings:

  - mergeFactor: 40
  - minMergeDocs: 10,000
  - maxMergeDocs: Integer.MAX_VALUE
  - use compound file format

Those values were obtained from some empirical (but not exhaustive)
merge testing.

In each test run I merge N mini indexes into a single destination
index. Each merge starts with an empty destination index. N increases
by 25 for each data point.

This means our test merges 25 minis to an empty index.  Then merges 50
minis to an empty index. Etc... until it merges 1,000 minis to an
empty index.

Mini indexes merged with V2.1 were indexed with V2.1.
Mini indexes merged with V2.0 were indexed with V2.0.

Hardware:
  - 64-bit
  - 4 CPUs: AMD Opteron 280, 2.41 GHz
  - 12.8 GB RAM
  - 1.63 TB disk space / 6 SCSI drives / RAID ??

Software:
  - Lucene V2.1 and V2.0
  - Java 1.6
  - Windows Server 2003, SP 1

I did tests with:
  - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
  - Lucene 2.1 using addIndexes() - 1 run
  - Lucene 2.0 using addIndexes() - 1 run

The recorded merge times (in seconds) include a final call to
optimize() the destination index after returning from
addIndexesNoOptimize() or addIndexes().

I've included the test data below. (If you'd like, I can email an
Excel version of the data with a graph.)

A few things caught my attention (seen easily when graphing "Indexes"
vs "Merge Time (secs)"):

1. The runs (3 & 4) using addIndexes() show a relatively smooth
   increase in merge times (as expected).

2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
   spikes in times for a particular merge count - with the next merge
   counts running faster. The pattern of spikes was identical in both
   runs.

   The most notable spike occurs in the addIndexesNoOptimize() merge
   of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
   other. In both runs the merge times for 925, 950, 975, and 1000
   indexes took less time than the 900 merge.

3. Overall, using addIndexes() appears to be faster than
   addIndexesNoOptimize().

4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
   the very last row of data below - it is the merge rate (in
   docs/min) for each test run.

Can someone explain what might be happening to cause the spikes in 2.1,
not seen in 2.0?

Any thoughts on 2.0 merging faster than 2.1?

Thanks, david.


Run 1: Lucene 2.1 / addIndexesNoOptimize()
Run 2: Repeat Run 1
Run 3: Lucene 2.1 / addIndexes()
Run 4: Lucene 2.0 / addIndexes()
All runs include a final call to optimize()

      Merge Times (seconds)
Numb  Run   Run   Run   Run
Idxs  1     2     3     4
25    39    59    44    38
50    73    93    83    81
75    113   131   128   120
100   147   169   163   154
125   179   198   193   220
150   222   241   227   231
175   246   261   249   239
200   266   273   269   266
225   297   301   288   283
250   323   325   312   308
275   461   471   343   337
300   393   388   376   364
325   424   423   410   401
350   465   467   445   438
375   498   504   475   466
400   527   528   516   503
425   586   567   608   587
450   677   656   703   675
475   876   880   786   750
500   841   832   872   821
525   920   914   937   924
550   1213  1206  1038  995
575   1094  1065  1137  1069
600   1250  1207  1275  1189
625   1367  1337  1385  1315
650   1473  1433  1454  1396
675   1600  1575  1499  1468
700   1570  1552  1563  1516
725   1605  1587  1602  1581
750   1852  1808  1687  1627
775   1761  1719  1732  1668
800   1829  1821  1876  1753
825   2167  2138  1882  1832
850   2045  2042  2057  1887
875   2169  2207  2101  2025
900   2666  2617  2138  2014
925   2174  2218  2206  2057
950   2390  2391  2193  2121
975   2304  2322  2247  2149
1000  2321  2322  2321  2227

20500 43423 43248 41820 40095   <- Totals
      28326 28441 29412 30667   <- Merge rate: Docs per minute

Re: Merge performance

Reply via email to