On 11/3/2021 1:44 PM, Michael Conrad wrote:
Is there a way to set max segment count for builtin merge policy?
I'm having a serious issue where I'm trying to reindex 75 million
documents and the segment count skyrockets with associated significant
drop in performance. To the point we start getting lots of timeouts.
Is there a way to set the merge policy to try and keep the total segment
count to around 16 or so? (This seems to be close to the max the hosts
can manage without having serious performance issues.)
Solr 7.7.3.
The way to reduce total segment count is to reduce the thresholds that
used to be controlled by mergeFactor. I don't know of a way to
explicitly set the max total count, but the per-tier count will affect
the total count. There will be at least three tiers of merging on most
Solr installs, so the max total segment count will be at least three
times the per-tier setting.
This config represents the defaults for Solr's merging policy:
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="segmentsPerTier">10</int>
</mergePolicyFactory>
On some Solr servers that I used to manage, those numbers were set to
35. I regularly saw total segment counts larger than 100. That did not
affect performance in a significant way.
If you are seeing significant performance problems it is more likely one
of two problems that have nothing to do with the segment count:
1) Your max heap size is not quite big enough and needs to be increased.
This can lead to severe GC pauses because Java will spend more time
doing GC than running the application.
2) Your index is so big that the amount of free memory on the server
cannot effectively cache it. The fix for that is to add physical
memory, so that more unallocated memory is available to the operating
system. Solr is absolutely reliant on effective index caching for
performance.
More of a side note: One problem that you might be having with indexing
millions of documents is that the indexing thread can get paused when
merging becomes heavy. This will be even more likely to happen if you
reduce the numbers in the config that I included above. The fix for
that is to fiddle with the mergeScheduler config.
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">6</int>
<int name="maxThreadCount">1</int>
</mergeScheduler>
Some notes: Go with a maxMergeCount that's at least 6. If your indexes
are on spinning hard disks, leave maxThreadCount at 1. If the indexes
are on SSD, you can increase the thread count, but don't go too wild.
Probably 3 or 4 max, and I would be more likely to choose 2. I have
never had indexes on SSD, so I do not know how many threads are too many.
Thanks,
Shawn