This is tricky, particularly on cloud storage (network storage rather than
local disk).

How big are your indexes?

You can set the merge policy settings so you have fewer larger segments.
But that is usually done to improve query time performance not indexing
performance.

If the disks are truly the bottleneck adding more RAM for the filesystem
cache and increasing the ramBufferSize should help.

But, indexing performance can be improved in lot's of ways:

1) Make sure you are batching updates, 200 docs per batch is often a good
size.
2) Make sure you are not over-committing, auto-commits are best.
3) Make sure you are not under-committing, commit and open new searchers
every minute.
4) Use more than one indexing client, usually 6-7 will saturate a cluster.
5) Use CloudSolrClient for indexing which does client side document routing.

Harder changes:

1) Reduce the size of the index on disk by storing fewer fields.
2) Use stopwords to decrease the size of the positions stored in the index.
3) Use more shards.




















Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Nov 4, 2021 at 8:35 AM Michael Conrad <mich...@newsrx.com> wrote:

> Would either of these two settings possibly help, at least during full
> reindexes?
>
> <!-- ramBufferSizeMB sets the amount of RAM that may be used by Lucene
>           indexing for buffering added documents and deletions before
> they are
>           flushed to the Directory.
>           maxBufferedDocs sets a limit on the number of documents buffered
>           before flushing.
>           If both ramBufferSizeMB and maxBufferedDocs is set, then
>           Lucene will flush based on whichever limit is hit first. -->
>      <!-- <ramBufferSizeMB>100</ramBufferSizeMB> -->
>      <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
>
>
> On 11/4/21 08:27, Michael Conrad wrote:
> > When segment count is around 16 to 20, performance seems OK.
> >
> > Here is our config:
> >
> > Solr 7.7.3.
> >
> > java -version
> > openjdk version "11.0.11" 2021-04-20
> >
> > Our current config is 4 solr nodes with 5 collections. Each collection
> > is split into two shards. Each collection is replicated.
> >
> > They are Amazon VMs. 8GB Ram each, 2 CPU cores each. Index storage is
> > on NVME backed volumes.
> >
> > When there is a high segment count processes on the systems start
> > showing very high "i/o wait" with associated idle CPU time according
> > to top. Which indicates to me that CPU core count isn't a main culprit.
> >
> > SOLR_JAVA_MEM="-Xms1g -Xmx5g"
> > GC_TUNE=""
> > SOLR_LOG_LEVEL="WARN"
> >
> > vm.swappiness = 0
> >
> > swap in use = 0
> >
> > free -h
> >               total        used        free      shared buff/cache
> > available
> > Mem:           7.7G        6.3G        130M         80M 1.3G        1.0G
> > Swap:           15G          0B         15G
> >
> >
> > On 11/3/21 23:19, Shawn Heisey wrote:
> >> On 11/3/2021 1:44 PM, Michael Conrad wrote:
> >>> Is there a way to set max segment count for builtin merge policy?
> >>>
> >>> I'm having a serious issue where I'm trying to reindex 75 million
> >>> documents and the segment count skyrockets with associated
> >>> significant drop in performance. To the point we start getting lots
> >>> of timeouts.
> >>>
> >>> Is there a way to set the merge policy to try and keep the total
> >>> segment count to around 16 or so? (This seems to be close to the max
> >>> the hosts can manage without having serious performance issues.)
> >>>
> >>> Solr 7.7.3.
> >>
> >> The way to reduce total segment count is to reduce the thresholds
> >> that used to be controlled by mergeFactor.  I don't know of a way to
> >> explicitly set the max total count, but the per-tier count will
> >> affect the total count.  There will be at least three tiers of
> >> merging on most Solr installs, so the max total segment count will be
> >> at least three times the per-tier setting.
> >>
> >> This config represents the defaults for Solr's merging policy:
> >>
> >> <mergePolicyFactory
> >> class="org.apache.solr.index.TieredMergePolicyFactory">
> >>   <int name="maxMergeAtOnce">10</int>
> >>   <int name="segmentsPerTier">10</int>
> >> </mergePolicyFactory>
> >>
> >> On some Solr servers that I used to manage, those numbers were set to
> >> 35.  I regularly saw total segment counts larger than 100. That did
> >> not affect performance in a significant way.
> >>
> >> If you are seeing significant performance problems it is more likely
> >> one of two problems that have nothing to do with the segment count:
> >>
> >> 1) Your max heap size is not quite big enough and needs to be
> >> increased.  This can lead to severe GC pauses because Java will spend
> >> more time doing GC than running the application.
> >>
> >> 2) Your index is so big that the amount of free memory on the server
> >> cannot effectively cache it.  The fix for that is to add physical
> >> memory, so that more unallocated memory is available to the operating
> >> system.  Solr is absolutely reliant on effective index caching for
> >> performance.
> >>
> >> More of a side note:  One problem that you might be having with
> >> indexing millions of documents is that the indexing thread can get
> >> paused when merging becomes heavy.  This will be even more likely to
> >> happen if you reduce the numbers in the config that I included
> >> above.  The fix for that is to fiddle with the mergeScheduler config.
> >>
> >> <mergeScheduler
> >> class="org.apache.lucene.index.ConcurrentMergeScheduler">
> >>   <int name="maxMergeCount">6</int>
> >>   <int name="maxThreadCount">1</int>
> >> </mergeScheduler>
> >>
> >> Some notes:  Go with a maxMergeCount that's at least 6.  If your
> >> indexes are on spinning hard disks, leave maxThreadCount at 1. If the
> >> indexes are on SSD, you can increase the thread count, but don't go
> >> too wild. Probably 3 or 4 max, and I would be more likely to choose
> >> 2.  I have never had indexes on SSD, so I do not know how many
> >> threads are too many.
> >>
> >> Thanks,
> >> Shawn
> >
>

Reply via email to