Re: Migration from NRT to TLOG performance issues

Houston Putman Fri, 11 Jun 2021 09:38:43 -0700

So the issue seems to be with the autocommit time.

The PULL and TLOG followers fetch the index every x seconds. This 'x' is
1/2 of the autocommit time, so when you increased your autocommit, you were
actually just increasing the amount of time your TLOG followers and PULL
replicas were able to keep their index.


On your TLOG leader, it is probably hard-committing quite often, since you
have the <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> included and
you are "constantly indexing".

I would recommend trying the following, so that we can test TLOG/PULL and
NRT with as similar setups as possible:

Set <openSearcher>false</openSearcher> for <autocommit>, and set the
autoSoftCommit maxTime to be exactly half of the autoCommit time.

>     <updateHandler class="solr.DirectUpdateHandler2">
>         <updateLog>
>             <str name="dir">${solr.data.dir:}</str>
>             <int
name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int>
>         </updateLog>
>
>         <autoCommit>
>             <maxTime>${solr.autoCommit.maxTime:240000}</maxTime>
>             <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs>
>             <openSearcher>false</openSearcher>
>         </autoCommit>
>
>         <autoSoftCommit>
>             <maxTime>${solr.autoSoftCommit.maxTime:120000}</maxTime>
>         </autoSoftCommit>
>     </updateHandler>

That way all of your replicas, NRT, TLOG and PULL, should be opening
searchers at the same rate (every 2 minutes).
You can also adjust these times if 2 minutes is too long of a wait. I just
want to see if there is a difference when the searchers are opened at the
same interval.
If there is still a significant difference between the two, then we will
know there is something very strange going on with TLOG/PULL.

Lastly, you might think about removing the <maxDocs> section, but that
doesn't mean a whole lot for this test we are trying to run.

- Houston

On Fri, Jun 11, 2021 at 12:07 PM Nick Vladiceanu <[email protected]>
wrote:

> Very good point Mike. To avoid this, I’ve scaled the cluster to 34 nodes
> (which would compensate the 6TLOG that aren’t going to be used for search),
> and we were only 2 nodes less for search queries than the NRT cluster had.
> At lower request rate, the results weren’t better either.
>
> TLOG replica are receiving requests (since it’s behind the same LB), but
> it doesn’t perform searches (can notice by the load on cpu, memory, etc.),
> only proxies the requests to PULL replicas (in case of TLOG + PULL).
>
> We use c2.2xlarge instances with EBS volumes of 100Gi, on the disk side
> there is no pressure, the IOPS look ok, reads/writes are normal. I’m using
> on-demand secondary cluster where I run all these experiments, which is
> identical to the production one. Request queries I generate while testing
> are captured from the production, thus, identical too.
>
> Thank you
>
>
> > On 11. Jun 2021, at 5:54 PM, Mike Drob <[email protected]> wrote:
> >
> > When you have 6TLOG+24PULL and you're setting
> >
> shards.preference=replica.type:PULL,replica.type:TLOG,replica.location:local,
> > I would expect zero queries going to the TLOG replicas, can you
> > confirm that is the case? If so, this might be an issue of 24 nodes
> > trying to keep up with the work that 30 were doing previously. Maybe
> > try an experiment with 1TLOG+29PULL to see if that gets you closer to
> > the old numbers, but I wouldn't recommend running that for a long time
> > in production.
> >
> > Thinking out loud here... maybe there is some difference in how the OS
> > handles the page cache and memory mapping the index files if they come
> > in cold over the network vs being actively written by the Solr
> > process. What kind of storage are you using?
> >
> > On Fri, Jun 11, 2021 at 10:38 AM Nick Vladiceanu <[email protected]>
> wrote:
> >>
> >> actually not using HDFSDirectory, it’s a leftover in the config from
> some previous tests.
> >>
> >> I don’t see anything in the logs related to maxWarmingSearchers, nor
> other errors/warnings show in the logs. I tried to reduce
> maxWarmingSearchers to 3 and increased the Hard commit maxTime to 2mins,
> the results improved significantly, from ~350ms p99 to ~210ms p99, which is
> still higher than NRT result, but better than it was.
> >>
> >> I also tried with only TLOG replicas, and the results are more or less
> the same, ~340ms p99 and ~110ms p95. So, both are slower, TLOG + PULL and
> TLOG only.
> >>
> >>
> >>
> >>> On 11. Jun 2021, at 5:28 PM, Mike Drob <[email protected]> wrote:
> >>>
> >>> Are you using HDFSDirectory to serve your indices? I noticed that
> tlogDfsReplication is set, so that's why I'm asking.
> >>>
> >>> 8 maxWarmingSearchers is very high, typically that value is 2 or maybe
> 4, but you would know if this was an issue by looking at your logs.
> >>>
> >>> I'm assuming that you had 30 NRT replicas before? If you had fewer,
> then your tail latencies might be higher because you're seeing cache misses
> on the queries. Do you have metrics on the response times for TLOG v PULL?
> Are they both slower, or just one?
> >>>
> >>> Mike
> >>>
> >>> On 2021/06/11 12:55:31, Nick Vladiceanu <[email protected]>
> wrote:
> >>>> hello,
> >>>> I’m facing some performance issues when moving from NRT replica types
> to TLOG + PULL. We’re constantly indexing new data and heavily querying
> (~2k rps).
> >>>>
> >>>> - index size is ~ 2.5Gi;
> >>>> - number of docs ~4.6M;
> >>>> - 2 shards;
> >>>> - 7 cores and 14Gi of memory
> >>>> - 30 instances
> >>>> - JVM Heap is 12Gi
> >>>>
> >>>> When running on NRT only, the response time in avg is ~150ms p99 and
> 40ms p95. When changing to TLOG (6 tlog replicas) + 24 PULL, the response
> time grows to ~350ms p99 and 120ms p95.
> >>>>
> >>>> Here are some fragments from our solrconfig:
> >>>>
> >>>>
> >>>>>   <updateHandler class="solr.DirectUpdateHandler2">
> >>>>>       <updateLog>
> >>>>>           <str name="dir">${solr.data.dir:}</str>
> >>>>>           <int
> name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int>
> >>>>>       </updateLog>
> >>>>>
> >>>>>       <autoCommit>
> >>>>>           <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
> >>>>>           <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs>
> >>>>>           <openSearcher>true</openSearcher>
> >>>>>       </autoCommit>
> >>>>>
> >>>>>       <autoSoftCommit>
> >>>>>           <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime>
> >>>>>       </autoSoftCommit>
> >>>>>   </updateHandler>
> >>>>
> >>>>>    <query>
> >>>>>       <maxBooleanClauses>1000</maxBooleanClauses>
> >>>>>       <filterCache class="solr.CaffeineCache"
> >>>>>                    size="${filterCache.size:32768}"
> >>>>>                    initialSize="${filterCache.initialSize:32768}"
> >>>>>                    autowarmCount="20%"/>
> >>>>>
> >>>>>       <queryResultCache class="solr.CaffeineCache"
> >>>>>                         size="${queryResultCache.size:32768}"
> >>>>>
>  initialSize="${queryResultCache.initialSize:32768}"
> >>>>>                         autowarmCount="0%"/>
> >>>>>
> >>>>>       <documentCache class="solr.CaffeineCache"
> >>>>>                      size="${documentCache.size:150000}"
> >>>>>
> initialSize="${documentCache.initialSize:150000}"
> >>>>>                      autowarmCount="0%"/>
> >>>>>
> >>>>>       <enableLazyFieldLoading>true</enableLazyFieldLoading>
> >>>>>       <useFilterForSortedQuery>true</useFilterForSortedQuery>
> >>>>>
> >>>>>       <queryResultWindowSize>160</queryResultWindowSize>
> >>>>>       <queryResultMaxDocsCached>300</queryResultMaxDocsCached>
> >>>>>
> >>>>>       <listener event="newSearcher" class="solr.QuerySenderListener">
> >>>>>       </listener>
> >>>>>       <listener event="firstSearcher"
> class="solr.QuerySenderListener">
> >>>>>       </listener>
> >>>>>
> >>>>>       <useColdSearcher>false</useColdSearcher>
> >>>>>       <maxWarmingSearchers>8</maxWarmingSearchers>
> >>>>>   </query>
> >>>>
> >>>> One of my assumption was to reduce the maxWarmingSearchers and to
> increase the autoCommit maxTime, since the softCommit isn’t available
> anymore in TLOG replicas. Is that valid?
> >>>>
> >>>> I couldn’t find any documents with the differences/considerations we
> need to take into account between NRT and TLOG, could you please help?
> Thanks a lot in advance. Please let me know if there is anything else
> required.
> >>>>
> >>>> Best regards,
> >>>> Nick Vladiceanu
> >>
>
>

Re: Migration from NRT to TLOG performance issues

Reply via email to