So the issue seems to be with the autocommit time. The PULL and TLOG followers fetch the index every x seconds. This 'x' is 1/2 of the autocommit time, so when you increased your autocommit, you were actually just increasing the amount of time your TLOG followers and PULL replicas were able to keep their index.
On your TLOG leader, it is probably hard-committing quite often, since you have the <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> included and you are "constantly indexing". I would recommend trying the following, so that we can test TLOG/PULL and NRT with as similar setups as possible: Set <openSearcher>false</openSearcher> for <autocommit>, and set the autoSoftCommit maxTime to be exactly half of the autoCommit time. > <updateHandler class="solr.DirectUpdateHandler2"> > <updateLog> > <str name="dir">${solr.data.dir:}</str> > <int name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int> > </updateLog> > > <autoCommit> > <maxTime>${solr.autoCommit.maxTime:240000}</maxTime> > <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> > <openSearcher>false</openSearcher> > </autoCommit> > > <autoSoftCommit> > <maxTime>${solr.autoSoftCommit.maxTime:120000}</maxTime> > </autoSoftCommit> > </updateHandler> That way all of your replicas, NRT, TLOG and PULL, should be opening searchers at the same rate (every 2 minutes). You can also adjust these times if 2 minutes is too long of a wait. I just want to see if there is a difference when the searchers are opened at the same interval. If there is still a significant difference between the two, then we will know there is something very strange going on with TLOG/PULL. Lastly, you might think about removing the <maxDocs> section, but that doesn't mean a whole lot for this test we are trying to run. - Houston On Fri, Jun 11, 2021 at 12:07 PM Nick Vladiceanu <vladicean...@gmail.com> wrote: > Very good point Mike. To avoid this, I’ve scaled the cluster to 34 nodes > (which would compensate the 6TLOG that aren’t going to be used for search), > and we were only 2 nodes less for search queries than the NRT cluster had. > At lower request rate, the results weren’t better either. > > TLOG replica are receiving requests (since it’s behind the same LB), but > it doesn’t perform searches (can notice by the load on cpu, memory, etc.), > only proxies the requests to PULL replicas (in case of TLOG + PULL). > > We use c2.2xlarge instances with EBS volumes of 100Gi, on the disk side > there is no pressure, the IOPS look ok, reads/writes are normal. I’m using > on-demand secondary cluster where I run all these experiments, which is > identical to the production one. Request queries I generate while testing > are captured from the production, thus, identical too. > > Thank you > > > > On 11. Jun 2021, at 5:54 PM, Mike Drob <md...@apache.org> wrote: > > > > When you have 6TLOG+24PULL and you're setting > > > shards.preference=replica.type:PULL,replica.type:TLOG,replica.location:local, > > I would expect zero queries going to the TLOG replicas, can you > > confirm that is the case? If so, this might be an issue of 24 nodes > > trying to keep up with the work that 30 were doing previously. Maybe > > try an experiment with 1TLOG+29PULL to see if that gets you closer to > > the old numbers, but I wouldn't recommend running that for a long time > > in production. > > > > Thinking out loud here... maybe there is some difference in how the OS > > handles the page cache and memory mapping the index files if they come > > in cold over the network vs being actively written by the Solr > > process. What kind of storage are you using? > > > > On Fri, Jun 11, 2021 at 10:38 AM Nick Vladiceanu <vladicean...@gmail.com> > wrote: > >> > >> actually not using HDFSDirectory, it’s a leftover in the config from > some previous tests. > >> > >> I don’t see anything in the logs related to maxWarmingSearchers, nor > other errors/warnings show in the logs. I tried to reduce > maxWarmingSearchers to 3 and increased the Hard commit maxTime to 2mins, > the results improved significantly, from ~350ms p99 to ~210ms p99, which is > still higher than NRT result, but better than it was. > >> > >> I also tried with only TLOG replicas, and the results are more or less > the same, ~340ms p99 and ~110ms p95. So, both are slower, TLOG + PULL and > TLOG only. > >> > >> > >> > >>> On 11. Jun 2021, at 5:28 PM, Mike Drob <md...@apache.org> wrote: > >>> > >>> Are you using HDFSDirectory to serve your indices? I noticed that > tlogDfsReplication is set, so that's why I'm asking. > >>> > >>> 8 maxWarmingSearchers is very high, typically that value is 2 or maybe > 4, but you would know if this was an issue by looking at your logs. > >>> > >>> I'm assuming that you had 30 NRT replicas before? If you had fewer, > then your tail latencies might be higher because you're seeing cache misses > on the queries. Do you have metrics on the response times for TLOG v PULL? > Are they both slower, or just one? > >>> > >>> Mike > >>> > >>> On 2021/06/11 12:55:31, Nick Vladiceanu <vladicean...@gmail.com> > wrote: > >>>> hello, > >>>> I’m facing some performance issues when moving from NRT replica types > to TLOG + PULL. We’re constantly indexing new data and heavily querying > (~2k rps). > >>>> > >>>> - index size is ~ 2.5Gi; > >>>> - number of docs ~4.6M; > >>>> - 2 shards; > >>>> - 7 cores and 14Gi of memory > >>>> - 30 instances > >>>> - JVM Heap is 12Gi > >>>> > >>>> When running on NRT only, the response time in avg is ~150ms p99 and > 40ms p95. When changing to TLOG (6 tlog replicas) + 24 PULL, the response > time grows to ~350ms p99 and 120ms p95. > >>>> > >>>> Here are some fragments from our solrconfig: > >>>> > >>>> > >>>>> <updateHandler class="solr.DirectUpdateHandler2"> > >>>>> <updateLog> > >>>>> <str name="dir">${solr.data.dir:}</str> > >>>>> <int > name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int> > >>>>> </updateLog> > >>>>> > >>>>> <autoCommit> > >>>>> <maxTime>${solr.autoCommit.maxTime:60000}</maxTime> > >>>>> <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> > >>>>> <openSearcher>true</openSearcher> > >>>>> </autoCommit> > >>>>> > >>>>> <autoSoftCommit> > >>>>> <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> > >>>>> </autoSoftCommit> > >>>>> </updateHandler> > >>>> > >>>>> <query> > >>>>> <maxBooleanClauses>1000</maxBooleanClauses> > >>>>> <filterCache class="solr.CaffeineCache" > >>>>> size="${filterCache.size:32768}" > >>>>> initialSize="${filterCache.initialSize:32768}" > >>>>> autowarmCount="20%"/> > >>>>> > >>>>> <queryResultCache class="solr.CaffeineCache" > >>>>> size="${queryResultCache.size:32768}" > >>>>> > initialSize="${queryResultCache.initialSize:32768}" > >>>>> autowarmCount="0%"/> > >>>>> > >>>>> <documentCache class="solr.CaffeineCache" > >>>>> size="${documentCache.size:150000}" > >>>>> > initialSize="${documentCache.initialSize:150000}" > >>>>> autowarmCount="0%"/> > >>>>> > >>>>> <enableLazyFieldLoading>true</enableLazyFieldLoading> > >>>>> <useFilterForSortedQuery>true</useFilterForSortedQuery> > >>>>> > >>>>> <queryResultWindowSize>160</queryResultWindowSize> > >>>>> <queryResultMaxDocsCached>300</queryResultMaxDocsCached> > >>>>> > >>>>> <listener event="newSearcher" class="solr.QuerySenderListener"> > >>>>> </listener> > >>>>> <listener event="firstSearcher" > class="solr.QuerySenderListener"> > >>>>> </listener> > >>>>> > >>>>> <useColdSearcher>false</useColdSearcher> > >>>>> <maxWarmingSearchers>8</maxWarmingSearchers> > >>>>> </query> > >>>> > >>>> One of my assumption was to reduce the maxWarmingSearchers and to > increase the autoCommit maxTime, since the softCommit isn’t available > anymore in TLOG replicas. Is that valid? > >>>> > >>>> I couldn’t find any documents with the differences/considerations we > need to take into account between NRT and TLOG, could you please help? > Thanks a lot in advance. Please let me know if there is anything else > required. > >>>> > >>>> Best regards, > >>>> Nick Vladiceanu > >> > >