Hello, The results of the tests proved the hypothesis around autoCommit frequency and TLOG performance degradation.
I’ve ran the following tests: 1. TLOG + PULL 34 replicas in total. 28 of them PULL, 6 TLOG. Query params: replica.type:PULL,replica.type:TLOG,replica.location:local. Increased the autoCommit from 60s to 420s (7 mins). autoSoftCommit on NRT cluster is 5 mins, so tried to get something closer for the TLOG/PULL fetchers, according to the formula: x / 2, where x is the autoCommit interval. Set openSearcher to true. Now, we have autoCommit every 420s, and TLOG/PULL fetch every 210 seconds. The tests are showing p99 ~170ms, p95 ~ 45ms on TLOG+PULL vs p99 ~150ms and p95 ~ 40 ms on NRT (5mins softCommit, 1min hardCommit). 2. NRT with lower autoSoftCommit 30 replicas, all NRT. Query params: replica.location:local Left autoCommit to 60s. Lowered autoSoftCommit from 300s to 30s, which half of autoCommit interval. Set openSearcher to false. Now, we have autoCommit every 60s and autoSoftCommit every 30s. The test are showing p99 ~340ms and p95 ~150ms, which are similar to the TLOG + PULL with autoCommit every 60s. Before we do the Go-Live test on TLOG + PULL with increased autoCommit time, we have one main question: having autoCommit to 420s it’s dangerous if we lose the Leader? The new data will be lost if we lose the leader? What is the best approach to control the TLOG+PULL fetch frequency? Thanks a lot folks for all the help on this topic. Best regards, Nick Vladiceanu > On 11. Jun 2021, at 6:38 PM, Houston Putman <houstonput...@gmail.com> wrote: > > So the issue seems to be with the autocommit time. > > The PULL and TLOG followers fetch the index every x seconds. This 'x' is > 1/2 of the autocommit time, so when you increased your autocommit, you were > actually just increasing the amount of time your TLOG followers and PULL > replicas were able to keep their index. > > On your TLOG leader, it is probably hard-committing quite often, since you > have the <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> included and > you are "constantly indexing". > > I would recommend trying the following, so that we can test TLOG/PULL and > NRT with as similar setups as possible: > > Set <openSearcher>false</openSearcher> for <autocommit>, and set the > autoSoftCommit maxTime to be exactly half of the autoCommit time. > >> <updateHandler class="solr.DirectUpdateHandler2"> >> <updateLog> >> <str name="dir">${solr.data.dir:}</str> >> <int > name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int> >> </updateLog> >> >> <autoCommit> >> <maxTime>${solr.autoCommit.maxTime:240000}</maxTime> >> <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> >> <openSearcher>false</openSearcher> >> </autoCommit> >> >> <autoSoftCommit> >> <maxTime>${solr.autoSoftCommit.maxTime:120000}</maxTime> >> </autoSoftCommit> >> </updateHandler> > > That way all of your replicas, NRT, TLOG and PULL, should be opening > searchers at the same rate (every 2 minutes). > You can also adjust these times if 2 minutes is too long of a wait. I just > want to see if there is a difference when the searchers are opened at the > same interval. > If there is still a significant difference between the two, then we will > know there is something very strange going on with TLOG/PULL. > > Lastly, you might think about removing the <maxDocs> section, but that > doesn't mean a whole lot for this test we are trying to run. > > - Houston > > On Fri, Jun 11, 2021 at 12:07 PM Nick Vladiceanu <vladicean...@gmail.com> > wrote: > >> Very good point Mike. To avoid this, I’ve scaled the cluster to 34 nodes >> (which would compensate the 6TLOG that aren’t going to be used for search), >> and we were only 2 nodes less for search queries than the NRT cluster had. >> At lower request rate, the results weren’t better either. >> >> TLOG replica are receiving requests (since it’s behind the same LB), but >> it doesn’t perform searches (can notice by the load on cpu, memory, etc.), >> only proxies the requests to PULL replicas (in case of TLOG + PULL). >> >> We use c2.2xlarge instances with EBS volumes of 100Gi, on the disk side >> there is no pressure, the IOPS look ok, reads/writes are normal. I’m using >> on-demand secondary cluster where I run all these experiments, which is >> identical to the production one. Request queries I generate while testing >> are captured from the production, thus, identical too. >> >> Thank you >> >> >>> On 11. Jun 2021, at 5:54 PM, Mike Drob <md...@apache.org> wrote: >>> >>> When you have 6TLOG+24PULL and you're setting >>> >> shards.preference=replica.type:PULL,replica.type:TLOG,replica.location:local, >>> I would expect zero queries going to the TLOG replicas, can you >>> confirm that is the case? If so, this might be an issue of 24 nodes >>> trying to keep up with the work that 30 were doing previously. Maybe >>> try an experiment with 1TLOG+29PULL to see if that gets you closer to >>> the old numbers, but I wouldn't recommend running that for a long time >>> in production. >>> >>> Thinking out loud here... maybe there is some difference in how the OS >>> handles the page cache and memory mapping the index files if they come >>> in cold over the network vs being actively written by the Solr >>> process. What kind of storage are you using? >>> >>> On Fri, Jun 11, 2021 at 10:38 AM Nick Vladiceanu <vladicean...@gmail.com> >> wrote: >>>> >>>> actually not using HDFSDirectory, it’s a leftover in the config from >> some previous tests. >>>> >>>> I don’t see anything in the logs related to maxWarmingSearchers, nor >> other errors/warnings show in the logs. I tried to reduce >> maxWarmingSearchers to 3 and increased the Hard commit maxTime to 2mins, >> the results improved significantly, from ~350ms p99 to ~210ms p99, which is >> still higher than NRT result, but better than it was. >>>> >>>> I also tried with only TLOG replicas, and the results are more or less >> the same, ~340ms p99 and ~110ms p95. So, both are slower, TLOG + PULL and >> TLOG only. >>>> >>>> >>>> >>>>> On 11. Jun 2021, at 5:28 PM, Mike Drob <md...@apache.org> wrote: >>>>> >>>>> Are you using HDFSDirectory to serve your indices? I noticed that >> tlogDfsReplication is set, so that's why I'm asking. >>>>> >>>>> 8 maxWarmingSearchers is very high, typically that value is 2 or maybe >> 4, but you would know if this was an issue by looking at your logs. >>>>> >>>>> I'm assuming that you had 30 NRT replicas before? If you had fewer, >> then your tail latencies might be higher because you're seeing cache misses >> on the queries. Do you have metrics on the response times for TLOG v PULL? >> Are they both slower, or just one? >>>>> >>>>> Mike >>>>> >>>>> On 2021/06/11 12:55:31, Nick Vladiceanu <vladicean...@gmail.com> >> wrote: >>>>>> hello, >>>>>> I’m facing some performance issues when moving from NRT replica types >> to TLOG + PULL. We’re constantly indexing new data and heavily querying >> (~2k rps). >>>>>> >>>>>> - index size is ~ 2.5Gi; >>>>>> - number of docs ~4.6M; >>>>>> - 2 shards; >>>>>> - 7 cores and 14Gi of memory >>>>>> - 30 instances >>>>>> - JVM Heap is 12Gi >>>>>> >>>>>> When running on NRT only, the response time in avg is ~150ms p99 and >> 40ms p95. When changing to TLOG (6 tlog replicas) + 24 PULL, the response >> time grows to ~350ms p99 and 120ms p95. >>>>>> >>>>>> Here are some fragments from our solrconfig: >>>>>> >>>>>> >>>>>>> <updateHandler class="solr.DirectUpdateHandler2"> >>>>>>> <updateLog> >>>>>>> <str name="dir">${solr.data.dir:}</str> >>>>>>> <int >> name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int> >>>>>>> </updateLog> >>>>>>> >>>>>>> <autoCommit> >>>>>>> <maxTime>${solr.autoCommit.maxTime:60000}</maxTime> >>>>>>> <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs> >>>>>>> <openSearcher>true</openSearcher> >>>>>>> </autoCommit> >>>>>>> >>>>>>> <autoSoftCommit> >>>>>>> <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> >>>>>>> </autoSoftCommit> >>>>>>> </updateHandler> >>>>>> >>>>>>> <query> >>>>>>> <maxBooleanClauses>1000</maxBooleanClauses> >>>>>>> <filterCache class="solr.CaffeineCache" >>>>>>> size="${filterCache.size:32768}" >>>>>>> initialSize="${filterCache.initialSize:32768}" >>>>>>> autowarmCount="20%"/> >>>>>>> >>>>>>> <queryResultCache class="solr.CaffeineCache" >>>>>>> size="${queryResultCache.size:32768}" >>>>>>> >> initialSize="${queryResultCache.initialSize:32768}" >>>>>>> autowarmCount="0%"/> >>>>>>> >>>>>>> <documentCache class="solr.CaffeineCache" >>>>>>> size="${documentCache.size:150000}" >>>>>>> >> initialSize="${documentCache.initialSize:150000}" >>>>>>> autowarmCount="0%"/> >>>>>>> >>>>>>> <enableLazyFieldLoading>true</enableLazyFieldLoading> >>>>>>> <useFilterForSortedQuery>true</useFilterForSortedQuery> >>>>>>> >>>>>>> <queryResultWindowSize>160</queryResultWindowSize> >>>>>>> <queryResultMaxDocsCached>300</queryResultMaxDocsCached> >>>>>>> >>>>>>> <listener event="newSearcher" class="solr.QuerySenderListener"> >>>>>>> </listener> >>>>>>> <listener event="firstSearcher" >> class="solr.QuerySenderListener"> >>>>>>> </listener> >>>>>>> >>>>>>> <useColdSearcher>false</useColdSearcher> >>>>>>> <maxWarmingSearchers>8</maxWarmingSearchers> >>>>>>> </query> >>>>>> >>>>>> One of my assumption was to reduce the maxWarmingSearchers and to >> increase the autoCommit maxTime, since the softCommit isn’t available >> anymore in TLOG replicas. Is that valid? >>>>>> >>>>>> I couldn’t find any documents with the differences/considerations we >> need to take into account between NRT and TLOG, could you please help? >> Thanks a lot in advance. Please let me know if there is anything else >> required. >>>>>> >>>>>> Best regards, >>>>>> Nick Vladiceanu >>>> >> >>