Re: Migration from NRT to TLOG performance issues

Nick Vladiceanu Fri, 11 Jun 2021 09:07:01 -0700

Very good point Mike. To avoid this, I’ve scaled the cluster to 34 nodes (which 
would compensate the 6TLOG that aren’t going to be used for search), and we 
were only 2 nodes less for search queries than the NRT cluster had. At lower 
request rate, the results weren’t better either.


TLOG replica are receiving requests (since it’s behind the same LB), but it 
doesn’t perform searches (can notice by the load on cpu, memory, etc.), only 
proxies the requests to PULL replicas (in case of TLOG + PULL). 

We use c2.2xlarge instances with EBS volumes of 100Gi, on the disk side there 
is no pressure, the IOPS look ok, reads/writes are normal. I’m using on-demand 
secondary cluster where I run all these experiments, which is identical to the 
production one. Request queries I generate while testing are captured from the 
production, thus, identical too. 

Thank you


> On 11. Jun 2021, at 5:54 PM, Mike Drob <md...@apache.org> wrote:
> 
> When you have 6TLOG+24PULL and you're setting
> shards.preference=replica.type:PULL,replica.type:TLOG,replica.location:local,
> I would expect zero queries going to the TLOG replicas, can you
> confirm that is the case? If so, this might be an issue of 24 nodes
> trying to keep up with the work that 30 were doing previously. Maybe
> try an experiment with 1TLOG+29PULL to see if that gets you closer to
> the old numbers, but I wouldn't recommend running that for a long time
> in production.
> 
> Thinking out loud here... maybe there is some difference in how the OS
> handles the page cache and memory mapping the index files if they come
> in cold over the network vs being actively written by the Solr
> process. What kind of storage are you using?
> 
> On Fri, Jun 11, 2021 at 10:38 AM Nick Vladiceanu <vladicean...@gmail.com> 
> wrote:
>> 
>> actually not using HDFSDirectory, it’s a leftover in the config from some 
>> previous tests.
>> 
>> I don’t see anything in the logs related to maxWarmingSearchers, nor other 
>> errors/warnings show in the logs. I tried to reduce maxWarmingSearchers to 3 
>> and increased the Hard commit maxTime to 2mins, the results improved 
>> significantly, from ~350ms p99 to ~210ms p99, which is still higher than NRT 
>> result, but better than it was.
>> 
>> I also tried with only TLOG replicas, and the results are more or less the 
>> same, ~340ms p99 and ~110ms p95. So, both are slower, TLOG + PULL and TLOG 
>> only.
>> 
>> 
>> 
>>> On 11. Jun 2021, at 5:28 PM, Mike Drob <md...@apache.org> wrote:
>>> 
>>> Are you using HDFSDirectory to serve your indices? I noticed that 
>>> tlogDfsReplication is set, so that's why I'm asking.
>>> 
>>> 8 maxWarmingSearchers is very high, typically that value is 2 or maybe 4, 
>>> but you would know if this was an issue by looking at your logs.
>>> 
>>> I'm assuming that you had 30 NRT replicas before? If you had fewer, then 
>>> your tail latencies might be higher because you're seeing cache misses on 
>>> the queries. Do you have metrics on the response times for TLOG v PULL? Are 
>>> they both slower, or just one?
>>> 
>>> Mike
>>> 
>>> On 2021/06/11 12:55:31, Nick Vladiceanu <vladicean...@gmail.com> wrote:
>>>> hello,
>>>> I’m facing some performance issues when moving from NRT replica types to 
>>>> TLOG + PULL. We’re constantly indexing new data and heavily querying (~2k 
>>>> rps).
>>>> 
>>>> - index size is ~ 2.5Gi;
>>>> - number of docs ~4.6M;
>>>> - 2 shards;
>>>> - 7 cores and 14Gi of memory
>>>> - 30 instances
>>>> - JVM Heap is 12Gi
>>>> 
>>>> When running on NRT only, the response time in avg is ~150ms p99 and 40ms 
>>>> p95. When changing to TLOG (6 tlog replicas) + 24 PULL, the response time 
>>>> grows to ~350ms p99 and 120ms p95.
>>>> 
>>>> Here are some fragments from our solrconfig:
>>>> 
>>>> 
>>>>>   <updateHandler class="solr.DirectUpdateHandler2">
>>>>>       <updateLog>
>>>>>           <str name="dir">${solr.data.dir:}</str>
>>>>>           <int 
>>>>> name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int>
>>>>>       </updateLog>
>>>>> 
>>>>>       <autoCommit>
>>>>>           <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
>>>>>           <maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs>
>>>>>           <openSearcher>true</openSearcher>
>>>>>       </autoCommit>
>>>>> 
>>>>>       <autoSoftCommit>
>>>>>           <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime>
>>>>>       </autoSoftCommit>
>>>>>   </updateHandler>
>>>> 
>>>>>    <query>
>>>>>       <maxBooleanClauses>1000</maxBooleanClauses>
>>>>>       <filterCache class="solr.CaffeineCache"
>>>>>                    size="${filterCache.size:32768}"
>>>>>                    initialSize="${filterCache.initialSize:32768}"
>>>>>                    autowarmCount="20%"/>
>>>>> 
>>>>>       <queryResultCache class="solr.CaffeineCache"
>>>>>                         size="${queryResultCache.size:32768}"
>>>>>                         
>>>>> initialSize="${queryResultCache.initialSize:32768}"
>>>>>                         autowarmCount="0%"/>
>>>>> 
>>>>>       <documentCache class="solr.CaffeineCache"
>>>>>                      size="${documentCache.size:150000}"
>>>>>                      initialSize="${documentCache.initialSize:150000}"
>>>>>                      autowarmCount="0%"/>
>>>>> 
>>>>>       <enableLazyFieldLoading>true</enableLazyFieldLoading>
>>>>>       <useFilterForSortedQuery>true</useFilterForSortedQuery>
>>>>> 
>>>>>       <queryResultWindowSize>160</queryResultWindowSize>
>>>>>       <queryResultMaxDocsCached>300</queryResultMaxDocsCached>
>>>>> 
>>>>>       <listener event="newSearcher" class="solr.QuerySenderListener">
>>>>>       </listener>
>>>>>       <listener event="firstSearcher" class="solr.QuerySenderListener">
>>>>>       </listener>
>>>>> 
>>>>>       <useColdSearcher>false</useColdSearcher>
>>>>>       <maxWarmingSearchers>8</maxWarmingSearchers>
>>>>>   </query>
>>>> 
>>>> One of my assumption was to reduce the maxWarmingSearchers and to increase 
>>>> the autoCommit maxTime, since the softCommit isn’t available anymore in 
>>>> TLOG replicas. Is that valid?
>>>> 
>>>> I couldn’t find any documents with the differences/considerations we need 
>>>> to take into account between NRT and TLOG, could you please help? Thanks a 
>>>> lot in advance. Please let me know if there is anything else required.
>>>> 
>>>> Best regards,
>>>> Nick Vladiceanu
>>

Re: Migration from NRT to TLOG performance issues

Reply via email to