[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817128#comment-17817128
 ] 

Aparna Suresh commented on SOLR-17106:
--------------------------------------

Thanks for the feedback! Sorry I did not have a chance to respond for a few 
weeks - I was out sick with Covid initially and then tied up investigating 
issues in Production. Appreciate the detailed evaluation. 

 

I completely missed the point about backwards compatibility!
{quote}I'm guessing what you ment to do is have 
{{reduceRemainingZombieTime(...)}} subtract 
{{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this 
approach still seems kind of confusing & misleading, because tracking & 
recording "remaining milliseconds" like this implies more granularity then here 
really is.

{{remainingTime=10 (ms)}} is meaningless if 
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait 
the full 60 seconds.
{quote}
I specified an override to zombieStateMonitoringIntervalMillis = 5s in my first 
commit on LBHttp2SolrClient, with remainingTime set to 10s. So the thread 
running periodically doesnt evict a zombie entry right away, I added the 
following if condition - but I agree that would keep some entries as zombies up 
to the next run. Agree 100% about the point that the time based approach doesnt 
provide a lot of flexibility compared to the numIters approach.

 
{code:java}
private void reduceRemainingZombieTime(ServerWrapper wrapper) {
    if(wrapper == null){
      return;
    }
    if (wrapper.remainingTime == 0) {
      //evict from zombieServers, add to aliveServers
      zombieServers.remove(wrapper.getBaseUrl());
      wrapper.failedPings = 0;
      if (wrapper.standard) {
        addToAlive(wrapper);
      }
    } else {
      wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - 
minZombieReleaseTimeMillis));
    }
  }
 
{code}
 

 

Have updated the PR based on your comments here: 
[https://github.com/apache/solr/pull/2160/files]

 

 

 

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
>                 Key: SOLR-17106
>                 URL: https://issues.apache.org/jira/browse/SOLR-17106
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Aparna Suresh
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to