[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817128#comment-17817128 ]
Aparna Suresh commented on SOLR-17106: -------------------------------------- Thanks for the feedback! Sorry I did not have a chance to respond for a few weeks - I was out sick with Covid initially and then tied up investigating issues in Production. Appreciate the detailed evaluation. I completely missed the point about backwards compatibility! {quote}I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this approach still seems kind of confusing & misleading, because tracking & recording "remaining milliseconds" like this implies more granularity then here really is. {{remainingTime=10 (ms)}} is meaningless if {{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait the full 60 seconds. {quote} I specified an override to zombieStateMonitoringIntervalMillis = 5s in my first commit on LBHttp2SolrClient, with remainingTime set to 10s. So the thread running periodically doesnt evict a zombie entry right away, I added the following if condition - but I agree that would keep some entries as zombies up to the next run. Agree 100% about the point that the time based approach doesnt provide a lot of flexibility compared to the numIters approach. {code:java} private void reduceRemainingZombieTime(ServerWrapper wrapper) { if(wrapper == null){ return; } if (wrapper.remainingTime == 0) { //evict from zombieServers, add to aliveServers zombieServers.remove(wrapper.getBaseUrl()); wrapper.failedPings = 0; if (wrapper.standard) { addToAlive(wrapper); } } else { wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - minZombieReleaseTimeMillis)); } } {code} Have updated the PR based on your comments here: [https://github.com/apache/solr/pull/2160/files] > LBSolrClient: Make it configurable to remove zombie ping checks > --------------------------------------------------------------- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement > Reporter: Aparna Suresh > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org