[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807491#comment-17807491 ]
Chris M. Hostetter commented on SOLR-17106: ------------------------------------------- Hi Aprna, thanks for creating this jira! I read through the changes on your branch, and I'm a little confused by a few things... 1. What was your motivation for replacing {{setAliveCheckInterval(int aliveCheckInterval, TimeUnit unit)}} with {{setZombieStateMonitoringIntervalMillis(long zombieStateMonitoringIntervalMillis)}} ? It appears that they ultimately serve the same purpose of dictating the {{scheduleAtFixedRate()}} executor, but changing the name just seems like an unnecessary back compat break? (We also, AFAIK, don't currently use the term "zombie" in any of the {{public}} APIs of SolrJ – only in the internal tracking of what servers are "not alive" .... we probably want to keep the APIs in terms of being "alive" or "not alive") Even if there is a strong reason I'm overlooking why it makes sense to change the method name: There's also been a push over the last few years to standardize on always taking in {{TimeUnit}} when dealing with method args that specify a time duration, rather then forcing the client to specify milliseconds (see SOLR-16595) 2. I don't think {{setMinZombieReleaseTimeMillis(long jailTimeInMillis)}} is being used the way you intended / expected? >From the javadocs you added... {quote}This param should be set if enableZombieChecks=false. This param configures the time a server should be regarded, at a minimum, as a zombie before being released {quote} ...but from what I've seen of where & how the values are actually used: * In {{addZombie(...)}} it explicitly sets {{ServerWraper.remainingTime = zombieCheckIntervalMillis}} * In {{{}reduceRemainingZombieTime(...){}}}, if {{ServerWraper.remainingTime}} is non zero, explicitly set it to {{wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - minZombieReleaseTimeMillis));}} ...that's it. So no matter what the value {{minZombieReleaseTimeMillis}} is set to, the only thing that determines how long a URL is kept in the zombie list is 2x the {{zombieStateMonitoringIntervalMillis}} value used to configure the scheduled executor. The first time the executor calls {{reduceRemainingZombieTime(...)}} on a {{ServerWrapper}} it will reduce the {{remainingTime}} to zero. The Second time the executor calls {{reduceRemainingZombieTime(...)}} it will remove the URL from zombies and re-add it to alive. I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this approach still seems kind of confusing & misleading, because tracking & recording "remaining milliseconds" like this implies more granularity then here really is. {{remainingTime=10 (ms)}} is meaningless if {{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait the full 60 seconds. I would suggest a "num iters" type configuration, rather then a "time" based configuration, so that we are very transparent that our "liveness checker" runs every {{aliveCheckInterval}} – and you can configure the "liveness checker" to only inspect a URL ever "N runs". ---- Thinking about the API of this change holistically, my inclination would be to support disabling the zombie check queries as a special case of making the specifics of those queries configurable. Here's the API changes I would suggest: * Keep {{setAliveCheckInterval(...)}} the way it is * Add a new {{setAliveCheckSkipIters(int)}} that defaults to {{0}} ** Document that positive values can be used to ensure that the liveness checks for a given URL will only be run every "Nth" iteration of the {{setAliveCheckInterval}} * Add a new {{setAliveCheckQuery(SolrQuery)}} that defaults to the current static {{SolrQuery}} ** Document that if this is set to {{null}} the client will implicitly assume success anytime it would normally perform a a liveness check on a server. The underlying functional changes would be: * Add an {{int skipAliveCheckIters}} to {{ServerWrapper}} that is initialized to whatever {{setAliveCheckSkipIters(int)}} is set to * In {{checkAZombieServer(...)}} ... ** Befor any existing logic: {{if (0 < zombieServer.skipAliveCheckIters)}} ... decrement and return immediately ** Short circuit the the current {{QueryRequest}} logic such that if {{setAliveCheckQuery(null)}} is in use, we don't execute any request, we just enter the "success" block Unless I'm overlooking something, this would all be back-compatable, so we could merge it to the 9x branch, and then on the main branch we could change some of the defaults (like using {{setAliveCheckRequest(null)}} by default, increasing {{{}setAliveCheckSkipIters(int){}}}, decreasing {{{}setAliveCheckInterval(...){}}}) What do you think? > LBSolrClient: Make it configurable to remove zombie ping checks > --------------------------------------------------------------- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Aparna Suresh > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org