[jira] [Commented] (SOLR-17348) Mitigate extreme parallelism of zkCallback executor

Michael Gibney (Jira) Wed, 26 Jun 2024 05:10:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860138#comment-17860138
 ]


Michael Gibney commented on SOLR-17348:
---------------------------------------

re: the higher {{corePoolSize}}, if you set {{allowCoreThreadTimeout=true}} 
then the pool size shrinks for core threads the same as it does for other 
threads, including down to 0 (as far as I can tell from the documentation). So 
in that case, the only thing {{corePoolSize}} determines is the threshold at 
which enqueuing is preferred to creating a new thread without attempting to 
enqueue.

Tbh, even if there are some dependencies between tasks that could in theory 
lead to deadlock, I suspect it's vanishingly unlikely/borderline-impossible 
that with a sufficiently high number of core threads (e.g. 1024), there would 
be enough interdependent threads blocked at once to fully lock things up. I say 
this because, even if there are interdependencies, they'd be scoped differently 
(per core or per collection), so all it would take is _some_ of the threads to 
make progress in order to keep tasks moving. Still it'd be nice to figure out 
what the interdependencies are, if they exist (and it does look like they may 
exist for leader election, at least under certain circumstances).

Virtual threads would definitely be helpful for this executor; but even if we 
went that route, there'd still have to be some context-switching overhead, 
right? And thus it still might be worth trying to control extreme parallelism.

For the stackoverflow answer: that sounds reasonable, iff it's acceptable to 
block the calling thread. I'm assuming that would cause callbacks to get 
"queued" somewhere before submission to the executor -- not sure if that could 
potentially lead to overflow of some kind? (This is why the solution I was 
pursuing still admitted the possibility of queue overflow -> unbounded pool 
growth).

> Mitigate extreme parallelism of zkCallback executor
> ---------------------------------------------------
>
>                 Key: SOLR-17348
>                 URL: https://issues.apache.org/jira/browse/SOLR-17348
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Michael Gibney
>            Priority: Minor
>
> zkCallback executor is [currently an unbounded thread pool of core size 
> 0|https://github.com/apache/solr/blob/709a1ee27df23b419d09fe8f67c3276409131a4a/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L91-L92],
>  using a SynchronousQueue. Thus, a flood of zkCallback events (as might be 
> triggered by a cluster restart, e.g.) can result in spinning up a very large 
> number of threads. In practice we have encountered as many as 35k threads 
> created in some such cases, even after the impact of this situation was 
> reduced by the fix for SOLR-11535.
> Inspired by [~cpoerschke]'s recent [closer look at thread pool 
> behavior|https://issues.apache.org/jira/browse/SOLR-13350?focusedCommentId=17853178#comment-17853178],
>  I wondered if we might be able to employ a bounded queue to alleviate some 
> of the pressure from bursty zk callbacks.
> The new config might look something like: {{corePoolSize=1024, 
> maximumPoolSize=Integer.MAX_VALUE, allowCoreThreadTimeout=true, workQueue=new 
> LinkedBlockingQueue<>(1024)}}. This would allow the pool to grow up to (and 
> shrink from) corePoolSize in the same manner it currently does, but once 
> exceeding corePoolSize (e.g. during a cluster restart or other callback flood 
> event), tasks would be queued (up to some fixed limit). If the queue limit is 
> exceeded, new threads would still be created, but we would have avoided the 
> current “always create a thread” behavior, and by so doing hopefully reduce 
> task execution time and improve overall throughput.
> From the ThreadPoolExecutor javadocs:
> {quote}Direct handoffs. A good default choice for a work queue is a 
> SynchronousQueue that hands off tasks to threads without otherwise holding 
> them. Here, an attempt to queue a task will fail if no threads are 
> immediately available to run it, so a new thread will be constructed. This 
> policy avoids lockups when handling sets of requests that might have internal 
> dependencies. Direct handoffs generally require unbounded maximumPoolSizes to 
> avoid rejection of new submitted tasks. This in turn admits the possibility 
> of unbounded thread growth when commands continue to arrive on average faster 
> than they can be processed.{quote}
> So afaict SynchronousQueue mainly makes sense if there exists the possibility 
> of deadlock due to dependencies among tasks, and I think this should ideally 
> _not_ be the case with zk callbacks (though in practice I'm not sure this is 
> the case).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17348) Mitigate extreme parallelism of zkCallback executor

Reply via email to