[ https://issues.apache.org/jira/browse/SOLR-16414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628495#comment-17628495 ]
Ishan Chattopadhyaya commented on SOLR-16414: --------------------------------------------- bq. Still, if this is a PRS-only issue and the fix today is likely to work on the 8-node 1000 collections test, then we should not hold up 9.1 to try to get a perfect solution for 1% of Solr's users. This affects non PRS as well. This test was specifically for the general case: https://github.com/fullstorydev/solr-bench/blob/ishan/repeatable-jenkins/suites/cluster-test.json#L7 bq. we should not hold up 9.1 to try to get a perfect solution for 1% of Solr's users. However, this attitude towards our users needs to change. Fullstory is using PRS in production, only because of performance benefits. Punishing early adopters and active contributors for no reason is not useful for our project. Even though this issue is not PRS specific, but even had it been so, a respin would've been warranted (given the severity of the problem). bq. Do you know the root cause of the deadlock? We can guess that the unbounded parallellStream would cause too much traffic to ZK at once so that something breaks? But if someone has 100 solr nodes instead of 8, you'd still get a massive parallell load on ZK? Here's the thread dump at the time of failed graceful shutdown. https://termbin.com/7cfz. Seems like a Java/JDK/JVM or usage issue regarding parallelStream, that just causes unnecessary contention on the thread pool. > Race condition in PRS state updates > ----------------------------------- > > Key: SOLR-16414 > URL: https://issues.apache.org/jira/browse/SOLR-16414 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Noble Paul > Assignee: Noble Paul > Priority: Major > Fix For: 9.1 > > Time Spent: 40m > Remaining Estimate: 0h > > For PRS collections the individual states are potentially updated from > individual nodes and sometimes from overseer too. it's possible that > > # OP1 is sent to overseer at T1 > # OP2 is executed in the node itself at T2 > > Because we cannot guarantee that the OP1 sent to overseer may execute before > OP2 tyhe final state will be the result of OP1 which is incorrect and can > lead to errors . > The solution is to never do any PRS writes from overseer. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org