[ https://issues.apache.org/jira/browse/SOLR-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768964#comment-17768964 ]
Alex Deparvu commented on SOLR-16992: ------------------------------------- {quote} I like the impl of {{SolrClientCache.close()}} a lot ... strictly speaking i don't think {{AtomicBoolean}} is actually necessary given that (all?) the methods are synchronized, but i don't care. i prefer your AtomicBoolean approach over depending on the synchronization (maybe/hopefully enough of the methods can similarly be improved to the point we can remove the synchronization some day) {quote} you are correct the atomic boolean is not strictly needed. {quote} checkState(); is too vague of a method name to be readable in context w/o consulting javados. The convention in lucene is {{ensureOpen(); }} which i think is really self documenting as you skim code {quote} yeah, will move to ensureOpen, sounds better. {quote} I think having a StreamExecutorHelper.submitAllAndAwaitAggregatingExceptions(...) method that is re-used in multiple streams is a fine idea, but I think the "guts" of it should be re-implemented as an ExecutorUtil class that takes an existing ExecutorService as an argument (and does not close it even on exception) {quote} I agree. will move it there {quote} but the if (result != null) ... logic should not be part of a general ExecutorUtil helper arguably it shouldn't be part of StreamExecutorHelper.submitAllAndAwaitAggregatingExceptions(...) either ... i'm pretty sure if you check the diffs of all the methods where you are using it in your PR not all of them currently ignore null results. {quote} this part, I am not sold on. looked at all uses again and the large majority filter nulls, one would NPE and the shards ones have no nulls. but I do see the point that a generic method does not need to concern itself with null values, it feels very 'implementation dependent' thanks for the review so far, will update and ping again > Non-reproducible StreamingTest failures -- suggests CloudSolrStream > concurency race condition > --------------------------------------------------------------------------------------------- > > Key: SOLR-16992 > URL: https://issues.apache.org/jira/browse/SOLR-16992 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Assignee: Alex Deparvu > Priority: Major > Attachments: > OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt, > thetaphi_solr_Solr-main-Linux_14679.log.txt > > Time Spent: 10m > Remaining Estimate: 0h > > Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having > suite level failures. > These failures have historically taken the form of > {{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads > all have names like > {{"h2sc-718-thread-2"}} indicating that they come from the internal > {{ExecutorService}} of an {{{}Http2SolrClient{}}}. > In my experience, the seeds from these failures have never reproduced - > suggesting that the problem is related to concurrency. > SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in > theory should help pinpoint where {{Http2SolrClient}} instances might not be > getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces > of when/where any unclosed instances were created - ie: which test method) > In practice, I have managed to force one failure from {{StreamingTest}} since > the SOLR-16983 changes (logs to be attached soon) - but it still didn't > indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead > indicated was a _single_ unclosed {{InputStream}} instance related to > {{Http2SolrClient}} connections (SOLR-16983 also added better tracking of > this) coming from {{StreamingTest.testExceptionStream}} - a test method that > opens _five_ very similar {{ExceptionStream}} instances, wrapping > {{CloudSolrStream}} instance, which expect to trigger server side errors. > By it's very design, {{ExceptionStream}} catches & records any exceptions > from the stream it wraps, so even in the event of these "expected" server > side errors, {{ExceptionStream.close()}} should still be correctly getting > called (and propagating down to the {{CloudStream}} it wraps). > I believe the underlying problem has to do with a concurrency race condition > between the call to {{CloudStream.close()}} and the {{ExecutorService}} used > internally by {{CloudSolrStream.openStreams()}} (details to follow) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org