[ https://issues.apache.org/jira/browse/SOLR-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386590#comment-17386590 ]
Mark Robert Miller commented on SOLR-12386: ------------------------------------------- It’s really all related. Moving from a strategy of N actors on X nodes all all trying to independent create and manage zk paths (at a segment level) with retries and the same ground 0 strategy every time the come up. It would be fairly difficult to do without saying, no, this this is going to be the sensible strategy for how we manage zk nodes. And that comes down to pretty much doing it like a human would on paper. We know what nodes should exist when. We know what nodes should and paths should be created or removed and when. We share this knowledge with the computer. I’m just talking about a simple zk lock for cluster creation. With curator, it’s minimal code. With a fullish zk lock recipe, it’s a bit more. But you could also do something more basic, like designate an Ephem zk node as the cluster wide core zk layout state lock. On startup, try to create it, if it exists, someone else has it, wait for a key final node to show up and continue (/collections or whatever). If you succeeded in creating it, create the rest of the cluster layout and continue. How much just this helps depends on what you want. I’d love to be able to startup hundreds of nodes, each with hundreds of cores quickly and reliably. That’s a bit of a challenge with all of them + internal objects, fighting to create and retry each individual zk path. Throw in some connection loss, perhaps a zk node blinked, or the hardware is chugging getting everything up. Now you have a storm of indeterminate time, ferocity and fallout. The system can scale like nuts, but not with this kind of behavior. > Test fails for "Can't find resource" for files in the _default configset > ------------------------------------------------------------------------ > > Key: SOLR-12386 > URL: https://issues.apache.org/jira/browse/SOLR-12386 > Project: Solr > Issue Type: Test > Components: SolrCloud > Reporter: David Smiley > Priority: Minor > Attachments: cant find resource, stacktrace.txt > > > Some tests, especially ConcurrentCreateRoutedAliasTest, have failed > sporadically failed with the message "Can't find resource" pertaining to a > file that is in the default ConfigSet yet mysteriously can't be found. This > happens when a collection is being created that ultimately fails for this > reason. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org