[ https://issues.apache.org/jira/browse/CASSANDRA-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891217#comment-13891217 ]
Brandon Williams commented on CASSANDRA-6648: --------------------------------------------- Thinking about this a bit more, I'm inclined to think that a) isFatClient should never have checked epState.isAlive, since a fat client can be either alive or dead, and neither make it more or less of a fat client, and thus b) onAlive is the wrong event for MM to be looking at to decide on pulling schema, since potentially *every* node actually IS a fat client when first seen. The true source of 'fatclientness' or not is TMD.isMember, but SS hasn't processed the onJoin event yet when onAlive is called. We could possibly fix this by having isFatClient check for the presence of TOKENS, which a fat client shouldn't have, or we could make SS.onJoin trigger MM.maybeScheduleSchemaPull after it has processed the join event. > Race condition during node bootstrapping > ---------------------------------------- > > Key: CASSANDRA-6648 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6648 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Sergio Bossa > Assignee: Sergio Bossa > Priority: Critical > Attachments: 6648-v2.txt, CASSANDRA-6648.patch > > > When bootstrapping a new node, data is "missing" as if the new node didn't > actually bootstrap, which I tracked down to the following scenario: > 1) New node joins token ring and waits for schema to be settled before > actually bootstrapping. > 2) The schema scheck somewhat passes and it starts bootstrapping. > 3) Bootstrapping doesn't find the ks/cf that should have received from the > other node. > 4) Queries at this point cause NPEs, until when later they "recover" but data > is missed. > The problem seems to be caused by a race condition between the migration > manager and the bootstrapper, with the former running after the latter. > I think this is supposed to protect against such scenarios: > {noformat} > while (!MigrationManager.isReadyForBootstrap()) > { > setMode(Mode.JOINING, "waiting for schema information to > complete", true); > Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS); > } > {noformat} > But MigrationManager.isReadyForBootstrap() implementation is quite fragile > and doesn't take into account "slow" schema propagation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)