rondagostino commented on pull request #10991: URL: https://github.com/apache/kafka/pull/10991#issuecomment-877430917
TL;DR: my preference is to skip the regex check rather than adding a new message and carefully/appropriately grepping for it. > I think you do need to special case this and send SIGKILL to the last n / 2 (round down) combined nodes Yeah, we have to keep track of the number of co-located controllers that are running in order to send `SIGKILL` to brokers when there are less than a majority of controllers available. For completeness, I will note here that co-located controller nodes will always be the lowest-numbered nodes in the cluster given the way the system test code is written (https://github.com/apache/kafka/blob/trunk/tests/kafkatest/services/kafka/kafka.py#L724-L729), and that means those nodes will be started and stopped first given the way ducktape is written. So upon shutdown of a co-located cluster, once we don't have a majority of controllers, the rest of the nodes will have to be killed. For example, if we were to have 5 brokers and only 3 co-located nodes, the first 2 nodes will be controllers and can be shutdown normally, but then the last 3 will have to be killed due to lack of quorum (I discovered a minor bug in the current PR, which I am fixing with a small commit, while testing locally f or the shutdown of a cluster with 3 co-located controllers out of 5 total nodes). For starting up nodes, the question becomes whether we wait for a log message or not when starting a broker with less than a majority of co-located controllers running. It gets a bit tricky because we have to wait for both the controller and the broker to start when starting a node with `process.roles=broker,controller` would result in a majority, but if we wouldn't have a majority with this node started then we would have to grep for just the controller log message. Furthermore, any controller-related message must not match the grep that identifies both the broker and controller being started. Finally, it generally would not be the case that we would start a node with `process.roles=broker` in the system tests with less than a majority of co-located controllers because of the startup order described above -- typically all co-located controllers are started before anything without a co-located controller starts. But if we wanted to deal with it accurately in case it were to happen, then we would have to know to skip the regex altogether here. Assume for a moment that we skip the regex check if the node to be started wouldn't get us to a co-located quorum. With 3 co-located controllers this means we skip the regex check until we start the second controller node, and at that point and thereafter we wait for the broker to be fully started with the current regex check. And if there were any problems previously that we didn't catch due to not explicitly checking for a controller-specific startup message (e.g. the first node didn't start properly) then this next controller would not start and the regex would not be satisfied -- and the test would fail. This isn't ideal -- we would like to have known about the failure when the previous controller did not start properly -- but it seems like a reasonable tradeoff to be able to avoid the messiness of expanding our `grep`-based paradigm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
