rondagostino commented on pull request #10991:
URL: https://github.com/apache/kafka/pull/10991#issuecomment-877430917


   TL;DR: my preference is to skip the regex check rather than adding a new 
message and carefully/appropriately grepping for it.
   
   > I think you do need to special case this and send SIGKILL to the last n / 
2 (round down) combined nodes 
   
   Yeah, we have to keep track of the number of co-located controllers that are 
running in order to send `SIGKILL` to brokers when there are less than a 
majority of controllers available.  For completeness, I will note here that 
co-located controller nodes will always be the lowest-numbered nodes in the 
cluster given the way the system test code is written 
(https://github.com/apache/kafka/blob/trunk/tests/kafkatest/services/kafka/kafka.py#L724-L729),
 and that means those nodes will be started and stopped first given the way 
ducktape is written.  So upon shutdown of a co-located cluster, once we don't 
have a majority of controllers, the rest of the nodes will have to be killed.  
For example, if we were to have 5 brokers and only 3 co-located nodes, the 
first 2 nodes will be controllers and can be shutdown normally, but then the 
last 3 will have to be killed due to lack of quorum (I discovered a minor bug 
in the current PR, which I am fixing with a small commit, while testing locally 
f
 or the shutdown of a cluster with 3 co-located controllers out of 5 total 
nodes).
   
   For starting up nodes, the question becomes whether we wait for a log 
message or not when starting a broker with less than a majority of co-located 
controllers running.  It gets a bit tricky because we have to wait for both the 
controller and the broker to start when starting a node with 
`process.roles=broker,controller` would result in a majority, but if we 
wouldn't have a majority with this node started then we would have to grep for 
just the controller log message.  Furthermore, any controller-related message 
must not match the grep that identifies both the broker and controller being 
started.
   
   Finally, it generally would not be the case that we would start a node with 
`process.roles=broker` in the system tests with less than a majority of 
co-located controllers because of the startup order described above -- 
typically all co-located controllers are started before anything without a 
co-located controller starts.   But if we wanted to deal with it accurately in 
case it were to happen, then we would have to know to skip the regex altogether 
here.
   
   Assume for a moment that we skip the regex check if the node to be started 
wouldn't get us to a co-located quorum.  With 3 co-located controllers this 
means we skip the regex check until we start the second controller node, and at 
that point and thereafter we wait for the broker to be fully started with the 
current regex check.  And if there were any problems previously that we didn't 
catch due to not explicitly checking for a controller-specific startup message 
(e.g. the first node didn't start properly) then this next controller would not 
start and the regex would not be satisfied -- and the test would fail.  This 
isn't ideal -- we would like to have known about the failure when the previous 
controller did not start properly -- but it seems like a reasonable tradeoff to 
be able to avoid the messiness of expanding our `grep`-based paradigm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to