Gwen, thanks for the response. 1.1 Your life may be a bit simpler if you have a way of starting a new
> broker with the same ID as the old one - this means it will > automatically pick up the old replicas and you won't need to > rebalance. Makes life slightly easier in some cases. > Yeah, this is definitely doable, I just don't *want* to do it. I really want all of these to share the same code path: 1) rolling all nodes in an ASG to pick up a new AMI, 2) hardware failure / unintentional node termination, 3) resizing the ASG and rebalancing the data across nodes. Everything but the first one means generating new node IDs, so I would rather just do that across the board. It's the solution that really fits the ASG model best, so I'm reluctant to give up on it. > 1.2 Careful not too rebalance too many partitions at once - you only > have so much bandwidth and currently Kafka will not throttle > rebalancing traffic. > Nod, got it. This is def something I plan to work on hardening once I have the basic nut of things working (or if I've had to give up on it and accept a lesser solution). > 2. I think your rebalance script is not rebalancing the offsets topic? > It still has a replica on broker 1002. You have two good replicas, so > you are no where near disaster, but make sure you get this working > too. > Yes, this is another problem I am working on in parallel. The Shopify sarama library <https://godoc.org/github.com/Shopify/sarama> uses the __consumer_offsets topic, but it does *not* let you rebalance or resize the topic when consumers connect, disconnect, or restart. "Note that Sarama's Consumer implementation does not currently support automatic consumer-group rebalancing and offset tracking" I'm working on trying to get the sarama-cluster to do something here. I think these problems are likely related, I'm not sure wtf you are *supposed* to do to rebalance this god damn topic. It also seems like we aren't using a consumer group which sarama-cluster depends on to rebalance a topic. I'm still pretty confused by the 0.9 "consumer group" stuff. Seriously considering downgrading to the latest 0.8 release, because there's a massive gap in documentation for the new stuff in 0.9 (like consumer groups) and we don't really need any of the new features. A common work-around is to configure the consumer to handle "offset > out of range" exception by jumping to the last offset available in the > log. This is the behavior of the Java client, and it would have saved > your consumer here. Go client looks very low level, so I don't know > how easy it is to do that. > Erf, this seems like it would almost guarantee data loss. :( Will check it out tho. If I were you, I'd retest your ASG scripts without the auto leader > election - since your own scripts can / should handle that. > Okay, this is straightforward enough. Will try it. And will keep tryingn to figure out how to balance the __consumer_offsets topic, since I increasingly think that's the key to this giant mess. If anyone has any advice there, massively appreciated. Thanks, charity.