It would be cool to know if Netflix runs kafka in ASGs ... I can't find any mention of it online. (https://github.com/Netflix/suro/wiki/FAQ sorta implies that maybe they do, but it's not clear, and also old.)
I've seen other people talking about running kafka in ASGs, e.g. http://blog.jimjh.com/building-elastic-clusters.html, but they all rely on reusing broker IDs. Which certainly makes it easier, but imho is the wrong way to do this for all the reasons I listed before. On Sun, Jul 3, 2016 at 11:29 AM, Charity Majors <char...@hound.sh> wrote: > Great talks, but not relevant to either of my problems -- the golang > client not rebalancing the consumer offset topic, or autoscaling group > behavior (which is I think is probably just a consequence of the first). > > Thanks though, there's good stuff in here. > > On Sun, Jul 3, 2016 at 10:23 AM, James Cheng <wushuja...@gmail.com> wrote: > >> Charity, >> >> I'm not sure about the specific problem you are having, but about Kafka >> on AWS, Netflix did a talk at a meetup about their Kafka installation on >> AWS. There might be some useful information in there. There is a video >> stream as well as slides, and maybe you can get in touch with the speakers. >> Look in the comment section for links to the slides and video. >> >> Kafka at Netflix >> >> http://www.meetup.com//http-kafka-apache-org/events/220355031/?showDescription=true >> >> There's also a talk about running Kafka on Mesos, which might be relevant. >> >> Kafka on Mesos >> >> http://www.meetup.com//http-kafka-apache-org/events/222537743/?showDescription=true >> >> -James >> >> Sent from my iPhone >> >> > On Jul 2, 2016, at 5:15 PM, Charity Majors <char...@hound.sh> wrote: >> > >> > Gwen, thanks for the response. >> > >> > 1.1 Your life may be a bit simpler if you have a way of starting a new >> > >> >> broker with the same ID as the old one - this means it will >> >> automatically pick up the old replicas and you won't need to >> >> rebalance. Makes life slightly easier in some cases. >> > >> > Yeah, this is definitely doable, I just don't *want* to do it. I really >> > want all of these to share the same code path: 1) rolling all nodes in >> an >> > ASG to pick up a new AMI, 2) hardware failure / unintentional node >> > termination, 3) resizing the ASG and rebalancing the data across nodes. >> > >> > Everything but the first one means generating new node IDs, so I would >> > rather just do that across the board. It's the solution that really >> fits >> > the ASG model best, so I'm reluctant to give up on it. >> > >> > >> >> 1.2 Careful not too rebalance too many partitions at once - you only >> >> have so much bandwidth and currently Kafka will not throttle >> >> rebalancing traffic. >> > >> > Nod, got it. This is def something I plan to work on hardening once I >> have >> > the basic nut of things working (or if I've had to give up on it and >> accept >> > a lesser solution). >> > >> > >> >> 2. I think your rebalance script is not rebalancing the offsets topic? >> >> It still has a replica on broker 1002. You have two good replicas, so >> >> you are no where near disaster, but make sure you get this working >> >> too. >> > >> > Yes, this is another problem I am working on in parallel. The Shopify >> > sarama library <https://godoc.org/github.com/Shopify/sarama> uses the >> > __consumer_offsets topic, but it does *not* let you rebalance or resize >> the >> > topic when consumers connect, disconnect, or restart. >> > >> > "Note that Sarama's Consumer implementation does not currently support >> > automatic consumer-group rebalancing and offset tracking" >> > >> > I'm working on trying to get the sarama-cluster to do something here. I >> > think these problems are likely related, I'm not sure wtf you are >> > *supposed* to do to rebalance this god damn topic. It also seems like >> we >> > aren't using a consumer group which sarama-cluster depends on to >> rebalance >> > a topic. I'm still pretty confused by the 0.9 "consumer group" stuff. >> > >> > Seriously considering downgrading to the latest 0.8 release, because >> > there's a massive gap in documentation for the new stuff in 0.9 (like >> > consumer groups) and we don't really need any of the new features. >> > >> > A common work-around is to configure the consumer to handle "offset >> >> out of range" exception by jumping to the last offset available in the >> >> log. This is the behavior of the Java client, and it would have saved >> >> your consumer here. Go client looks very low level, so I don't know >> >> how easy it is to do that. >> > >> > Erf, this seems like it would almost guarantee data loss. :( Will >> check >> > it out tho. >> > >> > If I were you, I'd retest your ASG scripts without the auto leader >> >> election - since your own scripts can / should handle that. >> > >> > Okay, this is straightforward enough. Will try it. And will keep >> tryingn >> > to figure out how to balance the __consumer_offsets topic, since I >> > increasingly think that's the key to this giant mess. >> > >> > If anyone has any advice there, massively appreciated. >> > >> > Thanks, >> > >> > charity. >> > >