Re: kafka + autoscaling groups fuckery

Pradeep Gollakota Tue, 28 Jun 2016 15:56:56 -0700

Just out of curiosity, if you guys are in AWS for everything, why not use
Kinesis?


On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors <char...@hound.sh> wrote:

> Hi there,
>
> I just finished implementing kafka + autoscaling groups in a way that made
> sense to me.  I have a _lot_ of experience with ASGs and various storage
> types but I'm a kafka noob (about 4-5 months of using in development and
> staging and pre-launch production).
>
> It seems to be working fine from the Kafka POV but causing troubling side
> effects elsewhere that I don't understand.  I don't know enough about Kafka
> to know if my implementation is just fundamentally flawed for some reason,
> or if so how and why.
>
> My process is basically this:
>
> - *Terminate a node*, or increment the size of the ASG by one.  (I'm not
> doing any graceful shutdowns because I don't want to rely on graceful
> shutdowns, and I'm not attempting to act upon more than one node at a
> time.  Planning on doing a ZK lock or something later to enforce one
> process at a time, if I can work the major kinks out.)
>
> - *Firstboot script,* which runs on all hosts from rc.init.  (We run ASGs
> for *everything.)  It infers things like the chef role, environment,
> cluster name, etc, registers DNS, bootstraps and runs chef-client, etc.
> For storage nodes, it formats and mounts a PIOPS volume under the right
> mount point, or just remounts the volume if it already contains data.  Etc.
>
> - *Run a balancing script from firstboot* on kafka nodes.  It checks to
> see how many brokers there are and what their ids are, and checks for any
> underbalanced partitions with less than 3 ISRs.  Then we generate a new
> assignment file for rebalancing partitions, and execute it.  We watch on
> the host for all the partitions to finish rebalancing, then complete.
>
> *- So far so good*.  I have repeatedly killed kafka nodes and had them
> come up, rebalance the cluster, and everything on the kafka side looks
> healthy.  All the partitions have the correct number of ISRs, etc.
>
> But after doing this, we have repeatedly gotten into a state where
> consumers that are pulling off the kafka partitions enter a weird state
> where their last known offset is *ahead* of the last known offset for that
> partition, and we can't recover from it.
>
> *A example.*  Last night I terminated ... I think it was broker 1002 or
> 1005, and it came back up as broker 1009.  It rebalanced on boot,
> everything looked good from the kafka side.  This morning we noticed that
> the storage node that maps to partition 5 has been broken for like 22
> hours, it thinks the next offset is too far ahead / out of bounds so
> stopped consuming.  This happened shortly after broker 1009 came online and
> the consumer caught up.
>
> From the storage node log:
>
> time="2016-06-28T21:51:48.286035635Z" level=info msg="Serving at
> 0.0.0.0:8089..."
> time="2016-06-28T21:51:48.293946529Z" level=error msg="Error creating
> consumer" error="kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.294532365Z" level=error msg="Failed to start
> services: kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.29461156Z" level=info msg="Shutting down..."
>
> From the mysql mapping of partitions to storage nodes/statuses:
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel
>
> Listing by default. Use -action <listkafka, nextoffset, watchlive,
> setstate, addslot, removeslot, removenode> for other actions
>
> Part    Status          Last Updated                    Hostname
> 0       live            2016-06-28 22:29:10 +0000 UTC   retriever-772045ec
> 1       live            2016-06-28 22:29:29 +0000 UTC   retriever-75e0e4f2
> 2       live            2016-06-28 22:29:25 +0000 UTC   retriever-78804480
> 3       live            2016-06-28 22:30:01 +0000 UTC   retriever-c0da5f85
> 4       live            2016-06-28 22:29:42 +0000 UTC   retriever-122c6d8e
> 5                       2016-06-28 21:53:48 +0000 UTC
>
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel -partition 5 -action nextoffset
>
> Next offset for partition 5: 12040353
>
>
> Interestingly, the primary for partition 5 is 1004, and its follower is
> the new node 1009.  (Partition 2 has 1009 as its leader and 1004 as its
> follower, and seems just fine.)
>
> I've attached all the kafka logs for the broker 1009 node since it
> launched yesterday.
>
> I guess my main question is: *Is there something I am fundamentally
> missing about the kafka model that makes it it not play well with
> autoscaling?*  I see a couple of other people on the internet talking
> about using ASGs with kafka, but always in the context of maintaining a
> list of broker ids and reusing them.
>
> *I don't want to do that.  I want the path for hardware termination,
> expanding the ASG size, and rolling entire ASGs to pick up new AMIs to all
> be the same.*  I want all of these actions to be completely trivial and
> no big deal.  Is there something I'm missing, does anyone know why this is
> causing problems?
>
> Thanks so much for any help or insight anyone can provide,
> charity.
>
>
> P.S., some additional details about our kafka/consumer configuration:
>
> - We autogenerate/autoincrement broker ids from zk
>
> - We have one topic, with "many" partitions depending on the env, and a
> replication factor of 2 (now bumping to 3)
>
> - We have our own in-house written storage layer ("retriever") which
> consumes Kafka partitions.  The mapping of partitions to storage nodes is
> stored in mysql, as well as last known offset and some other details.
> Partitions currently have a 1-1 mapping with storage nodes, e.g. partition
> 5 => retriever-112c6d8d storage node.
>
> - We are using the golang serama client, with the __consumer_offset
> internal partition.  This also seems to have weird problems.  It does not
> rebalance the way the docs say it is supposed to, when consumers are added
> or restarted.  (In fact I haven't been able to figure out how to get it to
> rebalance or how to change the replication factor ... but I haven't really
> dived into this one and tried to debug it yet, I've been deep in the ASG
> stuff.)  But looking at this next, it seems very likely related in some way
> because the __consumer_offsets topic seems to break at the same time.
> `kafkacat` and `kafka-topics --describe output` in the gist below:
>
> https://gist.github.com/charity/d83f25b5e3f4994eb202f35fae74e7d1
>
> as you can see, even though 2/3 of the __consumer_offsets replicas are
> online, it thinks none of them are available.  despite the fact that 5 of 6
> consumers are happily consuming away.
>
>

Re: kafka + autoscaling groups fuckery

Reply via email to