Re: Cascading failures on running out of disk space

Jananee S Mon, 08 Jun 2015 00:30:07 -0700

Thanks Jason.

We did run out of disk space and noticed IOExceptions too. No, the broker
did not shut itself down. Is there some configuration that would enable
this for one or all brokers? That would be a better scenario to be in.
Right now, we have setup some alerts when disk space goes beyond a
threshold. We have also decreased the replication factor to 2. Hope this
should be enough to avert disaster.  Only thing that is worrying is the
consumer offsets getting reset part. All our systems use high level
consumers. In some cases, we have some state which can be used to prevent
reprocessing of old messages. On other cases, we don't have anything that
could help us here.


On Tue, Jun 2, 2015 at 1:26 AM, Jason Rosenberg <j...@squareup.com> wrote:

> Hi Jananee,
>
> Do you for sure that you ran out of disk space completely? Did you see an
> IOExceptions failing to write?  Normally, when that happens, the broker is
> supposed to immediately shut itself down.  Did the one broker shut itself
> down?
>
> The NotLeaderForPartitionException's are normal when partition leadership
> changes, and clients don't yet know about it.  They usually discover a
> leadership change by getting this failure, and then re-checking the
> partition metadata.  But, this metadata request can also fail in certain
> conditions, which result in repeated NotLeaderForPartitionExceptions......
>
> I've seen consumer offsets get reset too, if/when there's an unclean leader
> election.  E.g. if the leader goes down hard without the followers up to
> date (perhaps this happened in this case, if the leader was on the broker
> with the full disk)?  I'm not sure why the consumer offsets have to be
> completely reset, but that's what I've seen too.
>
> Probably the most important thing to know, is that you don't want to let
> your disks fill up, so if you can add early warning/monitoring so you can
> take action before that happens, you'd avoid these scenarios with unclean
> leader election, etc.
>
> Jason
>
> On Wed, May 27, 2015 at 10:54 AM, Jananee S <janane...@gmail.com> wrote:
>
> > Hi,
> >
> >   We have the following setup -
> >
> > Number of brokers: 3
> > Number of zookeepers: 3
> > Default replication factor: 3
> > Offets Storage: kafka
> >
> > When one of our brokers ran out of disk space, we started seeing lot of
> > errors in the broker logs at an alarming rate. This caused the other
> > brokers also to run out of disk space.
> >
> > ERROR [ReplicaFetcherThread-0-101813211], Error for partition [xxxx,47]
> to
> > broker 101813211:class kafka.common.UnknownException
> > (kafka.server.ReplicaFetcherThread)
> >
> > WARN [Replica Manager on Broker 101813211]: Fetch request with
> correlation
> > id 161672 from client ReplicaFetcherThread-0-101813211 on partition
> > [xxxx,11] failed due to Leader not local for partition [xxxx,11] on
> broker
> > 101813211 (kafka.server.ReplicaManager)
> >
> > We also noticed NotLeaderForPartitionException in the producer and
> consumer
> > logs (also at alarming rate)
> >
> > ERROR [2015-05-27 09:54:48,613] kafka.consumer.ConsumerFetcherThread: [
> > ConsumerFetcherThread-xxxx_prod2-1432719772385-bd7608b8-0-101813211],
> Error
> > for partition [yyyy,1] to broker 101813211:class kafka.common.
> > NotLeaderForPartitionException
> >
> > The __consumer_offsets topic somehow got corrupted and consumers started
> > consuming already consumed messages on restart.
> >
> > We deleted the offending topic and tried restarting the brokers and
> > zookeepers. Now we are getting lots of corrupt index errors on broker
> start
> > up.
> >
> > Was all this due to the replication factor being the same as number of
> > brokers? Why would the topic files get corrupted in such a scenario?
> > Please let us know how to recover from this scenario. Also, how do we
> turn
> > down the error logging rate?
> >
> > Thanks,
> > Jananee
> >
>

Re: Cascading failures on running out of disk space

Reply via email to