Re: Segment recovery and replication

Jay Kreps Thu, 29 Aug 2013 17:12:18 -0700

Can't he get this automatically though with the Sriram's controlled
shutdown stuff?


-Jay


On Thu, Aug 29, 2013 at 2:12 PM, Neha Narkhede <[email protected]>wrote:

> >> How do you automate waiting for the broker to come up? Just keep
> monitoring the process and keep trying to connect to the port?
>
> Every leader in a Kafka cluster exposes the UnderReplicatedPartitionCount
> metric. The safest way to issue controlled shutdown is to wait until that
> metric reports 0 on the brokers. If you try to shutdown the last broker in
> the ISR, the controlled shutdown cannot succeed since there is no other
> broker to move the leader to. Waiting until under replicated partition
> count hits 0 prevents you from hitting this issue.
>
> This also solves the problem of waiting until the broker comes up since you
> will automatically wait until the broker comes up and joins ISR.
>
>
> Thanks,
> Neha
>
>
> On Thu, Aug 29, 2013 at 12:59 PM, Sam Meder <[email protected]
> >wrote:
>
> > Ok, I spent some more time staring at our logs and figured out that it
> was
> > our fault. We were not waiting around for the Kafka broker to fully
> > initialize before moving on to the next broker and loading the data logs
> > can take quite some time (~7 minutes in one case), so   we ended up with
> no
> > replicas online at some point and the replica that came back first was a
> > little short on data...
> >
> > How do you automate waiting for the broker to come up? Just keep
> > monitoring the process and keep trying to connect to the port?
> >
> > /Sam
> >
> > On Aug 29, 2013, at 6:40 PM, Sam Meder <[email protected]>
> wrote:
> >
> > >
> > > On Aug 29, 2013, at 5:50 PM, Sriram Subramanian <
> > [email protected]> wrote:
> > >
> > >> Do you know why you timed out on a regular shutdown?
> > >
> > > No, though I think it may just have been that the timeout we put in was
> > too short.
> > >
> > >> If the replica had
> > >> fallen off of the ISR and shutdown was forced on the leader this could
> > >> happen.
> > >
> > > Hmm, but it shouldn't really be made leader if it isn't even in the
> isr,
> > should it?
> > >
> > > /Sam
> > >
> > >> With ack = -1, we guarantee that all the replicas in the in sync
> > >> set have received the message before exposing the message to the
> > consumer.
> > >>
> > >> On 8/29/13 8:32 AM, "Sam Meder" <[email protected]> wrote:
> > >>
> > >>> We've recently come across a scenario where we see consumers
> resetting
> > >>> their offsets to earliest and which as far as I can tell may also
> lead
> > to
> > >>> data loss (we're running with ack = -1 to avoid loss). This seems to
> > >>> happen when we time out on doing a regular shutdown and instead kill
> -9
> > >>> the kafka broker, but does obviously apply to any scenario that
> > involves
> > >>> a unclean exit. As far as I can tell what happens is
> > >>>
> > >>> 1. On restart the broker truncates the data for the affected
> > partitions,
> > >>> i.e. not all data was written to disk.
> > >>> 2. The new broker then becomes a leader for the affected partitions
> and
> > >>> consumers get confused because they've already consumed beyond the
> now
> > >>> available offset.
> > >>>
> > >>> Does that seem like a possible failure scenario?
> > >>>
> > >>> /Sam
> > >>
> > >
> >
> >
>

Re: Segment recovery and replication

Reply via email to