Can't he get this automatically though with the Sriram's controlled shutdown stuff?
-Jay On Thu, Aug 29, 2013 at 2:12 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > >> How do you automate waiting for the broker to come up? Just keep > monitoring the process and keep trying to connect to the port? > > Every leader in a Kafka cluster exposes the UnderReplicatedPartitionCount > metric. The safest way to issue controlled shutdown is to wait until that > metric reports 0 on the brokers. If you try to shutdown the last broker in > the ISR, the controlled shutdown cannot succeed since there is no other > broker to move the leader to. Waiting until under replicated partition > count hits 0 prevents you from hitting this issue. > > This also solves the problem of waiting until the broker comes up since you > will automatically wait until the broker comes up and joins ISR. > > > Thanks, > Neha > > > On Thu, Aug 29, 2013 at 12:59 PM, Sam Meder <sam.me...@jivesoftware.com > >wrote: > > > Ok, I spent some more time staring at our logs and figured out that it > was > > our fault. We were not waiting around for the Kafka broker to fully > > initialize before moving on to the next broker and loading the data logs > > can take quite some time (~7 minutes in one case), so we ended up with > no > > replicas online at some point and the replica that came back first was a > > little short on data... > > > > How do you automate waiting for the broker to come up? Just keep > > monitoring the process and keep trying to connect to the port? > > > > /Sam > > > > On Aug 29, 2013, at 6:40 PM, Sam Meder <sam.me...@jivesoftware.com> > wrote: > > > > > > > > On Aug 29, 2013, at 5:50 PM, Sriram Subramanian < > > srsubraman...@linkedin.com> wrote: > > > > > >> Do you know why you timed out on a regular shutdown? > > > > > > No, though I think it may just have been that the timeout we put in was > > too short. > > > > > >> If the replica had > > >> fallen off of the ISR and shutdown was forced on the leader this could > > >> happen. > > > > > > Hmm, but it shouldn't really be made leader if it isn't even in the > isr, > > should it? > > > > > > /Sam > > > > > >> With ack = -1, we guarantee that all the replicas in the in sync > > >> set have received the message before exposing the message to the > > consumer. > > >> > > >> On 8/29/13 8:32 AM, "Sam Meder" <sam.me...@jivesoftware.com> wrote: > > >> > > >>> We've recently come across a scenario where we see consumers > resetting > > >>> their offsets to earliest and which as far as I can tell may also > lead > > to > > >>> data loss (we're running with ack = -1 to avoid loss). This seems to > > >>> happen when we time out on doing a regular shutdown and instead kill > -9 > > >>> the kafka broker, but does obviously apply to any scenario that > > involves > > >>> a unclean exit. As far as I can tell what happens is > > >>> > > >>> 1. On restart the broker truncates the data for the affected > > partitions, > > >>> i.e. not all data was written to disk. > > >>> 2. The new broker then becomes a leader for the affected partitions > and > > >>> consumers get confused because they've already consumed beyond the > now > > >>> available offset. > > >>> > > >>> Does that seem like a possible failure scenario? > > >>> > > >>> /Sam > > >> > > > > > > > >