Thanks for the updates Dong, they look good to me. Ismael
On Wed, Mar 15, 2017 at 5:50 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Ismael, > > Sure, I have updated "Changes in Operational Procedures" section in KIP-113 > to specify the problem and solution with known disk failure. And I updated > the "Test Plan" section to note that we have test in KIP-113 to verify that > replicas already created on the good log directories will not be affected > by failure of other log directories. > > Please let me know if there is any other improvement I can make. Thanks for > your comment. > > Dong > > > On Wed, Mar 15, 2017 at 3:18 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Hi Dong, > > > > Yes, that sounds good to me. I'd list option 2 first since that is safe > > and, as you said, no worse than what happens today. The file approach is > a > > bit hacky as you said, so it may be a bit fragile. Not sure if we really > > want to mention that. :) > > > > About the note in KIP-112 versus adding the test in KIP-113, I think it > > would make sense to add a short sentence stating that this scenario is > > covered in KIP-113. People won't necessarily read both KIPs at the same > > time and it's helpful to cross-reference when it makes sense. > > > > Thanks for your work on this. > > > > Ismael > > > > On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Ismael, > > > > > > I get your concern that it is more likely for a disk to be slow, or > > exhibit > > > other forms of non-fatal symptom, after some known fatal error. Then it > > is > > > weird for user to start broker with the likely-problematic disk in the > > > broker config. In that case, I think there are two things user can do: > > > > > > 1) Intentionally change the log directory in the config to point to a > > file. > > > This is a bit hacky but it works well before we make more-appropriate > > > long-term change in Kafka to handle this case. > > > 2) Just don't start broker with bad log directories. Always fix disk > > before > > > restarting the broker. This is a safe approach that is no worse than > > > current practice. > > > > > > Would this address your concern if I specify the problem and the two > > > solutions in the KIP? > > > > > > Thanks, > > > Dong > > > > > > On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > > > Hey Ismael, > > > > > > > > Thanks for the comment. Please see my reply below. > > > > > > > > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma <ism...@juma.me.uk> > > wrote: > > > > > > > >> Thanks Dong. Comments inline. > > > >> > > > >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin <lindon...@gmail.com> > > wrote: > > > >> > > > > >> > I get your point. But I am not sure we should recommend user to > > simply > > > >> > remove disk from the broker config. If user simply does this > without > > > >> > checking the utilization of good disks, replica on the bad disk > will > > > be > > > >> > re-created on the good disk and may overload the good disks, > causing > > > >> > cascading failure. > > > >> > > > > >> > > > >> Good point. > > > >> > > > >> > > > >> > > > > >> > I agree with you and Colin that slow disk may cause problem. > > However, > > > >> > performance degradation due to slow disk this is an existing > problem > > > >> that > > > >> > is not detected or handled by Kafka or KIP-112. > > > >> > > > >> > > > >> I think an important difference is that a number of disk errors are > > > >> currently fatal and won't be after KIP-112. So it introduces new > > > scenarios > > > >> (for example, bouncing a broker that is working fine although some > > disks > > > >> have been marked bad). > > > >> > > > > > > > > Hmm.. I am still trying to understand why KIP-112 creates new > > scenarios. > > > > Slow disk is not considered fatal error and won't be caught by either > > > > existing Kafka design or this KIP. If any disk is marked bad, it > means > > > > broker encounters IOException when accessing disk, most likely the > > broker > > > > will encounter IOException again when accessing this disk and mark > this > > > > disk as bad after bounce. I guess you are talking about the case > that a > > > > disk is marked bad, broker is bounced, then the disk provides > degraded > > > > performance without being marked bad, right? But this seems to be an > > > > existing problem we already have today with slow disk. > > > > > > > > Here are the possible scenarios with bad disk after broker bounce: > > > > > > > > 1) bad disk -> broker bounce -> good disk. This would be great. > > > > 2) bad disk -> broker bounce -> slow disk. Slow disk is an existing > > > > problem that is not addressed by Kafka today. > > > > 3) bad disk -> broker bounce -> bad disk. This is handled by this KIP > > > such > > > > that only replicas on the bad disk become offline. > > > > > > > > > > > >> > > > >> > Detection and handling of > > > >> > slow disk is a separate problem that needs to be addressed in a > > future > > > >> KIP. > > > >> > It is currently listed in the future work. Does this sound OK? > > > >> > > > > >> > > > >> I'm OK with it being handled in the future. In the meantime, I was > > just > > > >> hoping that we can make it clear to users about the potential issue > > of a > > > >> disk marked as bad becoming good again after a bounce (which can be > > > >> dangerous). > > > >> > > > >> The main benefit of creating the second topic after log directory > goes > > > >> > offline is that we can make sure the second topic is created on > the > > > good > > > >> > log directory. I am not sure we can simply assume that the first > > topic > > > >> will > > > >> > always be created on the first log directory in the broker config > > and > > > >> the > > > >> > second topic will be created on the second log directory in the > > broker > > > >> > config. > > > >> > > > >> > > > >> > > > >> > However, I can add this test in KIP-113 which allows user to > > > >> > re-assign replica to specific log directory of a broker. Is this > OK? > > > >> > > > > >> > > > >> OK. Please add a note to KIP-112 about this as well (so that it's > > clear > > > >> why > > > >> we only do it in KIP-113). > > > >> > > > > > > > > Sure. Instead of adding note to KIP-112, I have added test in KIP-113 > > to > > > > verify that bad log directories discovered during runtime would not > > > affect > > > > replicas on the good log directories. Does this address the problem? > > > > > > > > > > > >> Ismael > > > >> > > > > > > > > > > > > > >