Hey Jun, I am happy to work for a few days if that is what it takes to discuss KIP-113. But if it takes 2+ weeks to discuss KIP-113, I wondering if we can vote for KIP-112 first. We aim to start JBOD test in a test cluster by end of Q1 and there is only three weeks from that. If we can miss the schedule, we may have to do one-broker-per-disk setup first which causes the unnecessary trouble to migrate to JBOD later. It will greatly help me meet this schedule if we can vote for KIP-112 and get the patch reviewed while KIP-113 is under discussion.
It seems reasonable to vote for KIP-112 before KIP-113. The reason is that we have agreed on the motivation of both KIPs and we already have a compelling case for JBOD. What we need to discuss for KIP-113 is its implementation detail instead of motivation. Whatever implementation we end up with KIP-113 should be implemented on top of KIP-112 without backward incompatible change. I understand that you are busy with reviewing many stuffs so I would prefer not to ask you for quick response on KIP-113. If KIP-112 is blocked on KIP-113, I would somehow be under pressure to bother you repeatedly. Allowing me to work on KIP-112 implementation can make our schedule easier:) Thank you for all the time to review these KIPs! Dong On Thu, Mar 2, 2017 at 9:36 AM, Jun Rao <j...@confluent.io> wrote: > Hi, Dong, > > Ok. We can keep LeaderAndIsrRequest as it is in the wiki. > > Since we need both KIP-112 and KIP-113 to make a compelling case for JBOD, > perhaps we should discuss KIP-113 before voting for both? I left some > comments in the other thread. > > Thanks, > > Jun > > On Wed, Mar 1, 2017 at 1:58 PM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Jun, > > > > Do you think it is OK to keep the existing wire protocol in the KIP? I am > > wondering if we can initiate vote for this KIP. > > > > Thanks, > > Dong > > > > > > > > On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Jun, > > > > > > I just realized that StopReplicaRequest itself doesn't specify the > > > replicaId in the wire protocol. Thus controller would need to log the > > > brokerId with StopReplicaRequest in the log. Thus it may be > > > reasonable for controller to do the same with LeaderAndIsrRequest and > > only > > > specify the isNewReplica for the broker that receives > > LeaderAndIsrRequest. > > > > > > Thanks, > > > Dong > > > > > > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > >> Hi Jun, > > >> > > >> Yeah there is tradeoff between controller's implementation complexity > > vs. > > >> wire-protocol complexity. I personally think it is more important to > > keep > > >> wire-protocol concise and only add information in wire-protocol if > > >> necessary. It seems fine to add a little bit complexity to > controller's > > >> implementation, e.g. log destination broker per LeaderAndIsrRequet. > > Becket > > >> also shares this opinion with me. Is the only purpose of doing so to > > make > > >> controller log simpler? > > >> > > >> And certainly, I have added Todd's comment in the wiki. > > >> > > >> Thanks, > > >> Dong > > >> > > >> > > >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <j...@confluent.io> wrote: > > >> > > >>> Hi, Dong, > > >>> > > >>> 52. What you suggested would work. However, I am thinking that it's > > >>> probably simpler to just set isNewReplica at the replica level. That > > way, > > >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a > > >>> LeaderAndIsrRequest in the controller log, it's easier to see which > > >>> replicas are new without looking at which broker the request is > > intended > > >>> for. > > >>> > > >>> Could you also add those additional points from Todd's on 1 broker > per > > >>> disk > > >>> vs JBOD vs RAID5/6 to the KIP? > > >>> > > >>> Thanks, > > >>> > > >>> Hi, Todd, > > >>> > > >>> Thanks for the feedback. That's very useful. > > >>> > > >>> Jun > > >>> > > >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <lindon...@gmail.com> > > wrote: > > >>> > > >>> > Hey Jun, > > >>> > > > >>> > Certainly, I have added Todd to reply to the thread. And I have > > >>> updated the > > >>> > item to in the wiki. > > >>> > > > >>> > 50. The full statement is "Broker assumes a log directory to be > good > > >>> after > > >>> > it starts, and mark log directory as bad once there is IOException > > when > > >>> > broker attempts to access (i.e. read or write) the log directory". > > This > > >>> > statement seems reasonable, right? If a log directory is actually > > bad, > > >>> then > > >>> > the broker will first assume it is OK, try to read logs on this log > > >>> > directory, encounter IOException, and then mark it as bad. > > >>> > > > >>> > 51. My bad. I thought I removed it but I didn't. It is removed now. > > >>> > > > >>> > 52. I don't think so.. The isNewReplica field in the > > >>> LeaderAndIsrRequest is > > >>> > only relevant to the replica (i.e. broker) that receives the > > >>> > LeaderAndIsrRequest. There is no need to specify whether each > replica > > >>> is > > >>> > new inside LeaderAndIsrRequest. In other words, if a broker sends > > >>> > LeaderAndIsrRequest to three different replicas of a given > partition, > > >>> the > > >>> > isNewReplica field can be different across these three requests. > > >>> > > > >>> > Yeah, I would definitely want to start discussion on KIP-113 after > we > > >>> have > > >>> > reached agreement on KIP-112. I have actually opened KIP-113 > > discussion > > >>> > thread on 1/12 together with this thread. I have yet to add the > > >>> ability to > > >>> > list offline directories in KIP-113 which we discussed in this > > thread. > > >>> > > > >>> > Thanks for all your reviews! Is there further concern with the > latest > > >>> KIP? > > >>> > > > >>> > Thanks! > > >>> > Dong > > >>> > > > >>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote: > > >>> > > > >>> > > Hi, Dong, > > >>> > > > > >>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks > > failure. > > >>> > Eno's > > >>> > > point is that the rebuild of RAID5/RAID6 requires reading more > data > > >>> > > compared with RAID10, which increases the probability of error > > during > > >>> > > rebuild. This makes sense. In any case, do you think you could > ask > > >>> the > > >>> > SREs > > >>> > > at LinkedIn to share their opinions on RAID5/RAID6? > > >>> > > > > >>> > > Yes, when a replica is offline due to a bad disk, it makes sense > to > > >>> > handle > > >>> > > it immediately as if a StopReplicaRequest is received (i.e., > > replica > > >>> is > > >>> > no > > >>> > > longer considered a leader and is removed from any replica > fetcher > > >>> > thread). > > >>> > > Could you add that detail in item 2. in the wiki? > > >>> > > > > >>> > > 50. The wiki says "Broker assumes a log directory to be good > after > > it > > >>> > > starts" : A log directory actually could be bad during startup. > > >>> > > > > >>> > > 51. In item 4, the wiki says "The controller watches the path > > >>> > > /log_dir_event_notification for new znode.". This doesn't seem be > > >>> needed > > >>> > > now? > > >>> > > > > >>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for > > each > > >>> > > replica inside the replicas field, right? > > >>> > > > > >>> > > Other than those, the current KIP looks good to me. Do you want > to > > >>> start > > >>> > a > > >>> > > separate discussion thread on KIP-113? I do have some comments > > there. > > >>> > > > > >>> > > Thanks for working on this! > > >>> > > > > >>> > > Jun > > >>> > > > > >>> > > > > >>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com> > > >>> wrote: > > >>> > > > > >>> > > > Hi Jun, > > >>> > > > > > >>> > > > In addition to the Eno's reference of why rebuild time with > > RAID-5 > > >>> is > > >>> > > more > > >>> > > > expensive, another concern is that RAID-5 will fail if more > than > > >>> one > > >>> > disk > > >>> > > > fails. JBOD is still works with 1+ disk failure and has better > > >>> > > performance > > >>> > > > with one disk failure. These seems like good argument for using > > >>> JBOD > > >>> > > > instead of RAID-5. > > >>> > > > > > >>> > > > If a leader replica goes offline, the broker should first take > > all > > >>> > > actions > > >>> > > > (i.e. remove the partition from fetcher thread) as if it has > > >>> received > > >>> > > > StopReplicaRequest for this partition because the replica can > no > > >>> longer > > >>> > > > work anyway. It will also respond with error to any > > ProduceRequest > > >>> and > > >>> > > > FetchRequest for partition. The broker notifies controller by > > >>> writing > > >>> > > > notification znode in ZK. The controller learns the disk > failure > > >>> event > > >>> > > from > > >>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse > > to > > >>> > learn > > >>> > > > that the replica is offline. The controller will then elect new > > >>> leader > > >>> > > for > > >>> > > > this partition and sends LeaderAndIsrRequest/MetadataUp > > dateRequest > > >>> to > > >>> > > > relevant brokers. The broker should stop adjusting the ISR for > > this > > >>> > > > partition as if the broker is already offline. I am not sure > > there > > >>> is > > >>> > any > > >>> > > > inconsistency in broker's behavior when it is leader or > follower. > > >>> Is > > >>> > > there > > >>> > > > any concern with this approach? > > >>> > > > > > >>> > > > Thanks for catching this. I have removed that reference from > the > > >>> KIP. > > >>> > > > > > >>> > > > Hi Eno, > > >>> > > > > > >>> > > > Thank you for providing the reference of the RAID-5. In > LinkedIn > > we > > >>> > have > > >>> > > 10 > > >>> > > > disks per Kafka machine. It will not be a show-stopper > > >>> operationally > > >>> > for > > >>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other > > >>> hand we > > >>> > > > previously discussed the advantage of JBOD vs. > > one-broker-per-disk > > >>> or > > >>> > > > one-broker-per-machine. One-broker-per-disk suffers from the > > >>> problems > > >>> > > > described in the KIP and one-broker-per-machine increases the > > >>> failure > > >>> > > > caused by disk failure by 10X. Since JBOD is strictly better > than > > >>> > either > > >>> > > of > > >>> > > > the two, it is also better then one-broker-per-multiple-disk > > which > > >>> is > > >>> > > > somewhere between one-broker-per-disk and > one-broker-per-machine. > > >>> > > > > > >>> > > > I personally think the benefits of JBOD design is worth the > > >>> > > implementation > > >>> > > > complexity it introduces. I would also argue that it is > > reasonable > > >>> for > > >>> > > > Kafka to manage this low level detail because Kafka is already > > >>> exposing > > >>> > > and > > >>> > > > managing replication factor of its data. But whether the > > >>> complexity is > > >>> > > > worthwhile can be subjective and I can not prove my opinion. I > am > > >>> > > > contributing significant amount of time to do this KIP because > > >>> Kafka > > >>> > > > develops at LinkedIn believes it is useful and worth the > effort. > > >>> Yeah, > > >>> > it > > >>> > > > will be useful to see what everyone else think about it. > > >>> > > > > > >>> > > > > > >>> > > > Thanks, > > >>> > > > Dong > > >>> > > > > > >>> > > > > > >>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> > > wrote: > > >>> > > > > > >>> > > > > Hi, Dong, > > >>> > > > > > > >>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. > If > > a > > >>> disk > > >>> > > > > fails, typically an admin has to bring down the broker, > replace > > >>> the > > >>> > > > failed > > >>> > > > > disk with a new one, trigger the RAID rebuild, and bring up > the > > >>> > broker. > > >>> > > > > This way, there is no performance impact at runtime due to > > >>> rebuild. > > >>> > The > > >>> > > > > benefit is that a broker doesn't fail in a hard way when > there > > >>> is a > > >>> > > disk > > >>> > > > > failure and can be brought down in a controlled way for > > >>> maintenance. > > >>> > > > While > > >>> > > > > the broker is running with a failed disk, reads may be more > > >>> expensive > > >>> > > > since > > >>> > > > > they have to be computed from the parity. However, if most > > reads > > >>> are > > >>> > > from > > >>> > > > > page cache, this may not be a big issue either. So, it would > be > > >>> > useful > > >>> > > to > > >>> > > > > do some tests on RAID5 before we completely rule it out. > > >>> > > > > > > >>> > > > > Regarding whether to remove an offline replica from the > fetcher > > >>> > thread > > >>> > > > > immediately. What do we do when a failed replica is a leader? > > Do > > >>> we > > >>> > do > > >>> > > > > nothing or mark the replica as not the leader immediately? > > >>> > Intuitively, > > >>> > > > it > > >>> > > > > seems it's better if the broker acts consistently on a failed > > >>> replica > > >>> > > > > whether it's a leader or a follower. For ISR churns, I was > just > > >>> > > pointing > > >>> > > > > out that if we don't send StopReplicaRequest to a broker to > be > > >>> shut > > >>> > > down > > >>> > > > in > > >>> > > > > a controlled way, then the leader will shrink ISR, expand it > > and > > >>> > shrink > > >>> > > > it > > >>> > > > > again after the timeout. > > >>> > > > > > > >>> > > > > The KIP seems to still reference " > > >>> > > > > /broker/topics/[topic]/partitions/[partitionId]/ > > >>> > > > controller_managed_state". > > >>> > > > > > > >>> > > > > Thanks, > > >>> > > > > > > >>> > > > > Jun > > >>> > > > > > > >>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin < > lindon...@gmail.com > > > > > >>> > wrote: > > >>> > > > > > > >>> > > > > > Hey Jun, > > >>> > > > > > > > >>> > > > > > Thanks for the suggestion. I think it is a good idea to > know > > >>> put > > >>> > > > created > > >>> > > > > > flag in ZK and simply specify isNewReplica=true in > > >>> > > LeaderAndIsrRequest > > >>> > > > if > > >>> > > > > > repilcas was in NewReplica state. It will only fail the > > replica > > >>> > > > creation > > >>> > > > > in > > >>> > > > > > the scenario that the controller fails after > > >>> > > > > > topic-creation/partition-reassignment/partition-number- > chang > > e > > >>> but > > >>> > > > before > > >>> > > > > > actually sends out the LeaderAndIsrRequest while there is > > >>> ongoing > > >>> > > disk > > >>> > > > > > failure, which should be pretty rare and acceptable. This > > >>> should > > >>> > > > simplify > > >>> > > > > > the design of this KIP. > > >>> > > > > > > > >>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not > > just > > >>> > about > > >>> > > > > > performance when there is no failure. For example, RAID-5 > can > > >>> > support > > >>> > > > up > > >>> > > > > to > > >>> > > > > > one disk failure and it takes time to rebuild disk after > one > > >>> disk > > >>> > > > > > failure. RAID 5 implementations are susceptible to system > > >>> failures > > >>> > > > > because > > >>> > > > > > of trends regarding array rebuild time and the chance of > > drive > > >>> > > failure > > >>> > > > > > during rebuild. There is no such performance degradation > for > > >>> JBOD > > >>> > and > > >>> > > > > JBOD > > >>> > > > > > can support multiple log directory failure without reducing > > >>> > > performance > > >>> > > > > of > > >>> > > > > > good log directories. Would this be a reasonable reason for > > >>> using > > >>> > > JBOD > > >>> > > > > > instead of RAID-5/6? > > >>> > > > > > > > >>> > > > > > Previously we discussed wether broker should remove offline > > >>> replica > > >>> > > > from > > >>> > > > > > replica fetcher thread. I still think it should do it > instead > > >>> of > > >>> > > > > printing a > > >>> > > > > > lot of error in the log4j log. We can still let controller > > send > > >>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand > > why > > >>> > > > allowing > > >>> > > > > > broker to remove offline replica from fetcher thread will > > >>> increase > > >>> > > > churns > > >>> > > > > > in ISR. Do you think this is concern with this approach? > > >>> > > > > > > > >>> > > > > > I have updated the KIP to remove created flag from ZK and > > >>> change > > >>> > the > > >>> > > > > filed > > >>> > > > > > name to isNewReplica. Can you check if there is any issue > > with > > >>> the > > >>> > > > latest > > >>> > > > > > KIP? Thanks for your time! > > >>> > > > > > > > >>> > > > > > Regards, > > >>> > > > > > Dong > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io > > > > >>> wrote: > > >>> > > > > > > > >>> > > > > > > Hi, Dong, > > >>> > > > > > > > > >>> > > > > > > Thanks for the reply. > > >>> > > > > > > > > >>> > > > > > > Personally, I'd prefer not to write the created flag per > > >>> replica > > >>> > in > > >>> > > > ZK. > > >>> > > > > > > Your suggestion of disabling replica creation if there > is a > > >>> bad > > >>> > log > > >>> > > > > > > directory on the broker could work. The only thing is > that > > >>> it may > > >>> > > > delay > > >>> > > > > > the > > >>> > > > > > > creation of new replicas. I was thinking that an > > alternative > > >>> is > > >>> > to > > >>> > > > > extend > > >>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field per > > >>> replica. > > >>> > > That > > >>> > > > > > field > > >>> > > > > > > will be set when a replica is transitioning from the > > >>> NewReplica > > >>> > > state > > >>> > > > > to > > >>> > > > > > > Online state. Then, when a broker receives a > > >>> LeaderAndIsrRequest, > > >>> > > if > > >>> > > > a > > >>> > > > > > > replica is marked as the new replica, it will be created > > on a > > >>> > good > > >>> > > > log > > >>> > > > > > > directory, if not already present. Otherwise, it only > > >>> creates the > > >>> > > > > replica > > >>> > > > > > > if all log directories are good and the replica is not > > >>> already > > >>> > > > present. > > >>> > > > > > > This way, we don't delay the processing of new replicas > in > > >>> the > > >>> > > common > > >>> > > > > > case. > > >>> > > > > > > > > >>> > > > > > > I am ok with not persisting the offline replicas in ZK > and > > >>> just > > >>> > > > > > discovering > > >>> > > > > > > them through the LeaderAndIsrRequest. It handles the > cases > > >>> when a > > >>> > > > > broker > > >>> > > > > > > starts up with bad log directories better. So, the > > additional > > >>> > > > overhead > > >>> > > > > of > > >>> > > > > > > rediscovering the offline replicas is justified. > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > Another high level question. The proposal rejected > RAID5/6 > > >>> since > > >>> > it > > >>> > > > > adds > > >>> > > > > > > additional I/Os. The main issue with RAID5 is that to > > write a > > >>> > block > > >>> > > > > that > > >>> > > > > > > doesn't match the RAID stripe size, we have to first read > > >>> the old > > >>> > > > > parity > > >>> > > > > > to > > >>> > > > > > > compute the new one, which increases the number of I/Os ( > > >>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am > > >>> wondering if > > >>> > > you > > >>> > > > > > have > > >>> > > > > > > tested RAID5's performance by creating a file system > whose > > >>> block > > >>> > > size > > >>> > > > > > > matches the RAID stripe size ( > > https://www.percona.com/blog/ > > >>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This > way, > > >>> > writing > > >>> > > a > > >>> > > > > > block > > >>> > > > > > > doesn't require a read first. A large block size may > > >>> increase the > > >>> > > > > amount > > >>> > > > > > of > > >>> > > > > > > data writes, when the same block has to be written to > disk > > >>> > multiple > > >>> > > > > > times. > > >>> > > > > > > However, this is probably ok in Kafka's use case since we > > >>> batch > > >>> > the > > >>> > > > I/O > > >>> > > > > > > flush already. As you can see, we will be adding some > > >>> complexity > > >>> > to > > >>> > > > > > support > > >>> > > > > > > JBOD in Kafka one way or another. If we can tune the > > >>> performance > > >>> > of > > >>> > > > > RAID5 > > >>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler > > >>> > solution. > > >>> > > > > > > > > >>> > > > > > > Thanks, > > >>> > > > > > > > > >>> > > > > > > Jun > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin < > > >>> lindon...@gmail.com> > > >>> > > > > wrote: > > >>> > > > > > > > > >>> > > > > > > > Hey Jun, > > >>> > > > > > > > > > >>> > > > > > > > I don't think we should allow failed replicas to be > > >>> re-created > > >>> > on > > >>> > > > the > > >>> > > > > > > good > > >>> > > > > > > > disks. Say there are 2 disks and each of them is 51% > > >>> loaded. If > > >>> > > any > > >>> > > > > > disk > > >>> > > > > > > > fail, and we allow replicas to be re-created on the > other > > >>> > disks, > > >>> > > > both > > >>> > > > > > > disks > > >>> > > > > > > > will fail. Alternatively we can disable replica > creation > > if > > >>> > there > > >>> > > > is > > >>> > > > > > bad > > >>> > > > > > > > disk on a broker. I personally think it is worth the > > >>> additional > > >>> > > > > > > complexity > > >>> > > > > > > > in the broker to store created replicas in ZK so that > we > > >>> allow > > >>> > > new > > >>> > > > > > > replicas > > >>> > > > > > > > to be created on the broker even when there is bad log > > >>> > directory. > > >>> > > > > This > > >>> > > > > > > > approach won't add complexity in the controller. But I > am > > >>> fine > > >>> > > with > > >>> > > > > > > > disabling replica creation when there is bad log > > directory > > >>> that > > >>> > > if > > >>> > > > it > > >>> > > > > > is > > >>> > > > > > > > the only blocking issue for this KIP. > > >>> > > > > > > > > > >>> > > > > > > > Whether we store created flags is independent of > > >>> whether/how we > > >>> > > > store > > >>> > > > > > > > offline replicas. Per our previous discussion, do you > > >>> think it > > >>> > is > > >>> > > > OK > > >>> > > > > > not > > >>> > > > > > > > store offline replicas in ZK and propagate the offline > > >>> replicas > > >>> > > > from > > >>> > > > > > > broker > > >>> > > > > > > > to controller via LeaderAndIsrRequest? > > >>> > > > > > > > > > >>> > > > > > > > Thanks, > > >>> > > > > > > > Dong > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > > > > >