Hi, Dong, Ok. We can keep LeaderAndIsrRequest as it is in the wiki.
Since we need both KIP-112 and KIP-113 to make a compelling case for JBOD, perhaps we should discuss KIP-113 before voting for both? I left some comments in the other thread. Thanks, Jun On Wed, Mar 1, 2017 at 1:58 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Jun, > > Do you think it is OK to keep the existing wire protocol in the KIP? I am > wondering if we can initiate vote for this KIP. > > Thanks, > Dong > > > > On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Jun, > > > > I just realized that StopReplicaRequest itself doesn't specify the > > replicaId in the wire protocol. Thus controller would need to log the > > brokerId with StopReplicaRequest in the log. Thus it may be > > reasonable for controller to do the same with LeaderAndIsrRequest and > only > > specify the isNewReplica for the broker that receives > LeaderAndIsrRequest. > > > > Thanks, > > Dong > > > > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <lindon...@gmail.com> wrote: > > > >> Hi Jun, > >> > >> Yeah there is tradeoff between controller's implementation complexity > vs. > >> wire-protocol complexity. I personally think it is more important to > keep > >> wire-protocol concise and only add information in wire-protocol if > >> necessary. It seems fine to add a little bit complexity to controller's > >> implementation, e.g. log destination broker per LeaderAndIsrRequet. > Becket > >> also shares this opinion with me. Is the only purpose of doing so to > make > >> controller log simpler? > >> > >> And certainly, I have added Todd's comment in the wiki. > >> > >> Thanks, > >> Dong > >> > >> > >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <j...@confluent.io> wrote: > >> > >>> Hi, Dong, > >>> > >>> 52. What you suggested would work. However, I am thinking that it's > >>> probably simpler to just set isNewReplica at the replica level. That > way, > >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a > >>> LeaderAndIsrRequest in the controller log, it's easier to see which > >>> replicas are new without looking at which broker the request is > intended > >>> for. > >>> > >>> Could you also add those additional points from Todd's on 1 broker per > >>> disk > >>> vs JBOD vs RAID5/6 to the KIP? > >>> > >>> Thanks, > >>> > >>> Hi, Todd, > >>> > >>> Thanks for the feedback. That's very useful. > >>> > >>> Jun > >>> > >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <lindon...@gmail.com> > wrote: > >>> > >>> > Hey Jun, > >>> > > >>> > Certainly, I have added Todd to reply to the thread. And I have > >>> updated the > >>> > item to in the wiki. > >>> > > >>> > 50. The full statement is "Broker assumes a log directory to be good > >>> after > >>> > it starts, and mark log directory as bad once there is IOException > when > >>> > broker attempts to access (i.e. read or write) the log directory". > This > >>> > statement seems reasonable, right? If a log directory is actually > bad, > >>> then > >>> > the broker will first assume it is OK, try to read logs on this log > >>> > directory, encounter IOException, and then mark it as bad. > >>> > > >>> > 51. My bad. I thought I removed it but I didn't. It is removed now. > >>> > > >>> > 52. I don't think so.. The isNewReplica field in the > >>> LeaderAndIsrRequest is > >>> > only relevant to the replica (i.e. broker) that receives the > >>> > LeaderAndIsrRequest. There is no need to specify whether each replica > >>> is > >>> > new inside LeaderAndIsrRequest. In other words, if a broker sends > >>> > LeaderAndIsrRequest to three different replicas of a given partition, > >>> the > >>> > isNewReplica field can be different across these three requests. > >>> > > >>> > Yeah, I would definitely want to start discussion on KIP-113 after we > >>> have > >>> > reached agreement on KIP-112. I have actually opened KIP-113 > discussion > >>> > thread on 1/12 together with this thread. I have yet to add the > >>> ability to > >>> > list offline directories in KIP-113 which we discussed in this > thread. > >>> > > >>> > Thanks for all your reviews! Is there further concern with the latest > >>> KIP? > >>> > > >>> > Thanks! > >>> > Dong > >>> > > >>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote: > >>> > > >>> > > Hi, Dong, > >>> > > > >>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks > failure. > >>> > Eno's > >>> > > point is that the rebuild of RAID5/RAID6 requires reading more data > >>> > > compared with RAID10, which increases the probability of error > during > >>> > > rebuild. This makes sense. In any case, do you think you could ask > >>> the > >>> > SREs > >>> > > at LinkedIn to share their opinions on RAID5/RAID6? > >>> > > > >>> > > Yes, when a replica is offline due to a bad disk, it makes sense to > >>> > handle > >>> > > it immediately as if a StopReplicaRequest is received (i.e., > replica > >>> is > >>> > no > >>> > > longer considered a leader and is removed from any replica fetcher > >>> > thread). > >>> > > Could you add that detail in item 2. in the wiki? > >>> > > > >>> > > 50. The wiki says "Broker assumes a log directory to be good after > it > >>> > > starts" : A log directory actually could be bad during startup. > >>> > > > >>> > > 51. In item 4, the wiki says "The controller watches the path > >>> > > /log_dir_event_notification for new znode.". This doesn't seem be > >>> needed > >>> > > now? > >>> > > > >>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for > each > >>> > > replica inside the replicas field, right? > >>> > > > >>> > > Other than those, the current KIP looks good to me. Do you want to > >>> start > >>> > a > >>> > > separate discussion thread on KIP-113? I do have some comments > there. > >>> > > > >>> > > Thanks for working on this! > >>> > > > >>> > > Jun > >>> > > > >>> > > > >>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com> > >>> wrote: > >>> > > > >>> > > > Hi Jun, > >>> > > > > >>> > > > In addition to the Eno's reference of why rebuild time with > RAID-5 > >>> is > >>> > > more > >>> > > > expensive, another concern is that RAID-5 will fail if more than > >>> one > >>> > disk > >>> > > > fails. JBOD is still works with 1+ disk failure and has better > >>> > > performance > >>> > > > with one disk failure. These seems like good argument for using > >>> JBOD > >>> > > > instead of RAID-5. > >>> > > > > >>> > > > If a leader replica goes offline, the broker should first take > all > >>> > > actions > >>> > > > (i.e. remove the partition from fetcher thread) as if it has > >>> received > >>> > > > StopReplicaRequest for this partition because the replica can no > >>> longer > >>> > > > work anyway. It will also respond with error to any > ProduceRequest > >>> and > >>> > > > FetchRequest for partition. The broker notifies controller by > >>> writing > >>> > > > notification znode in ZK. The controller learns the disk failure > >>> event > >>> > > from > >>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse > to > >>> > learn > >>> > > > that the replica is offline. The controller will then elect new > >>> leader > >>> > > for > >>> > > > this partition and sends LeaderAndIsrRequest/MetadataUp > dateRequest > >>> to > >>> > > > relevant brokers. The broker should stop adjusting the ISR for > this > >>> > > > partition as if the broker is already offline. I am not sure > there > >>> is > >>> > any > >>> > > > inconsistency in broker's behavior when it is leader or follower. > >>> Is > >>> > > there > >>> > > > any concern with this approach? > >>> > > > > >>> > > > Thanks for catching this. I have removed that reference from the > >>> KIP. > >>> > > > > >>> > > > Hi Eno, > >>> > > > > >>> > > > Thank you for providing the reference of the RAID-5. In LinkedIn > we > >>> > have > >>> > > 10 > >>> > > > disks per Kafka machine. It will not be a show-stopper > >>> operationally > >>> > for > >>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other > >>> hand we > >>> > > > previously discussed the advantage of JBOD vs. > one-broker-per-disk > >>> or > >>> > > > one-broker-per-machine. One-broker-per-disk suffers from the > >>> problems > >>> > > > described in the KIP and one-broker-per-machine increases the > >>> failure > >>> > > > caused by disk failure by 10X. Since JBOD is strictly better than > >>> > either > >>> > > of > >>> > > > the two, it is also better then one-broker-per-multiple-disk > which > >>> is > >>> > > > somewhere between one-broker-per-disk and one-broker-per-machine. > >>> > > > > >>> > > > I personally think the benefits of JBOD design is worth the > >>> > > implementation > >>> > > > complexity it introduces. I would also argue that it is > reasonable > >>> for > >>> > > > Kafka to manage this low level detail because Kafka is already > >>> exposing > >>> > > and > >>> > > > managing replication factor of its data. But whether the > >>> complexity is > >>> > > > worthwhile can be subjective and I can not prove my opinion. I am > >>> > > > contributing significant amount of time to do this KIP because > >>> Kafka > >>> > > > develops at LinkedIn believes it is useful and worth the effort. > >>> Yeah, > >>> > it > >>> > > > will be useful to see what everyone else think about it. > >>> > > > > >>> > > > > >>> > > > Thanks, > >>> > > > Dong > >>> > > > > >>> > > > > >>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> > wrote: > >>> > > > > >>> > > > > Hi, Dong, > >>> > > > > > >>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. If > a > >>> disk > >>> > > > > fails, typically an admin has to bring down the broker, replace > >>> the > >>> > > > failed > >>> > > > > disk with a new one, trigger the RAID rebuild, and bring up the > >>> > broker. > >>> > > > > This way, there is no performance impact at runtime due to > >>> rebuild. > >>> > The > >>> > > > > benefit is that a broker doesn't fail in a hard way when there > >>> is a > >>> > > disk > >>> > > > > failure and can be brought down in a controlled way for > >>> maintenance. > >>> > > > While > >>> > > > > the broker is running with a failed disk, reads may be more > >>> expensive > >>> > > > since > >>> > > > > they have to be computed from the parity. However, if most > reads > >>> are > >>> > > from > >>> > > > > page cache, this may not be a big issue either. So, it would be > >>> > useful > >>> > > to > >>> > > > > do some tests on RAID5 before we completely rule it out. > >>> > > > > > >>> > > > > Regarding whether to remove an offline replica from the fetcher > >>> > thread > >>> > > > > immediately. What do we do when a failed replica is a leader? > Do > >>> we > >>> > do > >>> > > > > nothing or mark the replica as not the leader immediately? > >>> > Intuitively, > >>> > > > it > >>> > > > > seems it's better if the broker acts consistently on a failed > >>> replica > >>> > > > > whether it's a leader or a follower. For ISR churns, I was just > >>> > > pointing > >>> > > > > out that if we don't send StopReplicaRequest to a broker to be > >>> shut > >>> > > down > >>> > > > in > >>> > > > > a controlled way, then the leader will shrink ISR, expand it > and > >>> > shrink > >>> > > > it > >>> > > > > again after the timeout. > >>> > > > > > >>> > > > > The KIP seems to still reference " > >>> > > > > /broker/topics/[topic]/partitions/[partitionId]/ > >>> > > > controller_managed_state". > >>> > > > > > >>> > > > > Thanks, > >>> > > > > > >>> > > > > Jun > >>> > > > > > >>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com > > > >>> > wrote: > >>> > > > > > >>> > > > > > Hey Jun, > >>> > > > > > > >>> > > > > > Thanks for the suggestion. I think it is a good idea to know > >>> put > >>> > > > created > >>> > > > > > flag in ZK and simply specify isNewReplica=true in > >>> > > LeaderAndIsrRequest > >>> > > > if > >>> > > > > > repilcas was in NewReplica state. It will only fail the > replica > >>> > > > creation > >>> > > > > in > >>> > > > > > the scenario that the controller fails after > >>> > > > > > topic-creation/partition-reassignment/partition-number-chang > e > >>> but > >>> > > > before > >>> > > > > > actually sends out the LeaderAndIsrRequest while there is > >>> ongoing > >>> > > disk > >>> > > > > > failure, which should be pretty rare and acceptable. This > >>> should > >>> > > > simplify > >>> > > > > > the design of this KIP. > >>> > > > > > > >>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not > just > >>> > about > >>> > > > > > performance when there is no failure. For example, RAID-5 can > >>> > support > >>> > > > up > >>> > > > > to > >>> > > > > > one disk failure and it takes time to rebuild disk after one > >>> disk > >>> > > > > > failure. RAID 5 implementations are susceptible to system > >>> failures > >>> > > > > because > >>> > > > > > of trends regarding array rebuild time and the chance of > drive > >>> > > failure > >>> > > > > > during rebuild. There is no such performance degradation for > >>> JBOD > >>> > and > >>> > > > > JBOD > >>> > > > > > can support multiple log directory failure without reducing > >>> > > performance > >>> > > > > of > >>> > > > > > good log directories. Would this be a reasonable reason for > >>> using > >>> > > JBOD > >>> > > > > > instead of RAID-5/6? > >>> > > > > > > >>> > > > > > Previously we discussed wether broker should remove offline > >>> replica > >>> > > > from > >>> > > > > > replica fetcher thread. I still think it should do it instead > >>> of > >>> > > > > printing a > >>> > > > > > lot of error in the log4j log. We can still let controller > send > >>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand > why > >>> > > > allowing > >>> > > > > > broker to remove offline replica from fetcher thread will > >>> increase > >>> > > > churns > >>> > > > > > in ISR. Do you think this is concern with this approach? > >>> > > > > > > >>> > > > > > I have updated the KIP to remove created flag from ZK and > >>> change > >>> > the > >>> > > > > filed > >>> > > > > > name to isNewReplica. Can you check if there is any issue > with > >>> the > >>> > > > latest > >>> > > > > > KIP? Thanks for your time! > >>> > > > > > > >>> > > > > > Regards, > >>> > > > > > Dong > >>> > > > > > > >>> > > > > > > >>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> > >>> wrote: > >>> > > > > > > >>> > > > > > > Hi, Dong, > >>> > > > > > > > >>> > > > > > > Thanks for the reply. > >>> > > > > > > > >>> > > > > > > Personally, I'd prefer not to write the created flag per > >>> replica > >>> > in > >>> > > > ZK. > >>> > > > > > > Your suggestion of disabling replica creation if there is a > >>> bad > >>> > log > >>> > > > > > > directory on the broker could work. The only thing is that > >>> it may > >>> > > > delay > >>> > > > > > the > >>> > > > > > > creation of new replicas. I was thinking that an > alternative > >>> is > >>> > to > >>> > > > > extend > >>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field per > >>> replica. > >>> > > That > >>> > > > > > field > >>> > > > > > > will be set when a replica is transitioning from the > >>> NewReplica > >>> > > state > >>> > > > > to > >>> > > > > > > Online state. Then, when a broker receives a > >>> LeaderAndIsrRequest, > >>> > > if > >>> > > > a > >>> > > > > > > replica is marked as the new replica, it will be created > on a > >>> > good > >>> > > > log > >>> > > > > > > directory, if not already present. Otherwise, it only > >>> creates the > >>> > > > > replica > >>> > > > > > > if all log directories are good and the replica is not > >>> already > >>> > > > present. > >>> > > > > > > This way, we don't delay the processing of new replicas in > >>> the > >>> > > common > >>> > > > > > case. > >>> > > > > > > > >>> > > > > > > I am ok with not persisting the offline replicas in ZK and > >>> just > >>> > > > > > discovering > >>> > > > > > > them through the LeaderAndIsrRequest. It handles the cases > >>> when a > >>> > > > > broker > >>> > > > > > > starts up with bad log directories better. So, the > additional > >>> > > > overhead > >>> > > > > of > >>> > > > > > > rediscovering the offline replicas is justified. > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > Another high level question. The proposal rejected RAID5/6 > >>> since > >>> > it > >>> > > > > adds > >>> > > > > > > additional I/Os. The main issue with RAID5 is that to > write a > >>> > block > >>> > > > > that > >>> > > > > > > doesn't match the RAID stripe size, we have to first read > >>> the old > >>> > > > > parity > >>> > > > > > to > >>> > > > > > > compute the new one, which increases the number of I/Os ( > >>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am > >>> wondering if > >>> > > you > >>> > > > > > have > >>> > > > > > > tested RAID5's performance by creating a file system whose > >>> block > >>> > > size > >>> > > > > > > matches the RAID stripe size ( > https://www.percona.com/blog/ > >>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This way, > >>> > writing > >>> > > a > >>> > > > > > block > >>> > > > > > > doesn't require a read first. A large block size may > >>> increase the > >>> > > > > amount > >>> > > > > > of > >>> > > > > > > data writes, when the same block has to be written to disk > >>> > multiple > >>> > > > > > times. > >>> > > > > > > However, this is probably ok in Kafka's use case since we > >>> batch > >>> > the > >>> > > > I/O > >>> > > > > > > flush already. As you can see, we will be adding some > >>> complexity > >>> > to > >>> > > > > > support > >>> > > > > > > JBOD in Kafka one way or another. If we can tune the > >>> performance > >>> > of > >>> > > > > RAID5 > >>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler > >>> > solution. > >>> > > > > > > > >>> > > > > > > Thanks, > >>> > > > > > > > >>> > > > > > > Jun > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin < > >>> lindon...@gmail.com> > >>> > > > > wrote: > >>> > > > > > > > >>> > > > > > > > Hey Jun, > >>> > > > > > > > > >>> > > > > > > > I don't think we should allow failed replicas to be > >>> re-created > >>> > on > >>> > > > the > >>> > > > > > > good > >>> > > > > > > > disks. Say there are 2 disks and each of them is 51% > >>> loaded. If > >>> > > any > >>> > > > > > disk > >>> > > > > > > > fail, and we allow replicas to be re-created on the other > >>> > disks, > >>> > > > both > >>> > > > > > > disks > >>> > > > > > > > will fail. Alternatively we can disable replica creation > if > >>> > there > >>> > > > is > >>> > > > > > bad > >>> > > > > > > > disk on a broker. I personally think it is worth the > >>> additional > >>> > > > > > > complexity > >>> > > > > > > > in the broker to store created replicas in ZK so that we > >>> allow > >>> > > new > >>> > > > > > > replicas > >>> > > > > > > > to be created on the broker even when there is bad log > >>> > directory. > >>> > > > > This > >>> > > > > > > > approach won't add complexity in the controller. But I am > >>> fine > >>> > > with > >>> > > > > > > > disabling replica creation when there is bad log > directory > >>> that > >>> > > if > >>> > > > it > >>> > > > > > is > >>> > > > > > > > the only blocking issue for this KIP. > >>> > > > > > > > > >>> > > > > > > > Whether we store created flags is independent of > >>> whether/how we > >>> > > > store > >>> > > > > > > > offline replicas. Per our previous discussion, do you > >>> think it > >>> > is > >>> > > > OK > >>> > > > > > not > >>> > > > > > > > store offline replicas in ZK and propagate the offline > >>> replicas > >>> > > > from > >>> > > > > > > broker > >>> > > > > > > > to controller via LeaderAndIsrRequest? > >>> > > > > > > > > >>> > > > > > > > Thanks, > >>> > > > > > > > Dong > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > >> > > >