Hey Jun, Do you think it is OK to keep the existing wire protocol in the KIP? I am wondering if we can initiate vote for this KIP.
Thanks, Dong On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Jun, > > I just realized that StopReplicaRequest itself doesn't specify the > replicaId in the wire protocol. Thus controller would need to log the > brokerId with StopReplicaRequest in the log. Thus it may be > reasonable for controller to do the same with LeaderAndIsrRequest and only > specify the isNewReplica for the broker that receives LeaderAndIsrRequest. > > Thanks, > Dong > > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <lindon...@gmail.com> wrote: > >> Hi Jun, >> >> Yeah there is tradeoff between controller's implementation complexity vs. >> wire-protocol complexity. I personally think it is more important to keep >> wire-protocol concise and only add information in wire-protocol if >> necessary. It seems fine to add a little bit complexity to controller's >> implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket >> also shares this opinion with me. Is the only purpose of doing so to make >> controller log simpler? >> >> And certainly, I have added Todd's comment in the wiki. >> >> Thanks, >> Dong >> >> >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <j...@confluent.io> wrote: >> >>> Hi, Dong, >>> >>> 52. What you suggested would work. However, I am thinking that it's >>> probably simpler to just set isNewReplica at the replica level. That way, >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a >>> LeaderAndIsrRequest in the controller log, it's easier to see which >>> replicas are new without looking at which broker the request is intended >>> for. >>> >>> Could you also add those additional points from Todd's on 1 broker per >>> disk >>> vs JBOD vs RAID5/6 to the KIP? >>> >>> Thanks, >>> >>> Hi, Todd, >>> >>> Thanks for the feedback. That's very useful. >>> >>> Jun >>> >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <lindon...@gmail.com> wrote: >>> >>> > Hey Jun, >>> > >>> > Certainly, I have added Todd to reply to the thread. And I have >>> updated the >>> > item to in the wiki. >>> > >>> > 50. The full statement is "Broker assumes a log directory to be good >>> after >>> > it starts, and mark log directory as bad once there is IOException when >>> > broker attempts to access (i.e. read or write) the log directory". This >>> > statement seems reasonable, right? If a log directory is actually bad, >>> then >>> > the broker will first assume it is OK, try to read logs on this log >>> > directory, encounter IOException, and then mark it as bad. >>> > >>> > 51. My bad. I thought I removed it but I didn't. It is removed now. >>> > >>> > 52. I don't think so.. The isNewReplica field in the >>> LeaderAndIsrRequest is >>> > only relevant to the replica (i.e. broker) that receives the >>> > LeaderAndIsrRequest. There is no need to specify whether each replica >>> is >>> > new inside LeaderAndIsrRequest. In other words, if a broker sends >>> > LeaderAndIsrRequest to three different replicas of a given partition, >>> the >>> > isNewReplica field can be different across these three requests. >>> > >>> > Yeah, I would definitely want to start discussion on KIP-113 after we >>> have >>> > reached agreement on KIP-112. I have actually opened KIP-113 discussion >>> > thread on 1/12 together with this thread. I have yet to add the >>> ability to >>> > list offline directories in KIP-113 which we discussed in this thread. >>> > >>> > Thanks for all your reviews! Is there further concern with the latest >>> KIP? >>> > >>> > Thanks! >>> > Dong >>> > >>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote: >>> > >>> > > Hi, Dong, >>> > > >>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. >>> > Eno's >>> > > point is that the rebuild of RAID5/RAID6 requires reading more data >>> > > compared with RAID10, which increases the probability of error during >>> > > rebuild. This makes sense. In any case, do you think you could ask >>> the >>> > SREs >>> > > at LinkedIn to share their opinions on RAID5/RAID6? >>> > > >>> > > Yes, when a replica is offline due to a bad disk, it makes sense to >>> > handle >>> > > it immediately as if a StopReplicaRequest is received (i.e., replica >>> is >>> > no >>> > > longer considered a leader and is removed from any replica fetcher >>> > thread). >>> > > Could you add that detail in item 2. in the wiki? >>> > > >>> > > 50. The wiki says "Broker assumes a log directory to be good after it >>> > > starts" : A log directory actually could be bad during startup. >>> > > >>> > > 51. In item 4, the wiki says "The controller watches the path >>> > > /log_dir_event_notification for new znode.". This doesn't seem be >>> needed >>> > > now? >>> > > >>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each >>> > > replica inside the replicas field, right? >>> > > >>> > > Other than those, the current KIP looks good to me. Do you want to >>> start >>> > a >>> > > separate discussion thread on KIP-113? I do have some comments there. >>> > > >>> > > Thanks for working on this! >>> > > >>> > > Jun >>> > > >>> > > >>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com> >>> wrote: >>> > > >>> > > > Hi Jun, >>> > > > >>> > > > In addition to the Eno's reference of why rebuild time with RAID-5 >>> is >>> > > more >>> > > > expensive, another concern is that RAID-5 will fail if more than >>> one >>> > disk >>> > > > fails. JBOD is still works with 1+ disk failure and has better >>> > > performance >>> > > > with one disk failure. These seems like good argument for using >>> JBOD >>> > > > instead of RAID-5. >>> > > > >>> > > > If a leader replica goes offline, the broker should first take all >>> > > actions >>> > > > (i.e. remove the partition from fetcher thread) as if it has >>> received >>> > > > StopReplicaRequest for this partition because the replica can no >>> longer >>> > > > work anyway. It will also respond with error to any ProduceRequest >>> and >>> > > > FetchRequest for partition. The broker notifies controller by >>> writing >>> > > > notification znode in ZK. The controller learns the disk failure >>> event >>> > > from >>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to >>> > learn >>> > > > that the replica is offline. The controller will then elect new >>> leader >>> > > for >>> > > > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest >>> to >>> > > > relevant brokers. The broker should stop adjusting the ISR for this >>> > > > partition as if the broker is already offline. I am not sure there >>> is >>> > any >>> > > > inconsistency in broker's behavior when it is leader or follower. >>> Is >>> > > there >>> > > > any concern with this approach? >>> > > > >>> > > > Thanks for catching this. I have removed that reference from the >>> KIP. >>> > > > >>> > > > Hi Eno, >>> > > > >>> > > > Thank you for providing the reference of the RAID-5. In LinkedIn we >>> > have >>> > > 10 >>> > > > disks per Kafka machine. It will not be a show-stopper >>> operationally >>> > for >>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other >>> hand we >>> > > > previously discussed the advantage of JBOD vs. one-broker-per-disk >>> or >>> > > > one-broker-per-machine. One-broker-per-disk suffers from the >>> problems >>> > > > described in the KIP and one-broker-per-machine increases the >>> failure >>> > > > caused by disk failure by 10X. Since JBOD is strictly better than >>> > either >>> > > of >>> > > > the two, it is also better then one-broker-per-multiple-disk which >>> is >>> > > > somewhere between one-broker-per-disk and one-broker-per-machine. >>> > > > >>> > > > I personally think the benefits of JBOD design is worth the >>> > > implementation >>> > > > complexity it introduces. I would also argue that it is reasonable >>> for >>> > > > Kafka to manage this low level detail because Kafka is already >>> exposing >>> > > and >>> > > > managing replication factor of its data. But whether the >>> complexity is >>> > > > worthwhile can be subjective and I can not prove my opinion. I am >>> > > > contributing significant amount of time to do this KIP because >>> Kafka >>> > > > develops at LinkedIn believes it is useful and worth the effort. >>> Yeah, >>> > it >>> > > > will be useful to see what everyone else think about it. >>> > > > >>> > > > >>> > > > Thanks, >>> > > > Dong >>> > > > >>> > > > >>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> wrote: >>> > > > >>> > > > > Hi, Dong, >>> > > > > >>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. If a >>> disk >>> > > > > fails, typically an admin has to bring down the broker, replace >>> the >>> > > > failed >>> > > > > disk with a new one, trigger the RAID rebuild, and bring up the >>> > broker. >>> > > > > This way, there is no performance impact at runtime due to >>> rebuild. >>> > The >>> > > > > benefit is that a broker doesn't fail in a hard way when there >>> is a >>> > > disk >>> > > > > failure and can be brought down in a controlled way for >>> maintenance. >>> > > > While >>> > > > > the broker is running with a failed disk, reads may be more >>> expensive >>> > > > since >>> > > > > they have to be computed from the parity. However, if most reads >>> are >>> > > from >>> > > > > page cache, this may not be a big issue either. So, it would be >>> > useful >>> > > to >>> > > > > do some tests on RAID5 before we completely rule it out. >>> > > > > >>> > > > > Regarding whether to remove an offline replica from the fetcher >>> > thread >>> > > > > immediately. What do we do when a failed replica is a leader? Do >>> we >>> > do >>> > > > > nothing or mark the replica as not the leader immediately? >>> > Intuitively, >>> > > > it >>> > > > > seems it's better if the broker acts consistently on a failed >>> replica >>> > > > > whether it's a leader or a follower. For ISR churns, I was just >>> > > pointing >>> > > > > out that if we don't send StopReplicaRequest to a broker to be >>> shut >>> > > down >>> > > > in >>> > > > > a controlled way, then the leader will shrink ISR, expand it and >>> > shrink >>> > > > it >>> > > > > again after the timeout. >>> > > > > >>> > > > > The KIP seems to still reference " >>> > > > > /broker/topics/[topic]/partitions/[partitionId]/ >>> > > > controller_managed_state". >>> > > > > >>> > > > > Thanks, >>> > > > > >>> > > > > Jun >>> > > > > >>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> >>> > wrote: >>> > > > > >>> > > > > > Hey Jun, >>> > > > > > >>> > > > > > Thanks for the suggestion. I think it is a good idea to know >>> put >>> > > > created >>> > > > > > flag in ZK and simply specify isNewReplica=true in >>> > > LeaderAndIsrRequest >>> > > > if >>> > > > > > repilcas was in NewReplica state. It will only fail the replica >>> > > > creation >>> > > > > in >>> > > > > > the scenario that the controller fails after >>> > > > > > topic-creation/partition-reassignment/partition-number-change >>> but >>> > > > before >>> > > > > > actually sends out the LeaderAndIsrRequest while there is >>> ongoing >>> > > disk >>> > > > > > failure, which should be pretty rare and acceptable. This >>> should >>> > > > simplify >>> > > > > > the design of this KIP. >>> > > > > > >>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not just >>> > about >>> > > > > > performance when there is no failure. For example, RAID-5 can >>> > support >>> > > > up >>> > > > > to >>> > > > > > one disk failure and it takes time to rebuild disk after one >>> disk >>> > > > > > failure. RAID 5 implementations are susceptible to system >>> failures >>> > > > > because >>> > > > > > of trends regarding array rebuild time and the chance of drive >>> > > failure >>> > > > > > during rebuild. There is no such performance degradation for >>> JBOD >>> > and >>> > > > > JBOD >>> > > > > > can support multiple log directory failure without reducing >>> > > performance >>> > > > > of >>> > > > > > good log directories. Would this be a reasonable reason for >>> using >>> > > JBOD >>> > > > > > instead of RAID-5/6? >>> > > > > > >>> > > > > > Previously we discussed wether broker should remove offline >>> replica >>> > > > from >>> > > > > > replica fetcher thread. I still think it should do it instead >>> of >>> > > > > printing a >>> > > > > > lot of error in the log4j log. We can still let controller send >>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand why >>> > > > allowing >>> > > > > > broker to remove offline replica from fetcher thread will >>> increase >>> > > > churns >>> > > > > > in ISR. Do you think this is concern with this approach? >>> > > > > > >>> > > > > > I have updated the KIP to remove created flag from ZK and >>> change >>> > the >>> > > > > filed >>> > > > > > name to isNewReplica. Can you check if there is any issue with >>> the >>> > > > latest >>> > > > > > KIP? Thanks for your time! >>> > > > > > >>> > > > > > Regards, >>> > > > > > Dong >>> > > > > > >>> > > > > > >>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> >>> wrote: >>> > > > > > >>> > > > > > > Hi, Dong, >>> > > > > > > >>> > > > > > > Thanks for the reply. >>> > > > > > > >>> > > > > > > Personally, I'd prefer not to write the created flag per >>> replica >>> > in >>> > > > ZK. >>> > > > > > > Your suggestion of disabling replica creation if there is a >>> bad >>> > log >>> > > > > > > directory on the broker could work. The only thing is that >>> it may >>> > > > delay >>> > > > > > the >>> > > > > > > creation of new replicas. I was thinking that an alternative >>> is >>> > to >>> > > > > extend >>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field per >>> replica. >>> > > That >>> > > > > > field >>> > > > > > > will be set when a replica is transitioning from the >>> NewReplica >>> > > state >>> > > > > to >>> > > > > > > Online state. Then, when a broker receives a >>> LeaderAndIsrRequest, >>> > > if >>> > > > a >>> > > > > > > replica is marked as the new replica, it will be created on a >>> > good >>> > > > log >>> > > > > > > directory, if not already present. Otherwise, it only >>> creates the >>> > > > > replica >>> > > > > > > if all log directories are good and the replica is not >>> already >>> > > > present. >>> > > > > > > This way, we don't delay the processing of new replicas in >>> the >>> > > common >>> > > > > > case. >>> > > > > > > >>> > > > > > > I am ok with not persisting the offline replicas in ZK and >>> just >>> > > > > > discovering >>> > > > > > > them through the LeaderAndIsrRequest. It handles the cases >>> when a >>> > > > > broker >>> > > > > > > starts up with bad log directories better. So, the additional >>> > > > overhead >>> > > > > of >>> > > > > > > rediscovering the offline replicas is justified. >>> > > > > > > >>> > > > > > > >>> > > > > > > Another high level question. The proposal rejected RAID5/6 >>> since >>> > it >>> > > > > adds >>> > > > > > > additional I/Os. The main issue with RAID5 is that to write a >>> > block >>> > > > > that >>> > > > > > > doesn't match the RAID stripe size, we have to first read >>> the old >>> > > > > parity >>> > > > > > to >>> > > > > > > compute the new one, which increases the number of I/Os ( >>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am >>> wondering if >>> > > you >>> > > > > > have >>> > > > > > > tested RAID5's performance by creating a file system whose >>> block >>> > > size >>> > > > > > > matches the RAID stripe size (https://www.percona.com/blog/ >>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This way, >>> > writing >>> > > a >>> > > > > > block >>> > > > > > > doesn't require a read first. A large block size may >>> increase the >>> > > > > amount >>> > > > > > of >>> > > > > > > data writes, when the same block has to be written to disk >>> > multiple >>> > > > > > times. >>> > > > > > > However, this is probably ok in Kafka's use case since we >>> batch >>> > the >>> > > > I/O >>> > > > > > > flush already. As you can see, we will be adding some >>> complexity >>> > to >>> > > > > > support >>> > > > > > > JBOD in Kafka one way or another. If we can tune the >>> performance >>> > of >>> > > > > RAID5 >>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler >>> > solution. >>> > > > > > > >>> > > > > > > Thanks, >>> > > > > > > >>> > > > > > > Jun >>> > > > > > > >>> > > > > > > >>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin < >>> lindon...@gmail.com> >>> > > > > wrote: >>> > > > > > > >>> > > > > > > > Hey Jun, >>> > > > > > > > >>> > > > > > > > I don't think we should allow failed replicas to be >>> re-created >>> > on >>> > > > the >>> > > > > > > good >>> > > > > > > > disks. Say there are 2 disks and each of them is 51% >>> loaded. If >>> > > any >>> > > > > > disk >>> > > > > > > > fail, and we allow replicas to be re-created on the other >>> > disks, >>> > > > both >>> > > > > > > disks >>> > > > > > > > will fail. Alternatively we can disable replica creation if >>> > there >>> > > > is >>> > > > > > bad >>> > > > > > > > disk on a broker. I personally think it is worth the >>> additional >>> > > > > > > complexity >>> > > > > > > > in the broker to store created replicas in ZK so that we >>> allow >>> > > new >>> > > > > > > replicas >>> > > > > > > > to be created on the broker even when there is bad log >>> > directory. >>> > > > > This >>> > > > > > > > approach won't add complexity in the controller. But I am >>> fine >>> > > with >>> > > > > > > > disabling replica creation when there is bad log directory >>> that >>> > > if >>> > > > it >>> > > > > > is >>> > > > > > > > the only blocking issue for this KIP. >>> > > > > > > > >>> > > > > > > > Whether we store created flags is independent of >>> whether/how we >>> > > > store >>> > > > > > > > offline replicas. Per our previous discussion, do you >>> think it >>> > is >>> > > > OK >>> > > > > > not >>> > > > > > > > store offline replicas in ZK and propagate the offline >>> replicas >>> > > > from >>> > > > > > > broker >>> > > > > > > > to controller via LeaderAndIsrRequest? >>> > > > > > > > >>> > > > > > > > Thanks, >>> > > > > > > > Dong >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >