Hey Jun, I have been thinking about whether it is better to return lag (i.e. HW - LEO) instead of LEO. Note that the lag in the DescribeDirsResponse may be negative if LEO > HW. It will almost always be negative for leader and in-sync replicas. Note that we can not calculate lag as max(0, HW - LEO) because we still need the difference between two lags to measure the progress of intra-broker replica movement. The AdminClient API can choose to return max(0, HW - LEO) depending on whether it is used for tracking progress of inter-broker reassignment or intra-broker movement. Is it OK? If so, I will update the KIP-113 accordingly to return lag in the DescribeDirsResponse .
Thanks, Dong <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Wed, Aug 9, 2017 at 5:06 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Dong, > > Yes, the lag in a replica is calculated as the difference of LEO of the > replica and the HW. So, as long as a replica is in sync, the lag is always > 0. > > So, I was suggesting to return lag instead of LEO in DescribeDirsResponse > for each replica. I am not sure if we need to return HW though. > > Thanks, > > Jun > > On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Jun, > > > > It just came to me that you may be assuming that folower_lag = HW - > > follower_LEO. If that is the case, then we need to have new > > request/response to retrieve this lag since the DescribeDirsResponse > > doesn't even include HW. It seems that KIP-179 does not explicitly > specify > > the definition of this lag. > > > > I have been assuming that follow_lag = leader_LEO - follower_LEO given > that > > the request is used to query the reassignment status. Strictly speaking > the > > difference between leader_LEO and the HW is limited by the amount of data > > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also > > assumed that 10 seconds is probably not a big deal given the typical time > > length of the reassignment. > > > > Thanks, > > Dong > > > > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Jun, > > > > > > If I understand you right, you are suggesting that, in the case when > > there > > > is continuous incoming traffic, the approach in the KIP-179 will report > > lag > > > as 0 whereas the approach using DescribeDirsRequest will report lag as > > > non-zero. But I think the approach in KIP-179 will also report non-zero > > lag > > > when there is continuous traffic. This is because at the time the > leader > > > receives ReplicaStatusRequest, it is likely that some data has been > > > appended to the partition after the last FetchRequest from the > follower. > > > Does this make sense? > > > > > > Thanks, > > > Dong > > > > > > > > > > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote: > > > > > >> Hi, Dong, > > >> > > >> As for whether to return LEO or lag, my point was the following. What > > you > > >> are concerned about is that an in-sync replica could become out of > sync > > >> again. However, the more common case is that once a replica is caught > > up, > > >> it will stay in sync afterwards. In that case, once the reassignment > > >> process completes, if we report based on lag, all lags will be 0. If > we > > >> report based on Math.max(0, leaderLEO - followerLEO), the value may > not > > be > > >> 0 if there is continuous incoming traffic, which will be confusing to > > the > > >> user. > > >> > > >> Thanks, > > >> > > >> Jun > > >> > > >> > > >> > > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote: > > >> > > >> > Hey Jun, > > >> > > > >> > Thanks for the comment! > > >> > > > >> > Yes, it should work. The tool can send request to any broker and > > broker > > >> can > > >> > just write the reassignment znode. My previous intuition is that it > > may > > >> be > > >> > better to only send this request to controller. But I don't have > good > > >> > reasons for this restriction. > > >> > > > >> > My intuition is that we can keep them separate as well. Becket and I > > >> have > > >> > discussed this both offline and in https://github.com/apache/ > > >> > kafka/pull/3621. > > >> > Currently I don't have a strong opinion on this and I am open to > using > > >> only > > >> > one API to do both if someone can come up with a reasonable API > > >> signature > > >> > for this method. For now I have added the method alterReplicaDir() > in > > >> > KafkaAdminClient instead of the AdminClient interface so that the > > >> > reassignment script can use this method without concluding what the > > API > > >> > would look like in AdminClient in the future. > > >> > > > >> > Regarding DescribeDirsResponse, I think it is probably OK to have > > >> slightly > > >> > more lag. The script can calculate the lag of the follower replica > as > > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be > slightly > > >> less > > >> > accurate than the current approach in KIP-179. But even with the > > current > > >> > approach in KIP-179, the result provided by the script is an > > >> approximation > > >> > anyway, since there is delay from the time that leader returns > > response > > >> to > > >> > the time that the script collects response from all brokers and > prints > > >> > result to user. I think if the slight difference in the accuracy > > between > > >> > the two approaches does not make a difference to the intended > use-case > > >> of > > >> > this API, then we probably want to re-use the exiting > request/response > > >> to > > >> > keep the protocol simple. > > >> > > > >> > Thanks, > > >> > Dong > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote: > > >> > > > >> > > Hi, Dong, > > >> > > > > >> > > I think Tom was suggesting to have the AlterTopicsRequest sent to > > any > > >> > > broker, which just writes the reassignment json to ZK. The > > controller > > >> > will > > >> > > pick up the reassignment and act on it as usual. This should work, > > >> right? > > >> > > > > >> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest > > seems > > >> > > simpler to me. The former is handled by the controller and the > > latter > > >> is > > >> > > handled by the affected broker. They don't always have to be done > > >> > together. > > >> > > Merging the two into a single request probably will make both the > > api > > >> and > > >> > > the implementation a bit more complicated. If we do keep the two > > >> separate > > >> > > requests, it seems that we should just add AlterReplicaDirRequest > to > > >> the > > >> > > AdminClient interface? > > >> > > > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be used > for > > >> the > > >> > > status reporting in KIP-179 as well. However, it seems that > > reporting > > >> the > > >> > > log end offset of each replica may not be easy to use. The log end > > >> offset > > >> > > will be returned from different brokers in slightly different > time. > > If > > >> > > there is continuous producing traffic, the difference in log end > > >> offset > > >> > > between the leader and the follower could be larger than 0 even if > > the > > >> > > follower has fully caught up. I am wondering if it's better to > > instead > > >> > > return the lag in offset per replica. This way, the status can > > >> probably > > >> > be > > >> > > reported more reliably. > > >> > > > > >> > > Thanks, > > >> > > > > >> > > Jun > > >> > > > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com> > > >> wrote: > > >> > > > > >> > > > Hey Tom, > > >> > > > > > >> > > > Thanks for the quick reply. Please see my comment inline. > > >> > > > > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley < > > t.j.bent...@gmail.com > > >> > > > >> > > > wrote: > > >> > > > > > >> > > > > Hi Dong, > > >> > > > > > > >> > > > > Replies inline, as usual > > >> > > > > > > >> > > > > > As I originally envisaged it, KIP-179's support for > > reassigning > > >> > > > > partitions > > >> > > > > > > > >> > > > > > would have more-or-less taken the logic currently in the > > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the > > >> > > > > > > ZkUtils.ReassignPartitionsPath) > > >> > > > > > > and put it behind a suitable network protocol API. Thus it > > >> > wouldn't > > >> > > > > > matter > > >> > > > > > > which broker received the protocol call: It would be acted > > on > > >> by > > >> > > > > brokers > > >> > > > > > > being notified of the change in the ZK path, just as > > >> currently. > > >> > > This > > >> > > > > > would > > >> > > > > > > have kept the ReassignPartitionsCommand relatively simple, > > as > > >> it > > >> > > > > > currently > > >> > > > > > > is. > > >> > > > > > > > > >> > > > > > > > >> > > > > > I am not sure I fully understand your proposal. I think you > > are > > >> > > saying > > >> > > > > that > > >> > > > > > any broker can receive and handle the AlterTopicRequest. > > >> > > > > > > >> > > > > > > >> > > > > That's right. > > >> > > > > > > >> > > > > > > >> > > > > > Let's say a > > >> > > > > > non-controller broker received AlterTopicRequest, is this > > broker > > >> > > going > > >> > > > to > > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker > > >> create > > >> > > the > > >> > > > > > reassignment znode in zookeper? > > >> > > > > > > >> > > > > > > >> > > > > Exactly: It's going to write some JSON to the relevant znode. > > >> Other > > >> > > > brokers > > >> > > > > will get notified by zk when the contents of this znode > changes, > > >> and > > >> > do > > >> > > > as > > >> > > > > they do now. This is what the tool/script does now. > > >> > > > > > > >> > > > > I will confess that I don't completely understand the role of > > >> > > > > LeaderAndIsrRequest, since the current code just seems to > write > > to > > >> > the > > >> > > > > znode do get the brokers to do the reassignment. If you could > > >> explain > > >> > > the > > >> > > > > role of LeaderAndIsrRequest that would be great. > > >> > > > > > > >> > > > > > >> > > > Currently only the controller will listen to the reassignment > > znode > > >> and > > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in > > >> order to > > >> > > > complete reassignment. Brokers won't need to listen to zookeeper > > for > > >> > any > > >> > > > reassignment -- brokers only reacts to the request from > > controller. > > >> > > > Currently Kafka's design replies a lot on the controller to > keep a > > >> > > > consistent view of who are the leader of partitions and what is > > the > > >> ISR > > >> > > > etc. It will be a pretty drastic change, if not impossible, for > > the > > >> > > script > > >> > > > to reassign partitions without going through controller. > > >> > > > > > >> > > > Thus I think it is likely that your AlterTopicsRequest can only > be > > >> sent > > >> > > to > > >> > > > controller. Then the controller can create the reassignment > znode > > in > > >> > > > zookeeper so that the information is persisted across controller > > >> fail > > >> > > over. > > >> > > > I haven't think through this in detail though. > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > > >> > > > > > I may have missed it. But I couldn't find > > >> > > > > > the explanation of AlterTopicRequest handling in KIP-179. > > >> > > > > > > > >> > > > > > > >> > > > > You're right, it doesn't go into that much detail. I will fix > > >> that. > > >> > > > > > > >> > > > > > > >> > > > > > > > > >> > > > > > > KIP-113 is obviously seeking to make more radical changes. > > The > > >> > > > > algorithm > > >> > > > > > > described for moving a replica to a particular directory > on > > a > > >> > > > different > > >> > > > > > > broker ( > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > > >> > directories#KIP-113: > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2) > > >> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers > > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > > >> > directories#KIP-113: > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2% > > >> > > > > > > 29Howtoreassignreplicabetweenl > ogdirectoriesacrossbrokers>) > > >> > > > > > > involves both sending AlterReplicaDirRequest to "the" > broker > > >> (the > > >> > > > > > receiving > > >> > > > > > > broker, I assume, but it's not spelled out), _as well as_ > > >> writing > > >> > > to > > >> > > > > the > > >> > > > > > ZK > > >> > > > > > > node. > > >> > > > > > > > > >> > > > > > > This assumes the script (ReassignPartitionsCommand) has > > direct > > >> > > access > > >> > > > > to > > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. > It > > >> > seems > > >> > > a > > >> > > > > > waste > > >> > > > > > > of time to put the logic in the script as part of KIP-113, > > >> only > > >> > for > > >> > > > > > KIP-179 > > >> > > > > > > to have to move it back to the controller. > > >> > > > > > > > > >> > > > > > > > >> > > > > > I am not sure I understand what you mean by "It seems a > waste > > of > > >> > time > > >> > > > to > > >> > > > > > put the logic in the script as part of KIP-113, only for > > >> KIP-179 to > > >> > > > have > > >> > > > > to > > >> > > > > > move it back to the controller". > > >> > > > > > > >> > > > > > > >> > > > > Sorry, I misunderstood slightly what you were proposing in > > >> KIP-113, > > >> > so > > >> > > > the > > >> > > > > "waste of time" comment isn't quite right, but I'm still not > > >> > convinced > > >> > > > that > > >> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory > > >> > result. > > >> > > > > > > >> > > > > Let me elaborate... KIP-113 says that to support reassigning > > >> replica > > >> > > > > between log directories across brokers: > > >> > > > > * ... > > >> > > > > * The script sends AlterReplicaDirRequest to those brokers > which > > >> need > > >> > > to > > >> > > > > move replicas... > > >> > > > > * The script creates reassignment znode in zookeeper. > > >> > > > > * The script retries AlterReplicaDirRequest to those broker... > > >> > > > > * ... > > >> > > > > > > >> > > > > So the ReassignPartitionsCommand still talks to ZK directly, > but > > >> now > > >> > > it's > > >> > > > > bracketed by calls to the AdminClient. KIP-179 could replace > > that > > >> > > talking > > >> > > > > to ZK directly with a new call to the AdminClient. But then > > we've > > >> > got a > > >> > > > > pretty weird API, where we have to make three AdminClient > calls > > >> (two > > >> > of > > >> > > > > them to the same method), to move a replica. I don't really > > >> > understand > > >> > > > why > > >> > > > > the admin client can't present a single API method to achieve > > >> this, > > >> > and > > >> > > > > encapsulate on the server side the careful sequence of events > > >> > necessary > > >> > > > to > > >> > > > > coordinate the movement. I understood this position is what > > Ismael > > >> > was > > >> > > > > advocating when he said it was better to put the logic in the > > >> > > controller > > >> > > > > than spread between the script and the controller. But maybe I > > >> > > > > misunderstood him. > > >> > > > > > > >> > > > > > >> > > > I have some concern with putting this logic in controller which > > can > > >> be > > >> > > > found in my previous email. Before that is addressed, the script > > (or > > >> > > > AdminClient) seems to be the simplest place to have this logic. > > >> > > > > > >> > > > I agree it is better to have a single API to achieve both > > partition > > >> and > > >> > > > replica -> dir assignment. I think it is likely that we will > find > > a > > >> > good > > >> > > > API to do both. I have updated the KIP-113 to remove API > > >> > alterReplicaDir > > >> > > > from AdminClient interface. > > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > > >> > > > > > I assume that the logic you mentioned is > > >> > > > > > "movement of replica to the specified log directory". This > > logic > > >> > (or > > >> > > > the > > >> > > > > > implementation of this logic) resides mainly in the > > >> > KafkaAdminClient > > >> > > > and > > >> > > > > > broker. The script only needs to parse the json file as > > >> appropriate > > >> > > and > > >> > > > > > call the new API in AdminClient as appropriate. The logic in > > the > > >> > > script > > >> > > > > is > > >> > > > > > therefore not much and can be easily moved to other classes > if > > >> > > needed. > > >> > > > > > > > >> > > > > > Can you clarify why this logic, i.e. movement of replica to > > the > > >> > > > specified > > >> > > > > > log directory, needs to be moved to controller in KIP-179? I > > >> think > > >> > it > > >> > > > can > > >> > > > > > still be done in the script and controller should not need > to > > >> worry > > >> > > > about > > >> > > > > > log directory of any replica. > > >> > > > > > > > >> > > > > > Thanks, > > >> > > > > > Dong > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >