Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Wed, 09 Aug 2017 21:29:25 -0700

Hey Jun,

I have been thinking about whether it is better to return lag (i.e. HW -
LEO) instead of LEO. Note that the lag in the DescribeDirsResponse may be
negative if LEO > HW. It will almost always be negative for leader and
in-sync replicas. Note that we can not calculate lag as max(0, HW - LEO)
because we still need the difference between two lags to measure the
progress of intra-broker replica movement. The AdminClient API can choose
to return max(0, HW - LEO) depending on whether it is used for tracking
progress of inter-broker reassignment or intra-broker movement. Is it OK?
If so, I will update the KIP-113 accordingly to return lag in the
DescribeDirsResponse .


Thanks,
Dong



<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Aug 9, 2017 at 5:06 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> Yes, the lag in a replica is calculated as the difference of LEO of the
> replica and the HW. So, as long as a replica is in sync, the lag is always
> 0.
>
> So, I was suggesting to return lag instead of LEO in DescribeDirsResponse
> for each replica. I am not sure if we need to return HW though.
>
> Thanks,
>
> Jun
>
> On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > It just came to me that you may be assuming that folower_lag = HW -
> > follower_LEO. If that is the case, then we need to have new
> > request/response to retrieve this lag since the DescribeDirsResponse
> > doesn't even include HW. It seems that KIP-179 does not explicitly
> specify
> > the definition of this lag.
> >
> > I have been assuming that follow_lag = leader_LEO - follower_LEO given
> that
> > the request is used to query the reassignment status. Strictly speaking
> the
> > difference between leader_LEO and the HW is limited by the amount of data
> > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also
> > assumed that 10 seconds is probably not a big deal given the typical time
> > length of the reassignment.
> >
> > Thanks,
> > Dong
> >
> > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > > Hey Jun,
> > >
> > > If I understand you right, you are suggesting that, in the case when
> > there
> > > is continuous incoming traffic, the approach in the KIP-179 will report
> > lag
> > > as 0 whereas the approach using DescribeDirsRequest will report lag as
> > > non-zero. But I think the approach in KIP-179 will also report non-zero
> > lag
> > > when there is continuous traffic. This is because at the time the
> leader
> > > receives ReplicaStatusRequest, it is likely that some data has been
> > > appended to the partition after the last FetchRequest from the
> follower.
> > > Does this make sense?
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > >
> > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote:
> > >
> > >> Hi, Dong,
> > >>
> > >> As for whether to return LEO or lag, my point was the following. What
> > you
> > >> are concerned about is that an in-sync replica could become out of
> sync
> > >> again. However, the more common case is that once a replica is caught
> > up,
> > >> it will stay in sync afterwards. In that case, once the reassignment
> > >> process completes, if we report based on lag, all lags will be 0. If
> we
> > >> report based on Math.max(0, leaderLEO - followerLEO), the value may
> not
> > be
> > >> 0 if there is continuous incoming traffic, which will be confusing to
> > the
> > >> user.
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >>
> > >>
> > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote:
> > >>
> > >> > Hey Jun,
> > >> >
> > >> > Thanks for the comment!
> > >> >
> > >> > Yes, it should work. The tool can send request to any broker and
> > broker
> > >> can
> > >> > just write the reassignment znode. My previous intuition is that it
> > may
> > >> be
> > >> > better to only send this request to controller. But I don't have
> good
> > >> > reasons for this restriction.
> > >> >
> > >> > My intuition is that we can keep them separate as well. Becket and I
> > >> have
> > >> > discussed this both offline and in https://github.com/apache/
> > >> > kafka/pull/3621.
> > >> > Currently I don't have a strong opinion on this and I am open to
> using
> > >> only
> > >> > one API to do both if someone can come up with a reasonable API
> > >> signature
> > >> > for this method. For now I have added the method alterReplicaDir()
> in
> > >> > KafkaAdminClient instead of the AdminClient interface so that the
> > >> > reassignment script can use this method without concluding what the
> > API
> > >> > would look like in AdminClient in the future.
> > >> >
> > >> > Regarding DescribeDirsResponse, I think it is probably OK to have
> > >> slightly
> > >> > more lag. The script can calculate the lag of the follower replica
> as
> > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be
> slightly
> > >> less
> > >> > accurate than the current approach in KIP-179. But even with the
> > current
> > >> > approach in KIP-179, the result provided by the script is an
> > >> approximation
> > >> > anyway, since there is delay from the time that leader returns
> > response
> > >> to
> > >> > the time that the script collects response from all brokers and
> prints
> > >> > result to user. I think if the slight difference in the accuracy
> > between
> > >> > the two approaches does not make a difference to the intended
> use-case
> > >> of
> > >> > this API, then we probably want to re-use the exiting
> request/response
> > >> to
> > >> > keep the protocol simple.
> > >> >
> > >> > Thanks,
> > >> > Dong
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote:
> > >> >
> > >> > > Hi, Dong,
> > >> > >
> > >> > > I think Tom was suggesting to have the AlterTopicsRequest sent to
> > any
> > >> > > broker, which just writes the reassignment json to ZK. The
> > controller
> > >> > will
> > >> > > pick up the reassignment and act on it as usual. This should work,
> > >> right?
> > >> > >
> > >> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest
> > seems
> > >> > > simpler to me. The former is handled by the controller and the
> > latter
> > >> is
> > >> > > handled by the affected broker. They don't always have to be done
> > >> > together.
> > >> > > Merging the two into a single request probably will make both the
> > api
> > >> and
> > >> > > the implementation a bit more complicated. If we do keep the two
> > >> separate
> > >> > > requests, it seems that we should just add AlterReplicaDirRequest
> to
> > >> the
> > >> > > AdminClient interface?
> > >> > >
> > >> > > Now, regarding DescribeDirsResponse. I agree that it can be used
> for
> > >> the
> > >> > > status reporting in KIP-179 as well. However, it seems that
> > reporting
> > >> the
> > >> > > log end offset of each replica may not be easy to use. The log end
> > >> offset
> > >> > > will be returned from different brokers in slightly different
> time.
> > If
> > >> > > there is continuous producing traffic, the difference in log end
> > >> offset
> > >> > > between the leader and the follower could be larger than 0 even if
> > the
> > >> > > follower has fully caught up. I am wondering if it's better to
> > instead
> > >> > > return the lag in offset per replica. This way, the status can
> > >> probably
> > >> > be
> > >> > > reported more reliably.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hey Tom,
> > >> > > >
> > >> > > > Thanks for the quick reply. Please see my comment inline.
> > >> > > >
> > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <
> > t.j.bent...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Dong,
> > >> > > > >
> > >> > > > > Replies inline, as usual
> > >> > > > >
> > >> > > > > > As I originally envisaged it, KIP-179's support for
> > reassigning
> > >> > > > > partitions
> > >> > > > > >
> > >> > > > > > would have more-or-less taken the logic currently in the
> > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the
> > >> > > > > > > ZkUtils.ReassignPartitionsPath)
> > >> > > > > > > and put it behind a suitable network protocol API. Thus it
> > >> > wouldn't
> > >> > > > > > matter
> > >> > > > > > > which broker received the protocol call: It would be acted
> > on
> > >> by
> > >> > > > > brokers
> > >> > > > > > > being notified of the change in the ZK path, just as
> > >> currently.
> > >> > > This
> > >> > > > > > would
> > >> > > > > > > have kept the ReassignPartitionsCommand relatively simple,
> > as
> > >> it
> > >> > > > > > currently
> > >> > > > > > > is.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > I am not sure I fully understand your proposal. I think you
> > are
> > >> > > saying
> > >> > > > > that
> > >> > > > > > any broker can receive and handle the AlterTopicRequest.
> > >> > > > >
> > >> > > > >
> > >> > > > > That's right.
> > >> > > > >
> > >> > > > >
> > >> > > > > > Let's say a
> > >> > > > > > non-controller broker received AlterTopicRequest, is this
> > broker
> > >> > > going
> > >> > > > to
> > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker
> > >> create
> > >> > > the
> > >> > > > > > reassignment znode in zookeper?
> > >> > > > >
> > >> > > > >
> > >> > > > > Exactly: It's going to write some JSON to the relevant znode.
> > >> Other
> > >> > > > brokers
> > >> > > > > will get notified by zk when the contents of this znode
> changes,
> > >> and
> > >> > do
> > >> > > > as
> > >> > > > > they do now. This is what the tool/script does now.
> > >> > > > >
> > >> > > > > I will confess that I don't completely understand the role of
> > >> > > > > LeaderAndIsrRequest, since the current code just seems to
> write
> > to
> > >> > the
> > >> > > > > znode do get the brokers to do the reassignment. If you could
> > >> explain
> > >> > > the
> > >> > > > > role of LeaderAndIsrRequest that would be great.
> > >> > > > >
> > >> > > >
> > >> > > > Currently only the controller will listen to the reassignment
> > znode
> > >> and
> > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in
> > >> order to
> > >> > > > complete reassignment. Brokers won't need to listen to zookeeper
> > for
> > >> > any
> > >> > > > reassignment -- brokers only reacts to the request from
> > controller.
> > >> > > > Currently Kafka's design replies a lot on the controller to
> keep a
> > >> > > > consistent view of who are the leader of partitions and what is
> > the
> > >> ISR
> > >> > > > etc. It will be a pretty drastic change, if not impossible, for
> > the
> > >> > > script
> > >> > > > to reassign partitions without going through controller.
> > >> > > >
> > >> > > > Thus I think it is likely that your AlterTopicsRequest can only
> be
> > >> sent
> > >> > > to
> > >> > > > controller. Then the controller can create the reassignment
> znode
> > in
> > >> > > > zookeeper so that the information is persisted across controller
> > >> fail
> > >> > > over.
> > >> > > > I haven't think through this in detail though.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > > I may have missed it. But I couldn't find
> > >> > > > > > the explanation of AlterTopicRequest handling in KIP-179.
> > >> > > > > >
> > >> > > > >
> > >> > > > > You're right, it doesn't go into that much detail. I will fix
> > >> that.
> > >> > > > >
> > >> > > > >
> > >> > > > > > >
> > >> > > > > > > KIP-113 is obviously seeking to make more radical changes.
> > The
> > >> > > > > algorithm
> > >> > > > > > > described for moving a replica to a particular directory
> on
> > a
> > >> > > > different
> > >> > > > > > > broker (
> > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > >> > directories#KIP-113:
> > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
> > >> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers
> > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > >> > directories#KIP-113:
> > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
> > >> > > > > > > 29Howtoreassignreplicabetweenl
> ogdirectoriesacrossbrokers>)
> > >> > > > > > > involves both sending AlterReplicaDirRequest to "the"
> broker
> > >> (the
> > >> > > > > > receiving
> > >> > > > > > > broker, I assume, but it's not spelled out), _as well as_
> > >> writing
> > >> > > to
> > >> > > > > the
> > >> > > > > > ZK
> > >> > > > > > > node.
> > >> > > > > > >
> > >> > > > > > > This assumes the script (ReassignPartitionsCommand) has
> > direct
> > >> > > access
> > >> > > > > to
> > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate.
> It
> > >> > seems
> > >> > > a
> > >> > > > > > waste
> > >> > > > > > > of time to put the logic in the script as part of KIP-113,
> > >> only
> > >> > for
> > >> > > > > > KIP-179
> > >> > > > > > > to have to move it back to the controller.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > I am not sure I understand what you mean by "It seems a
> waste
> > of
> > >> > time
> > >> > > > to
> > >> > > > > > put the logic in the script as part of KIP-113, only for
> > >> KIP-179 to
> > >> > > > have
> > >> > > > > to
> > >> > > > > > move it back to the controller".
> > >> > > > >
> > >> > > > >
> > >> > > > > Sorry, I misunderstood slightly what you were proposing in
> > >> KIP-113,
> > >> > so
> > >> > > > the
> > >> > > > > "waste of time" comment isn't quite right, but I'm still not
> > >> > convinced
> > >> > > > that
> > >> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory
> > >> > result.
> > >> > > > >
> > >> > > > > Let me elaborate... KIP-113 says that to support reassigning
> > >> replica
> > >> > > > > between log directories across brokers:
> > >> > > > > * ...
> > >> > > > > * The script sends AlterReplicaDirRequest to those brokers
> which
> > >> need
> > >> > > to
> > >> > > > > move replicas...
> > >> > > > > * The script creates reassignment znode in zookeeper.
> > >> > > > > * The script retries AlterReplicaDirRequest to those broker...
> > >> > > > > * ...
> > >> > > > >
> > >> > > > > So the ReassignPartitionsCommand still talks to ZK directly,
> but
> > >> now
> > >> > > it's
> > >> > > > > bracketed by calls to the AdminClient. KIP-179 could replace
> > that
> > >> > > talking
> > >> > > > > to ZK directly with a new call to the AdminClient. But then
> > we've
> > >> > got a
> > >> > > > > pretty weird API, where we have to make three AdminClient
> calls
> > >> (two
> > >> > of
> > >> > > > > them to the same method), to move a replica. I don't really
> > >> > understand
> > >> > > > why
> > >> > > > > the admin client can't present a single API method to achieve
> > >> this,
> > >> > and
> > >> > > > > encapsulate on the server side the careful sequence of events
> > >> > necessary
> > >> > > > to
> > >> > > > > coordinate the movement. I understood this position is what
> > Ismael
> > >> > was
> > >> > > > > advocating when he said it was better to put the logic in the
> > >> > > controller
> > >> > > > > than spread between the script and the controller. But maybe I
> > >> > > > > misunderstood him.
> > >> > > > >
> > >> > > >
> > >> > > > I have some concern with putting this logic in controller which
> > can
> > >> be
> > >> > > > found in my previous email. Before that is addressed, the script
> > (or
> > >> > > > AdminClient) seems to be the simplest place to have this logic.
> > >> > > >
> > >> > > > I agree it is better to have a single API to achieve both
> > partition
> > >> and
> > >> > > > replica -> dir assignment. I think it is likely that we will
> find
> > a
> > >> > good
> > >> > > > API to do both. I have updated the KIP-113 to remove API
> > >> > alterReplicaDir
> > >> > > > from AdminClient interface.
> > >> > > >
> > >> > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > > I assume that the logic you mentioned is
> > >> > > > > > "movement of replica to the specified log directory". This
> > logic
> > >> > (or
> > >> > > > the
> > >> > > > > > implementation of this logic) resides mainly in the
> > >> > KafkaAdminClient
> > >> > > > and
> > >> > > > > > broker. The script only needs to parse the json file as
> > >> appropriate
> > >> > > and
> > >> > > > > > call the new API in AdminClient as appropriate. The logic in
> > the
> > >> > > script
> > >> > > > > is
> > >> > > > > > therefore not much and can be easily moved to other classes
> if
> > >> > > needed.
> > >> > > > > >
> > >> > > > > > Can you clarify why this logic, i.e. movement of replica to
> > the
> > >> > > > specified
> > >> > > > > > log directory, needs to be moved to controller in KIP-179? I
> > >> think
> > >> > it
> > >> > > > can
> > >> > > > > > still be done in the script and controller should not need
> to
> > >> worry
> > >> > > > about
> > >> > > > > > log directory of any replica.
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Dong
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to