Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Thu, 30 Mar 2017 15:30:58 -0700

Thanks Jun!

Hi all,


Thanks for all the comments. I am going to open the voting thread if there
is no further concern with the KIP.

Dong

On Thu, Mar 30, 2017 at 3:19 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> I don't have further concerns. If there are no more comments from other
> people, we can start the vote.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 30, 2017 at 10:59 AM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > Thanks much for the comment! Do you think we start vote for KIP-112 and
> > KIP-113 if there is no further concern?
> >
> > Dong
> >
> > On Thu, Mar 30, 2017 at 10:40 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Dong,
> > >
> > > Ok, so it seems that in solution (2), if the tool exits successfully,
> > then
> > > we know for sure that all replicas will be in the right log dirs.
> > Solution
> > > (1) doesn't guarantee that. That seems better and we can go with your
> > > current solution then.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Mar 24, 2017 at 4:28 PM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > No.. the current approach describe in the KIP (see here
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%
> > > > 3A+Support+replicas+movement+between+log+directories#KIP-
> > > > 113:Supportreplicasmovementbetweenlogdirectories-2)Howtoreas
> > > > signreplicabetweenlogdirectoriesacrossbrokers>)
> > > > also sends ChangeReplicaDirRequest before writing reassignment path
> in
> > > ZK.
> > > > I think we discussing whether ChangeReplicaDirResponse (1) shows
> > success
> > > or
> > > > (2) should specify ReplicaNotAvailableException, if replica has not
> > been
> > > > created yet.
> > > >
> > > > Since both solution will send ChangeReplicaDirRequest before writing
> > > > reassignment in ZK, their chance of creating replica in the right
> > > directory
> > > > is the same.
> > > >
> > > > To take care of the rarer case that some brokers go down immediately
> > > after
> > > > the reassignment tool is run, solution (1) requires reassignment tool
> > to
> > > > repeatedly send DescribeDirRequest and ChangeReplicaDirRequest, while
> > > > solution (1) requires tool to only retry ChangeReplicaDirRequest if
> the
> > > > response says ReplicaNotAvailableException. It seems that solution
> (2)
> > is
> > > > cleaner because ChangeReplicaDirRequest won't depend on
> > > DescribeDirRequest.
> > > > What do you think?
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > > On Fri, Mar 24, 2017 at 3:56 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, Dong,
> > > > >
> > > > > We are just comparing whether it's better for the reassignment tool
> > to
> > > > > send ChangeReplicaDirRequest
> > > > > (1) before or (2) after writing the reassignment path in ZK .
> > > > >
> > > > > In the case when all brokers are alive when the reassignment tool
> is
> > > run,
> > > > > (1) guarantees 100% that the new replicas will be in the right log
> > dirs
> > > > and
> > > > > (2) can't.
> > > > >
> > > > > In the rarer case that some brokers go down immediately after the
> > > > > reassignment tool is run, in either approach, there is a chance
> when
> > > the
> > > > > failed broker comes back, it will complete the pending reassignment
> > > > process
> > > > > by putting some replicas in the wrong log dirs.
> > > > >
> > > > > Implementation wise, (1) and (2) seem to be the same. So, it seems
> to
> > > me
> > > > > that (1) is better?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Thu, Mar 23, 2017 at 11:54 PM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > Thanks much for the response! I agree with you that if multiple
> > > > replicas
> > > > > > are created in the wrong directory, we may waste resource if
> either
> > > > > > replicaMoveThread number is low or intra.broker.throttled.rate is
> > > slow.
> > > > > > Then the question is whether the suggested approach increases the
> > > > chance
> > > > > of
> > > > > > replica being created in the correct log directory.
> > > > > >
> > > > > > I think the answer is no due to the argument provided in the
> > previous
> > > > > > email. Sending ChangeReplicaDirRequest before updating znode has
> > > > > negligible
> > > > > > impact on the chance that the broker processes
> > > ChangeReplicaDirRequest
> > > > > > before LeaderAndIsrRequest from controller. If we still worry
> about
> > > the
> > > > > > order they are sent, the reassignment tool can first send
> > > > > > ChangeReplicaDirRequest (so that broker remembers it in memory),
> > > create
> > > > > > reassignment znode, and then retry ChangeReplicaDirRequset if the
> > > > > previous
> > > > > > ChangeReplicaDirResponse says the replica has not been created.
> > This
> > > > > should
> > > > > > give us the highest possible chance of creating replica in the
> > > correct
> > > > > > directory and avoid the problem of the suggested approach. I have
> > > > updated
> > > > > > "How
> > > > > > to reassign replica between log directories across brokers" in
> the
> > > KIP
> > > > to
> > > > > > explain this procedure.
> > > > > >
> > > > > > To answer your question, the reassignment tool should fail with
> > with
> > > > > proper
> > > > > > error message if user has specified log directory for a replica
> on
> > an
> > > > > > offline broker.  This is reasonable because reassignment tool can
> > not
> > > > > > guarantee that the replica will be moved to the specified log
> > > directory
> > > > > if
> > > > > > the broker is offline. If all brokers are online, the
> reassignment
> > > tool
> > > > > may
> > > > > > hung up to 10 seconds (by default) to retry
> ChangeReplicaDirRequest
> > > if
> > > > > any
> > > > > > replica has not been created already. User can change this
> timeout
> > > > value
> > > > > > using the newly-added --timeout argument of the reassignment
> tool.
> > > This
> > > > > is
> > > > > > specified in the Public Interface section in the KIP. The
> > > reassignment
> > > > > tool
> > > > > > will only block if user uses this new feature of reassigning
> > replica
> > > > to a
> > > > > > specific log directory in the broker. Therefore it seems backward
> > > > > > compatible.
> > > > > >
> > > > > > Does this address the concern?
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > > > On Thu, Mar 23, 2017 at 10:06 PM, Jun Rao <j...@confluent.io>
> > wrote:
> > > > > >
> > > > > > > Hi, Dong,
> > > > > > >
> > > > > > > 11.2 I think there are a few reasons why the cross disk
> movement
> > > may
> > > > > not
> > > > > > > catch up if the replicas are created in the wrong log dirs to
> > start
> > > > > with.
> > > > > > > (a) There could be more replica fetcher threads than the disk
> > > > movement
> > > > > > > threads. (b) intra.broker.throttled.rate may be configured
> lower
> > > than
> > > > > the
> > > > > > > replica throttle rate. That's why I think getting the replicas
> > > > created
> > > > > in
> > > > > > > the right log dirs will be better.
> > > > > > >
> > > > > > > For the corner case issue that you mentioned, I am not sure if
> > the
> > > > > > approach
> > > > > > > in the KIP completely avoids that. If a broker is down when the
> > > > > partition
> > > > > > > reassignment tool is started, does the tool just hang (keep
> > > retrying
> > > > > > > ChangeReplicaDirRequest) until the broker comes back?
> Currently,
> > > the
> > > > > > > partition reassignment tool doesn't block.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Mar 21, 2017 at 11:24 AM, Dong Lin <
> lindon...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hey Jun,
> > > > > > > >
> > > > > > > > Thanks for the explanation. Please see below my thoughts.
> > > > > > > >
> > > > > > > > 10. I see. So you are concerned with the potential
> > implementation
> > > > > > > > complexity which I wasn't aware of. I think it is OK not to
> do
> > > log
> > > > > > > > cleaning on the .move log since there can be only one such
> log
> > in
> > > > > each
> > > > > > > > directory. I have updated the KIP to specify this:
> > > > > > > >
> > > > > > > > "The log segments in topicPartition.move directory will be
> > > subject
> > > > to
> > > > > > log
> > > > > > > > truncation, log retention in the same way as the log segments
> > in
> > > > the
> > > > > > > source
> > > > > > > > log directory. But we may not do log cleaning on the
> > > > > > topicPartition.move
> > > > > > > to
> > > > > > > > simplify the implementation."
> > > > > > > >
> > > > > > > > 11.2 Now I get your point. I think we have slightly different
> > > > > > expectation
> > > > > > > > of the order in which the reassignment tools updates
> > reassignment
> > > > > node
> > > > > > in
> > > > > > > > ZK and sends ChangeReplicaDirRequest.
> > > > > > > >
> > > > > > > > I think the reassignment tool should first create
> reassignment
> > > > znode
> > > > > > and
> > > > > > > > then keep sending ChangeReplicaDirRequest until success. I
> > think
> > > > > > sending
> > > > > > > > ChangeReplicaDirRequest before updating znode has negligible
> > > impact
> > > > > on
> > > > > > > the
> > > > > > > > chance that the broker processes ChangeReplicaDirRequest
> before
> > > > > > > > LeaderAndIsrRequest from controller, because the time for
> > > > controller
> > > > > to
> > > > > > > > receive ZK notification, handle state machine changes and
> send
> > > > > > > > LeaderAndIsrRequests should be much longer than the time for
> > > > > > reassignment
> > > > > > > > tool to setup connection with broker and send
> > > > > ChangeReplicaDirRequest.
> > > > > > > Even
> > > > > > > > if broker receives LeaderAndIsrRequest a bit sooner, the data
> > in
> > > > the
> > > > > > > > original replica should be smaller enough for .move log to
> > catch
> > > up
> > > > > > very
> > > > > > > > quickly, so that broker can swap the log soon after it
> receives
> > > > > > > > ChangeReplicaDirRequest -- otherwise the
> > > > intra.broker.throttled.rate
> > > > > is
> > > > > > > > probably too small. Does this address your concern with the
> > > > > > performance?
> > > > > > > >
> > > > > > > > One concern with the suggested approach is that the
> > > > > > > ChangeReplicaDirRequest
> > > > > > > > may be lost if broker crashes before it creates the replica.
> I
> > > > agree
> > > > > it
> > > > > > > is
> > > > > > > > rare. But it will be confusing when it happens. Operators
> would
> > > > have
> > > > > to
> > > > > > > > keep verifying reassignment and possibly retry execution
> until
> > > > > success
> > > > > > if
> > > > > > > > they want to make sure that the ChangeReplicaDirRequest is
> > > > executed.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Dong
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Mar 21, 2017 at 8:37 AM, Jun Rao <j...@confluent.io>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Dong,
> > > > > > > > >
> > > > > > > > > 10. I was mainly concerned about the additional complexity
> > > needed
> > > > > to
> > > > > > > > > support log cleaning in the .move log. For example,
> > LogToClean
> > > is
> > > > > > keyed
> > > > > > > > off
> > > > > > > > > TopicPartition. To be able to support cleaning different
> > > > instances
> > > > > of
> > > > > > > the
> > > > > > > > > same partition, we need additional logic. I am not how much
> > > > > > additional
> > > > > > > > > complexity is needed and whether it's worth it. If we don't
> > do
> > > > log
> > > > > > > > cleaning
> > > > > > > > > at all on the .move log, then we don't have to change the
> log
> > > > > > cleaner's
> > > > > > > > > code.
> > > > > > > > >
> > > > > > > > > 11.2 I was thinking of the following flow. In the execute
> > > phase,
> > > > > the
> > > > > > > > > reassignment tool first issues a ChangeReplicaDirRequest to
> > > > brokers
> > > > > > > where
> > > > > > > > > new replicas will be created. The brokers remember the
> > mapping
> > > > and
> > > > > > > > return a
> > > > > > > > > successful code. The reassignment tool then initiates the
> > cross
> > > > > > broker
> > > > > > > > > movement through the controller. In the verify phase, in
> > > addition
> > > > > to
> > > > > > > > > checking the replica assignment at the brokers, it issues
> > > > > > > > > DescribeDirsRequest to check the replica to log dirs
> mapping.
> > > For
> > > > > > each
> > > > > > > > > partition in the response, the broker returns a state to
> > > indicate
> > > > > > > whether
> > > > > > > > > the replica is final, temporary or pending. If all replicas
> > are
> > > > in
> > > > > > the
> > > > > > > > > final state, the tool checks if all replicas are in the
> > > expected
> > > > > log
> > > > > > > > dirs.
> > > > > > > > > If they are not, output a warning (and perhaps suggest the
> > > users
> > > > to
> > > > > > > move
> > > > > > > > > the data again). However, this should be rare.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Mar 20, 2017 at 10:46 AM, Dong Lin <
> > > lindon...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Jun,
> > > > > > > > > >
> > > > > > > > > > Thanks for the response! It seems that we have only two
> > > > remaining
> > > > > > > > issues.
> > > > > > > > > > Please see my reply below.
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 20, 2017 at 7:45 AM, Jun Rao <
> j...@confluent.io
> > >
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Dong,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the update. A few replies inlined below.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2017 at 12:28 AM, Dong Lin <
> > > > > lindon...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your comment! Please see my reply below.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Mar 15, 2017 at 9:45 PM, Jun Rao <
> > > j...@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Dong,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 10. Could you comment on that?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Sorry, I missed that comment.
> > > > > > > > > > > >
> > > > > > > > > > > > Good point. I think the log segments in
> > > topicPartition.move
> > > > > > > > directory
> > > > > > > > > > > will
> > > > > > > > > > > > be subject to log truncation, log retention and log
> > > > cleaning
> > > > > in
> > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > > > way as the log segments in the source log directory.
> I
> > > just
> > > > > > > > specified
> > > > > > > > > > > this
> > > > > > > > > > > > inthe KIP.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > This is ok, but doubles the overhead of log cleaning.
> We
> > > > > probably
> > > > > > > > want
> > > > > > > > > to
> > > > > > > > > > > think a bit more on this.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think this is OK because the number of replicas that
> are
> > > > being
> > > > > > > moved
> > > > > > > > is
> > > > > > > > > > limited by the number of ReplicaMoveThread. The default
> > > number
> > > > of
> > > > > > > > > > ReplicaMoveThread is the number of log directories, which
> > > mean
> > > > we
> > > > > > > incur
> > > > > > > > > > these overhead for at most one replica per log directory
> at
> > > any
> > > > > > time.
> > > > > > > > > > Suppose there are most than 100 replica in any log
> > directory,
> > > > the
> > > > > > > > > increase
> > > > > > > > > > in overhead is less than 1%.
> > > > > > > > > >
> > > > > > > > > > Another way to look at this is that this is no worse than
> > > > replica
> > > > > > > > > > reassignment. When we reassign replica from one broker to
> > > > > another,
> > > > > > we
> > > > > > > > > will
> > > > > > > > > > double the overhread of log cleaning in the cluster for
> > this
> > > > > > replica.
> > > > > > > > If
> > > > > > > > > we
> > > > > > > > > > are OK with this then we are OK with replica movement
> > between
> > > > log
> > > > > > > > > > directories.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 11.2 "I am concerned that the
> ChangeReplicaDirRequest
> > > > would
> > > > > > be
> > > > > > > > lost
> > > > > > > > > > if
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > restarts after it sends ChangeReplicaDirResponse
> but
> > > > before
> > > > > > it
> > > > > > > > > > receives
> > > > > > > > > > > > > LeaderAndIsrRequest."
> > > > > > > > > > > > >
> > > > > > > > > > > > > In that case, the reassignment tool could detect
> that
> > > > > through
> > > > > > > > > > > > > DescribeDirsRequest
> > > > > > > > > > > > > and issue ChangeReplicaDirRequest again, right? In
> > the
> > > > > common
> > > > > > > > case,
> > > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > probably not needed and we only need to write each
> > > > replica
> > > > > > > once.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My main concern with the approach in the current
> KIP
> > is
> > > > > that
> > > > > > > > once a
> > > > > > > > > > new
> > > > > > > > > > > > > replica is created in the wrong log dir, the cross
> > log
> > > > > > > directory
> > > > > > > > > > > movement
> > > > > > > > > > > > > may not catch up until the new replica is fully
> > > > > bootstrapped.
> > > > > > > So,
> > > > > > > > > we
> > > > > > > > > > > end
> > > > > > > > > > > > up
> > > > > > > > > > > > > writing the data for the same replica twice.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I agree with your concern. My main concern is that it
> > is
> > > a
> > > > > bit
> > > > > > > > weird
> > > > > > > > > if
> > > > > > > > > > > > ChangeReplicaDirResponse can not guarantee success
> and
> > > the
> > > > > tool
> > > > > > > > needs
> > > > > > > > > > to
> > > > > > > > > > > > rely on DescribeDirResponse to see if it needs to
> send
> > > > > > > > > > > > ChangeReplicaDirRequest again.
> > > > > > > > > > > >
> > > > > > > > > > > > How about this: If broker doesn't not have already
> > > replica
> > > > > > > created
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > specified topicParition when it receives
> > > > > > ChangeReplicaDirRequest,
> > > > > > > > it
> > > > > > > > > > will
> > > > > > > > > > > > reply ReplicaNotAvailableException AND remember
> > (replica,
> > > > > > > > destination
> > > > > > > > > > log
> > > > > > > > > > > > directory) pair in memory to create the replica in
> the
> > > > > > specified
> > > > > > > > log
> > > > > > > > > > > > directory.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > I am not sure if returning ReplicaNotAvailableException
> > is
> > > > > > useful?
> > > > > > > > What
> > > > > > > > > > > will the client do on receiving
> > > ReplicaNotAvailableException
> > > > in
> > > > > > > this
> > > > > > > > > > case?
> > > > > > > > > > >
> > > > > > > > > > > Perhaps we could just replace the is_temporary field in
> > > > > > > > > > > DescribeDirsRresponsePartition with a state field. We
> can
> > > > use 0
> > > > > > to
> > > > > > > > > > indicate
> > > > > > > > > > > the partition is created, 1 to indicate the partition
> is
> > > > > > temporary
> > > > > > > > and
> > > > > > > > > 2
> > > > > > > > > > to
> > > > > > > > > > > indicate that the partition is pending.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ReplicaNotAvailableException is useful because the client
> > can
> > > > > > re-send
> > > > > > > > > > ChangeReplicaDirRequest (with backoff) after receiving
> > > > > > > > > > ReplicaNotAvailableException in the response.
> > > > > > ChangeReplicaDirRequest
> > > > > > > > > will
> > > > > > > > > > only succeed after replica has been created for the
> > specified
> > > > > > > partition
> > > > > > > > > in
> > > > > > > > > > the broker.
> > > > > > > > > >
> > > > > > > > > > I think this is cleaner than asking reassignment tool to
> > > detect
> > > > > > that
> > > > > > > > > > through DescribeDirsRequest and issue
> > ChangeReplicaDirRequest
> > > > > > again.
> > > > > > > > Both
> > > > > > > > > > solution has the same chance of writing the data for the
> > same
> > > > > > replica
> > > > > > > > > > twice. In the original solution, the reassignment tool
> will
> > > > keep
> > > > > > > > retrying
> > > > > > > > > > ChangeReplicaDirRequest until success. In the second
> > > suggested
> > > > > > > > solution,
> > > > > > > > > > the reassignment tool needs to send
> > ChangeReplicaDirRequest,
> > > > send
> > > > > > > > > > DescribeDirsRequest to verify result, and retry
> > > > > > > ChangeReplicaDirRequest
> > > > > > > > > and
> > > > > > > > > > DescribeDirsRequest again if the replica hasn't been
> > created
> > > > > > already.
> > > > > > > > > Thus
> > > > > > > > > > the second solution couples ChangeReplicaDirRequest with
> > > > > > > > > > DescribeDirsRequest and makes tool's logic is bit more
> > > > > complicated.
> > > > > > > > > >
> > > > > > > > > > Besides, I am not sure I understand your suggestion for
> > > > > > is_temporary
> > > > > > > > > field.
> > > > > > > > > > It seems that a replica can have only two states, i.e.
> > normal
> > > > it
> > > > > is
> > > > > > > > being
> > > > > > > > > > used to serve fetch/produce requests and temporary if it
> > is a
> > > > > > replica
> > > > > > > > is
> > > > > > > > > > that catching up with the normal one. If you think we
> > should
> > > > have
> > > > > > > > > > reassignment tool send DescribeDirsRequest before
> retrying
> > > > > > > > > > ChangeReplicaDirRequest, can you elaborate a bit what is
> > the
> > > > > > > "pending"
> > > > > > > > > > state?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 11.3 Are you saying the value in --throttle will be
> > > used
> > > > to
> > > > > > set
> > > > > > > > > both
> > > > > > > > > > > > > intra.broker.throttled.rate and
> > > > > leader.follower.replication.
> > > > > > > > > > > > > throttled.replicas?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > No. --throttle will be used to only to set
> > > > > > > > > leader.follower.replication
> > > > > > > > > > as
> > > > > > > > > > > > it does now. I think we do not need any option in the
> > > > > > > > > > > > kafka-reassignment-partitions.sh to specify
> > > > > > > > > > intra.broker.throttled.rate.
> > > > > > > > > > > > User canset it in broker config or dynamically using
> > > > > > > > kafka-config.sh.
> > > > > > > > > > > Does
> > > > > > > > > > > > this sound OK?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > Ok. This sounds good. It would be useful to make this
> > clear
> > > > in
> > > > > > the
> > > > > > > > > wiki.
> > > > > > > > > > >
> > > > > > > > > > > Sure. I have updated the wiki to specify this: "the
> quota
> > > > > > specified
> > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > argument `–throttle` will be applied to only inter-broker
> > > > replica
> > > > > > > > > > reassignment. It does not affect the quota for replica
> > > movement
> > > > > > > between
> > > > > > > > > log
> > > > > > > > > > directories".
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 12.2 If the user only wants to check one topic, the
> > > tool
> > > > > > could
> > > > > > > do
> > > > > > > > > the
> > > > > > > > > > > > > filtering on the client side, right? My concern
> with
> > > > having
> > > > > > > both
> > > > > > > > > > > log_dirs
> > > > > > > > > > > > > and topics is the semantic. For example, if both
> are
> > > not
> > > > > > empty,
> > > > > > > > do
> > > > > > > > > we
> > > > > > > > > > > > > return the intersection or the union?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Yes the tool could filter on the client side. But the
> > > > purpose
> > > > > > of
> > > > > > > > > having
> > > > > > > > > > > > this field is to reduce response side in case broker
> > has
> > > a
> > > > > lot
> > > > > > of
> > > > > > > > > > topics.
> > > > > > > > > > > > The both fields are used as filter and the result is
> > > > > > > intersection.
> > > > > > > > Do
> > > > > > > > > > you
> > > > > > > > > > > > think this semantic is confusing or
> counter-intuitive?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Ok. Could we document the semantic when both dirs and
> > > topics
> > > > > are
> > > > > > > > > > specified?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Sure. I have updated the wiki to specify this: "log_dirs
> > and
> > > > > topics
> > > > > > > are
> > > > > > > > > > used to filter the results to include only the specified
> > > > > > > log_dir/topic.
> > > > > > > > > The
> > > > > > > > > > result is the intersection of both filters".
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Mar 13, 2017 at 3:32 PM, Dong Lin <
> > > > > > lindon...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks much for your detailed comments. Please
> see
> > my
> > > > > reply
> > > > > > > > > below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Mar 13, 2017 at 9:09 AM, Jun Rao <
> > > > > j...@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Dong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the updated KIP. Some more comments
> > > below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 10. For the .move log, do we perform any
> segment
> > > > > deletion
> > > > > > > > > (based
> > > > > > > > > > on
> > > > > > > > > > > > > > > retention) or log cleaning (if a compacted
> > topic)?
> > > Or
> > > > > do
> > > > > > we
> > > > > > > > > only
> > > > > > > > > > > > enable
> > > > > > > > > > > > > > > that after the swap?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 11. kafka-reassign-partitions.sh
> > > > > > > > > > > > > > > 11.1 If all reassigned replicas are in the
> > current
> > > > > broker
> > > > > > > and
> > > > > > > > > > only
> > > > > > > > > > > > the
> > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > directories have changed, we can probably
> > optimize
> > > > the
> > > > > > tool
> > > > > > > > to
> > > > > > > > > > not
> > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > > partition reassignment through the controller
> and
> > > > only
> > > > > > > > > > > > > > > send ChangeReplicaDirRequest.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, the reassignment script should not create
> the
> > > > > > > reassignment
> > > > > > > > > > znode
> > > > > > > > > > > > if
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > replicas are not be moved between brokers. This
> > falls
> > > > > into
> > > > > > > the
> > > > > > > > > "How
> > > > > > > > > > > to
> > > > > > > > > > > > > move
> > > > > > > > > > > > > > replica between log directories on the same
> broker"
> > > of
> > > > > the
> > > > > > > > > Proposed
> > > > > > > > > > > > > Change
> > > > > > > > > > > > > > section.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 11.2 If ChangeReplicaDirRequest specifies a
> > replica
> > > > > > that's
> > > > > > > > not
> > > > > > > > > > > > created
> > > > > > > > > > > > > > yet,
> > > > > > > > > > > > > > > could the broker just remember that in memory
> and
> > > > > create
> > > > > > > the
> > > > > > > > > > > replica
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > the creation is requested? This way, when doing
> > > > cluster
> > > > > > > > > > expansion,
> > > > > > > > > > > we
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > > make sure that the new replicas on the new
> > brokers
> > > > are
> > > > > > > > created
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > log directory in the first place. We can also
> > avoid
> > > > the
> > > > > > > tool
> > > > > > > > > > having
> > > > > > > > > > > > to
> > > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > issuing ChangeReplicaDirRequest in response to
> > > > > > > > > > > > > > > ReplicaNotAvailableException.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I am concerned that the ChangeReplicaDirRequest
> > would
> > > > be
> > > > > > lost
> > > > > > > > if
> > > > > > > > > > > broker
> > > > > > > > > > > > > > restarts after it sends ChangeReplicaDirResponse
> > but
> > > > > before
> > > > > > > it
> > > > > > > > > > > receives
> > > > > > > > > > > > > > LeaderAndIsrRequest. In this case, the user will
> > > > receive
> > > > > > > > success
> > > > > > > > > > when
> > > > > > > > > > > > > they
> > > > > > > > > > > > > > initiate replica reassignment, but replica
> > > reassignment
> > > > > > will
> > > > > > > > > never
> > > > > > > > > > > > > complete
> > > > > > > > > > > > > > when they verify the reassignment later. This
> would
> > > be
> > > > > > > > confusing
> > > > > > > > > to
> > > > > > > > > > > > user.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There are three different approaches to this
> > problem
> > > if
> > > > > > > broker
> > > > > > > > > has
> > > > > > > > > > > not
> > > > > > > > > > > > > > created replica yet after it receives
> > > > > > > ChangeReplicaDirResquest:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) Broker immediately replies to user with
> > > > > > > > > > > ReplicaNotAvailableException
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > user can decide to retry again later. The
> advantage
> > > of
> > > > > this
> > > > > > > > > > solution
> > > > > > > > > > > is
> > > > > > > > > > > > > > that the broker logic is very simple and the
> > > > reassignment
> > > > > > > > script
> > > > > > > > > > > logic
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > seems straightforward. The disadvantage is that
> > user
> > > > > script
> > > > > > > has
> > > > > > > > > to
> > > > > > > > > > > > retry.
> > > > > > > > > > > > > > But it seems fine - we can set interval between
> > > retries
> > > > > to
> > > > > > be
> > > > > > > > 0.5
> > > > > > > > > > sec
> > > > > > > > > > > > so
> > > > > > > > > > > > > > that broker want be bombarded by those requests.
> > This
> > > > is
> > > > > > the
> > > > > > > > > > solution
> > > > > > > > > > > > > > chosen in the current KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) Broker can put ChangeReplicaDirRequest in a
> > > > purgatory
> > > > > > with
> > > > > > > > > > timeout
> > > > > > > > > > > > and
> > > > > > > > > > > > > > replies to user after the replica has been
> > created. I
> > > > > > didn't
> > > > > > > > > choose
> > > > > > > > > > > > this
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > the interest of keeping broker logic simpler.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3) Broker can remember that by making a mark in
> the
> > > > disk,
> > > > > > > e.g.
> > > > > > > > > > create
> > > > > > > > > > > > > > topicPartition.tomove directory in the
> destination
> > > log
> > > > > > > > directory.
> > > > > > > > > > > This
> > > > > > > > > > > > > mark
> > > > > > > > > > > > > > will be persisted across broker restart. This is
> > the
> > > > > first
> > > > > > > > idea I
> > > > > > > > > > had
> > > > > > > > > > > > > but I
> > > > > > > > > > > > > > replaced it with solution 1) in the interest of
> > > keeping
> > > > > > > broker
> > > > > > > > > > > simple.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It seems that solution 1) is the simplest one
> that
> > > > works.
> > > > > > > But I
> > > > > > > > > am
> > > > > > > > > > OK
> > > > > > > > > > > > to
> > > > > > > > > > > > > > switch to the other two solutions if we don't
> want
> > > the
> > > > > > retry
> > > > > > > > > logic.
> > > > > > > > > > > > What
> > > > > > > > > > > > > do
> > > > > > > > > > > > > > you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 11.3 Do we need an option in the tool to specify
> > > > > > > intra.broker.
> > > > > > > > > > > > > > > throttled.rate?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I don't find it useful to add this option to
> > > > > > > > > > > > > kafka-reassign-partitions.sh.
> > > > > > > > > > > > > > The reason we have the option "--throttle" in the
> > > > script
> > > > > to
> > > > > > > > > > throttle
> > > > > > > > > > > > > > replication rate is that we usually want higher
> > quota
> > > > to
> > > > > > fix
> > > > > > > an
> > > > > > > > > > > offline
> > > > > > > > > > > > > > replica to get out of URP. But we are OK to have
> a
> > > > lower
> > > > > > > quota
> > > > > > > > if
> > > > > > > > > > we
> > > > > > > > > > > > are
> > > > > > > > > > > > > > moving replica only to balance the cluster. Thus
> it
> > > is
> > > > > > common
> > > > > > > > for
> > > > > > > > > > SRE
> > > > > > > > > > > > to
> > > > > > > > > > > > > > use different quota when using
> > > > > kafka-reassign-partitions.sh
> > > > > > > to
> > > > > > > > > move
> > > > > > > > > > > > > replica
> > > > > > > > > > > > > > between brokers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > However, the only reason for moving replica
> between
> > > log
> > > > > > > > > directories
> > > > > > > > > > > of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > same broker is to balance cluster resource. Thus
> > the
> > > > > option
> > > > > > > to
> > > > > > > > > > > > > > specify intra.broker.throttled.rate in the tool
> is
> > > not
> > > > > that
> > > > > > > > > > useful. I
> > > > > > > > > > > > am
> > > > > > > > > > > > > > inclined not to add this option to keep this
> tool's
> > > > usage
> > > > > > > > > simpler.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 12. DescribeDirsRequest
> > > > > > > > > > > > > > > 12.1 In other requests like CreateTopicRequest,
> > we
> > > > > return
> > > > > > > an
> > > > > > > > > > empty
> > > > > > > > > > > > list
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the response for an empty input list. If the
> > input
> > > > list
> > > > > > is
> > > > > > > > > null,
> > > > > > > > > > we
> > > > > > > > > > > > > > return
> > > > > > > > > > > > > > > everything. We should probably follow the same
> > > > > convention
> > > > > > > > here.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks. I wasn't aware of this convention. I have
> > > > change
> > > > > > > > > > > > > > DescribeDirsRequest so that "null" indicates
> "all".
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 12.2 Do we need the topics field? Since the
> > request
> > > > is
> > > > > > > about
> > > > > > > > > log
> > > > > > > > > > > > dirs,
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > makes sense to specify the log dirs. But it's
> > weird
> > > > to
> > > > > > > > specify
> > > > > > > > > > > > topics.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The topics field is not necessary. But it is
> useful
> > > to
> > > > > > reduce
> > > > > > > > the
> > > > > > > > > > > > > response
> > > > > > > > > > > > > > size in case user are only interested in the
> status
> > > of
> > > > a
> > > > > > few
> > > > > > > > > > topics.
> > > > > > > > > > > > For
> > > > > > > > > > > > > > example, user may have initiated the reassignment
> > of
> > > a
> > > > > > given
> > > > > > > > > > replica
> > > > > > > > > > > > from
> > > > > > > > > > > > > > one log directory to another log directory on the
> > > same
> > > > > > > broker,
> > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > > > user
> > > > > > > > > > > > > > only wants to check the status of this given
> > > partition
> > > > by
> > > > > > > > looking
> > > > > > > > > > > > > > at DescribeDirsResponse. Thus this field is
> useful.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I am not sure if it is weird to call this request
> > > > > > > > > > > DescribeDirsRequest.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > > response is a map from log directory to
> information
> > > to
> > > > > some
> > > > > > > > > > > partitions
> > > > > > > > > > > > on
> > > > > > > > > > > > > > the log directory. Do you think we need to change
> > the
> > > > > name
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > > request?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 12.3 DescribeDirsResponsePartition: Should we
> > > include
> > > > > > > > > firstOffset
> > > > > > > > > > > and
> > > > > > > > > > > > > > > nextOffset in the response? That could be
> useful
> > to
> > > > > track
> > > > > > > the
> > > > > > > > > > > > progress
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the movement.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yeah good point. I agree it is useful to include
> > > > > > logEndOffset
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > > response. According to Log.scala doc the
> > logEndOffset
> > > > is
> > > > > > > > > equivalent
> > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > nextOffset. User can track progress by checking
> the
> > > > > > > difference
> > > > > > > > > > > between
> > > > > > > > > > > > > > logEndOffset of the given partition in the source
> > and
> > > > > > > > destination
> > > > > > > > > > log
> > > > > > > > > > > > > > directories. I have added logEndOffset to the
> > > > > > > > > > > > > DescribeDirsResponsePartition
> > > > > > > > > > > > > > in the KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > But it seems that we don't need firstOffset in
> the
> > > > > > response.
> > > > > > > Do
> > > > > > > > > you
> > > > > > > > > > > > think
> > > > > > > > > > > > > > firstOffset is still needed?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 13. ChangeReplicaDirResponse: Do we need error
> > code
> > > > at
> > > > > > both
> > > > > > > > > > levels?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My bad. It is not needed. I have removed request
> > > level
> > > > > > error
> > > > > > > > > code.
> > > > > > > > > > I
> > > > > > > > > > > > also
> > > > > > > > > > > > > > added ChangeReplicaDirRequestTopic and
> > > > > > > > > > ChangeReplicaDirResponseTopic
> > > > > > > > > > > to
> > > > > > > > > > > > > > reduce duplication of the "topic" string in the
> > > request
> > > > > and
> > > > > > > > > > response.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 14. num.replica.move.threads: Does it default
> to
> > #
> > > > log
> > > > > > > dirs?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > No. It doesn't. I expect default number to be set
> > to
> > > a
> > > > > > > > > conservative
> > > > > > > > > > > > value
> > > > > > > > > > > > > > such as 3. It may be surprising to user if the
> > number
> > > > of
> > > > > > > > threads
> > > > > > > > > > > > increase
> > > > > > > > > > > > > > just because they have assigned more log
> > directories
> > > to
> > > > > > Kafka
> > > > > > > > > > broker.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It seems that the number of replica move threads
> > > > doesn't
> > > > > > have
> > > > > > > > to
> > > > > > > > > > > depend
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > the number of log directories. It is possible to
> > have
> > > > one
> > > > > > > > thread
> > > > > > > > > > that
> > > > > > > > > > > > > moves
> > > > > > > > > > > > > > replicas across all log directories. On the other
> > > hand
> > > > we
> > > > > > can
> > > > > > > > > have
> > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > threads to move replicas to the same log
> directory.
> > > For
> > > > > > > > example,
> > > > > > > > > if
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > uses SSD, the CPU instead of disk IO may be the
> > > replica
> > > > > > move
> > > > > > > > > > > bottleneck
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > it will be faster to move replicas using multiple
> > > > threads
> > > > > > per
> > > > > > > > log
> > > > > > > > > > > > > > directory.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Mar 9, 2017 at 7:04 PM, Dong Lin <
> > > > > > > > lindon...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I just made one correction in the KIP. If
> > broker
> > > > > > receives
> > > > > > > > > > > > > > > > ChangeReplicaDirRequest and the replica
> hasn't
> > > been
> > > > > > > created
> > > > > > > > > > > there,
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > broker will respond
> > ReplicaNotAvailableException.
> > > > > > > > > > > > > > > > The kafka-reassignemnt-partitions.sh will
> need
> > > to
> > > > > > > re-send
> > > > > > > > > > > > > > > > ChangeReplicaDirRequest in this case in order
> > to
> > > > wait
> > > > > > for
> > > > > > > > > > > > controller
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > send LeaderAndIsrRequest to broker. The
> > previous
> > > > > > approach
> > > > > > > > of
> > > > > > > > > > > > creating
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > empty directory seems hacky.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Mar 9, 2017 at 6:33 PM, Dong Lin <
> > > > > > > > > lindon...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hey Jun,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for your comments! I have updated
> the
> > > KIP
> > > > to
> > > > > > > > address
> > > > > > > > > > > your
> > > > > > > > > > > > > > > > comments.
> > > > > > > > > > > > > > > > > Please see my reply inline.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Can you let me know if the latest KIP has
> > > > addressed
> > > > > > > your
> > > > > > > > > > > > comments?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Mar 8, 2017 at 9:56 PM, Jun Rao <
> > > > > > > > j...@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Hi, Dong,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Thanks for the reply.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> 1.3 So the thread gets the lock, checks if
> > > > caught
> > > > > up
> > > > > > > and
> > > > > > > > > > > > releases
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > lock
> > > > > > > > > > > > > > > > >> if not? Then, in the case when there is
> > > > continuous
> > > > > > > > > incoming
> > > > > > > > > > > > data,
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> thread may never get a chance to swap. One
> > way
> > > > to
> > > > > > > > address
> > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> thread is getting really close in catching
> > up,
> > > > > just
> > > > > > > hold
> > > > > > > > > > onto
> > > > > > > > > > > > the
> > > > > > > > > > > > > > lock
> > > > > > > > > > > > > > > > >> until the thread fully catches up.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, that was my original solution. I see
> > your
> > > > > point
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > > lock
> > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > be fairly assigned to ReplicaMoveThread and
> > > > > > > > > > > RequestHandlerThread
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > is frequent incoming requets. You solution
> > > should
> > > > > > > address
> > > > > > > > > the
> > > > > > > > > > > > > problem
> > > > > > > > > > > > > > > > and I
> > > > > > > > > > > > > > > > > have updated the KIP to use it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> 2.3 So, you are saying that the partition
> > > > > > reassignment
> > > > > > > > > tool
> > > > > > > > > > > can
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > send
> > > > > > > > > > > > > > > > >> a ChangeReplicaDirRequest to relevant
> > brokers
> > > to
> > > > > > > > establish
> > > > > > > > > > the
> > > > > > > > > > > > log
> > > > > > > > > > > > > > dir
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > >> replicas not created yet, then trigger the
> > > > > partition
> > > > > > > > > > movement
> > > > > > > > > > > > > across
> > > > > > > > > > > > > > > > >> brokers through the controller? That's
> > > actually
> > > > a
> > > > > > good
> > > > > > > > > idea.
> > > > > > > > > > > > Then,
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > >> just leave LeaderAndIsrRequest as it is.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, that is what I plan to do. If broker
> > > > receives
> > > > > a
> > > > > > > > > > > > > > > > > ChangeReplicaDirRequest while it is not
> > leader
> > > or
> > > > > > > > follower
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > partition, the broker will create an empty
> > Log
> > > > > > instance
> > > > > > > > > > (i.e. a
> > > > > > > > > > > > > > > directory
> > > > > > > > > > > > > > > > > named topicPartition) in the destination
> log
> > > > > > directory
> > > > > > > so
> > > > > > > > > > that
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > will be placed there when broker receives
> > > > > > > > > LeaderAndIsrRequest
> > > > > > > > > > > > from
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > broker. The broker should clean up empty
> > those
> > > > Log
> > > > > > > > > instances
> > > > > > > > > > on
> > > > > > > > > > > > > > startup
> > > > > > > > > > > > > > > > > just in case a ChangeReplicaDirRequest was
> > > > > mistakenly
> > > > > > > > sent
> > > > > > > > > > to a
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > was not meant to be follower/leader of the
> > > > > > partition..
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Another thing related to
> > > > > > > > > > > > > > > > >> ChangeReplicaDirRequest.
> > > > > > > > > > > > > > > > >> Since this request may take long to
> > complete,
> > > I
> > > > am
> > > > > > not
> > > > > > > > > sure
> > > > > > > > > > if
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > >> wait for the movement to complete before
> > > > respond.
> > > > > > > While
> > > > > > > > > > > waiting
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> movement to complete, the idle connection
> > may
> > > be
> > > > > > > killed
> > > > > > > > or
> > > > > > > > > > the
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > >> be gone already. An alternative is to
> return
> > > > > > > immediately
> > > > > > > > > and
> > > > > > > > > > > > add a
> > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > >> request like CheckReplicaDirRequest to see
> > if
> > > > the
> > > > > > > > movement
> > > > > > > > > > has
> > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > >> The tool can take advantage of that to
> check
> > > the
> > > > > > > status.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I agree with your concern and solution. We
> > need
> > > > > > request
> > > > > > > > to
> > > > > > > > > > > query
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > partition -> log_directory mapping on the
> > > > broker. I
> > > > > > > have
> > > > > > > > > > > updated
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > remove need for
> > ChangeReplicaDirRequestPurgato
> > > > ry.
> > > > > > > > > > > > > > > > > Instead, kafka-reassignemnt-partitions.sh
> > will
> > > > > send
> > > > > > > > > > > > > > > DescribeDirsRequest
> > > > > > > > > > > > > > > > > to brokers when user wants to verify the
> > > > partition
> > > > > > > > > > assignment.
> > > > > > > > > > > > > Since
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > need this DescribeDirsRequest anyway, we
> can
> > > also
> > > > > use
> > > > > > > > this
> > > > > > > > > > > > request
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > expose stats like the individual log size
> > > instead
> > > > > of
> > > > > > > > using
> > > > > > > > > > JMX.
> > > > > > > > > > > > One
> > > > > > > > > > > > > > > > > drawback of using JMX is that user has to
> > > manage
> > > > > the
> > > > > > > JMX
> > > > > > > > > port
> > > > > > > > > > > and
> > > > > > > > > > > > > > > related
> > > > > > > > > > > > > > > > > credentials if they haven't already done
> > this,
> > > > > which
> > > > > > is
> > > > > > > > the
> > > > > > > > > > > case
> > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > LinkedIn.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Thanks,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Jun
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On Wed, Mar 8, 2017 at 6:21 PM, Dong Lin <
> > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> > Hey Jun,
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Thanks for the detailed explanation. I
> > will
> > > > use
> > > > > > the
> > > > > > > > > > separate
> > > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > > >> pool to
> > > > > > > > > > > > > > > > >> > move replica between log directories. I
> > will
> > > > let
> > > > > > you
> > > > > > > > > know
> > > > > > > > > > > when
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > >> has
> > > > > > > > > > > > > > > > >> > been updated to use a separate thread
> > pool.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Here is my response to your other
> > questions:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > 1.3 My idea is that the
> ReplicaMoveThread
> > > that
> > > > > > moves
> > > > > > > > > data
> > > > > > > > > > > > should
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > lock before checking whether the replica
> > in
> > > > the
> > > > > > > > > > destination
> > > > > > > > > > > > log
> > > > > > > > > > > > > > > > >> directory
> > > > > > > > > > > > > > > > >> > has caught up. If the new replica has
> > caught
> > > > up,
> > > > > > > then
> > > > > > > > > the
> > > > > > > > > > > > > > > > >> ReplicaMoveThread
> > > > > > > > > > > > > > > > >> > should swaps the replica while it is
> still
> > > > > holding
> > > > > > > the
> > > > > > > > > > lock.
> > > > > > > > > > > > The
> > > > > > > > > > > > > > > > >> > ReplicaFetcherThread or
> > RequestHandlerThread
> > > > > will
> > > > > > > not
> > > > > > > > be
> > > > > > > > > > > able
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > append
> > > > > > > > > > > > > > > > >> > data to the replica in the source
> replica
> > > > during
> > > > > > > this
> > > > > > > > > > period
> > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > >> they
> > > > > > > > > > > > > > > > >> > can not get the lock. Does this address
> > the
> > > > > > problem?
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > 2.3 I get your point that we want to
> keep
> > > > > > controller
> > > > > > > > > > > simpler.
> > > > > > > > > > > > If
> > > > > > > > > > > > > > > admin
> > > > > > > > > > > > > > > > >> tool
> > > > > > > > > > > > > > > > >> > can send ChangeReplicaDirRequest to move
> > > data
> > > > > > > within a
> > > > > > > > > > > broker,
> > > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > >> > controller probably doesn't even need to
> > > > include
> > > > > > log
> > > > > > > > > > > directory
> > > > > > > > > > > > > > path
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > LeaderAndIsrRequest. How about this:
> > > > controller
> > > > > > will
> > > > > > > > > only
> > > > > > > > > > > deal
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > >> > reassignment across brokers as it does
> > now.
> > > If
> > > > > > user
> > > > > > > > > > > specified
> > > > > > > > > > > > > > > > >> destination
> > > > > > > > > > > > > > > > >> > replica for any disk, the admin tool
> will
> > > send
> > > > > > > > > > > > > > > ChangeReplicaDirRequest
> > > > > > > > > > > > > > > > >> and
> > > > > > > > > > > > > > > > >> > wait for response from broker to confirm
> > > that
> > > > > all
> > > > > > > > > replicas
> > > > > > > > > > > > have
> > > > > > > > > > > > > > been
> > > > > > > > > > > > > > > > >> moved
> > > > > > > > > > > > > > > > >> > to the destination log direcotry. The
> > broker
> > > > > will
> > > > > > > put
> > > > > > > > > > > > > > > > >> > ChangeReplicaDirRequset in a purgatory
> and
> > > > > respond
> > > > > > > > > either
> > > > > > > > > > > when
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> movement
> > > > > > > > > > > > > > > > >> > is completed or when the request has
> > > > timed-out.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > 4. I agree that we can expose these
> > metrics
> > > > via
> > > > > > JMX.
> > > > > > > > > But I
> > > > > > > > > > > am
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > >> if
> > > > > > > > > > > > > > > > >> > it can be obtained easily with good
> > > > performance
> > > > > > > using
> > > > > > > > > > either
> > > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > >> tools
> > > > > > > > > > > > > > > > >> > or new script in kafka. I will ask SREs
> > for
> > > > > their
> > > > > > > > > opinion.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Thanks,
> > > > > > > > > > > > > > > > >> > Dong
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > On Wed, Mar 8, 2017 at 1:24 PM, Jun Rao
> <
> > > > > > > > > j...@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > > Hi, Dong,
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > Thanks for the updated KIP. A few more
> > > > > comments
> > > > > > > > below.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > 1.1 and 1.2: I am still not sure there
> > is
> > > > > enough
> > > > > > > > > benefit
> > > > > > > > > > > of
> > > > > > > > > > > > > > > reusing
> > > > > > > > > > > > > > > > >> > > ReplicaFetchThread
> > > > > > > > > > > > > > > > >> > > to move data across disks.
> > > > > > > > > > > > > > > > >> > > (a) A big part of ReplicaFetchThread
> is
> > to
> > > > > deal
> > > > > > > with
> > > > > > > > > > > issuing
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> tracking
> > > > > > > > > > > > > > > > >> > > fetch requests. So, it doesn't feel
> that
> > > we
> > > > > get
> > > > > > > much
> > > > > > > > > > from
> > > > > > > > > > > > > > reusing
> > > > > > > > > > > > > > > > >> > > ReplicaFetchThread
> > > > > > > > > > > > > > > > >> > > only to disable the fetching part.
> > > > > > > > > > > > > > > > >> > > (b) The leader replica has no
> > > > > ReplicaFetchThread
> > > > > > > to
> > > > > > > > > > start
> > > > > > > > > > > > > with.
> > > > > > > > > > > > > > It
> > > > > > > > > > > > > > > > >> feels
> > > > > > > > > > > > > > > > >> > > weird to start one just for intra
> broker
> > > > data
> > > > > > > > > movement.
> > > > > > > > > > > > > > > > >> > > (c) The ReplicaFetchThread is per
> > broker.
> > > > > > > > Intuitively,
> > > > > > > > > > the
> > > > > > > > > > > > > > number
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> > > threads doing intra broker data
> movement
> > > > > should
> > > > > > be
> > > > > > > > > > related
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> number
> > > > > > > > > > > > > > > > >> > of
> > > > > > > > > > > > > > > > >> > > disks in the broker, not the number of
> > > > brokers
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > cluster.
> > > > > > > > > > > > > > > > >> > > (d) If the destination disk fails, we
> > want
> > > > to
> > > > > > stop
> > > > > > > > the
> > > > > > > > > > > intra
> > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > >> data
> > > > > > > > > > > > > > > > >> > > movement, but want to continue inter
> > > broker
> > > > > > > > > replication.
> > > > > > > > > > > So,
> > > > > > > > > > > > > > > > >> logically,
> > > > > > > > > > > > > > > > >> > it
> > > > > > > > > > > > > > > > >> > > seems it's better to separate out the
> > two.
> > > > > > > > > > > > > > > > >> > > (e) I am also not sure if we should
> > reuse
> > > > the
> > > > > > > > existing
> > > > > > > > > > > > > > throttling
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > >> > > replication. It's designed to handle
> > > traffic
> > > > > > > across
> > > > > > > > > > > brokers
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > delaying is done in the fetch request.
> > So,
> > > > if
> > > > > we
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > > doing
> > > > > > > > > > > > > > > > >> > > fetching in ReplicaFetchThread,
> > > > > > > > > > > > > > > > >> > > I am not sure the existing throttling
> is
> > > > > > > effective.
> > > > > > > > > > Also,
> > > > > > > > > > > > when
> > > > > > > > > > > > > > > > >> specifying
> > > > > > > > > > > > > > > > >> > > the throttling of moving data across
> > > disks,
> > > > it
> > > > > > > seems
> > > > > > > > > the
> > > > > > > > > > > > user
> > > > > > > > > > > > > > > > >> shouldn't
> > > > > > > > > > > > > > > > >> > > care about whether a replica is a
> leader
> > > or
> > > > a
> > > > > > > > > follower.
> > > > > > > > > > > > > Reusing
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > existing throttling config name will
> be
> > > > > awkward
> > > > > > in
> > > > > > > > > this
> > > > > > > > > > > > > regard.
> > > > > > > > > > > > > > > > >> > > (f) It seems it's simpler and more
> > > > consistent
> > > > > to
> > > > > > > > use a
> > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > > >> > pool
> > > > > > > > > > > > > > > > >> > > for local data movement (for both
> leader
> > > and
> > > > > > > > follower
> > > > > > > > > > > > > replicas).
> > > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > >> > > process can then be configured (e.g.
> > > number
> > > > of
> > > > > > > > > threads,
> > > > > > > > > > > etc)
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> > throttled
> > > > > > > > > > > > > > > > >> > > independently.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > 1.3 Yes, we will need some
> > synchronization
> > > > > > there.
> > > > > > > > So,
> > > > > > > > > if
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > movement
> > > > > > > > > > > > > > > > >> > > thread catches up, gets the lock to do
> > the
> > > > > swap,
> > > > > > > but
> > > > > > > > > > > > realizes
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > >> > data
> > > > > > > > > > > > > > > > >> > > is added, it has to continue catching
> up
> > > > while
> > > > > > > > holding
> > > > > > > > > > the
> > > > > > > > > > > > > lock?
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > 2.3 The benefit of including the
> desired
> > > log
> > > > > > > > directory
> > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > > LeaderAndIsrRequest
> > > > > > > > > > > > > > > > >> > > during partition reassignment is that
> > the
> > > > > > > controller
> > > > > > > > > > > doesn't
> > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > track
> > > > > > > > > > > > > > > > >> > > the progress for disk movement. So,
> you
> > > > don't
> > > > > > need
> > > > > > > > the
> > > > > > > > > > > > > > additional
> > > > > > > > > > > > > > > > >> > > BrokerDirStateUpdateRequest. Then the
> > > > > controller
> > > > > > > > never
> > > > > > > > > > > needs
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > issue
> > > > > > > > > > > > > > > > >> > > ChangeReplicaDirRequest.
> > > > > > > > > > > > > > > > >> > > Only the admin tool will issue
> > > > > > > > ChangeReplicaDirRequest
> > > > > > > > > > to
> > > > > > > > > > > > move
> > > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > >> > within
> > > > > > > > > > > > > > > > >> > > a broker. I agree that this makes
> > > > > > > > LeaderAndIsrRequest
> > > > > > > > > > more
> > > > > > > > > > > > > > > > >> complicated,
> > > > > > > > > > > > > > > > >> > but
> > > > > > > > > > > > > > > > >> > > that seems simpler than changing the
> > > > > controller
> > > > > > to
> > > > > > > > > track
> > > > > > > > > > > > > > > additional
> > > > > > > > > > > > > > > > >> > states
> > > > > > > > > > > > > > > > >> > > during partition reassignment.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > 4. We want to make a decision on how
> to
> > > > expose
> > > > > > the
> > > > > > > > > > stats.
> > > > > > > > > > > So
> > > > > > > > > > > > > > far,
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > >> are
> > > > > > > > > > > > > > > > >> > > exposing stats like the individual log
> > > size
> > > > as
> > > > > > > JMX.
> > > > > > > > > So,
> > > > > > > > > > > one
> > > > > > > > > > > > > way
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > just
> > > > > > > > > > > > > > > > >> > > add new jmx to expose the log
> directory
> > of
> > > > > > > > individual
> > > > > > > > > > > > > replicas.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > Jun
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > On Thu, Mar 2, 2017 at 11:18 PM, Dong
> > Lin
> > > <
> > > > > > > > > > > > > lindon...@gmail.com>
> > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > > Hey Jun,
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Thanks for all the comments! Please
> > see
> > > my
> > > > > > > answer
> > > > > > > > > > > below. I
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > >> updated
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > KIP to address most of the questions
> > and
> > > > > make
> > > > > > > the
> > > > > > > > > KIP
> > > > > > > > > > > > easier
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > > understand.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Thanks,
> > > > > > > > > > > > > > > > >> > > > Dong
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > On Thu, Mar 2, 2017 at 9:35 AM, Jun
> > Rao
> > > <
> > > > > > > > > > > j...@confluent.io
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > Hi, Dong,
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > Thanks for the KIP. A few comments
> > > > below.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > 1. For moving data across
> > directories
> > > > > > > > > > > > > > > > >> > > > > 1.1 I am not sure why we want to
> use
> > > > > > > > > > > > ReplicaFetcherThread
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > >> > data
> > > > > > > > > > > > > > > > >> > > > > around in the leader.
> > > ReplicaFetchThread
> > > > > > > fetches
> > > > > > > > > > data
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > > socket.
> > > > > > > > > > > > > > > > >> > For
> > > > > > > > > > > > > > > > >> > > > > moving data locally, it seems that
> > we
> > > > want
> > > > > > to
> > > > > > > > > avoid
> > > > > > > > > > > the
> > > > > > > > > > > > > > socket
> > > > > > > > > > > > > > > > >> > > overhead.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > The purpose of using
> > ReplicaFetchThread
> > > is
> > > > > to
> > > > > > > > re-use
> > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > > >> > > > instead of creating more threads and
> > > make
> > > > > our
> > > > > > > > thread
> > > > > > > > > > > model
> > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > > >> > complex.
> > > > > > > > > > > > > > > > >> > > It
> > > > > > > > > > > > > > > > >> > > > seems like a nature choice for
> copying
> > > > data
> > > > > > > > between
> > > > > > > > > > > disks
> > > > > > > > > > > > > > since
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > >> is
> > > > > > > > > > > > > > > > >> > > > similar to copying data between
> > brokers.
> > > > > > Another
> > > > > > > > > > reason
> > > > > > > > > > > is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > > > replica to be moved is a follower,
> we
> > > > don't
> > > > > > need
> > > > > > > > > lock
> > > > > > > > > > to
> > > > > > > > > > > > > swap
> > > > > > > > > > > > > > > > >> replicas
> > > > > > > > > > > > > > > > >> > > when
> > > > > > > > > > > > > > > > >> > > > destination replica has caught up,
> > since
> > > > the
> > > > > > > same
> > > > > > > > > > thread
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> > > fetching
> > > > > > > > > > > > > > > > >> > > > data from leader will swap the
> > replica.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > The ReplicaFetchThread will not
> incur
> > > > socket
> > > > > > > > > overhead
> > > > > > > > > > > > while
> > > > > > > > > > > > > > > > copying
> > > > > > > > > > > > > > > > >> > data
> > > > > > > > > > > > > > > > >> > > > between disks. It will read directly
> > > from
> > > > > > source
> > > > > > > > > disk
> > > > > > > > > > > (as
> > > > > > > > > > > > we
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > >> > > > processing FetchRequest) and write
> to
> > > > > > > destination
> > > > > > > > > disk
> > > > > > > > > > > (as
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > >> when
> > > > > > > > > > > > > > > > >> > > > processing ProduceRequest).
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 1.2 I am also not sure about
> moving
> > > data
> > > > > in
> > > > > > > the
> > > > > > > > > > > > > > > > >> ReplicaFetcherThread
> > > > > > > > > > > > > > > > >> > in
> > > > > > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > > > > > >> > > > > follower. For example, I am not
> sure
> > > > > setting
> > > > > > > > > > > > > > > > >> replica.fetch.max.wait
> > > > > > > > > > > > > > > > >> > to
> > > > > > > > > > > > > > > > >> > > 0
> > > > > > > > > > > > > > > > >> > > > >  is ideal. It may not always be
> > > > effective
> > > > > > > since
> > > > > > > > a
> > > > > > > > > > > fetch
> > > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > the
> > > > > > > > > > > > > > > > >> > > > > ReplicaFetcherThread could be
> > > > arbitrarily
> > > > > > > > delayed
> > > > > > > > > > due
> > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> replication
> > > > > > > > > > > > > > > > >> > > > > throttling on the leader. In
> > general,
> > > > the
> > > > > > data
> > > > > > > > > > > movement
> > > > > > > > > > > > > > logic
> > > > > > > > > > > > > > > > >> across
> > > > > > > > > > > > > > > > >> > > > disks
> > > > > > > > > > > > > > > > >> > > > > seems different from that in
> > > > > > > > ReplicaFetcherThread.
> > > > > > > > > > > So, I
> > > > > > > > > > > > > am
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > >> sure
> > > > > > > > > > > > > > > > >> > > why
> > > > > > > > > > > > > > > > >> > > > > they need to be coupled.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > While it may not be the most
> efficient
> > > way
> > > > > to
> > > > > > > copy
> > > > > > > > > > data
> > > > > > > > > > > > > > between
> > > > > > > > > > > > > > > > >> local
> > > > > > > > > > > > > > > > >> > > > disks, it will be at least as
> > efficient
> > > as
> > > > > > > copying
> > > > > > > > > > data
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > destination disk. The expected goal
> of
> > > > > KIP-113
> > > > > > > is
> > > > > > > > to
> > > > > > > > > > > > enable
> > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > >> > movement
> > > > > > > > > > > > > > > > >> > > > between disks with no less
> efficiency
> > > than
> > > > > > what
> > > > > > > we
> > > > > > > > > do
> > > > > > > > > > > now
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > >> moving
> > > > > > > > > > > > > > > > >> > > data
> > > > > > > > > > > > > > > > >> > > > between brokers. I think we can
> > optimize
> > > > its
> > > > > > > > > > performance
> > > > > > > > > > > > > using
> > > > > > > > > > > > > > > > >> separate
> > > > > > > > > > > > > > > > >> > > > thread if the performance is not
> good
> > > > > enough.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 1.3 Could you add a bit more
> details
> > > on
> > > > > how
> > > > > > we
> > > > > > > > > swap
> > > > > > > > > > > the
> > > > > > > > > > > > > > > replicas
> > > > > > > > > > > > > > > > >> when
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > new ones are fully caught up? For
> > > > example,
> > > > > > > what
> > > > > > > > > > > happens
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > >> > > > > replica in the new log directory
> is
> > > > caught
> > > > > > up,
> > > > > > > > but
> > > > > > > > > > > when
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > want
> > > > > > > > > > > > > > > > >> to do
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > swap, some new data has arrived?
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > If the replica is a leader, then
> > > > > > > > > ReplicaFetcherThread
> > > > > > > > > > > will
> > > > > > > > > > > > > > > perform
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > > > replacement. Proper lock is needed
> to
> > > > > prevent
> > > > > > > > > > > > > > > KafkaRequestHandler
> > > > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > > > >> > > > appending data to the
> > topicPartition.log
> > > > on
> > > > > > the
> > > > > > > > > source
> > > > > > > > > > > > disks
> > > > > > > > > > > > > > > > before
> > > > > > > > > > > > > > > > >> > this
> > > > > > > > > > > > > > > > >> > > > replacement is completed by
> > > > > > > ReplicaFetcherThread.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > If the replica is a follower,
> because
> > > the
> > > > > same
> > > > > > > > > > > > > > > ReplicaFetchThread
> > > > > > > > > > > > > > > > >> which
> > > > > > > > > > > > > > > > >> > > > fetches data from leader will also
> > swap
> > > > the
> > > > > > > > replica
> > > > > > > > > ,
> > > > > > > > > > no
> > > > > > > > > > > > > lock
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> > needed.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > I have updated the KIP to specify
> both
> > > > more
> > > > > > > > > > explicitly.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 1.4 Do we need to do the .move at
> > the
> > > > log
> > > > > > > > segment
> > > > > > > > > > > level
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> we
> > > > > > > > > > > > > > > > >> > > just
> > > > > > > > > > > > > > > > >> > > > do
> > > > > > > > > > > > > > > > >> > > > > that at the replica directory
> level?
> > > > > > Renaming
> > > > > > > > > just a
> > > > > > > > > > > > > > directory
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> > much
> > > > > > > > > > > > > > > > >> > > > > faster than renaming the log
> > segments.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Great point. I have updated the KIP
> to
> > > > > rename
> > > > > > > the
> > > > > > > > > log
> > > > > > > > > > > > > > directory
> > > > > > > > > > > > > > > > >> > instead.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 1.5 Could you also describe a bit
> > what
> > > > > > happens
> > > > > > > > > when
> > > > > > > > > > > > either
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> source
> > > > > > > > > > > > > > > > >> > > or
> > > > > > > > > > > > > > > > >> > > > > the target log directory fails
> while
> > > the
> > > > > > data
> > > > > > > > > moving
> > > > > > > > > > > is
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> progress?
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > If source log directory fails, then
> > the
> > > > > > replica
> > > > > > > > > > movement
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > > > >> and
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > source replica is marked offline. If
> > > > > > destination
> > > > > > > > log
> > > > > > > > > > > > > directory
> > > > > > > > > > > > > > > > >> fails,
> > > > > > > > > > > > > > > > >> > > then
> > > > > > > > > > > > > > > > >> > > > the replica movement will stop. I
> have
> > > > > updated
> > > > > > > the
> > > > > > > > > KIP
> > > > > > > > > > > to
> > > > > > > > > > > > > > > clarify
> > > > > > > > > > > > > > > > >> this.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > 2. For partition reassignment.
> > > > > > > > > > > > > > > > >> > > > > 2.1 I am not sure if the
> controller
> > > can
> > > > > > block
> > > > > > > on
> > > > > > > > > > > > > > > > >> > > ChangeReplicaDirRequest.
> > > > > > > > > > > > > > > > >> > > > > Data movement may take a long time
> > to
> > > > > > > complete.
> > > > > > > > If
> > > > > > > > > > > there
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > >> > > > outstanding
> > > > > > > > > > > > > > > > >> > > > > request from the controller to a
> > > broker,
> > > > > > that
> > > > > > > > > broker
> > > > > > > > > > > > won't
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > >> able to
> > > > > > > > > > > > > > > > >> > > > > process any new request from the
> > > > > controller.
> > > > > > > So
> > > > > > > > if
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > event
> > > > > > > > > > > > > > > > >> > (e.g.
> > > > > > > > > > > > > > > > >> > > > > broker failure) happens when the
> > data
> > > > > > movement
> > > > > > > > is
> > > > > > > > > in
> > > > > > > > > > > > > > progress,
> > > > > > > > > > > > > > > > >> > > subsequent
> > > > > > > > > > > > > > > > >> > > > > LeaderAnIsrRequest will be
> delayed.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Yeah good point. I missed the fact
> > that
> > > > > there
> > > > > > is
> > > > > > > > be
> > > > > > > > > > only
> > > > > > > > > > > > one
> > > > > > > > > > > > > > > > >> inflight
> > > > > > > > > > > > > > > > >> > > > request from controller to broker.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > How about I add a request, e.g.
> > > > > > > > > > > > BrokerDirStateUpdateRequest,
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > >> maps
> > > > > > > > > > > > > > > > >> > > > topicPartition to log directory and
> > can
> > > be
> > > > > > sent
> > > > > > > > from
> > > > > > > > > > > > broker
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > controller
> > > > > > > > > > > > > > > > >> > > > to indicate completion?
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 2.2 in the KIP, the partition
> > > > reassignment
> > > > > > > tool
> > > > > > > > is
> > > > > > > > > > > also
> > > > > > > > > > > > > used
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > >> > cases
> > > > > > > > > > > > > > > > >> > > > > where an admin just wants to
> balance
> > > the
> > > > > > > > existing
> > > > > > > > > > data
> > > > > > > > > > > > > > across
> > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > >> > > > > directories in the broker. In this
> > > case,
> > > > > it
> > > > > > > > seems
> > > > > > > > > > that
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > over
> > > > > > > > > > > > > > > > >> > > killing
> > > > > > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > > > > > >> > > > > have the process go through the
> > > > > controller.
> > > > > > A
> > > > > > > > > > simpler
> > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > >> > > > issue
> > > > > > > > > > > > > > > > >> > > > > an RPC request to the broker
> > directly.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > I agree we can optimize this case.
> It
> > is
> > > > > just
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > have
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > add
> > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > >> > > logic
> > > > > > > > > > > > > > > > >> > > > or code path to handle a scenario
> that
> > > is
> > > > > > > already
> > > > > > > > > > > covered
> > > > > > > > > > > > by
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> more
> > > > > > > > > > > > > > > > >> > > > complicated scenario. I will add it
> to
> > > the
> > > > > > KIP.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > > 2.3 When using the partition
> > > > reassignment
> > > > > > tool
> > > > > > > > to
> > > > > > > > > > move
> > > > > > > > > > > > > > > replicas
> > > > > > > > > > > > > > > > >> > across
> > > > > > > > > > > > > > > > >> > > > > brokers, it make sense to be able
> to
> > > > > specify
> > > > > > > the
> > > > > > > > > log
> > > > > > > > > > > > > > directory
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > > > newly
> > > > > > > > > > > > > > > > >> > > > > created replicas. The KIP does
> that
> > in
> > > > two
> > > > > > > > > separate
> > > > > > > > > > > > > requests
> > > > > > > > > > > > > > > > >> > > > > ChangeReplicaDirRequest and
> > > > > > > LeaderAndIsrRequest,
> > > > > > > > > and
> > > > > > > > > > > > > tracks
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > progress
> > > > > > > > > > > > > > > > >> > > > of
> > > > > > > > > > > > > > > > >> > > > > each independently. An alternative
> > is
> > > to
> > > > > do
> > > > > > > that
> > > > > > > > > > just
> > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > > > > LeaderAndIsrRequest.
> > > > > > > > > > > > > > > > >> > > > > That way, the new replicas will be
> > > > created
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > right
> > > > > > > > > > > > > log
> > > > > > > > > > > > > > > dir
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > the
> > > > > > > > > > > > > > > > >> > > > > first place and the controller
> just
> > > > needs
> > > > > to
> > > > > > > > track
> > > > > > > > > > the
> > > > > > > > > > > > > > > progress
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> > > > > partition reassignment in the
> > current
> > > > way.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > I agree it is better to use one
> > request
> > > > > > instead
> > > > > > > of
> > > > > > > > > two
> > > > > > > > > > > to
> > > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > > >> > replica
> > > > > > > > > > > > > > > > >> > > > movement between disks. But I think
> > the
> > > > > > > > performance
> > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> doing
> > > > > > > > > > > > > > > > >> > so
> > > > > > > > > > > > > > > > >> > > > is negligible because we trigger
> > replica
> > > > > > > > assignment
> > > > > > > > > > much
> > > > > > > > > > > > > less
> > > > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > >> all
> > > > > > > > > > > > > > > > >> > > > other kinds of events in the Kafka
> > > > cluster.
> > > > > I
> > > > > > am
> > > > > > > > not
> > > > > > > > > > > sure
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > benefit
> > > > > > > > > > > > > > > > >> > > > of doing this is worth the effort to
> > add
> > > > an
> > > > > > > > optional
> > > > > > > > > > > > string
> > > > > > > > > > > > > > > field
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > the
> > > > > > > > > > > > > > > > >> > > > LeaderAndIsrRequest. Also if we add
> > this
> > > > > > > optional
> > > > > > > > > > field
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > > LeaderAndIsrRequest, we probably
> want
> > to
> > > > > > remove
> > > > > > > > > > > > > > > > >> ChangeReplicaDirRequest
> > > > > > > > > > > > > > > > >> > > to
> > > > > > > > > > > > > > > > >> > > > avoid having two requests doing the
> > same
> > > > > > thing.
> > > > > > > > But
> > > > > > > > > it
> > > > > > > > > > > > means
> > > > > > > > > > > > > > > user
> > > > > > > > > > > > > > > > >> > script
> > > > > > > > > > > > > > > > >> > > > can not send request directly to the
> > > > broker
> > > > > to
> > > > > > > > > trigger
> > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > >> movement
> > > > > > > > > > > > > > > > >> > > > between log directories.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > I will do it if you are strong about
> > > this
> > > > > > > > > optimzation.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > 3. /admin/reassign_partitions:
> > > Including
> > > > > the
> > > > > > > log
> > > > > > > > > dir
> > > > > > > > > > > in
> > > > > > > > > > > > > > every
> > > > > > > > > > > > > > > > >> replica
> > > > > > > > > > > > > > > > >> > > may
> > > > > > > > > > > > > > > > >> > > > > not be efficient. We could
> include a
> > > > list
> > > > > of
> > > > > > > log
> > > > > > > > > > > > > directories
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> > > > reference
> > > > > > > > > > > > > > > > >> > > > > the index of the log directory in
> > each
> > > > > > > replica.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Good point. I have updated the KIP
> to
> > > use
> > > > > this
> > > > > > > > > > solution.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > 4. DescribeDirsRequest: The stats
> in
> > > the
> > > > > > > request
> > > > > > > > > are
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > >> > available
> > > > > > > > > > > > > > > > >> > > > from
> > > > > > > > > > > > > > > > >> > > > > JMX. Do we need the new request?
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Does JMX also include the state
> (i.e.
> > > > > offline
> > > > > > or
> > > > > > > > > > online)
> > > > > > > > > > > > of
> > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > >> > > > directory and the log directory of
> > each
> > > > > > replica?
> > > > > > > > If
> > > > > > > > > > not,
> > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > maybe
> > > > > > > > > > > > > > > > >> we
> > > > > > > > > > > > > > > > >> > > > still need DescribeDirsRequest?
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > 5. We want to be consistent on
> > > > > > > > > > ChangeReplicaDirRequest
> > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > >> > > > > ChangeReplicaRequest.
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > I think ChangeReplicaRequest and
> > > > > > > > > ChangeReplicaResponse
> > > > > > > > > > > is
> > > > > > > > > > > > my
> > > > > > > > > > > > > > > typo.
> > > > > > > > > > > > > > > > >> > Sorry,
> > > > > > > > > > > > > > > > >> > > > they are fixed now.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > Jun
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > On Fri, Feb 3, 2017 at 6:19 PM,
> Dong
> > > > Lin <
> > > > > > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > > Hey ALexey,
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > > Thanks for all the comments!
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > > I have updated the KIP to
> specify
> > > how
> > > > we
> > > > > > > > enforce
> > > > > > > > > > > > quota.
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > >> > > updated
> > > > > > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > > > > > >> > > > > > "The thread model and broker
> logic
> > > for
> > > > > > > moving
> > > > > > > > > > > replica
> > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > >> between
> > > > > > > > > > > > > > > > >> > > log
> > > > > > > > > > > > > > > > >> > > > > > directories" to make it easier
> to
> > > > read.
> > > > > > You
> > > > > > > > can
> > > > > > > > > > find
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > exact
> > > > > > > > > > > > > > > > >> > change
> > > > > > > > > > > > > > > > >> > > > > here
> > > > > > > > > > > > > > > > >> > > > > > <https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > luence/pages/diffpagesbyversio
> > > > > > > > > > > > > > > > >> > > > > > n.action?pageId=67638408&selec
> > > > > > > > > > > > > > > tedPageVersions=5&selectedPage
> > > > > > > > > > > > > > > > >> > > > Versions=6>.
> > > > > > > > > > > > > > > > >> > > > > > The idea is to use the same
> > > > replication
> > > > > > > quota
> > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > >> introduced
> > > > > > > > > > > > > > > > >> > in
> > > > > > > > > > > > > > > > >> > > > > > KIP-73.
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > > > > > > > > >> > > > > > Dong
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > > On Wed, Feb 1, 2017 at 2:16 AM,
> > > Alexey
> > > > > > > > > Ozeritsky <
> > > > > > > > > > > > > > > > >> > > aozerit...@yandex.ru
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > 24.01.2017, 22:03, "Dong Lin"
> <
> > > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >> > > > > > > > Hey Alexey,
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > Thanks. I think we agreed
> that
> > > the
> > > > > > > > suggested
> > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > > >> doesn't
> > > > > > > > > > > > > > > > >> > > work
> > > > > > > > > > > > > > > > >> > > > in
> > > > > > > > > > > > > > > > >> > > > > > > > general for kafka users. To
> > > answer
> > > > > > your
> > > > > > > > > > > questions:
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > 1. I agree we need quota to
> > rate
> > > > > limit
> > > > > > > > > replica
> > > > > > > > > > > > > > movement
> > > > > > > > > > > > > > > > >> when a
> > > > > > > > > > > > > > > > >> > > > broker
> > > > > > > > > > > > > > > > >> > > > > > is
> > > > > > > > > > > > > > > > >> > > > > > > > moving a "leader" replica. I
> > > will
> > > > > come
> > > > > > > up
> > > > > > > > > with
> > > > > > > > > > > > > > solution,
> > > > > > > > > > > > > > > > >> > probably
> > > > > > > > > > > > > > > > >> > > > > > re-use
> > > > > > > > > > > > > > > > >> > > > > > > > the config of replication
> > quota
> > > > > > > introduced
> > > > > > > > > in
> > > > > > > > > > > > > KIP-73.
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > 2. Good point. I agree that
> > this
> > > > is
> > > > > a
> > > > > > > > > problem
> > > > > > > > > > in
> > > > > > > > > > > > > > > general.
> > > > > > > > > > > > > > > > >> If is
> > > > > > > > > > > > > > > > >> > > no
> > > > > > > > > > > > > > > > >> > > > > new
> > > > > > > > > > > > > > > > >> > > > > > > data
> > > > > > > > > > > > > > > > >> > > > > > > > on that broker, with current
> > > > default
> > > > > > > value
> > > > > > > > > of
> > > > > > > > > > > > > > > > >> > > > > > replica.fetch.wait.max.ms
> > > > > > > > > > > > > > > > >> > > > > > > > and replica.fetch.max.bytes,
> > the
> > > > > > replica
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > > moved
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > >> only
> > > > > > > > > > > > > > > > >> > 2
> > > > > > > > > > > > > > > > >> > > > MBps
> > > > > > > > > > > > > > > > >> > > > > > > > throughput. I think the
> > solution
> > > > is
> > > > > > for
> > > > > > > > > broker
> > > > > > > > > > > to
> > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > >> > > > > > > > replica.fetch.wait.max.ms
> to
> > 0
> > > in
> > > > > its
> > > > > > > > > > > > FetchRequest
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > > > > corresponding
> > > > > > > > > > > > > > > > >> > > > > > > > ReplicaFetcherThread needs
> to
> > > move
> > > > > > some
> > > > > > > > > > replica
> > > > > > > > > > > to
> > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > >> > disk.
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > 3. I have updated the KIP to
> > > > mention
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > read
> > > > > > > > > > > > > > size
> > > > > > > > > > > > > > > > of a
> > > > > > > > > > > > > > > > >> > > given
> > > > > > > > > > > > > > > > >> > > > > > > > partition is configured
> using
> > > > > > > > > > > > > replica.fetch.max.bytes
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > >> we
> > > > > > > > > > > > > > > > >> > > move
> > > > > > > > > > > > > > > > >> > > > > > > replicas
> > > > > > > > > > > > > > > > >> > > > > > > > between disks.
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > Please see this
> > > > > > > > > > > > > > > > >> > > > > > > > <
> > https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > > >> luence/pages/diffpagesbyversio
> > > > > > > > > > > > > > > > >> > > > n.action
> > > > > > > > > > > > > > > > >> > > > > ?
> > > > > > > > > > > > > > > > >> > > > > > > pageId=67638408&selectedPageVe
> > > > > > > > > > > > > > > > rsions=4&selectedPageVersions=
> > > > > > > > > > > > > > > > >> 5>
> > > > > > > > > > > > > > > > >> > > > > > > > for the change of the KIP. I
> > > will
> > > > > come
> > > > > > > up
> > > > > > > > > > with a
> > > > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > > > throttle
> > > > > > > > > > > > > > > > >> > > > > > > > replica movement when a
> broker
> > > is
> > > > > > > moving a
> > > > > > > > > > > > "leader"
> > > > > > > > > > > > > > > > replica.
> > > > > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > Thanks. It looks great.
> > > > > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > > On Tue, Jan 24, 2017 at 3:30
> > AM,
> > > > > > Alexey
> > > > > > > > > > > Ozeritsky
> > > > > > > > > > > > <
> > > > > > > > > > > > > > > > >> > > > > > aozerit...@yandex.ru>
> > > > > > > > > > > > > > > > >> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > > > > > >> > > > > > > >>  23.01.2017, 22:11, "Dong
> > Lin"
> > > <
> > > > > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >> > > > > > > >>  > Thanks. Please see my
> > > comment
> > > > > > > inline.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > On Mon, Jan 23, 2017 at
> > 6:45
> > > > AM,
> > > > > > > > Alexey
> > > > > > > > > > > > > Ozeritsky
> > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > >> > > > > > > aozerit...@yandex.ru>
> > > > > > > > > > > > > > > > >> > > > > > > >>  > wrote:
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> 13.01.2017, 22:29,
> "Dong
> > > > Lin" <
> > > > > > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > Hey Alexey,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > Thanks for your
> review
> > > and
> > > > > the
> > > > > > > > > > > alternative
> > > > > > > > > > > > > > > > approach.
> > > > > > > > > > > > > > > > >> > Here
> > > > > > > > > > > > > > > > >> > > is
> > > > > > > > > > > > > > > > >> > > > > my
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > understanding of your
> > > > patch.
> > > > > > > > kafka's
> > > > > > > > > > > > > background
> > > > > > > > > > > > > > > > >> threads
> > > > > > > > > > > > > > > > >> > > are
> > > > > > > > > > > > > > > > >> > > > > used
> > > > > > > > > > > > > > > > >> > > > > > > to
> > > > > > > > > > > > > > > > >> > > > > > > >>  move
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > data between
> replicas.
> > > When
> > > > > > data
> > > > > > > > > > movement
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> triggered,
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > log
> > > > > > > > > > > > > > > > >> > > > > > > will
> > > > > > > > > > > > > > > > >> > > > > > > >>  be
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > rolled and the new
> logs
> > > > will
> > > > > be
> > > > > > > put
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > >> > directory,
> > > > > > > > > > > > > > > > >> > > > and
> > > > > > > > > > > > > > > > >> > > > > > > >>  background
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > threads will move
> > segment
> > > > > from
> > > > > > > old
> > > > > > > > > > > > directory
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > >> > > > directory.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > It is important to
> note
> > > > that
> > > > > > > > KIP-112
> > > > > > > > > is
> > > > > > > > > > > > > > intended
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> work
> > > > > > > > > > > > > > > > >> > > > with
> > > > > > > > > > > > > > > > >> > > > > > > >>  KIP-113 to
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > support JBOD. I think
> > > your
> > > > > > > solution
> > > > > > > > > is
> > > > > > > > > > > > > > definitely
> > > > > > > > > > > > > > > > >> > simpler
> > > > > > > > > > > > > > > > >> > > > and
> > > > > > > > > > > > > > > > >> > > > > > > better
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> under
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > the current kafka
> > > > > > implementation
> > > > > > > > > that a
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > >> fail
> > > > > > > > > > > > > > > > >> > > if
> > > > > > > > > > > > > > > > >> > > > > any
> > > > > > > > > > > > > > > > >> > > > > > > disk
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> fails.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > But I am not sure if
> we
> > > > want
> > > > > to
> > > > > > > > allow
> > > > > > > > > > > > broker
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > run
> > > > > > > > > > > > > > > > >> with
> > > > > > > > > > > > > > > > >> > > > > partial
> > > > > > > > > > > > > > > > >> > > > > > > >>  disks
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > failure. Let's say
> the
> > a
> > > > > > replica
> > > > > > > is
> > > > > > > > > > being
> > > > > > > > > > > > > moved
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> > > > > log_dir_old
> > > > > > > > > > > > > > > > >> > > > > > > to
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > log_dir_new and then
> > > > > > log_dir_old
> > > > > > > > > stops
> > > > > > > > > > > > > working
> > > > > > > > > > > > > > > due
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > disk
> > > > > > > > > > > > > > > > >> > > > > > > failure.
> > > > > > > > > > > > > > > > >> > > > > > > >>  How
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > would your existing
> > patch
> > > > > > handles
> > > > > > > > it?
> > > > > > > > > > To
> > > > > > > > > > > > make
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > scenario a
> > > > > > > > > > > > > > > > >> > > > > bit
> > > > > > > > > > > > > > > > >> > > > > > > more
> > > > > > > > > > > > > > > > >> > > > > > > >>  >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> We will lose
> log_dir_old.
> > > > After
> > > > > > > > broker
> > > > > > > > > > > > restart
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > >> read
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > > data
> > > > > > > > > > > > > > > > >> > > > > > > >>  from
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> log_dir_new.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > No, you probably can't.
> > This
> > > > is
> > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > >> doesn't
> > > > > > > > > > > > > > > > >> > > have
> > > > > > > > > > > > > > > > >> > > > > > > *all* the
> > > > > > > > > > > > > > > > >> > > > > > > >>  > data for this partition.
> > For
> > > > > > > example,
> > > > > > > > > say
> > > > > > > > > > > the
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > has
> > > > > > > > > > > > > > > > >> > > > > > > >>  > partition_segement_1,
> > > > > > > > > partition_segment_50
> > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> > > > > > partition_segment_100
> > > > > > > > > > > > > > > > >> > > > > > > on
> > > > > > > > > > > > > > > > >> > > > > > > >>  the
> > > > > > > > > > > > > > > > >> > > > > > > >>  > log_dir_old.
> > > > > > partition_segment_100,
> > > > > > > > > which
> > > > > > > > > > > has
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > latest
> > > > > > > > > > > > > > > > >> > > data,
> > > > > > > > > > > > > > > > >> > > > > has
> > > > > > > > > > > > > > > > >> > > > > > > been
> > > > > > > > > > > > > > > > >> > > > > > > >>  > moved to log_dir_new,
> and
> > > the
> > > > > > > > > log_dir_old
> > > > > > > > > > > > fails
> > > > > > > > > > > > > > > before
> > > > > > > > > > > > > > > > >> > > > > > > >>  partition_segment_50
> > > > > > > > > > > > > > > > >> > > > > > > >>  > and partition_segment_1
> is
> > > > moved
> > > > > > to
> > > > > > > > > > > > log_dir_new.
> > > > > > > > > > > > > > > When
> > > > > > > > > > > > > > > > >> > broker
> > > > > > > > > > > > > > > > >> > > > > > > re-starts,
> > > > > > > > > > > > > > > > >> > > > > > > >>  it
> > > > > > > > > > > > > > > > >> > > > > > > >>  > won't have
> > > > partition_segment_50.
> > > > > > > This
> > > > > > > > > > causes
> > > > > > > > > > > > > > problem
> > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > >> > > broker
> > > > > > > > > > > > > > > > >> > > > is
> > > > > > > > > > > > > > > > >> > > > > > > elected
> > > > > > > > > > > > > > > > >> > > > > > > >>  > leader and consumer
> wants
> > to
> > > > > > consume
> > > > > > > > > data
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > > > > partition_segment_1.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  Right.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > complicated, let's
> say
> > > the
> > > > > > broker
> > > > > > > > is
> > > > > > > > > > > > > shtudown,
> > > > > > > > > > > > > > > > >> > > log_dir_old's
> > > > > > > > > > > > > > > > >> > > > > > disk
> > > > > > > > > > > > > > > > >> > > > > > > >>  fails,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > and the broker
> starts.
> > In
> > > > > this
> > > > > > > case
> > > > > > > > > > > broker
> > > > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > > > > >> even
> > > > > > > > > > > > > > > > >> > > know
> > > > > > > > > > > > > > > > >> > > > > if
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> log_dir_new
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > has all the data
> needed
> > > for
> > > > > > this
> > > > > > > > > > replica.
> > > > > > > > > > > > It
> > > > > > > > > > > > > > > > becomes
> > > > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > > > >> > > > problem
> > > > > > > > > > > > > > > > >> > > > > > if
> > > > > > > > > > > > > > > > >> > > > > > > the
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > broker is elected
> > leader
> > > of
> > > > > > this
> > > > > > > > > > > partition
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> case.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> log_dir_new contains
> the
> > > most
> > > > > > > recent
> > > > > > > > > data
> > > > > > > > > > > so
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > >> lose
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > tail
> > > > > > > > > > > > > > > > >> > > > > > > of
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> partition.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> This is not a big
> problem
> > > for
> > > > > us
> > > > > > > > > because
> > > > > > > > > > we
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > >> delete
> > > > > > > > > > > > > > > > >> > > > tails
> > > > > > > > > > > > > > > > >> > > > > > by
> > > > > > > > > > > > > > > > >> > > > > > > >>  hand
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> (see
> > > > > > > https://issues.apache.org/jira
> > > > > > > > > > > > > > > > /browse/KAFKA-1712
> > > > > > > > > > > > > > > > >> ).
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> Also we dont use
> > authomatic
> > > > > > leader
> > > > > > > > > > > balancing
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > (auto.leader.rebalance.enable=
> > > > > > false),
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> so this partition
> becomes
> > > the
> > > > > > > leader
> > > > > > > > > > with a
> > > > > > > > > > > > low
> > > > > > > > > > > > > > > > >> > probability.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> I think my patch can be
> > > > > modified
> > > > > > to
> > > > > > > > > > > prohibit
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> selection
> > > > > > > > > > > > > > > > >> > > of
> > > > > > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > > > > > >> > > > > > > >>  leader
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> until the partition
> does
> > > not
> > > > > move
> > > > > > > > > > > completely.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > I guess you are saying
> > that
> > > > you
> > > > > > have
> > > > > > > > > > deleted
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > tails
> > > > > > > > > > > > > > > > >> by
> > > > > > > > > > > > > > > > >> > > hand
> > > > > > > > > > > > > > > > >> > > > in
> > > > > > > > > > > > > > > > >> > > > > > > your
> > > > > > > > > > > > > > > > >> > > > > > > >>  own
> > > > > > > > > > > > > > > > >> > > > > > > >>  > kafka branch. But
> > KAFKA-1712
> > > > is
> > > > > > not
> > > > > > > > > > accepted
> > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > > >> > trunk
> > > > > > > > > > > > > > > > >> > > > > and I
> > > > > > > > > > > > > > > > >> > > > > > > am
> > > > > > > > > > > > > > > > >> > > > > > > >>  not
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  No. We just modify
> segments
> > > > mtime
> > > > > by
> > > > > > > > cron
> > > > > > > > > > job.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > works
> > > > > > > > > > > > > > > > >> > with
> > > > > > > > > > > > > > > > >> > > > > > vanilla
> > > > > > > > > > > > > > > > >> > > > > > > >>  kafka.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  > sure if it is the right
> > > > > solution.
> > > > > > > How
> > > > > > > > > > would
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> solution
> > > > > > > > > > > > > > > > >> > > > address
> > > > > > > > > > > > > > > > >> > > > > > the
> > > > > > > > > > > > > > > > >> > > > > > > >>  > problem mentioned above?
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  If you need only fresh
> data
> > > and
> > > > if
> > > > > > you
> > > > > > > > > > remove
> > > > > > > > > > > > old
> > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > >> > hands
> > > > > > > > > > > > > > > > >> > > > > this
> > > > > > > > > > > > > > > > >> > > > > > is
> > > > > > > > > > > > > > > > >> > > > > > > >>  not a problem. But in
> > general
> > > > case
> > > > > > > > > > > > > > > > >> > > > > > > >>  this is a problem of
> course.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > BTW, I am not sure the
> > > > solution
> > > > > > > > > mentioned
> > > > > > > > > > in
> > > > > > > > > > > > > > > > KAFKA-1712
> > > > > > > > > > > > > > > > >> is
> > > > > > > > > > > > > > > > >> > > the
> > > > > > > > > > > > > > > > >> > > > > > right
> > > > > > > > > > > > > > > > >> > > > > > > way
> > > > > > > > > > > > > > > > >> > > > > > > >>  to
> > > > > > > > > > > > > > > > >> > > > > > > >>  > address its problem. Now
> > > that
> > > > we
> > > > > > > have
> > > > > > > > > > > > timestamp
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > > message
> > > > > > > > > > > > > > > > >> > > > we
> > > > > > > > > > > > > > > > >> > > > > > > can use
> > > > > > > > > > > > > > > > >> > > > > > > >>  > that to delete old
> > segement
> > > > > > instead
> > > > > > > of
> > > > > > > > > > > relying
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> log
> > > > > > > > > > > > > > > > >> > > > segment
> > > > > > > > > > > > > > > > >> > > > > > > mtime.
> > > > > > > > > > > > > > > > >> > > > > > > >>  > Just some idea and we
> > don't
> > > > have
> > > > > > to
> > > > > > > > > > discuss
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > problem
> > > > > > > > > > > > > > > > >> > > here.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > The solution
> presented
> > in
> > > > the
> > > > > > KIP
> > > > > > > > > > > attempts
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > >> it
> > > > > > > > > > > > > > > > >> > by
> > > > > > > > > > > > > > > > >> > > > > > > replacing
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > replica in an atomic
> > > > version
> > > > > > > > fashion
> > > > > > > > > > > after
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > the
> > > > > > > > > > > > > > > > >> > > > new
> > > > > > > > > > > > > > > > >> > > > > > dir
> > > > > > > > > > > > > > > > >> > > > > > > has
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> fully
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > caught up with the
> log
> > in
> > > > the
> > > > > > old
> > > > > > > > > dir.
> > > > > > > > > > At
> > > > > > > > > > > > at
> > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > log
> > > > > > > > > > > > > > > > >> > > > can
> > > > > > > > > > > > > > > > >> > > > > be
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> considered
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > to exist on only one
> > log
> > > > > > > directory.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> As I understand your
> > > solution
> > > > > > does
> > > > > > > > not
> > > > > > > > > > > cover
> > > > > > > > > > > > > > > quotas.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> What happens if someone
> > > > starts
> > > > > to
> > > > > > > > > > transfer
> > > > > > > > > > > > 100
> > > > > > > > > > > > > > > > >> partitions
> > > > > > > > > > > > > > > > >> > ?
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > Good point. Quota can be
> > > > > > implemented
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > > future.
> > > > > > > > > > > > > > > It
> > > > > > > > > > > > > > > > >> is
> > > > > > > > > > > > > > > > >> > > > > currently
> > > > > > > > > > > > > > > > >> > > > > > > >>  > mentioned as as a
> > potential
> > > > > future
> > > > > > > > > > > improvement
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> KIP-112
> > > > > > > > > > > > > > > > >> > > > > > > >>  > <
> > > > https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > > luence/display/KAFKA/KIP-
> > > > > > > > > > > > > > > > >> > 112%3
> > > > > > > > > > > > > > > > >> > > > > > > >>  A+Handle+disk+failure+for+
> > > > > > > JBOD>.Thanks
> > > > > > > > > > > > > > > > >> > > > > > > >>  > for the reminder. I will
> > > move
> > > > it
> > > > > > to
> > > > > > > > > > KIP-113.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > If yes, it will read
> a
> > > > > > > > > > > ByteBufferMessageSet
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> > > > > > > topicPartition.log
> > > > > > > > > > > > > > > > >> > > > > > > >>  and
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> append the message set
> to
> > > > > > > > > > > topicPartition.move
> > > > > > > > > > > > > > > > >> > > > > > > >>  >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> i.e.
> processPartitionData
> > > > will
> > > > > > read
> > > > > > > > > data
> > > > > > > > > > > from
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> > beginning
> > > > > > > > > > > > > > > > >> > > of
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> topicPartition.log?
> What
> > is
> > > > the
> > > > > > > read
> > > > > > > > > > size?
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> A ReplicaFetchThread
> > reads
> > > > many
> > > > > > > > > > partitions
> > > > > > > > > > > so
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > > >> does
> > > > > > > > > > > > > > > > >> > > some
> > > > > > > > > > > > > > > > >> > > > > > > >>  complicated
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> work (= read a lot of
> > data
> > > > from
> > > > > > > disk)
> > > > > > > > > > > > > everything
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > >> slow
> > > > > > > > > > > > > > > > >> > > > down.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> I think read size
> should
> > > not
> > > > be
> > > > > > > very
> > > > > > > > > big.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> On the other hand at
> this
> > > > point
> > > > > > > > > > > > > > > > (processPartitionData)
> > > > > > > > > > > > > > > > >> one
> > > > > > > > > > > > > > > > >> > > can
> > > > > > > > > > > > > > > > >> > > > > use
> > > > > > > > > > > > > > > > >> > > > > > > only
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> the new data
> > > > > > (ByteBufferMessageSet
> > > > > > > > from
> > > > > > > > > > > > > > parameters)
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> > wait
> > > > > > > > > > > > > > > > >> > > > > until
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> (topicPartition.move.
> > > > > > > smallestOffset
> > > > > > > > <=
> > > > > > > > > > > > > > > > >> > > > > > > topicPartition.log.smallestOff
> > > > > > > > > > > > > > > > >> > > > > > > >>  set
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> && topicPartition.log.
> > > > > > > largestOffset
> > > > > > > > ==
> > > > > > > > > > > > > > > > >> > > > > > > topicPartition.log.largestOffs
> > > > > > > > > > > > > > > > >> > > > > > > >>  et).
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> In this case the write
> > > speed
> > > > to
> > > > > > > > > > > > > > topicPartition.move
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> > > > > > > >>  topicPartition.log
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> will be the same so
> this
> > > will
> > > > > > allow
> > > > > > > > us
> > > > > > > > > to
> > > > > > > > > > > > move
> > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > >> > > partitions
> > > > > > > > > > > > > > > > >> > > > > to
> > > > > > > > > > > > > > > > >> > > > > > > one
> > > > > > > > > > > > > > > > >> > > > > > > >>  disk.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > The read size of a given
> > > > > partition
> > > > > > > is
> > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > >> > > > > > > >>  > using
> > > replica.fetch.max.bytes,
> > > > > > which
> > > > > > > > is
> > > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > size
> > > > > > > > > > > > > > > > >> used
> > > > > > > > > > > > > > > > >> > by
> > > > > > > > > > > > > > > > >> > > > > > > >>  FetchRequest
> > > > > > > > > > > > > > > > >> > > > > > > >>  > from follower to leader.
> > If
> > > > the
> > > > > > > broker
> > > > > > > > > is
> > > > > > > > > > > > > moving a
> > > > > > > > > > > > > > > > >> replica
> > > > > > > > > > > > > > > > >> > > for
> > > > > > > > > > > > > > > > >> > > > > > which
> > > > > > > > > > > > > > > > >> > > > > > > it
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  OK. Could you mention it
> in
> > > KIP?
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  > acts as a follower, the
> > disk
> > > > > write
> > > > > > > > rate
> > > > > > > > > > for
> > > > > > > > > > > > > moving
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> > > replica
> > > > > > > > > > > > > > > > >> > > > > is
> > > > > > > > > > > > > > > > >> > > > > > at
> > > > > > > > > > > > > > > > >> > > > > > > >>  most
> > > > > > > > > > > > > > > > >> > > > > > > >>  > the rate it fetches from
> > > > leader
> > > > > > > > (assume
> > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > > > > catching
> > > > > > > > > > > > > > > > >> up
> > > > > > > > > > > > > > > > >> > and
> > > > > > > > > > > > > > > > >> > > > has
> > > > > > > > > > > > > > > > >> > > > > > > >>  > sufficient data to read
> > from
> > > > > > leader,
> > > > > > > > > which
> > > > > > > > > > > is
> > > > > > > > > > > > > > > subject
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> > > > > > > round-trip-time
> > > > > > > > > > > > > > > > >> > > > > > > >>  > between itself and the
> > > leader.
> > > > > > Thus
> > > > > > > > this
> > > > > > > > > > > part
> > > > > > > > > > > > if
> > > > > > > > > > > > > > > > >> probably
> > > > > > > > > > > > > > > > >> > > fine
> > > > > > > > > > > > > > > > >> > > > > even
> > > > > > > > > > > > > > > > >> > > > > > > >>  without
> > > > > > > > > > > > > > > > >> > > > > > > >>  > quota.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  I think there are 2
> problems
> > > > > > > > > > > > > > > > >> > > > > > > >>  1. Without speed limiter
> > this
> > > > will
> > > > > > not
> > > > > > > > > work
> > > > > > > > > > > good
> > > > > > > > > > > > > > even
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > >> 1
> > > > > > > > > > > > > > > > >> > > > > > partition.
> > > > > > > > > > > > > > > > >> > > > > > > In
> > > > > > > > > > > > > > > > >> > > > > > > >>  our production we had a
> > > problem
> > > > so
> > > > > > we
> > > > > > > > did
> > > > > > > > > > the
> > > > > > > > > > > > > > throuput
> > > > > > > > > > > > > > > > >> > limiter:
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > https://github.com/resetius/ka
> > > > > > > > > > > > > > > > >> fka/commit/cda31dadb2f135743bf
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > 41083062927886c5ddce1#diff-ffa
> > > > > > > > > > > > > > > > >> 8861e850121997a534ebdde2929c6R
> > > > > > > > > > > > > > > > >> > > 713
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  2. I dont understand how
> it
> > > will
> > > > > > work
> > > > > > > in
> > > > > > > > > > case
> > > > > > > > > > > of
> > > > > > > > > > > > > big
> > > > > > > > > > > > > > > > >> > > > > > > >>  replica.fetch.wait.max.ms
> > and
> > > > > > > partition
> > > > > > > > > > with
> > > > > > > > > > > > > > > irregular
> > > > > > > > > > > > > > > > >> flow.
> > > > > > > > > > > > > > > > >> > > > > > > >>  For example someone could
> > have
> > > > > > > > > > > > > > > > replica.fetch.wait.max.ms
> > > > > > > > > > > > > > > > >> > =10mi
> > > > > > > > > > > > > > > > >> > > > nutes
> > > > > > > > > > > > > > > > >> > > > > > and
> > > > > > > > > > > > > > > > >> > > > > > > >>  partition that has very
> high
> > > > data
> > > > > > flow
> > > > > > > > > from
> > > > > > > > > > > > 12:00
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > 13:00
> > > > > > > > > > > > > > > > >> > and
> > > > > > > > > > > > > > > > >> > > > zero
> > > > > > > > > > > > > > > > >> > > > > > > flow
> > > > > > > > > > > > > > > > >> > > > > > > >>  otherwise.
> > > > > > > > > > > > > > > > >> > > > > > > >>  In this case
> > > > processPartitionData
> > > > > > > could
> > > > > > > > be
> > > > > > > > > > > > called
> > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > per
> > > > > > > > > > > > > > > > >> > > > > 10minutes
> > > > > > > > > > > > > > > > >> > > > > > > so if
> > > > > > > > > > > > > > > > >> > > > > > > >>  we start data moving in
> > 13:01
> > > it
> > > > > > will
> > > > > > > be
> > > > > > > > > > > > finished
> > > > > > > > > > > > > > next
> > > > > > > > > > > > > > > > >> day.
> > > > > > > > > > > > > > > > >> > > > > > > >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  > But ff the broker is
> > moving
> > > a
> > > > > > > replica
> > > > > > > > > for
> > > > > > > > > > > > which
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > acts
> > > > > > > > > > > > > > > > >> as
> > > > > > > > > > > > > > > > >> > a
> > > > > > > > > > > > > > > > >> > > > > > leader,
> > > > > > > > > > > > > > > > >> > > > > > > as
> > > > > > > > > > > > > > > > >> > > > > > > >>  of
> > > > > > > > > > > > > > > > >> > > > > > > >>  > current KIP the broker
> > will
> > > > keep
> > > > > > > > reading
> > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> log_dir_old
> > > > > > > > > > > > > > > > >> > and
> > > > > > > > > > > > > > > > >> > > > > > append
> > > > > > > > > > > > > > > > >> > > > > > > to
> > > > > > > > > > > > > > > > >> > > > > > > >>  > log_dir_new without
> having
> > > to
> > > > > wait
> > > > > > > for
> > > > > > > > > > > > > > > > round-trip-time.
> > > > > > > > > > > > > > > > >> We
> > > > > > > > > > > > > > > > >> > > > > probably
> > > > > > > > > > > > > > > > >> > > > > > > need
> > > > > > > > > > > > > > > > >> > > > > > > >>  > quota for this in the
> > > future.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > And to answer your
> > > > question,
> > > > > > yes
> > > > > > > > > > > > > > > topicpartition.log
> > > > > > > > > > > > > > > > >> > refers
> > > > > > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > topic-paritition/segment.log.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > Thanks,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > Dong
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > On Fri, Jan 13, 2017
> at
> > > > 4:12
> > > > > > AM,
> > > > > > > > > Alexey
> > > > > > > > > > > > > > > Ozeritsky <
> > > > > > > > > > > > > > > > >> > > > > > > >>  aozerit...@yandex.ru>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> > wrote:
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> Hi,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> We have the similar
> > > > solution
> > > > > > > that
> > > > > > > > > have
> > > > > > > > > > > > been
> > > > > > > > > > > > > > > > working
> > > > > > > > > > > > > > > > >> in
> > > > > > > > > > > > > > > > >> > > > > > production
> > > > > > > > > > > > > > > > >> > > > > > > >>  since
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 2014. You can see it
> > > here:
> > > > > > > > > > > > > > > > >> > > https://github.com/resetius/ka
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > fka/commit/20658593e246d218490
> > > > > > > > > > > > > > > > 6879defa2e763c4d413fb
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> The idea is very
> > simple
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 1. Disk balancer
> runs
> > > in a
> > > > > > > > separate
> > > > > > > > > > > thread
> > > > > > > > > > > > > > > inside
> > > > > > > > > > > > > > > > >> > > scheduler
> > > > > > > > > > > > > > > > >> > > > > > pool.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 2. It does not touch
> > > empty
> > > > > > > > > partitions
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 3. Before it moves a
> > > > > partition
> > > > > > > it
> > > > > > > > > > > forcibly
> > > > > > > > > > > > > > > creates
> > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > >> > > > > segment
> > > > > > > > > > > > > > > > >> > > > > > > on a
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> destination disk
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 4. It moves segment
> by
> > > > > segment
> > > > > > > > from
> > > > > > > > > > new
> > > > > > > > > > > to
> > > > > > > > > > > > > > old.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 5. Log class works
> > with
> > > > > > segments
> > > > > > > > on
> > > > > > > > > > both
> > > > > > > > > > > > > disks
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> Your approach seems
> > too
> > > > > > > > complicated,
> > > > > > > > > > > > > moreover
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > >> means
> > > > > > > > > > > > > > > > >> > > that
> > > > > > > > > > > > > > > > >> > > > > you
> > > > > > > > > > > > > > > > >> > > > > > > >>  have to
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> patch different
> > > components
> > > > > of
> > > > > > > the
> > > > > > > > > > system
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> Could you clarify
> what
> > > do
> > > > > you
> > > > > > > mean
> > > > > > > > > by
> > > > > > > > > > > > > > > > >> > topicPartition.log?
> > > > > > > > > > > > > > > > >> > > > Is
> > > > > > > > > > > > > > > > >> > > > > it
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > topic-paritition/segment.log ?
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> 12.01.2017, 21:47,
> > "Dong
> > > > > Lin"
> > > > > > <
> > > > > > > > > > > > > > > > lindon...@gmail.com
> > > > > > > > > > > > > > > > >> >:
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Hi all,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > We created
> KIP-113:
> > > > > Support
> > > > > > > > > replicas
> > > > > > > > > > > > > > movement
> > > > > > > > > > > > > > > > >> between
> > > > > > > > > > > > > > > > >> > > log
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> directories.
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Please find the
> KIP
> > > wiki
> > > > > in
> > > > > > > the
> > > > > > > > > link
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > *
> > > > > > > https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > > >> > > > > luence/display/KAFKA/KIP-113%
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > 3A+Support+replicas+movement+b
> > > > > > > > > > > > > > > > >> etween+log+directories
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > <
> > > > > > > https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > > >> > > > > luence/display/KAFKA/KIP-113%
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > > 3A+Support+replicas+movement+
> > > > > > > > > > > > > > > > >> > between+log+directories>.*
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > This KIP is
> related
> > to
> > > > > > KIP-112
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > <
> > > > > > > https://cwiki.apache.org/conf
> > > > > > > > > > > > > > > > >> > > > > luence/display/KAFKA/KIP-112%
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >>
> > > > 3A+Handle+disk+failure+for+
> > > > > > > JBOD>:
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Handle disk
> failure
> > > for
> > > > > > JBOD.
> > > > > > > > They
> > > > > > > > > > are
> > > > > > > > > > > > > > needed
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > order
> > > > > > > > > > > > > > > > >> > > to
> > > > > > > > > > > > > > > > >> > > > > > > support
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> JBOD in
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Kafka. Please help
> > > > review
> > > > > > the
> > > > > > > > KIP.
> > > > > > > > > > You
> > > > > > > > > > > > > > > feedback
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> > > > > > appreciated!
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> >
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Thanks,
> > > > > > > > > > > > > > > > >> > > > > > > >>  >> >> > Dong
> > > > > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to