Hi Jason,

Can KIP-352 split the two metrics MaxLag and TotalLag for reassignment 
replication as well?  From the dashboard of these 2 metrics, one can see 
whether the replication is stuck (flat line) and estimate how long the 
reassignments will complete (how fast the Lag line is going down).

Thanks,
George


    On Thursday, August 8, 2019, 01:41:49 PM PDT, Jason Gustafson 
<ja...@confluent.io> wrote:  
 
 Hey Stan,

Thanks for the suggestion. I have updated the proposal to include two new
meters for reassignment traffic inbound and outbound.

-Jason

On Thu, Aug 8, 2019 at 12:07 PM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Agreed on not totally spitting the replication incoming/outgoing metric -
> that could cause confusion. Just adding a new metric sounds good to me!
>
> The throttle follow-up is mentioned as part of future work in KIP-455 and I
> agree that it is way out of scope for this one.
>
>
> On Thu, Aug 8, 2019 at 8:03 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hi Stan,
> >
> > That's an interesting thought. I'm wondering if it would be better to
> leave
> > the current replication metrics counting for the total replication
> traffic
> > and add a new metric for reassignment traffic?
> >
> > By the way, a further KIP-455 follow-up that I won't attempt here would
> be
> > to have a separate throttle for reassignment traffic.
> >
> > -Jason
> >
> > On Thu, Aug 8, 2019 at 11:34 AM Stanislav Kozlovski <
> > stanis...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > I like the new ReassigningPartitions metric. Would it be easy to expand
> > the
> > > scope of the KIP to split the ReplicationIncoming/Outgoing metric to
> > > distringuish between reassigning/non-reassigning traffic, or do you
> > prefer
> > > to keep this KIP nice and small?
> > >
> > > On Thu, Aug 8, 2019 at 12:08 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Since KIP-455 is passed, I would like to revive this proposal. I have
> > > > reduced the scope so that it is just changing the way we compute URP
> > and
> > > > adding a new metric for the number of reassigning partitions. Please
> > > take a
> > > > look:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > .
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Sun, Aug 5, 2018 at 3:41 PM Dong Lin <lindon...@gmail.com> wrote:
> > > >
> > > > > Hey Jason,
> > > > >
> > > > > Yes, I also think the right solution is probably to include the
> > > > > expected_replication_factor in the per-topic znode and include this
> > > > > information in the UpdateMetadataRequest.
> > > > >
> > > > > And I agree with Gwen and Ismael that it is reasonable to redefine
> > URP
> > > as
> > > > > those partitions whose ISR set size < expected replication factor.
> > > Given
> > > > > that URP is being used for monitoring cluster availability and and
> > > > > reassignment progress, it seems that one of them will break
> depending
> > > on
> > > > > the URP definition. It seems better to keep the legitimate
> use-case,
> > > i.e.
> > > > > monitoring cluster availability. Users who want to monitor
> > reassignment
> > > > > progress probably should use the "kafka-reassign-partitions.sh
> > > --verify".
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 3, 2018 at 2:57 PM, Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > @Dong
> > > > > >
> > > > > > Ugh, you are right. This is indeed trickier than I imagined. You
> > > could
> > > > > > argue that the problem here is that there is no single source of
> > > truth
> > > > > for
> > > > > > the expected replication factor. While a reassignment is in
> > progress,
> > > > the
> > > > > > reassignment itself has the expected replica set. Otherwise, it
> is
> > > > stored
> > > > > > partition metadata itself. This is why manually deleting the
> > > > reassignment
> > > > > > is problematic. We lose the expected replica set and we depend on
> > > users
> > > > > to
> > > > > > reinstall it. I guess I'm starting to think that the way we track
> > > > > > reassignment state in the controller is problematic. In addition
> to
> > > the
> > > > > > problems caused by deletion, we cannot easily change an existing
> > > > > > reassignment.
> > > > > >
> > > > > > High level what I'm thinking is that we need to move the pending
> > > > > > reassignment state out of the single znode and into the
> individual
> > > > > metadata
> > > > > > of the reassigned partitions so that there is a single source of
> > > truth
> > > > > for
> > > > > > the expected replicas (and hence the replication factor). This
> > would
> > > > also
> > > > > > give us an easier mechanism to manage the batching of multiple
> > > > > > reassignments. Let me think on it a bit and see if I can come up
> > > with a
> > > > > > proposal.
> > > > > >
> > > > > > @Gwen, @Ismael
> > > > > >
> > > > > > That is fair. I also prefer to redefine URP if we think the
> > > > compatibility
> > > > > > impact is a lower concern than continued misuse.
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 3, 2018 at 12:25 PM, Ismael Juma <ism...@juma.me.uk>
> > > > wrote:
> > > > > >
> > > > > > > RIght, that was my thinking too.
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Fri, Aug 3, 2018 at 12:04 PM Gwen Shapira <
> g...@confluent.io>
> > > > > wrote:
> > > > > > >
> > > > > > > > On Fri, Aug 3, 2018 at 11:23 AM, Jason Gustafson <
> > > > ja...@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Ismael,
> > > > > > > > >
> > > > > > > > > Yeah, my initial inclination was to redefine URP as well.
> My
> > > only
> > > > > > doubt
> > > > > > > > was
> > > > > > > > > how it would affect existing tools which might depend on
> URPs
> > > to
> > > > > > track
> > > > > > > > the
> > > > > > > > > progress of a reassignment. I decided to be conservative in
> > the
> > > > > end,
> > > > > > > but
> > > > > > > > > I'd reconsider if we think it is not a major concern. It is
> > > > > annoying
> > > > > > to
> > > > > > > > > need a new category.
> > > > > > > > >
> > > > > > > >
> > > > > > > > There are existing tools that use URP to track reassignment,
> > but
> > > > > there
> > > > > > > are
> > > > > > > > many more tools that use URP for monitoring and alerting. If
> I
> > > > > > understand
> > > > > > > > Ismael's suggestion correctly, a re-definition will improve
> the
> > > > > > > reliability
> > > > > > > > of the monitoring tools (since there won't be false alerts in
> > > case
> > > > of
> > > > > > > > re-assignment) without having to switch to a new metric.
> > > > > > > >
> > > > > > > > I think we should choose the proposal that improves the more
> > > common
> > > > > > usage
> > > > > > > > of the metric, in this case, failure monitoring rather than
> > > > > > reassignment.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > About your question about storage in ZK, I can't think of
> > > > anything
> > > > > > > > > additional that we need. Probably the main difficulty is
> > > getting
> > > > > > access
> > > > > > > > to
> > > > > > > > > the replication factor in the topic utility. My basic
> thought
> > > was
> > > > > > just
> > > > > > > to
> > > > > > > > > collect the URPs (as we know them today) and use the config
> > API
> > > > to
> > > > > > > > > partition them based on the replication factor. Do you see
> > any
> > > > > > problems
> > > > > > > > > with this?
> > > > > > > > >
> > > > > > > > > -Jason
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 2, 2018 at 12:14 PM, Ismael Juma <
> > > ism...@juma.me.uk>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jason. This is definitely a pain point. I actually
> > > > prefer
> > > > > > the
> > > > > > > > > option
> > > > > > > > > > to redefine what under-replicated means (currently under
> > > > rejected
> > > > > > > > > > alternatives). Also, do we need to make changes to what
> we
> > > > store
> > > > > in
> > > > > > > ZK?
> > > > > > > > > If
> > > > > > > > > > so, that should be in the KIP too.
> > > > > > > > > >
> > > > > > > > > > Ismael
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 2, 2018 at 11:45 AM Jason Gustafson <
> > > > > > ja...@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey All,
> > > > > > > > > > >
> > > > > > > > > > > Another day, another KIP. This one is hopefully
> > > > > straightforward:
> > > > > > > > > > >
> > > > > > > > > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%
> > > > > > > > > > 3A+Distinguish+URPs+caused+by+reassignment
> > > > > > > > > > > .
> > > > > > > > > > > Have a look and let me know what you think!
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > *Gwen Shapira*
> > > > > > > > Product Manager | Confluent
> > > > > > > > 650.450.2760 | @gwenshap
> > > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > > > > > <http://www.confluent.io/blog>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>
  

Reply via email to