Hey Stan,

Thanks for the suggestion. I have updated the proposal to include two new
meters for reassignment traffic inbound and outbound.

-Jason

On Thu, Aug 8, 2019 at 12:07 PM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Agreed on not totally spitting the replication incoming/outgoing metric -
> that could cause confusion. Just adding a new metric sounds good to me!
>
> The throttle follow-up is mentioned as part of future work in KIP-455 and I
> agree that it is way out of scope for this one.
>
>
> On Thu, Aug 8, 2019 at 8:03 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hi Stan,
> >
> > That's an interesting thought. I'm wondering if it would be better to
> leave
> > the current replication metrics counting for the total replication
> traffic
> > and add a new metric for reassignment traffic?
> >
> > By the way, a further KIP-455 follow-up that I won't attempt here would
> be
> > to have a separate throttle for reassignment traffic.
> >
> > -Jason
> >
> > On Thu, Aug 8, 2019 at 11:34 AM Stanislav Kozlovski <
> > stanis...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > I like the new ReassigningPartitions metric. Would it be easy to expand
> > the
> > > scope of the KIP to split the ReplicationIncoming/Outgoing metric to
> > > distringuish between reassigning/non-reassigning traffic, or do you
> > prefer
> > > to keep this KIP nice and small?
> > >
> > > On Thu, Aug 8, 2019 at 12:08 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Since KIP-455 is passed, I would like to revive this proposal. I have
> > > > reduced the scope so that it is just changing the way we compute URP
> > and
> > > > adding a new metric for the number of reassigning partitions. Please
> > > take a
> > > > look:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > .
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Sun, Aug 5, 2018 at 3:41 PM Dong Lin <lindon...@gmail.com> wrote:
> > > >
> > > > > Hey Jason,
> > > > >
> > > > > Yes, I also think the right solution is probably to include the
> > > > > expected_replication_factor in the per-topic znode and include this
> > > > > information in the UpdateMetadataRequest.
> > > > >
> > > > > And I agree with Gwen and Ismael that it is reasonable to redefine
> > URP
> > > as
> > > > > those partitions whose ISR set size < expected replication factor.
> > > Given
> > > > > that URP is being used for monitoring cluster availability and and
> > > > > reassignment progress, it seems that one of them will break
> depending
> > > on
> > > > > the URP definition. It seems better to keep the legitimate
> use-case,
> > > i.e.
> > > > > monitoring cluster availability. Users who want to monitor
> > reassignment
> > > > > progress probably should use the "kafka-reassign-partitions.sh
> > > --verify".
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 3, 2018 at 2:57 PM, Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > @Dong
> > > > > >
> > > > > > Ugh, you are right. This is indeed trickier than I imagined. You
> > > could
> > > > > > argue that the problem here is that there is no single source of
> > > truth
> > > > > for
> > > > > > the expected replication factor. While a reassignment is in
> > progress,
> > > > the
> > > > > > reassignment itself has the expected replica set. Otherwise, it
> is
> > > > stored
> > > > > > partition metadata itself. This is why manually deleting the
> > > > reassignment
> > > > > > is problematic. We lose the expected replica set and we depend on
> > > users
> > > > > to
> > > > > > reinstall it. I guess I'm starting to think that the way we track
> > > > > > reassignment state in the controller is problematic. In addition
> to
> > > the
> > > > > > problems caused by deletion, we cannot easily change an existing
> > > > > > reassignment.
> > > > > >
> > > > > > High level what I'm thinking is that we need to move the pending
> > > > > > reassignment state out of the single znode and into the
> individual
> > > > > metadata
> > > > > > of the reassigned partitions so that there is a single source of
> > > truth
> > > > > for
> > > > > > the expected replicas (and hence the replication factor). This
> > would
> > > > also
> > > > > > give us an easier mechanism to manage the batching of multiple
> > > > > > reassignments. Let me think on it a bit and see if I can come up
> > > with a
> > > > > > proposal.
> > > > > >
> > > > > > @Gwen, @Ismael
> > > > > >
> > > > > > That is fair. I also prefer to redefine URP if we think the
> > > > compatibility
> > > > > > impact is a lower concern than continued misuse.
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 3, 2018 at 12:25 PM, Ismael Juma <ism...@juma.me.uk>
> > > > wrote:
> > > > > >
> > > > > > > RIght, that was my thinking too.
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Fri, Aug 3, 2018 at 12:04 PM Gwen Shapira <
> g...@confluent.io>
> > > > > wrote:
> > > > > > >
> > > > > > > > On Fri, Aug 3, 2018 at 11:23 AM, Jason Gustafson <
> > > > ja...@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Ismael,
> > > > > > > > >
> > > > > > > > > Yeah, my initial inclination was to redefine URP as well.
> My
> > > only
> > > > > > doubt
> > > > > > > > was
> > > > > > > > > how it would affect existing tools which might depend on
> URPs
> > > to
> > > > > > track
> > > > > > > > the
> > > > > > > > > progress of a reassignment. I decided to be conservative in
> > the
> > > > > end,
> > > > > > > but
> > > > > > > > > I'd reconsider if we think it is not a major concern. It is
> > > > > annoying
> > > > > > to
> > > > > > > > > need a new category.
> > > > > > > > >
> > > > > > > >
> > > > > > > > There are existing tools that use URP to track reassignment,
> > but
> > > > > there
> > > > > > > are
> > > > > > > > many more tools that use URP for monitoring and alerting. If
> I
> > > > > > understand
> > > > > > > > Ismael's suggestion correctly, a re-definition will improve
> the
> > > > > > > reliability
> > > > > > > > of the monitoring tools (since there won't be false alerts in
> > > case
> > > > of
> > > > > > > > re-assignment) without having to switch to a new metric.
> > > > > > > >
> > > > > > > > I think we should choose the proposal that improves the more
> > > common
> > > > > > usage
> > > > > > > > of the metric, in this case, failure monitoring rather than
> > > > > > reassignment.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > About your question about storage in ZK, I can't think of
> > > > anything
> > > > > > > > > additional that we need. Probably the main difficulty is
> > > getting
> > > > > > access
> > > > > > > > to
> > > > > > > > > the replication factor in the topic utility. My basic
> thought
> > > was
> > > > > > just
> > > > > > > to
> > > > > > > > > collect the URPs (as we know them today) and use the config
> > API
> > > > to
> > > > > > > > > partition them based on the replication factor. Do you see
> > any
> > > > > > problems
> > > > > > > > > with this?
> > > > > > > > >
> > > > > > > > > -Jason
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 2, 2018 at 12:14 PM, Ismael Juma <
> > > ism...@juma.me.uk>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jason. This is definitely a pain point. I actually
> > > > prefer
> > > > > > the
> > > > > > > > > option
> > > > > > > > > > to redefine what under-replicated means (currently under
> > > > rejected
> > > > > > > > > > alternatives). Also, do we need to make changes to what
> we
> > > > store
> > > > > in
> > > > > > > ZK?
> > > > > > > > > If
> > > > > > > > > > so, that should be in the KIP too.
> > > > > > > > > >
> > > > > > > > > > Ismael
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 2, 2018 at 11:45 AM Jason Gustafson <
> > > > > > ja...@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey All,
> > > > > > > > > > >
> > > > > > > > > > > Another day, another KIP. This one is hopefully
> > > > > straightforward:
> > > > > > > > > > >
> > > > > > > > > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%
> > > > > > > > > > 3A+Distinguish+URPs+caused+by+reassignment
> > > > > > > > > > > .
> > > > > > > > > > > Have a look and let me know what you think!
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > *Gwen Shapira*
> > > > > > > > Product Manager | Confluent
> > > > > > > > 650.450.2760 | @gwenshap
> > > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > > > > > <http://www.confluent.io/blog>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Reply via email to