Hey Stan, Thanks for the suggestion. I have updated the proposal to include two new meters for reassignment traffic inbound and outbound.
-Jason On Thu, Aug 8, 2019 at 12:07 PM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Agreed on not totally spitting the replication incoming/outgoing metric - > that could cause confusion. Just adding a new metric sounds good to me! > > The throttle follow-up is mentioned as part of future work in KIP-455 and I > agree that it is way out of scope for this one. > > > On Thu, Aug 8, 2019 at 8:03 PM Jason Gustafson <ja...@confluent.io> wrote: > > > Hi Stan, > > > > That's an interesting thought. I'm wondering if it would be better to > leave > > the current replication metrics counting for the total replication > traffic > > and add a new metric for reassignment traffic? > > > > By the way, a further KIP-455 follow-up that I won't attempt here would > be > > to have a separate throttle for reassignment traffic. > > > > -Jason > > > > On Thu, Aug 8, 2019 at 11:34 AM Stanislav Kozlovski < > > stanis...@confluent.io> > > wrote: > > > > > Hi Jason, > > > > > > I like the new ReassigningPartitions metric. Would it be easy to expand > > the > > > scope of the KIP to split the ReplicationIncoming/Outgoing metric to > > > distringuish between reassigning/non-reassigning traffic, or do you > > prefer > > > to keep this KIP nice and small? > > > > > > On Thu, Aug 8, 2019 at 12:08 AM Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi All, > > > > > > > > Since KIP-455 is passed, I would like to revive this proposal. I have > > > > reduced the scope so that it is just changing the way we compute URP > > and > > > > adding a new metric for the number of reassigning partitions. Please > > > take a > > > > look: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment > > > > . > > > > > > > > Thanks, > > > > Jason > > > > > > > > On Sun, Aug 5, 2018 at 3:41 PM Dong Lin <lindon...@gmail.com> wrote: > > > > > > > > > Hey Jason, > > > > > > > > > > Yes, I also think the right solution is probably to include the > > > > > expected_replication_factor in the per-topic znode and include this > > > > > information in the UpdateMetadataRequest. > > > > > > > > > > And I agree with Gwen and Ismael that it is reasonable to redefine > > URP > > > as > > > > > those partitions whose ISR set size < expected replication factor. > > > Given > > > > > that URP is being used for monitoring cluster availability and and > > > > > reassignment progress, it seems that one of them will break > depending > > > on > > > > > the URP definition. It seems better to keep the legitimate > use-case, > > > i.e. > > > > > monitoring cluster availability. Users who want to monitor > > reassignment > > > > > progress probably should use the "kafka-reassign-partitions.sh > > > --verify". > > > > > > > > > > > > > > > Thanks, > > > > > Dong > > > > > > > > > > > > > > > > > > > > On Fri, Aug 3, 2018 at 2:57 PM, Jason Gustafson < > ja...@confluent.io> > > > > > wrote: > > > > > > > > > > > @Dong > > > > > > > > > > > > Ugh, you are right. This is indeed trickier than I imagined. You > > > could > > > > > > argue that the problem here is that there is no single source of > > > truth > > > > > for > > > > > > the expected replication factor. While a reassignment is in > > progress, > > > > the > > > > > > reassignment itself has the expected replica set. Otherwise, it > is > > > > stored > > > > > > partition metadata itself. This is why manually deleting the > > > > reassignment > > > > > > is problematic. We lose the expected replica set and we depend on > > > users > > > > > to > > > > > > reinstall it. I guess I'm starting to think that the way we track > > > > > > reassignment state in the controller is problematic. In addition > to > > > the > > > > > > problems caused by deletion, we cannot easily change an existing > > > > > > reassignment. > > > > > > > > > > > > High level what I'm thinking is that we need to move the pending > > > > > > reassignment state out of the single znode and into the > individual > > > > > metadata > > > > > > of the reassigned partitions so that there is a single source of > > > truth > > > > > for > > > > > > the expected replicas (and hence the replication factor). This > > would > > > > also > > > > > > give us an easier mechanism to manage the batching of multiple > > > > > > reassignments. Let me think on it a bit and see if I can come up > > > with a > > > > > > proposal. > > > > > > > > > > > > @Gwen, @Ismael > > > > > > > > > > > > That is fair. I also prefer to redefine URP if we think the > > > > compatibility > > > > > > impact is a lower concern than continued misuse. > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 3, 2018 at 12:25 PM, Ismael Juma <ism...@juma.me.uk> > > > > wrote: > > > > > > > > > > > > > RIght, that was my thinking too. > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > On Fri, Aug 3, 2018 at 12:04 PM Gwen Shapira < > g...@confluent.io> > > > > > wrote: > > > > > > > > > > > > > > > On Fri, Aug 3, 2018 at 11:23 AM, Jason Gustafson < > > > > ja...@confluent.io > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hey Ismael, > > > > > > > > > > > > > > > > > > Yeah, my initial inclination was to redefine URP as well. > My > > > only > > > > > > doubt > > > > > > > > was > > > > > > > > > how it would affect existing tools which might depend on > URPs > > > to > > > > > > track > > > > > > > > the > > > > > > > > > progress of a reassignment. I decided to be conservative in > > the > > > > > end, > > > > > > > but > > > > > > > > > I'd reconsider if we think it is not a major concern. It is > > > > > annoying > > > > > > to > > > > > > > > > need a new category. > > > > > > > > > > > > > > > > > > > > > > > > > There are existing tools that use URP to track reassignment, > > but > > > > > there > > > > > > > are > > > > > > > > many more tools that use URP for monitoring and alerting. If > I > > > > > > understand > > > > > > > > Ismael's suggestion correctly, a re-definition will improve > the > > > > > > > reliability > > > > > > > > of the monitoring tools (since there won't be false alerts in > > > case > > > > of > > > > > > > > re-assignment) without having to switch to a new metric. > > > > > > > > > > > > > > > > I think we should choose the proposal that improves the more > > > common > > > > > > usage > > > > > > > > of the metric, in this case, failure monitoring rather than > > > > > > reassignment. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > About your question about storage in ZK, I can't think of > > > > anything > > > > > > > > > additional that we need. Probably the main difficulty is > > > getting > > > > > > access > > > > > > > > to > > > > > > > > > the replication factor in the topic utility. My basic > thought > > > was > > > > > > just > > > > > > > to > > > > > > > > > collect the URPs (as we know them today) and use the config > > API > > > > to > > > > > > > > > partition them based on the replication factor. Do you see > > any > > > > > > problems > > > > > > > > > with this? > > > > > > > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 2, 2018 at 12:14 PM, Ismael Juma < > > > ism...@juma.me.uk> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks Jason. This is definitely a pain point. I actually > > > > prefer > > > > > > the > > > > > > > > > option > > > > > > > > > > to redefine what under-replicated means (currently under > > > > rejected > > > > > > > > > > alternatives). Also, do we need to make changes to what > we > > > > store > > > > > in > > > > > > > ZK? > > > > > > > > > If > > > > > > > > > > so, that should be in the KIP too. > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > On Thu, Aug 2, 2018 at 11:45 AM Jason Gustafson < > > > > > > ja...@confluent.io> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hey All, > > > > > > > > > > > > > > > > > > > > > > Another day, another KIP. This one is hopefully > > > > > straightforward: > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352% > > > > > > > > > > 3A+Distinguish+URPs+caused+by+reassignment > > > > > > > > > > > . > > > > > > > > > > > Have a look and let me know what you think! > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Gwen Shapira* > > > > > > > > Product Manager | Confluent > > > > > > > > 650.450.2760 | @gwenshap > > > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog > > > > > > > > <http://www.confluent.io/blog> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best, > > > Stanislav > > > > > > > > -- > Best, > Stanislav >