Ketan Talaulikar has entered the following ballot position for draft-ietf-bess-evpn-unequal-lb-33: Discuss
When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.) Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ for more information about how to handle DISCUSS and COMMENT positions. The document, along with other ballot positions, can be found here: https://datatracker.ietf.org/doc/draft-ietf-bess-evpn-unequal-lb/ ---------------------------------------------------------------------- DISCUSS: ---------------------------------------------------------------------- Thanks to the WG and the authors for their work on this document that brings in weighted load-balancing in EVPN networks. I have a few points that I would like to discuss. 661 * Each egress PE MUST advertise an EVPN Link Bandwidth Extended 662 Community along with the ES route to signal the PE–CE link 663 bandwidth associated with the ES. <discuss-1> What if one of the ePEs does not send this EC or if it is invalid? What does the receiver do? Is the BW Capability ignored and everything falls back to the default DF election algorithm? 674 As a result, a given PE MAY appear multiple times in the DF candidate 675 list. Consequently, the value N used in the (V mod N) operation 676 defined in [RFC7432] MUST be interpreted as the total number of 677 ordinals in the weighted candidate list, rather than the total number 678 of distinct egress PEs in the ES. <discuss-2> Since the default DF election is being modified, would this document also not update RFC7432? I am thinking that this document is tagged as "updates" RFC7432, RFC8584, RFC9785 (but also draft-ietf-bess-evpn-per-mcast-flow-df-election?) or none of those. If this is considered an "extension" or "enhancement" of the DF election rather than a "bugfix", then the "updates" tag is not necessary IMHO. Please see in my comments on existing text in section 6.3 that gives me the impression that this is an extension. My point is unnecessary "updates" tags on RFCs make it harder for implementers/operators/readers to differentiate real "fixes" from "enhancements/extensions". I am seeking consistency here and will leave the options for the authors/WG to consider. 840 7.1. Real-time Available Bandwidth 842 PE-CE link bandwidth availability may sometimes vary in real-time 843 disproportionately across PE-CE links within a multi-homed ES due to 844 various factors such as flow based hashing combined with fat flows 845 and unbalanced hashing. Reacting to real-time available bandwidth is 846 at this time outside the scope of this document. 848 Operators SHOULD be aware, however, that too frequent or dynamic re- 849 adjustment of advertised bandwidth values may lead to instability due 850 to repeated weighted path-list recomputation and DF election changes. 851 Appropriate guards, such as dampening or hysteresis mechanisms, 852 SHOULD be considered when dynamic bandwidth advertisement is used. <discuss-3> Upto this point the document talked about link bandwidth and not available/free bandwidth. This section is giving the impression that the value signaled could be something other than the fixed link bandwidth (i.e., fixed besides scenarios where LAG members go up/down). Why does the document not say that the values signaled MUST NOT be something that is varying based on the link usage as doing that would be very problematic. It is not sufficient to say that this is outside the scope. Then the first "SHOULD" is actually a "should". And the second SHOULD can give impression that this is dynamic when it is really not the case except in the situation of LAG members going up/down. On the whole, this entire section is problematic from the sense of routing stability. Likely I am misunderstanding the intent and, if so, please clarify. 854 7.2. Weighted Load-balancing to Multi-homed Subnets 856 EVPN Link bandwidth extended community may also be used to achieve 857 unequal load-balancing of prefix routed traffic by including this 858 extended community in EVPN Route Type 5. When included in EVPN RT-5, 859 its value is to be interpreted as egress PE's relative weight for the 860 prefix included in this RT-5. Ingress PE will then compute the 861 forwarding path-list for the prefix route using weighted paths 862 received from each egress PE. EVPN Link bandwidth extended community 863 MUST be encoded with "Value-Units = 0x01" to signal a generalized 864 weight associated with the advertising PE. <discuss-4> The MUST here is not clear to me. Is the intent that for RT5 only the Value-Units =1 MUST be used? If so, why? Also, why is it burried down here instead of being called out promimently in section 4.1? Or is it that if weights are used then the Value-Units MUST be 1. If so, isn't this covered in section 4 already. Am I missing something? 890 7.5. EVPN Link Bandwidth Extended Community in Non-EVPN Networks 892 While this document does not preclude future applicability to non- 893 EVPN networks, it considers usage and handling of EVPN Link Bandwidth 894 Extended Community specified in this document with non-EVPN routes 895 out of scope. <discuss-5> I would like to discuss why the use of an EVPN EC is being left "open" for other BGP address families. That too when there is a generic Link Bandwidth EC in BGP that already exists to provide similar functionality. Should this document not explicitly limit the EVPN Link Bandwidth EC to EVPN only? If so, this needs to be clarified upfront where the EC is defined and this section can then be removed. 897 7.6. Preference for EVPN Link Bandwidth in EVPN Networks 899 It is possible that a non-EVPN Link Bandwidth extended community such 900 as [BGP-LINK-BW] is leaked from an IP or IPVPN route into an EVPN 901 RT-5 towards an EVPN network. If an EVPN PE receives an EVPN route 902 with both the EVPN Link Bandwidth extended community specified in 903 this document and a non-EVPN Link Bandwidth extended community such 904 as the one specified in [BGP-LINK-BW], it MUST as default behavior, 905 prefer the EVPN Link Bandwidth extended community and handle it as 906 per procedures specified in this document. In other words, any non- 907 EVPN Link Bandwidth extended community is to be ignored if an EVPN 908 route is received with the EVPN Link Bandwidth extended community 909 specified in this document. <discuss-6> What if some routes to a destination have both and some have only the Link Bandwidth EC? Would a mix of the two ECs for different paths for the same destination route be acceptable? 914 7.7. Interworking with Non-EVPN networks 916 In EVPN routing interworking use cases with IPVPN and IPv4/IPv6 917 routing, it is not beneficial to preserve the EVPN Link Bandwidth 918 extended community from EVPN routes to non-EVPN routes as the next- 919 hop is rewritten when a prefix learnt via EVPN RT-5 is advertised 920 into IPVPN or IP routing networks. Interworking procedures, 921 including preservation, cummulation or translation of EVPN Link 922 Bandwidth extended community to address current or future use cases 923 are however considered beyond the scope of this document. Readers 924 are encouraged to refer to [EVPN-IPVPN] for interworking 925 specification. <discuss-7> There is no discussion in draft-ietf-bess-evpn-ipvpn-interworking that is related to handling of this EC propagation. On the contrary, that draft explicitly prohibits the propagation of all EVPN-specific ECs. I agree with what is specified by the interworking document and I wonder why this document is not normatively prohibit propagation of EVPN-specific Link Bandwidth EC into any other address-family. Also, I would have expected that this specification instead cover how the conversion is done between this and the BGP Link Bandwith ECs - if not in this document then where else does the WG plan to do it? Having introduced two ECs for practically the same thing (and I am not debating how we got to this stage), isn't the onus on this document to cover this aspect? Then, about the cumulation aspect as the NH changes across the gateway PE but also for inter-AS option B, the document says out of scope. But where else would the WG cover that? Now, there is also the draft-ietf-bess-ebgp-dmz that covers this aspect but for BGP LBW EC. Can that also cover for the EVPN LBW EC? 976 10. IANA Considerations 978 10.1. Bandwidth Weighted DF Election Capability 980 [RFC8584] defines a new extended community for PEs within a 981 redundancy group to signal and agree on uniform DF Election Type and 982 Capabilities for each ES. This document requests IANA to allocate a 983 bit in the "DF Election capabilities" registry setup by [RFC8584] 984 with the following suggested bit number: 986 Bit 4: BW (Bandwidth Weighted DF Election) <discuss-8> The first sentence is not suitable for IANA considerations (as suggested in my comments, please move into section 6.1). The registry group is not specified here (but also in section 10.2) and it would be the BGP Extended Communities registry group. ---------------------------------------------------------------------- COMMENT: ---------------------------------------------------------------------- Please also find below some comments inline in the idnits o/p of v33 of this document. Please look for the tag <EoRv33> at the end to ensure you have received the full review. 2 BESS WorkGroup N. Malhotra, Ed. 3 Internet-Draft A. Sajassi 4 Updates: RFC8584 (if approved) Cisco Systems <minor> Please put only "8584" instead of "RFC8584" in the "updates" tag 35 homing PE set. The document updates RFC 8584 to enable weighted load <minor> s/RFC 8584/RFC8584 142 distributed across all egress PEs. However, this assumption can be 143 restrictive in operational environments, particularly when adding or 144 removing member links in a multi-homed Link Aggregation Group (LAG), 145 and can be violated in the presence of individual PE–CE link 146 failures. <minor> Perhaps an informational reference to IEEE_802.1AX_2014 is required for LAG? 284 respective access bandwidths. Specifically, the fraction of unicast 285 and Broadcast, Unknown Unicast, and Multicast (BUM) traffic serviced 286 by egress PEx SHOULD be: 288 Lx / (L1 + L2 + ... + Ln) <major> This is an example and I don't thing the use of normative SHOULD is appropriate here. Also, this is covered normatively in section 5.2 as well? So perhaps consider: Specifically, the fraction of unicast and Broadcast, Unknown Unicast, and Multicast (BUM) traffic serviced by egress PEx is: 291 connected to a multi-homed Ethernet Segment. However, the 292 requirement described in this section is not limited to physical 293 Ethernet Segments. It equally applies to virtual Ethernet Segments 294 (vES) and to multi-homed subnets advertised using EVPN IP Prefix 295 routes. <major> Please add normative reference to RFC9136. I am also wondering if this document is changing/updating anything in RFC9136? Likely not, but just checking. 319 * ES: Ethernet Segment 321 * ESI: Ethernet Segment ID 323 * vES: Virtual Ethernet Segment 325 * EVI: Ethernet virtual Instance, this is a mac-vrf <major> Please provide references (in most cases RFC7432?) for all the EVPN terms above (as also IMET, DF, etc. below). 331 * RT-5: EVPN Route Type 5 as defined in [RFC7432] <major> RT5 is in RFC9136 381 4. EVPN Link Bandwidth Extended Community 383 This document defines a new EVPN Link Bandwidth Extended Community to 384 support the solution described herein. <major> Even if it is obvious, I was not able to find an RFC that restricts the use of the EVPN EC sub-types only for EVPN address-family. Could you please add a suitable reference that says that and if not then state that this type is specific to the EVPN AFI/SAFI and not applicable to others? This is related to one of the discuss points. 389 * IANA has assigned Sub-Type value 0x10 for the EVPN Link Bandwidth 390 Extended Community. <minor> Perhaps? The Sub-Type value 0x10 is allocated for the EVPN Link Bandwidth Extended Community. 433 of Mbps. Support for generalized weight values is OPTIONAL. No 434 other Value-Units code points are defined at this time. <minor> Can we please add a reference to section 10.3 in the last sentence so the reader becomes aware of the registry being created? 441 * Value-Units Consistency: When an EVPN Link Bandwidth Extended 442 Community is received with a route, a PE MUST verify that the 443 Value-Units field is consistent across all paths associated with <minor> Is consistent or more specifically, is identical? There are other use of the word consistent/consistently when it perhaps means "equal" or "same". 452 * Multiplicity: A PE MUST ensure that at most one instance of the 453 EVPN Link Bandwidth Extended Community is received per path. If 454 more than one instance is present, the extended community MUST be 455 ignored for all paths associated with the route. <major> How can a PE ensure what it receives? It can only ensure what its sends. Can you please rephrase the first sentence so the MUST applies to the sender? And then the MUST in the second sentence applies is rephrased so it applies to the receiver. 482 * Unexpected Route Types: This document specifies the use of the 483 EVPN Link Bandwidth Extended Community only with per-ES RT-1, RT- 484 4, and RT-5 routes. If the extended community is received with 485 any other EVPN route type, including per-[ES, EVI] RT-1 or RT-2 486 routes, it MUST be ignored, and a syslog message [RFC5424] SHOULD 487 be generated indicating the reason. <major> Why not leave the possibility of future route types being able to explicitly specify and use this EC? 497 generalized weight. New EVPN link bandwidth extended community 498 defined in this document is used for this purpose. <minor> s/New EVPN/The EVPN 504 5.2. Ingress PE Behavior 506 An ingress PE MUST ensure that the EVPN Link Bandwidth Extended 507 Community is received from all egress PEs associated with a given ES, 508 and MUST verify that the received Value-Units are consistent across 509 all such egress PEs. If the EVPN Link Bandwidth Extended Community 510 is missing from one or more egress PEs, or if inconsistent Value- 511 Units are detected, the ingress PE MUST ignore the EVPN Link 512 Bandwidth Extended Community for that ES and MUST revert to regular 513 ECMP forwarding toward that ES. When the EVPN Link Bandwidth 514 Extended Community is ignored, the ingress PE SHOULD generate a 515 syslog [RFC5424] notification. <major> Please remove the entire paragraph above since it is duplicate of the text in section 4.1.1 521 for the ES. These normalized weights SHOULD then be used to 522 construct a weighted forwarding path-list for load balancing, instead 523 of using an ECMP-based path-list. The computation of egress PE <minor> Perhaps s/an ECMP-based path-list/an equal weighted path-list ? 544 For a MAC+IP Advertisement Route (EVPN Route Type 2) received for ES- 545 y, the ingress PE MAY compute a MAC and IP forwarding path-list <major> s/the ingress PE MAY compute/the ingress PE computes ... the normative part is already stated previously in this section. The MAY is conflicting with the previous SHOULD. 568 For a remote MAC+IP host route associated with ES-10, the resulting 569 forwarding path-list MAY therefore be computed as: <major> s/path-list MAY therefore be computed/path-list is, therefore, computed 581 The above computation algorithm is provided for illustration only. 582 Weighted path-list computation based on the EVPN Link Bandwidth 583 Extended Community is a local implementation choice. If the received <major> Please remove the above sentence since it has already been stated previously in the same section. 584 bandwidth values do not yield a suitable HCF that allows programming 585 reasonable integer weights in hardware, an implementation MAY apply 586 alternative approximation or rounding methods to derive implementable 587 weight values. <minor> The above sentence is better placed right after the previous text in this section about how weights are determined and that they are local implementation matters. 589 Weighted path-list computation MUST be performed for an ES only if 590 the EVPN Link Bandwidth Extended Community is received from all 591 egress PEs advertising reachability to that ES via Ethernet A-D per- 592 ES Route Type 1. If the EVPN Link Bandwidth Extended Community is 593 not received from one or more such egress PEs, the ingress PE MUST 594 compute the forwarding path-list using regular ECMP semantics. A 595 default weight MUST NOT be assumed for an egress PE that does not 596 advertise link bandwidth, as the computed weights are strictly 597 relative. <major> The 2nd last statement is yet another repitition. The last sentence is new - please consider putting it either in sections 4.1.1 or where weights are discussed earlier in this section. 599 If a per-ES Route Type 1 is not advertised, or is withdrawn, by an 600 egress PE as specified in [RFC7432], that egress PE MUST be removed 601 from the forwarding path-list for the corresponding [EVI, ES], and 602 the weighted path-list MUST be recomputed accordingly. 604 If a per-[ES, EVI] Route Type 1 is not advertised by an egress PE as 605 specified in [RFC7432], that egress PE MUST NOT be included in the 606 forwarding path-list for the corresponding [EVI, ES]. In this case, 607 the weighted path-list MUST be computed using only the weights 608 received from egress PEs that advertised the per-[ES, EVI] Route Type 609 1. <major> The first sentences in the above 2 paragraphs are restating in a normative manner something that was already specified in RFC7432. This is wrong. Perhaps the intention here was to offer a reminder to the reader, and if so, please rephrase accordingly. Then the last sentences are obvious, but perhaps can be stated more generically that any change in the path-list results in the recomputation of the ratios of weights for each existing path (or something like that?). 622 6.1. The BW Capability in the DF Election Extended Community 624 This document requests IANA to allocate a new bit in the DF Election 625 Capabilities registry defined by [RFC8584]: <major> Please make requests to IANA only in IANA consideration sections. This is already done in section 10.1 so the above sentence needs to be rephrased with a TBD bit value. Later it will get replaced by the actual one upon RFC publication. Further, the following sentence that is in 10.1 has no place in IANA considerations and is better moved as the first sentence in this section. "[RFC8584] defines a new extended community for PEs within a redundancy group to signal and agree on uniform DF Election Type and Capabilities for each ES." 639 The BW Capability MAY be advertised with the following DF Types: 641 * Type 0: Default DF Election algorithm, as specified in [RFC7432] 643 * Type 1: Highest Random Weight (HRW) algorithm, as specified in 644 [RFC8584] 646 * Type 2: Preference-based DF Election algorithm, as specified in 647 [RFC9785] 649 * Type 4: HRW per-multicast-flow DF Election algorithm, as specified 650 in [EVPN-PER-MCAST-FLOW-DF] <major> Perhaps explicitly mention that future documents introducing new DF types are expected to specify their working with the BW Capability, as applicable? 688 6.3. BW Capability and HRW DF Election algorithm (Type 1 and 4) 690 [RFC8584] introduces Highest Random Weight (HRW) algorithm (DF Type 691 1) for DF election in order to solve potential DF election skew 692 depending on Ethernet tag space distribution. [EVPN-PER-MCAST-FLOW- 693 DF] further extends HRW algorithm for per-multicast flow based hash 694 computations (DF Type 4). This section describes extensions to HRW 695 Algorithm for EVPN DF Election specified in [RFC8584] and in [EVPN- 696 PER-MCAST-FLOW-DF] in order to achieve DF election distribution that 697 is weighted by link bandwidth. <major> This paragraph gives the correct impression that what this document is doing is extensions and not "updates" to all those other RFCs. Please reconsider doing that "updates" tag and that too just for RFC8584. Based on the current logic, draft-ietf-bess-evpn-per-mcast-flow-df-election would also get added to the list of "updates" RFCs? 729 Note that the bandwidth increment must always be an integer, <major> Is that a must or a MUST? 799 6.4. BW Capability and Preference DF Election algorithm 801 This section applies to ES'es where all the PEs in the ES agree use 802 the BW Capability with DF Type 2. The BW Capability modifies the 803 Preference DF Election procedure [RFC9785], by adding the LBW value 804 as a tie-breaker as follows: <major> So, does this document also "update" RFC9785? 873 and per-[ES, EVI] RT-1 from egress PEs. In such a case, only the 874 weights received via per-ES RT-1 from the egress PEs included in the 875 MAC path-list are to be considered for weighted path-list 876 computation. <major> Would ' only ... path-list MUST be considered ..." be more suitable given the implications on interoperability? 878 7.4. EVPN IRB Multi-homing With Non-EVPN routing 880 EVPN-LAG based multi-homing on an IRB gateway may also be deployed <major> Perhaps informative reference to RFC9135 is required here? 940 * When a generalized weight is used, the operator MUST ensure 941 consistent interpretation of the advertised value across all 942 egress PEs associated with the Ethernet Segment. This requirement 943 applies even when the egress PEs span multiple routing domains or 944 Autonomous Systems. <major> The above seems odd when the document does not define any specification for this feature across domains or ASes. 988 10.2. EVPN Link Bandwidth Extended Community 990 This document defines a new EVPN Link Bandwidth extended community to 991 signal local ES link bandwidth to ingress PEs. This extended 992 community is defined of type 0x06 (EVPN Extended Community Sub- 993 Types). IANA has assigned a sub-type value of 0x10 for the EVPN Link 994 bandwidth extended community, of type 0x06 (EVPN Extended Community 995 Sub-Types). EVPN Link Bandwidth extended community is defined as 996 transitive. <major> Only the 3 sentence in the above paragraph is suitable for IANA considerations as the rest is description of the extension that is already covered in section 4. 1096 Appendix A. BGP-Link-Bandwidth-Extended-Community 1098 Link bandwidth extended community described in [BGP-LINK-BW] for 1099 layer 3 VPNs was considered for re-use here. This Link bandwidth 1100 extended community is however defined in [BGP-LINK-BW] as optional 1101 non-transitive. Since it is not possible to change deployed behavior 1102 of extended community defined in [BGP-LINK-BW], it was decided to 1103 define a new one. In inter-AS scenarios within an EVPN network, EVPN 1104 link-bandwidth needs to be signaled to eBGP neighbors. When signaled 1105 across AS boundary, this extended community can be used to achieve 1106 optimal load-balancing towards egress PEs in a different AS. This is 1107 applicable both when next-hop is changed or unchanged across AS 1108 boundaries. <major> If you look at the latest version of draft-ietf-idr-link-bandwidth that is now in the RFC Editor Q, then the above appendix is not correct as there are now both transitive and non-transitive types. Please consider deleting this appendix or re-writing it for accuracy so as to explain how we got to having two things for the same thing. I would suggest knocking this off to keep things simple. <EoRv33> _______________________________________________ BESS mailing list -- [email protected] To unsubscribe send an email to [email protected]
