Jim, Revision to address your comments has been submitted: https://datatracker.ietf.org/doc/draft-ietf-rtgwg-net2cloud-problem-statement/
Thank you very much, Linda From: Linda Dunbar Sent: Friday, August 16, 2024 11:08 AM To: James Guichard <james.n.guich...@futurewei.com>; draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org Cc: rtgwg@ietf.org Subject: RE: AD review for draft-ietf-rtgwg-net2cloud-problem-statement Jim, Thank you very much for the detailed review and the comments. The resolutions to address your comments are inserted below. Please let us know any further actions are needed. Thanks, Linda From: James Guichard <james.n.guich...@futurewei.com<mailto:james.n.guich...@futurewei.com>> Sent: Friday, August 16, 2024 6:54 AM To: draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org<mailto:draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org> Cc: rtgwg@ietf.org<mailto:rtgwg@ietf.org> Subject: AD review for draft-ietf-rtgwg-net2cloud-problem-statement Deat authors, Please find my comments for draft-ietf-rtgwg-net2cloud-problem-statement (I have included line numbers from nits to help identify where in the document the comment is relevant): Please update references below. == Outdated reference: A later version (-13) exists of draft-ietf-idr-sdwan-edge-discovery-12 == Outdated reference: A later version (-12) exists of draft-ietf-opsawg-ntw-attachment-circuit-08 == Outdated reference: A later version (-23) exists of draft-ietf-idr-5g-edge-service-metadata-16 == Outdated reference: A later version (-15) exists of draft-ietf-opsawg-teas-attachment-circuit-10 == Outdated reference: A later version (-14) exists of draft-ietf-add-split-horizon-authority-07 [Linda] fixed. 109 Cloud services are generally exposed, on-demand services that claim 110 to be scalable, highly available, and have usage-based billing. Most Jim> The above sentence is difficult to parse. Do you mean "Cloud services are generally exposed as on-demand..." rather than "Cloud services are generally exposed,..." [Linda] Yes, fixed it. 115 hosts services to many customers. Jim> s/to/too [Linda] meant to say "hosts services for multiple customers." 137 "edge" locations. <https://cloud.google.com/learn/what- 138 is-hybrid-cloud>. Jim> Please remove the in-text reference and replace with a [] reference as either normative or informative. [Linda] fixed. 144 https://en.wikipedia.org/wiki/Internet_exchange_point. Jim> Please remove in-text reference and replace with a [] reference as either normative or informative. [Linda] fixed. 186 - If a Cloud Gateway (GW), a BGP speaker, receives from its BGP 187 peer a capability that it does not itself support or recognize, 188 it need to ignore that capability, and the BGP session need not Jim> As per RFC5492 it MUST ignore that capability and the BGP session MUST NOT be terminated. See section 3 of RFC5492 and correct the above text. [Linda] changed "need" to "Must" per RFC4592. 189 be terminated per [RFC5492]. When receiving a BGP UPDATE with a 190 malformed attribute, the revised BGP error handling procedure 191 in [RFC7606] should be followed instead of session resetting. Jim> the above paragraph seems to be confused. The first sentence is talking about BGP OPEN and how to handle capabilities, and then the second sentence talks about BGP UPDATE messages that have malformed attributes. These are two completely different things so I am struggling to understand why they are referenced in the same paragraph and what exactly they have to do with each other in the context of a Cloud Gateway?. Everything referenced is existing behavior, nothing new, so why is it here and what are the authors trying to convey? If they are trying to simply say that a Cloud Gateway should adhere to the procedure as specified in RFCs 5492 and 7606 then why not simply say that? If the authors wish to keep the text I would suggest a rewrite as follows: - If a Cloud Gateway (GW), a BGP speaker, receives from its BGP peer a BGP OPEN with a capability that it does not support or recognize, it MUST ignore that capability, and the BGP session MUST NOT be terminated, as per [RFC 5492]. - When receiving a BGP UPDATE with a malformed attribute, the revised BGP error handling procedures in [RFC 7606] should be followed instead of resetting the BGP session. [Linda] Thanks for the suggestion. Those requirements are from Azure. Sometimes, the BGP sessions with their clients got reset upon receiving unsupported capabilities. They want to make sure that the BGP sessions stay up. Do you know why RFC5492 says "the Unsupported Capability NOTIFICATION message MUST NOT be generated"? Azure wants notification for the unsupported capability. 196 - When a Cloud DC eBGP session supports a limited number of 197 routes from external entities, the on-premises DCs need to set 198 up default routes and filter as many routes as practical 199 replacing them with a default in the eBGP advertisement to 200 minimize the number of routes to be exchanged with the Cloud DC 201 eBGP peers. Jim> I do not understand the above paragraph. Is a Cloud DC different to an on-premise DC? Who is advertising default to who? The scenario that you are trying to convey above is non-obvious, at least to me, so please clarify. [Linda] The statement is meant to emphasize when a cloud DC GW doesn't multi-hop eBGP sessions with their peers, a tunnel should be established to achieve IP adjacency. For example, AWS Transit Gateway does not support traditional multi-hop eBGP sessions. AWS recommends establishing eBGP sessions with third-party virtual appliances (like SD-WAN appliances) running in a VPC to exchange routing information between on-premises network and multiple VPCs through a central point. 202 - When a Cloud GW receives inbound routes exceeding the maximum 203 routes threshold for a peer, the currently common practice is 204 generating out-of-band alerts (e.g., Syslog entries) via the 205 management system or terminating the BGP session (with cease 206 notification messages [RFC4486] being sent). Although out of 207 the scope of this document, more discussion is needed in the 208 IETF Inter-Domain Routing (IDR) Working Group for potential in- 209 band or autonomous notification directly to the peers when the 210 inbound routes exceed the maximum routes threshold. Jim> More explanation is needed here including a reference to section 4 of RFC4486 that describes the procedure for terminating a peering with a NOTIFICATION message and error code providing a reason e.g. "Maximum number of prefixes reached". [Linda] Azure doesn't want BGP session to be terminated when max number of prefixes reached. Azure wants a method to notify the peers when the routes received exceeding some threshold. Today's practice of using Syslog only informs Azure when max routes exceeded. But there is no effective way to notify peers to reduce routes. draft-sas-idr-maxprefix-inbound-05 would be a good solution. But the draft is expired. We are hoping to continue the draft by stating that "more discussion is needed in the IETF Inter-Domain Routing (IDR) Working Group for potential in-band or autonomous notification directly to the peers when the inbound routes exceed the maximum routes threshold." 222 Failures within a Cloud site, which can be a building, a floor, a 223 pod, or a server rack, include capacity degradation or complete out- 224 of-service failure. Here are some events that can trigger a site 225 failure: a) fiber cut for links connecting to the site or among pods 226 within the site; b) cooling failures; c) insufficient backup power 227 during a power failure; d) cyber threat attacks; e) too many changes 228 outside of the maintenance window; etc. A fiber-cut is not uncommon 229 in a Cloud site or between sites. Jim> I would suggest to say above that the types of events are not an exhaustive list but just some examples. [Linda] s/ Here are some events that can trigger a site failure/ Some examples of events that can trigger a site failure 244 [RFC7432] specifies a mass withdrawal mechanism for EVPN to signal a 245 large number of routes being changed to remote PE nodes as quickly 246 as possible. Jim> I am not sure that RFC 7432 is relevant here or why EVPN is even mentioned. Is there a reason to mention this or should the text simply be removed? [Linda] the goal of the document is to list out relevant IETF work for all the problem identified in the document. The paragraph is meant to explain RFC7432 Mass Withdrawal alone is insufficient, as the routes at the sites might not all be EVPN routes. Changed the paragraph to the following: [RFC7432] specifies a mass withdrawal mechanism for EVPN to signal a large number of routes being changed to remote PE nodes as quickly as possible. However, this alone is insufficient, as the routes at the sites might not all be EVPN routes. 597 premesis CPEs to a Cloud DC via a private VPN requires the private Jim> s/premesis/premise [Linda] fixed. 691 necessary. Alternative encapsulations, like SRH (Segment Routing Jim> Please provide a reference to RFC 8754 (SRH) [Linda] added. 695 6. Requirements for Networks Connecting Cloud Data Centers Jim> Why are there requirements in a problem statement document? Did the WG discuss splitting these out into a separate document? [Linda] This document is meant to list down the high level requirement, which will lead to other documents in the future. Thanks! Jim
_______________________________________________ rtgwg mailing list -- rtgwg@ietf.org To unsubscribe send an email to rtgwg-le...@ietf.org