Jim,

Revision to address your comments has been submitted: 
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-net2cloud-problem-statement/

Thank you very much,
Linda

From: Linda Dunbar
Sent: Friday, August 16, 2024 11:08 AM
To: James Guichard <james.n.guich...@futurewei.com>; 
draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org
Cc: rtgwg@ietf.org
Subject: RE: AD review for draft-ietf-rtgwg-net2cloud-problem-statement

Jim,

Thank you very much for the detailed review and the comments.
The resolutions to address your comments are inserted below.
Please let us know any further actions are needed.
Thanks,
Linda

From: James Guichard 
<james.n.guich...@futurewei.com<mailto:james.n.guich...@futurewei.com>>
Sent: Friday, August 16, 2024 6:54 AM
To: 
draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org<mailto:draft-ietf-rtgwg-net2cloud-problem-statem...@ietf.org>
Cc: rtgwg@ietf.org<mailto:rtgwg@ietf.org>
Subject: AD review for draft-ietf-rtgwg-net2cloud-problem-statement

Deat authors,

Please find my comments for draft-ietf-rtgwg-net2cloud-problem-statement (I 
have included line numbers from nits to help identify where in the document the 
comment is relevant):

Please update references below.


 == Outdated reference: A later version (-13) exists of

     draft-ietf-idr-sdwan-edge-discovery-12



  == Outdated reference: A later version (-12) exists of

     draft-ietf-opsawg-ntw-attachment-circuit-08



  == Outdated reference: A later version (-23) exists of

     draft-ietf-idr-5g-edge-service-metadata-16



  == Outdated reference: A later version (-15) exists of

     draft-ietf-opsawg-teas-attachment-circuit-10



  == Outdated reference: A later version (-14) exists of

     draft-ietf-add-split-horizon-authority-07



[Linda] fixed.



109     Cloud services are generally exposed, on-demand services that claim

110     to be scalable, highly available, and have usage-based billing. Most



Jim> The above sentence is difficult to parse. Do you mean "Cloud services are 
generally exposed as on-demand..." rather than "Cloud services are generally 
exposed,..."

[Linda] Yes, fixed it.



115     hosts services to many customers.



Jim> s/to/too

[Linda] meant to say "hosts services for multiple customers."



137                 "edge" locations. <https://cloud.google.com/learn/what-

138                 is-hybrid-cloud>.



Jim> Please remove the in-text reference and replace with a [] reference as 
either normative or informative.

[Linda] fixed.





144                 https://en.wikipedia.org/wiki/Internet_exchange_point.



Jim> Please remove in-text reference and replace with a [] reference as either 
normative or informative.

[Linda] fixed.





186       - If a Cloud Gateway (GW), a BGP speaker, receives from its BGP

187           peer a capability that it does not itself support or recognize,

188           it need to ignore that capability, and the BGP session need not



Jim> As per RFC5492 it MUST ignore that capability and the BGP session MUST NOT 
be terminated. See section 3 of RFC5492 and correct the above text.

[Linda] changed "need" to "Must" per RFC4592.



189           be terminated per [RFC5492]. When receiving a BGP UPDATE with a

190           malformed attribute, the revised BGP error handling procedure

191           in [RFC7606] should be followed instead of session resetting.



Jim> the above paragraph seems to be confused. The first sentence is talking 
about BGP OPEN and how to handle capabilities, and then the second sentence 
talks about BGP UPDATE messages that have malformed attributes. These are two 
completely different things so I am struggling to understand why they are 
referenced in the same paragraph and what exactly they have to do with each 
other in the context of a Cloud Gateway?. Everything referenced is existing 
behavior, nothing new, so why is it here and what are the authors trying to 
convey? If they are trying to simply say that a Cloud Gateway should adhere to 
the procedure as specified in RFCs 5492 and 7606 then why not simply say that? 
If the authors wish to keep the text I would suggest a rewrite as follows:



      - If a Cloud Gateway (GW), a BGP speaker, receives from its BGP peer a 
BGP OPEN with a capability that it does not support or recognize, it

     MUST ignore that capability, and the BGP session MUST NOT be terminated, 
as per [RFC 5492].

     - When receiving a BGP UPDATE with a malformed attribute, the revised BGP 
error handling procedures in [RFC 7606] should be followed instead of

     resetting the BGP session.

[Linda] Thanks for the suggestion. Those requirements are from Azure. 
Sometimes, the BGP sessions with their clients got reset  upon receiving 
unsupported capabilities. They want to make sure that the BGP sessions stay up.

Do you know why RFC5492 says "the Unsupported Capability NOTIFICATION message 
MUST NOT  be generated"? Azure wants notification for the unsupported 
capability.





196       - When a Cloud DC eBGP session supports a limited number of

197           routes from external entities, the on-premises DCs need to set

198           up default routes and filter as many routes as practical

199           replacing them with a default in the eBGP advertisement to

200           minimize the number of routes to be exchanged with the Cloud DC

201           eBGP peers.



Jim> I do not understand the above paragraph. Is a Cloud DC different to an 
on-premise DC? Who is advertising default to who? The scenario that you are 
trying to convey above is non-obvious, at least to me, so please clarify.

[Linda] The statement is meant to emphasize when a cloud DC GW doesn't  
multi-hop eBGP sessions with their peers, a tunnel should be established to 
achieve IP adjacency.

 For example, AWS Transit Gateway does not support traditional multi-hop eBGP 
sessions.  AWS recommends establishing eBGP sessions with third-party virtual 
appliances (like SD-WAN appliances) running in a VPC to  exchange routing 
information between on-premises network and multiple VPCs through a central 
point.





202       - When a Cloud GW receives inbound routes exceeding the maximum

203           routes threshold for a peer, the currently common practice is

204           generating out-of-band alerts (e.g., Syslog entries) via the

205           management system or terminating the BGP session (with cease

206           notification messages [RFC4486] being sent). Although out of

207           the scope of this document, more discussion is needed in the

208           IETF Inter-Domain Routing (IDR) Working Group for potential in-

209           band or autonomous notification directly to the peers when the

210           inbound routes exceed the maximum routes threshold.



Jim> More explanation is needed here including a reference to section 4 of 
RFC4486 that describes the procedure for terminating a peering with a 
NOTIFICATION message and error code providing a reason e.g. "Maximum number of 
prefixes reached".

[Linda] Azure doesn't want BGP session to be terminated when max number of 
prefixes reached. Azure wants a method to notify the peers when the routes 
received exceeding some threshold. Today's practice of using Syslog only 
informs Azure when max routes exceeded. But there is no effective way to notify 
peers to reduce routes. draft-sas-idr-maxprefix-inbound-05 would be a good 
solution. But the draft is expired.

We are hoping to continue the draft by stating that  "more discussion is needed 
in the IETF Inter-Domain Routing (IDR) Working Group for potential in-band or 
autonomous notification directly to the peers when the inbound routes exceed 
the maximum routes threshold."





222     Failures within a Cloud site, which can be a building, a floor, a

223     pod, or a server rack, include capacity degradation or complete out-

224     of-service failure. Here are some events that can trigger a site

225     failure: a) fiber cut for links connecting to the site or among pods

226     within the site; b) cooling failures; c) insufficient backup power

227     during a power failure; d) cyber threat attacks; e) too many changes

228     outside of the maintenance window; etc. A fiber-cut is not uncommon

229     in a Cloud site or between sites.



Jim> I would suggest to say above that the types of events are not an 
exhaustive list but just some examples.



[Linda] s/ Here are some events that can trigger a site failure/ Some examples 
of events that can trigger a site failure



244     [RFC7432] specifies a mass withdrawal mechanism for EVPN to signal a

245     large number of routes being changed to remote PE nodes as quickly

246     as possible.



Jim> I am not sure that RFC 7432 is relevant here or why EVPN is even 
mentioned. Is there a reason to mention this or should the text simply be 
removed?

[Linda] the goal of the document is to list out relevant IETF work for all the 
problem identified in the document. The paragraph is meant to explain RFC7432 
Mass Withdrawal  alone is insufficient, as the routes at the sites might not 
all be EVPN routes. Changed the paragraph to the following:

[RFC7432] specifies a mass withdrawal mechanism for EVPN to signal a large 
number of routes being changed to remote PE nodes as quickly as possible. 
However, this alone is insufficient, as the routes at the sites might not all 
be EVPN routes.



597     premesis CPEs to a Cloud DC via a private VPN requires the private



Jim> s/premesis/premise

[Linda] fixed.



691     necessary. Alternative encapsulations, like SRH (Segment Routing



Jim> Please provide a reference to RFC 8754 (SRH)

[Linda] added.



695   6. Requirements for Networks Connecting Cloud Data Centers



Jim> Why are there requirements in a problem statement document? Did the WG 
discuss splitting these out into a separate document?

[Linda] This document is meant to list down the high level requirement, which 
will lead to other documents in the future.



Thanks!



Jim





_______________________________________________
rtgwg mailing list -- rtgwg@ietf.org
To unsubscribe send an email to rtgwg-le...@ietf.org

Reply via email to