Hi Gunter, thanks for the thorough review and suggestions towards readability. I have uploaded -09 incorporating most of your suggestions with the exception of very minor details: Ex: “largest (latest)” instead of s/largest/latest/... we’re talking both values and time meaning both largest and latest make sense – I just kept both terms.
Regards, Luc André Luc André Burdet | Cisco | laburdet.i...@gmail.com | Tel: +1 613 254 4814 From: Gunter van de Velde (Nokia) <gunter.van_de_velde=40nokia....@dmarc.ietf.org> Date: Thursday, May 30, 2024 at 07:28 To: draft-ietf-bess-evpn-fast-df-recov...@ietf.org <draft-ietf-bess-evpn-fast-df-recov...@ietf.org> Cc: 'BESS' <bess@ietf.org> Subject: [bess] [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08 # Gunter Van de Velde, RTG AD, comments for draft-ietf-bess-evpn-fast-df-recovery-08 Hi All, Please find here a shepherding AD review of draft-ietf-bess-evpn-fast-df-recovery-08 I'm sorry it took a bit of time to get started on this draft. I've begun reviewing this document before we kick off the IETF Last Call process. Once we address these points, we can move forward with the document through the IESG chain. A big thank you to Adrian Farrel for his RTG-DIR review on the -07 version, which helped improve the document to its -08 version and to Matthew Bocci for the Shepherds write-up (4 July 2022) In my review, I've noted some final observations while going through the document. For better readability, I've suggested some paragraph edits. One thing I noticed is that there's not much RFC 2119-based normative language used. Maybe the authors can take another look and add or update the RFC 2119 text where needed. You can find my review notes below. #GENERIC COMMENTS #================ 88 Virtualization Overlay (NVO) and DC inte)rconnect (DCI) services, and Typo with the ")" 100 multihomed Ethernet Segment. This DF election is achieved 101 independent of the number of EVPN Instances (EVIs) associated with 102 that Ethernet Segment and it is performed via simple signaling 103 between the recovered node and each of the other nodes in the 104 multihomed group. I believe that the word 'simple' is reasonable subjective. It may be better to replace with a construct using 'straightforward'. Possible rewrite: "This Designated Forwarder (DF) election is conducted independently of the number of EVPN Instances (EVIs) associated with the Ethernet Segment and is executed through straightforward signaling between the recovered node and each of the other nodes in the multihomed group. " 105 This document updates the state machine described in Section 2.1 of Being more explicit in what is updated could be better. "This document updates the DF Election Finite State Machine (FSM) described in Section 2.1 of" 131 In EVPN technology, multiple Provider Edge (PE) devices have the 132 ability to encap and decap data belonging to the same VLAN. In expand on encap and decap for better readability. 131 In EVPN technology, multiple Provider Edge (PE) devices have the 132 ability to encap and decap data belonging to the same VLAN. In 133 certain situations, this may cause L2 duplicates and even loops if 134 there is a momentary overlap of forwarding roles between two or more 135 PE devices, leading to broadcast storms. possible readability rewrite: "In EVPN technology, multiple Provider Edge (PE) devices possess the capability to encapsulate and decapsulate data associated with the same VLAN. Under certain conditions, this may result in Layer 2 duplicates and potential loops if there is a temporary overlap in forwarding roles among two or more PE devices, consequently leading to broadcast storms. " 137 EVPN [RFC7432] currently uses timer based synchronization among PE 138 devices in a redundancy group that can result in duplications (and 139 even loops) because of multiple DFs if the timer is too short or 140 packets being dropped if the timer is too long. RFC7432 is providing more a specification the using a timer. Hence a more explicit text blob to document this property: "EVPN [RFC7432] currently specifies timer-based synchronization among PE devices within a redundancy group. This approach can lead to duplications and potential loops due to multiple Designated Forwarders (DFs) if the timer interval is too short, or to packet drops if the timer interval is too long." 142 Using split-horizon filtering (Section 8.3 of [RFC7432]) can prevent 143 loops (but not duplicates). However, if there are overlapping DFs in 144 two different sites at the same time for the same VLAN, the site 145 identifier will be different upon the packet re-entering the Ethernet 146 Segment and hence the split-horizon check will fail, leading to L2 147 loops. Strange grammatical construct and usage of "()". Potential rewrite to correct this assuming i kept the issue described correct: "Employing split-horizon filtering, as described in Section 8.3 of [RFC7432], can prevent loops but does not address duplicates. However, if there are overlapping Designated Forwarders (DFs) at two different sites simultaneously for the same VLAN, the site identifier will differ when the packet re-enters the Ethernet Segment. Consequently, the split-horizon check will fail, resulting in Layer 2 loops. " 149 The updated DF procedures in [RFC8584] use the well known Highest 150 Random Weight (HRW) algorithm to avoid reshuffling of VLANs among PE 151 devices in the redundancy group upon failure/recovery. This reduces 152 the impact to VLANs not assigned to the failed/recovered ports and 153 eliminates loops or duplicates at failure/recovery events. Is there a reference that can be used for the well known HRW algorithm? What about the following rewrite proposal for readability: "The updated Designated Forwarder (DF) procedures outlined in [RFC8584] utilize the well-known Highest Random Weight (HRW) algorithm to prevent the reshuffling of VLANs among PE devices within the redundancy group during failure or recovery events. This approach minimizes the impact on VLANs not assigned to the failed or recovered ports and eliminates the occurrence of loops or duplicates during such events. " 179 a given VLAN is possible. Duplication of DF roles may eventually 180 lead to duplication of traffic as well as L2 loops. in previous text the word 'overlap' was used while here the word Duplication of DF roles is used. 195 * Complicated handshake signamling mechanisms and state machines are 196 avoided in favor of a simple uni-directional signaling approach. s/Complicated/Complex/ s/signamling/signaling/ 198 * The solution is backwards-compatible (see Section 4), by PEs 199 simply discarding the unrecognized new BGP Extended Community. I think that the "The solution" seems reasonable opaque description. Maybe we should explicit mention that this concerns the fast dr recovery solution. I only noted this here as the first occurrence, but the more explicit text can be used in multiple locations within the draft text. What about: "The fast df recovery solution maintains backwards compatibility (see Section 4) by ensuring that PEs discard any unrecognized new BGP Extended Community." 201 * Existing DF Election algorithms are supported. s/are/remain/ 232 Upon receipt of that new BGP Extended Community, partner PEs can 233 determine the service carving time of the newly insterted PE. The 234 notion of skew is introduced to eliminate any potential duplicate 235 traffic or loops. The receiving partner PEs add a skew (default = 236 -10ms) to the Service Carving Time to enforce this. The previously 237 inserted PE(s) must carve first, followed shortly (skew) by the newly 238 insterted PE. I got thrown off-guard with the word skew as a non-native English speaker. Maybe a small explanation would be helpful. What about the following: "Upon receipt of the new BGP Extended Community, partner PEs can determine the service carving time of the newly inserted PE. To eliminate any potential for duplicate traffic or loops, the concept of skew-a small time delay added to the service carving process to ensure a controlled and orderly transition when multiple Provider Edge (PE) devices are involved-is introduced. The receiving partner PEs add a skew (default = -10ms) to the service carving time to enforce this mechanism. This ensures that the previously inserted PEs complete their carving process first, followed shortly thereafter (by the specified skew) by the newly inserted PE. " 240 To summarize, all peering PEs carve almost simultaneously at the time 241 announced by the newly added/recovered PE. The newly inserted PE 242 initiates the SCT, and carves immediately on its local timer expiry. 243 The previously inserted PE(s) receiving Ethernet Segment route (RT-4) 244 with a SCT BGP extended community, carve shortly before Service 245 Carving Time. This text provides me some confusion. The term "to carve" generally means to cut or shape something from a larger piece, often with precision and care. Hence i was a bit surprised to see this used here. May I assume that in the context of these network operations and specifically within EVPN (Ethernet VPN) and MPLS (Multiprotocol Label Switching) environments, "to carve" typically refers to the process of determining and establishing roles or responsibilities for forwarding traffic among Provider Edge (PE) devices? If yes, maybe such text blob should be explicit mentioned somewhere in the draft? 266 [RFC5905]. As the current NTP era value is not exchanged, a local 267 clock which is "synchronized" but to the wrong era is outside of the 268 scope of this document. What is era value? 257 1 2 3 258 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 | Type = 0x06 | Sub-Type(0x0F)| Timestamp Seconds ~ 261 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 262 ~ Timestamp Seconds | Timestamp Fractional Seconds | 263 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ a figure number/caption is missing. 269 The 64-bit timestamp of NTP consists of a 32-bit part for seconds and 270 a 32-bit part for fractional second: There seems to be a 32bit/64bit and 128bt timestamp according: https://datatracker.ietf.org/doc/html/rfc5905#section-6 Should description not align with all of these? 274 * Timestamp Fractional Seconds: the high order 16 bits of the NTP 275 fractional seconds are encoded in this field. The use of a 16-bit 276 fractional seconds yields adequate precision of 15 microseconds 277 (2^-16 s). I assume that the lower order 16 bits are assumed to be '0'? Maybe that should be explicit called out? 296 This capability is used in conjunction with the agreed upon DF Type 297 (DF Election Type). For example if all the PEs in the Ethernet 298 Segment indicate having Time Synchronization capability and are 299 requesting the DF type to be HRW, then the HRW algorithm is used in 300 conjunction with this capability. readability rewrite: "This capability is utilized in conjunction with the agreed-upon Designated Forwarder (DF) Type (DF Election Type). For instance, if all the PE devices in the Ethernet Segment indicate possessing Time Synchronization capability and request the DF Type to be Highest Random Weight (HRW), then the HRW algorithm is employed in conjunction with this capability. " Note, what happens if one of the involved PEs do not support Time synchronisation capability? 309 The peering PE's FSM in DF_DONE which receives a RECV_ES transitions 310 to DF_CALC. Because of the SCT carried in the Ethernet-Segment 311 update, the output of the DF_CALC and transition back into DF_DONE 312 are delayed, as are accompanying forwarding updates to DF/NDF state. This processes not so easy. I assume that all these are states of the FSM? Would the following be a correct rewrite for readability? "Upon receiving a RECV_ES message, the peering PE's Finite State Machine (FSM) transitions from the DF_DONE (indicating the DF election process was complete) state to the DF_CALC (indicating that a new DF calculation is needed) state . Due to the Service Carving Time (SCT) included in the Ethernet-Segment update, the completion of the DF_CALC state and the subsequent transition back to the DF_DONE state are delayed. This delay ensures proper synchronization and prevents conflicts. Consequently, the accompanying forwarding updates to the Designated Forwarder (DF) and Non-Designated Forwarder (NDF) states are also deferred. " 314 The corresponding actions when transitions are performed or states 315 are entered/exited is modified as follows: 316 317 9. DF_CALC on CALCULATED: Mark the election result for the VLAN or 318 Bundle. 319 320 9.1 Where SCT timestamp is present on the RECV_ES event of 321 Action 11, wait until the time indicated by the SCT before 322 continuing to 9.2. 323 324 9.2 Assume a DF/NDF for the local PE for the VLAN or VLAN 325 Bundle, and transition to DF_DONE. What about the following procedure text blob description for clarity: " The corresponding actions when transitions are performed or states are entered/exited are modified as follows: 9. DF_CALC on CALCULATED: Mark the election result for the VLAN or VLAN Bundle. 9.1. If an SCT timestamp is present during the RECV_ES event of Action 11, wait until the time indicated by the SCT before proceeding to step 9.2. 9.2. Assume the role of DF or NDF for the local PE concerning the VLAN or VLAN Bundle, and transition to the DF_DONE state. This revised approach ensures proper timing and synchronization in the DF election process, avoiding conflicts and ensuring accurate forwarding updates. " 329 Let's take Figure 1 as an example where initially PE2 had failed and 330 PE1 had taken over. This example shows the problem with the 331 DF-Election mechanism in Section 8.5 of [RFC7432], using the value of 332 the timer configured for all PEs on the Ethernet Segment. To make the text more proposed standard style, what about this textblob for readability: "Consider Figure 1 as an example, where initially PE2 has failed and PE1 has taken over. This scenario illustrates the problem with the DF-Election mechanism described in Section 8.5 of [RFC7432], specifically in the context of the timer value configured for all PEs on the Ethernet Segment. " 334 Based on Section 8.5 of [RFC7432] and using the default 3 second 335 timer in step 2: 337 1. Initial state: PE1 is in steady-state, PE2 is recovering 339 2. PE2 recovers at (absolute) time t=99 341 3. PE2 advertises RT-4 (sent at t=100) to partner PE1 343 4. PE2 starts a 3 second timer to allow the reception of RT-4 from 344 other PE nodes 346 5. PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal 347 BGP propagation delay 349 6. PE2 carves at time t=103 350 351 [RFC7432] aims of favouring traffic being dropped over duplicate 352 traffic. With the above procedure, traffic drops will occur as part 353 of each PE recovery sequence since PE1 has transitioned some VLANs to 354 Non-Designated-Forwarder (NDF) immediately upon reception. 355 The timer value (default = 3 seconds) has a direct effect on the 356 duration of the packets being dropped. A shorter (especially zero) 357 timer may, however, result in duplicate traffic or traffic loops. What about: "Procedure Based on Section 8.5 of [RFC7432] with Default 3-Second Timer: 1. Initial State: PE1 is in a steady state, and PE2 is recovering. 2. Recovery: PE2 recovers at an absolute time of t=99. 3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner PE1. 4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes. 5. Immediate Carving: PE1 carves immediately upon RT-4 reception, i.e., t=100 plus minimal BGP propagation delay. 6. Delayed Carving: PE2 carves at time t=103. [RFC7432] favors traffic drops over duplicate traffic. With the above procedure, traffic drops will occur as part of each PE recovery sequence since PE1 transitions some VLANs to Non-Designated Forwarder (NDF) immediately upon RT-4 reception. The timer value (default = 3 seconds) directly affects the duration of the packet drops. A shorter (or zero) timer may result in duplicate traffic or traffic loops. " 359 Based on the Service Carving Time (SCT) approach: 361 1. Initial state: PE1 is in steady-state, PE2 is recovering 363 2. PE2 recovers at (absolute) time t=99 365 3. PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 366 to partner PE1 368 4. PE2 starts a 3 second timer to allow the reception of RT-4 from 369 other PE nodes 371 5. PE1 starts service carving timer, with remaining time until t=103 373 6. Both PE1 and PE2 carve at (absolute) time t=103 374 In fact, PE1 should carve slightly before PE2 (skew) to maintain the 375 preference of minimal loss over duplicate traffic. The previously 376 inserted PE2 that is recovering performs both transitions DF to NDF 377 and NDF to DF per VLANs at the timer's expiry. Since the goal is to 378 prevent duplicates, the original PE1, which received the SCT will 379 apply: 381 * DF to NDF transition at t=SCT minus skew, where both PEs are NDF 382 for 'skew' amount of time 384 * NDF to DF transition at t=SCT 385 386 It is this split-behaviour which ensures a good transition of DF role 387 with contained amount of loss. 388 389 Using SCT approach, the negative effect of the timer to allow the 390 reception of RT-4 from other PE nodes is mitigated. Furthermore, the 391 BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to 392 PE1) becomes a non-issue. The use of SCT approach remedies the 393 problem associated with this timer: the 3 second timer window is 394 shortened to the order of milliseconds. What about the following textblobs for readability: "Procedure Based on the Service Carving Time (SCT) Approach: 1. Initial State: PE1 is in a steady state, and PE2 is recovering. 2. Recovery: PE2 recovers at an absolute time of t=99. 3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target SCT value of t=103 to partner PE1. 4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes. 5. Service Carving Timer: PE1 starts the service carving timer, with the remaining time until t=103. 6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time of t=103. To maintain the preference for minimal loss over duplicate traffic, PE1 should carve slightly before PE2 (with skew). The recovering PE2 performs both DF to NDF and NDF to DF transitions per VLAN at the timer's expiry. The original PE1, which received the SCT, applies the following: * DF to NDF Transition: At t=SCT minus skew, where both PEs are NDF for the skew duration. * NDF to DF Transition: At t=SCT. This split-behavior ensures a smooth DF role transition with minimal loss. Using the SCT approach, the negative effect of the timer to allow the reception of RT-4 from other PE nodes is mitigated. Furthermore, the BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to PE1) becomes a non-issue. The SCT approach shortens the 3-second timer window to the order of milliseconds, addressing the associated problems. " 396 3.1. Concurrent Recoveries This section seems to be missing RFC2119 language on how nodes need to behave with respect the procedures outlined in this document. 402 Election. A similar situation arises in staggered recovering PEs, 403 when a second PE recovers at rougly a first PE's advertised SCT 404 expiry, and with its own new SCT-2 outside of the initial SCT window. The word staggered is oddly used. What about the following: "A similar situation arises in sequentially recovering PEs, when a second PE recovers approximately at the time of the first PE's advertised SCT expiry, and with its own new SCT-2 outside of the initial SCT window." 406 In the case of multiple outstanding DF elections, one requested by 407 each of the recovering PEs, the SCTs must simply be time-ordered and 408 all PEs execute only a single DF Election at the service carving time 409 corresponding to the largest received timestamp value. The DF 410 Election will involve all the active PEs in a single DF Election 411 update. To add to a similar edited writing style: "In the case of multiple concurrent DF elections, each initiated by one of the recovering PEs, the SCTs must be ordered chronologically. All PEs shall execute only a single DF Election at the service carving time corresponding to the latest received timestamp value. This DF Election will involve all active PEs in a unified DF Election update. " However, it may require some formal RFC2119 language to make sure that implementations behave according this procedure 413 Example: 415 1. Initial state: PE1 is in steady-state, all services elected at 416 PE1. 418 2. PE2 recovers at time t=100, advertises RT-4 with target SCT value 419 t=103 to partners (PE1) 421 3. PE2 starts a 3 second timer to allow the reception of RT-4 from 422 other PE nodes 424 4. PE1 starts service carving timer, with remaining time until t=103 426 5. PE3 recovers at time t=102, advertises RT-4 with target SCT value 427 t=105 to partners (PE1, PE2) 429 6. PE3 starts a 3 second timer to allow the reception of RT-4 from 430 other PE nodes 432 7. PE2 cancels the running timer, starts service carving timer with 433 remaining time until t=105 435 8. PE1 updates service carving timer, with remaining time until 436 t=105 438 9. PE1, PE2 and PE3 carve at (absolute) time t=105 Example: 1. Initial State: PE1 is in a steady state, with all services elected at PE1. 2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 with a target SCT value of t=103 to its partners (PE1). 3. Timer Initiation by PE2: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes. 4. Timer Initiation by PE1: PE1 starts the service carving timer, with the remaining time until t=103. 5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 with a target SCT value of t=105 to its partners (PE1, PE2). 6. Timer Initiation by PE3: PE3 starts a 3-second timer to allow the reception of RT-4 from other PE nodes. 7. Timer Update by PE2: PE2 cancels the running timer and starts the service carving timer with the remaining time until t=105. 8. Timer Update by PE1: PE1 updates its service carving timer, with the remaining time until t=105. 9. Service Carving: PE1, PE2, and PE3 perform service carving at the absolute time of t=105. 446 4. Backwards Compatibility 447 448 Per redundancy group, for the DF election procedures to be globally 449 convergent and unanimous, it is necessary that all the participating 450 PEs agree on the DF Election algorithm to be used. It is, however, 451 possible that some PEs continue to use the existing modulo-based DF 452 election and do not rely on the new SCT BGP extended community. PEs 453 running a baseline DF election mechanism will simply discard the new 454 SCT BGP extended community as unrecognized. 455 456 A PE can indicate its willingness to support clock-synched carving by 457 signaling the new 'T' DF Election Capability as well as including the 458 new Service Carving Time BGP extended community along with the 459 Ethernet Segment Route (Type-4). In the case where one or more PEs 460 attached to the Ethernet Segment do not signal T=1, all PEs in the 461 Ethernet Segment SHALL revert back to the [RFC7432] timer approach. 462 This is especially important in the context of the VLAN shuffling 463 with more than 2 PEs. I am not sure what the modulo-based df is? is that the rfc7432 procedure? It was the first time that this was mentioned in this draft i believe. what about following rewrite proposal for readability, but please add reference for the modulo-based election: "For the DF election procedures to achieve global convergence and unanimity within a redundancy group, it is essential that all participating PEs agree on the DF election algorithm to be employed. However, it is possible that some PEs may continue to use the existing modulo-based DF election algorithm and not utilize the new Service Carving Time (SCT) BGP extended community. PEs that operate using the baseline DF election mechanism will simply discard the new SCT BGP extended community as unrecognized. A PE can indicate its willingness to support clock-synchronized carving by signaling the new 'T' DF Election Capability and including the new SCT BGP extended community along with the Ethernet Segment Route (Type-4). If one or more PEs attached to the Ethernet Segment do not signal T=1, then all PEs in the Ethernet Segment SHALL revert to the timer-based approach as specified in [RFC7432]. This reversion is particularly crucial in preventing VLAN shuffling when more than two PEs are involved" 465 5. Security Considerations The conditions for when the SCT is far away in the future, it was not entirely clear or spelled out what an implementation should do. Maybe make it more explicite in the textual decscription as a normative reference using RFC2119 language _______________________________________________ BESS mailing list -- bess@ietf.org To unsubscribe send an email to bess-le...@ietf.org
_______________________________________________ BESS mailing list -- bess@ietf.org To unsubscribe send an email to bess-le...@ietf.org