[bess] Re: [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08

Luc André Burdet Mon, 08 Jul 2024 14:45:51 -0700

Hi Gunter, thanks for the thorough review and suggestions towards readability.
I have uploaded -09 incorporating most of your suggestions with the exception 
of very minor details:
Ex: “largest (latest)” instead of s/largest/latest/... we’re talking both 
values and time meaning both largest and latest make sense – I just kept both 
terms.

Regards,
Luc André

Luc André Burdet |  Cisco  |  laburdet.i...@gmail.com  |  Tel: +1 613 254 4814

From: Gunter van de Velde (Nokia) 
<gunter.van_de_velde=40nokia....@dmarc.ietf.org>
Date: Thursday, May 30, 2024 at 07:28
To: draft-ietf-bess-evpn-fast-df-recov...@ietf.org 
<draft-ietf-bess-evpn-fast-df-recov...@ietf.org>
Cc: 'BESS' <bess@ietf.org>
Subject: [bess] [Shepherding AD review] review of 
draft-ietf-bess-evpn-fast-df-recovery-08
# Gunter Van de Velde, RTG AD, comments for 
draft-ietf-bess-evpn-fast-df-recovery-08

Hi All,

Please find here a shepherding AD review of 
draft-ietf-bess-evpn-fast-df-recovery-08

I'm sorry it took a bit of time to get started on this draft.

I've begun reviewing this document before we kick off the IETF Last Call 
process. Once we address these points, we can move forward with the document 
through the IESG chain.

A big thank you to Adrian Farrel for his RTG-DIR review on the -07 version, 
which helped improve the document to its -08 version and to Matthew Bocci for 
the Shepherds write-up (4 July 2022)

In my review, I've noted some final observations while going through the 
document. For better readability, I've suggested some paragraph edits.

One thing I noticed is that there's not much RFC 2119-based normative language 
used. Maybe the authors can take another look and add or update the RFC 2119 
text where needed.

You can find my review notes below.

#GENERIC COMMENTS
#================

88         Virtualization Overlay (NVO) and DC inte)rconnect (DCI) services, and

Typo with the ")"

100        multihomed Ethernet Segment.  This DF election is achieved
101        independent of the number of EVPN Instances (EVIs) associated with
102        that Ethernet Segment and it is performed via simple signaling
103        between the recovered node and each of the other nodes in the
104        multihomed group.

I believe that the word 'simple' is reasonable subjective. It may be better to 
replace with a construct using 'straightforward'. Possible rewrite:

"This Designated Forwarder (DF) election is conducted independently of the 
number of EVPN Instances (EVIs) associated with the Ethernet Segment and is 
executed through straightforward signaling between the recovered node and each 
of the other nodes in the multihomed group.
"

105        This document updates the state machine described in Section 2.1 of

Being more explicit in what is updated could be better.

"This document updates the DF Election Finite State Machine (FSM) described in 
Section 2.1 of"

131        In EVPN technology, multiple Provider Edge (PE) devices have the
132        ability to encap and decap data belonging to the same VLAN.  In

expand on encap and decap for better readability.

131        In EVPN technology, multiple Provider Edge (PE) devices have the
132        ability to encap and decap data belonging to the same VLAN.  In
133        certain situations, this may cause L2 duplicates and even loops if
134        there is a momentary overlap of forwarding roles between two or more
135        PE devices, leading to broadcast storms.

possible readability rewrite:
"In EVPN technology, multiple Provider Edge (PE) devices possess the capability 
to encapsulate and decapsulate data associated with the same VLAN. Under 
certain conditions, this may result in Layer 2 duplicates and potential loops 
if there is a temporary overlap in forwarding roles among two or more PE 
devices, consequently leading to broadcast storms.
"
137        EVPN [RFC7432] currently uses timer based synchronization among PE
138        devices in a redundancy group that can result in duplications (and
139        even loops) because of multiple DFs if the timer is too short or
140        packets being dropped if the timer is too long.

RFC7432 is providing more a specification the using a timer. Hence a more 
explicit text blob to document this property:

"EVPN [RFC7432] currently specifies timer-based synchronization among PE 
devices within a redundancy group. This approach can lead to duplications and 
potential loops due to multiple Designated Forwarders (DFs) if the timer 
interval is too short, or to packet drops if the timer interval is too long."

142        Using split-horizon filtering (Section 8.3 of [RFC7432]) can prevent
143        loops (but not duplicates).  However, if there are overlapping DFs in
144        two different sites at the same time for the same VLAN, the site
145        identifier will be different upon the packet re-entering the Ethernet
146        Segment and hence the split-horizon check will fail, leading to L2
147        loops.

Strange grammatical construct and usage of "()". Potential rewrite to correct 
this assuming i kept the issue described correct:

"Employing split-horizon filtering, as described in Section 8.3 of [RFC7432], 
can prevent loops but does not address duplicates. However, if there are 
overlapping Designated Forwarders (DFs) at two different sites simultaneously 
for the same VLAN, the site identifier will differ when the packet re-enters 
the Ethernet Segment. Consequently, the split-horizon check will fail, 
resulting in Layer 2 loops.
"

149        The updated DF procedures in [RFC8584] use the well known Highest
150        Random Weight (HRW) algorithm to avoid reshuffling of VLANs among PE
151        devices in the redundancy group upon failure/recovery.  This reduces
152        the impact to VLANs not assigned to the failed/recovered ports and
153        eliminates loops or duplicates at failure/recovery events.

Is there a reference that can be used for the well known HRW algorithm?
What about the following rewrite proposal for readability:

"The updated Designated Forwarder (DF) procedures outlined in [RFC8584] utilize 
the well-known Highest Random Weight (HRW) algorithm to prevent the reshuffling 
of VLANs among PE devices within the redundancy group during failure or 
recovery events. This approach minimizes the impact on VLANs not assigned to 
the failed or recovered ports and eliminates the occurrence of loops or 
duplicates during such events.
"

179        a given VLAN is possible.  Duplication of DF roles may eventually
180        lead to duplication of traffic as well as L2 loops.

in previous text the word 'overlap' was used while here the word Duplication of 
DF roles is used.

195        *  Complicated handshake signamling mechanisms and state machines are
196           avoided in favor of a simple uni-directional signaling approach.

s/Complicated/Complex/
s/signamling/signaling/

198        *  The solution is backwards-compatible (see Section 4), by PEs
199           simply discarding the unrecognized new BGP Extended Community.

I think that the "The solution" seems reasonable opaque description. Maybe we 
should explicit mention that this concerns the fast dr recovery solution. I 
only noted this here as the first occurrence, but the more explicit text can be 
used in multiple locations within the draft text.

What about:
"The fast df recovery solution maintains backwards compatibility (see Section 
4) by ensuring that PEs discard any unrecognized new BGP Extended Community."

201        *  Existing DF Election algorithms are supported.

s/are/remain/

232        Upon receipt of that new BGP Extended Community, partner PEs can
233        determine the service carving time of the newly insterted PE.  The
234        notion of skew is introduced to eliminate any potential duplicate
235        traffic or loops.  The receiving partner PEs add a skew (default =
236        -10ms) to the Service Carving Time to enforce this.  The previously
237        inserted PE(s) must carve first, followed shortly (skew) by the newly
238        insterted PE.

I got thrown off-guard with the word skew as a non-native English speaker.
Maybe a small explanation would be helpful. What about the following:

"Upon receipt of the new BGP Extended Community, partner PEs can determine the 
service carving time of the newly inserted PE. To eliminate any potential for 
duplicate traffic or loops, the concept of skew-a small time delay added to the 
service carving process to ensure a controlled and orderly transition when 
multiple Provider Edge (PE) devices are involved-is introduced. The receiving 
partner PEs add a skew (default = -10ms) to the service carving time to enforce 
this mechanism. This ensures that the previously inserted PEs complete their 
carving process first, followed shortly thereafter (by the specified skew) by 
the newly inserted PE.
"

240        To summarize, all peering PEs carve almost simultaneously at the time
241        announced by the newly added/recovered PE.  The newly inserted PE
242        initiates the SCT, and carves immediately on its local timer expiry.
243        The previously inserted PE(s) receiving Ethernet Segment route (RT-4)
244        with a SCT BGP extended community, carve shortly before Service
245        Carving Time.

This text provides me some confusion. The term "to carve" generally means to 
cut or shape something from a larger piece, often with precision and care. 
Hence i was a bit surprised to see this used here.

May I assume that in the context of these network operations and specifically 
within EVPN (Ethernet VPN) and MPLS (Multiprotocol Label Switching) 
environments, "to carve" typically refers to the process of determining and 
establishing roles or responsibilities for forwarding traffic among Provider 
Edge (PE) devices?

If yes, maybe such text blob should be explicit mentioned somewhere in the 
draft?

266        [RFC5905].  As the current NTP era value is not exchanged, a local
267        clock which is "synchronized" but to the wrong era is outside of the
268        scope of this document.

What is era value?

257                             1                   2                   3
258         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
259        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
260        | Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
261        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
262        ~  Timestamp Seconds            | Timestamp Fractional Seconds  |
263        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

a figure number/caption is missing.

269        The 64-bit timestamp of NTP consists of a 32-bit part for seconds and
270        a 32-bit part for fractional second:

There seems to be a 32bit/64bit and 128bt timestamp according:
https://datatracker.ietf.org/doc/html/rfc5905#section-6
Should description not align with all of these?

274        *  Timestamp Fractional Seconds: the high order 16 bits of the NTP
275           fractional seconds are encoded in this field.  The use of a 16-bit
276           fractional seconds yields adequate precision of 15 microseconds
277           (2^-16 s).

I assume that the lower order 16 bits are assumed to be '0'? Maybe that should 
be explicit called out?

296        This capability is used in conjunction with the agreed upon DF Type
297        (DF Election Type).  For example if all the PEs in the Ethernet
298        Segment indicate having Time Synchronization capability and are
299        requesting the DF type to be HRW, then the HRW algorithm is used in
300        conjunction with this capability.

readability rewrite:
"This capability is utilized in conjunction with the agreed-upon Designated 
Forwarder (DF) Type (DF Election Type). For instance, if all the PE devices in 
the Ethernet Segment indicate possessing Time Synchronization capability and 
request the DF Type to be Highest Random Weight (HRW), then the HRW algorithm 
is employed in conjunction with this capability.
"

Note, what happens if one of the involved PEs do not support Time 
synchronisation capability?

309        The peering PE's FSM in DF_DONE which receives a RECV_ES transitions
310        to DF_CALC.  Because of the SCT carried in the Ethernet-Segment
311        update, the output of the DF_CALC and transition back into DF_DONE
312        are delayed, as are accompanying forwarding updates to DF/NDF state.

This processes not so easy. I assume that all these are states of the FSM?
Would the following be a correct rewrite for readability?

"Upon receiving a RECV_ES message, the peering PE's Finite State Machine (FSM) 
transitions from the DF_DONE (indicating the DF election process was complete) 
state to the DF_CALC (indicating that a new DF calculation is needed) state . 
Due to the Service Carving Time (SCT) included in the Ethernet-Segment update, 
the completion of the DF_CALC state and the subsequent transition back to the 
DF_DONE state are delayed. This delay ensures proper synchronization and 
prevents conflicts. Consequently, the accompanying forwarding updates to the 
Designated Forwarder (DF) and Non-Designated Forwarder (NDF) states are also 
deferred.
"

314        The corresponding actions when transitions are performed or states
315        are entered/exited is modified as follows:
316
317        9.  DF_CALC on CALCULATED: Mark the election result for the VLAN or
318            Bundle.
319
320            9.1  Where SCT timestamp is present on the RECV_ES event of
321                 Action 11, wait until the time indicated by the SCT before
322                 continuing to 9.2.
323
324            9.2  Assume a DF/NDF for the local PE for the VLAN or VLAN
325                 Bundle, and transition to DF_DONE.

What about the following procedure text blob description for clarity:

"
The corresponding actions when transitions are performed or states are 
entered/exited are modified as follows:

9. DF_CALC on CALCULATED: Mark the election result for the VLAN or VLAN Bundle.

9.1. If an SCT timestamp is present during the RECV_ES event of Action 11, wait 
until the time indicated by the SCT before proceeding to step 9.2.

9.2. Assume the role of DF or NDF for the local PE concerning the VLAN or VLAN 
Bundle, and transition to the DF_DONE state.

This revised approach ensures proper timing and synchronization in the DF 
election process, avoiding conflicts and ensuring accurate forwarding updates.
"

329        Let's take Figure 1 as an example where initially PE2 had failed and
330        PE1 had taken over.  This example shows the problem with the
331        DF-Election mechanism in Section 8.5 of [RFC7432], using the value of
332        the timer configured for all PEs on the Ethernet Segment.

To make the text more proposed standard style, what about this textblob for 
readability:

"Consider Figure 1 as an example, where initially PE2 has failed and PE1 has 
taken over. This scenario illustrates the problem with the DF-Election 
mechanism described in Section 8.5 of [RFC7432], specifically in the context of 
the timer value configured for all PEs on the Ethernet Segment.
"

334        Based on Section 8.5 of [RFC7432] and using the default 3 second
335        timer in step 2:
337        1.  Initial state: PE1 is in steady-state, PE2 is recovering
339        2.  PE2 recovers at (absolute) time t=99
341        3.  PE2 advertises RT-4 (sent at t=100) to partner PE1
343        4.  PE2 starts a 3 second timer to allow the reception of RT-4 from
344            other PE nodes
346        5.  PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal
347            BGP propagation delay
349        6.  PE2 carves at time t=103
350
351        [RFC7432] aims of favouring traffic being dropped over duplicate
352        traffic.  With the above procedure, traffic drops will occur as part
353        of each PE recovery sequence since PE1 has transitioned some VLANs to
354        Non-Designated-Forwarder (NDF) immediately upon reception.
355        The timer value (default = 3 seconds) has a direct effect on the
356        duration of the packets being dropped.  A shorter (especially zero)
357        timer may, however, result in duplicate traffic or traffic loops.

What about:

"Procedure Based on Section 8.5 of [RFC7432] with Default 3-Second Timer:
1. Initial State: PE1 is in a steady state, and PE2 is recovering.
2. Recovery: PE2 recovers at an absolute time of t=99.
3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner PE1.
4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from 
other PE nodes.
5. Immediate Carving: PE1 carves immediately upon RT-4 reception, i.e., t=100 
plus minimal BGP propagation delay.
6. Delayed Carving: PE2 carves at time t=103.

[RFC7432] favors traffic drops over duplicate traffic. With the above 
procedure, traffic drops will occur as part of each PE recovery sequence since 
PE1 transitions some VLANs to Non-Designated Forwarder (NDF) immediately upon 
RT-4 reception. The timer value (default = 3 seconds) directly affects the 
duration of the packet drops. A shorter (or zero) timer may result in duplicate 
traffic or traffic loops.
"

359        Based on the Service Carving Time (SCT) approach:
361        1.  Initial state: PE1 is in steady-state, PE2 is recovering
363        2.  PE2 recovers at (absolute) time t=99
365        3.  PE2 advertises RT-4 (sent at t=100) with target SCT value t=103
366            to partner PE1
368        4.  PE2 starts a 3 second timer to allow the reception of RT-4 from
369            other PE nodes
371        5.  PE1 starts service carving timer, with remaining time until t=103
373        6.  Both PE1 and PE2 carve at (absolute) time t=103
374        In fact, PE1 should carve slightly before PE2 (skew) to maintain the
375        preference of minimal loss over duplicate traffic.  The previously
376        inserted PE2 that is recovering performs both transitions DF to NDF
377        and NDF to DF per VLANs at the timer's expiry.  Since the goal is to
378        prevent duplicates, the original PE1, which received the SCT will
379        apply:
381        *  DF to NDF transition at t=SCT minus skew, where both PEs are NDF
382           for 'skew' amount of time
384        *  NDF to DF transition at t=SCT
385
386        It is this split-behaviour which ensures a good transition of DF role
387        with contained amount of loss.
388
389        Using SCT approach, the negative effect of the timer to allow the
390        reception of RT-4 from other PE nodes is mitigated.  Furthermore, the
391        BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to
392        PE1) becomes a non-issue.  The use of SCT approach remedies the
393        problem associated with this timer: the 3 second timer window is
394        shortened to the order of milliseconds.

What about the following textblobs for readability:

"Procedure Based on the Service Carving Time (SCT) Approach:
1. Initial State: PE1 is in a steady state, and PE2 is recovering.
2. Recovery: PE2 recovers at an absolute time of t=99.
3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target SCT value 
of t=103 to partner PE1.
4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from 
other PE nodes.
5. Service Carving Timer: PE1 starts the service carving timer, with the 
remaining time until t=103.
6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time of t=103.

To maintain the preference for minimal loss over duplicate traffic, PE1 should 
carve slightly before PE2 (with skew). The recovering PE2 performs both DF to 
NDF and NDF to DF transitions per VLAN at the timer's expiry. The original PE1, 
which received the SCT, applies the following:

* DF to NDF Transition: At t=SCT minus skew, where both PEs are NDF for the 
skew duration.
* NDF to DF Transition: At t=SCT.

This split-behavior ensures a smooth DF role transition with minimal loss.

Using the SCT approach, the negative effect of the timer to allow the reception 
of RT-4 from other PE nodes is mitigated. Furthermore, the BGP Ethernet Segment 
route (RT-4) transmission delay (from PE2 to PE1) becomes a non-issue. The SCT 
approach shortens the 3-second timer window to the order of milliseconds, 
addressing the associated problems.
"

396     3.1.  Concurrent Recoveries

This section seems to be missing RFC2119 language on how nodes need to behave 
with respect the procedures outlined in this document.

402        Election.  A similar situation arises in staggered recovering PEs,
403        when a second PE recovers at rougly a first PE's advertised SCT
404        expiry, and with its own new SCT-2 outside of the initial SCT window.

The word staggered is oddly used. What about the following:

"A similar situation arises in sequentially recovering PEs, when a second PE 
recovers approximately at the time of the first PE's advertised SCT expiry, and 
with its own new SCT-2 outside of the initial SCT window."

406        In the case of multiple outstanding DF elections, one requested by
407        each of the recovering PEs, the SCTs must simply be time-ordered and
408        all PEs execute only a single DF Election at the service carving time
409        corresponding to the largest received timestamp value.  The DF
410        Election will involve all the active PEs in a single DF Election
411        update.

To add to a similar edited writing style:
"In the case of multiple concurrent DF elections, each initiated by one of the 
recovering PEs, the SCTs must be ordered chronologically. All PEs shall execute 
only a single DF Election at the service carving time corresponding to the 
latest received timestamp value. This DF Election will involve all active PEs 
in a unified DF Election update.
"

However, it may require some formal RFC2119 language to make sure that 
implementations behave according this procedure

413        Example:
415        1.  Initial state: PE1 is in steady-state, all services elected at
416            PE1.
418        2.  PE2 recovers at time t=100, advertises RT-4 with target SCT value
419            t=103 to partners (PE1)
421        3.  PE2 starts a 3 second timer to allow the reception of RT-4 from
422            other PE nodes
424        4.  PE1 starts service carving timer, with remaining time until t=103
426        5.  PE3 recovers at time t=102, advertises RT-4 with target SCT value
427            t=105 to partners (PE1, PE2)
429        6.  PE3 starts a 3 second timer to allow the reception of RT-4 from
430            other PE nodes
432        7.  PE2 cancels the running timer, starts service carving timer with
433            remaining time until t=105
435        8.  PE1 updates service carving timer, with remaining time until
436            t=105
438        9.  PE1, PE2 and PE3 carve at (absolute) time t=105

Example:
1. Initial State: PE1 is in a steady state, with all services elected at PE1.
2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 with a 
target SCT value of t=103 to its partners (PE1).
3. Timer Initiation by PE2: PE2 starts a 3-second timer to allow the reception 
of RT-4 from other PE nodes.
4. Timer Initiation by PE1: PE1 starts the service carving timer, with the 
remaining time until t=103.
5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 with a 
target SCT value of t=105 to its partners (PE1, PE2).
6. Timer Initiation by PE3: PE3 starts a 3-second timer to allow the reception 
of RT-4 from other PE nodes.
7. Timer Update by PE2: PE2 cancels the running timer and starts the service 
carving timer with the remaining time until t=105.
8. Timer Update by PE1: PE1 updates its service carving timer, with the 
remaining time until t=105.
9. Service Carving: PE1, PE2, and PE3 perform service carving at the absolute 
time of t=105.

446     4.  Backwards Compatibility
447
448        Per redundancy group, for the DF election procedures to be globally
449        convergent and unanimous, it is necessary that all the participating
450        PEs agree on the DF Election algorithm to be used.  It is, however,
451        possible that some PEs continue to use the existing modulo-based DF
452        election and do not rely on the new SCT BGP extended community.  PEs
453        running a baseline DF election mechanism will simply discard the new
454        SCT BGP extended community as unrecognized.
455
456        A PE can indicate its willingness to support clock-synched carving by
457        signaling the new 'T' DF Election Capability as well as including the
458        new Service Carving Time BGP extended community along with the
459        Ethernet Segment Route (Type-4).  In the case where one or more PEs
460        attached to the Ethernet Segment do not signal T=1, all PEs in the
461        Ethernet Segment SHALL revert back to the [RFC7432] timer approach.
462        This is especially important in the context of the VLAN shuffling
463        with more than 2 PEs.

I am not sure what the modulo-based df is? is that the rfc7432 procedure? It 
was the first time that this was mentioned in this draft i believe.

what about following rewrite proposal for readability, but please add reference 
for the modulo-based election:

"For the DF election procedures to achieve global convergence and unanimity 
within a redundancy group, it is essential that all participating PEs agree on 
the DF election algorithm to be employed. However, it is possible that some PEs 
may continue to use the existing modulo-based DF election algorithm and not 
utilize the new Service Carving Time (SCT) BGP extended community. PEs that 
operate using the baseline DF election mechanism will simply discard the new 
SCT BGP extended community as unrecognized.

A PE can indicate its willingness to support clock-synchronized carving by 
signaling the new 'T' DF Election Capability and including the new SCT BGP 
extended community along with the Ethernet Segment Route (Type-4). If one or 
more PEs attached to the Ethernet Segment do not signal T=1, then all PEs in 
the Ethernet Segment SHALL revert to the timer-based approach as specified in 
[RFC7432]. This reversion is particularly crucial in preventing VLAN shuffling 
when more than two PEs are involved"

465     5.  Security Considerations

The conditions for when the SCT is far away in the future, it was not entirely 
clear or spelled out what an implementation should do. Maybe make it more 
explicite in the textual decscription as a normative reference using RFC2119 
language

_______________________________________________
BESS mailing list -- bess@ietf.org
To unsubscribe send an email to bess-le...@ietf.org

_______________________________________________
BESS mailing list -- bess@ietf.org
To unsubscribe send an email to bess-le...@ietf.org

[bess] Re: [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08

Reply via email to