[OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review

Evans, John Wed, 17 Sep 2025 10:08:13 -0700

Hi Randy,

Thanks for taking the time to review the draft and your feedback.  Please see 
inline: je#

Cheers

John

From: Randy Bush <[email protected]>
Date: Monday 8 September 2025 at 20:41
To: "[email protected]" <[email protected]>
Subject: [EXTERNAL] [OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

i was poked to take a look at this draft. despite lack of yang fu,
here are some cursory comments.

While certain types of packet loss, such as policy-based discards,
are intentional and part of normal network operation, unintended
packet loss can impact customer services.

intentional drops also impact the customer. who are we kidding here?
when debugging loss, ignoring intentional drop can hide misconfigured
drop configuration.

je# I would make the clarification that discards due to a mis-configured ACL 
are still unintended; it’s just that you don’t know that without additional 
context.  In either case, the point really is that we need precise 
classification of discard reporting.  With errored discards, the discard metric 
may be enough to determine unintended loss.  With mis-configured ACL discards, 
we may need the metric + other context, e.g. that we made an ACL change and 
when.

This is partly covered in Appendix C

   7.   It is not possible to identify a configuration error - e.g.,

        when intended discards are unintended - with device discard

        metrics alone.  For example, additional context is needed to

        determine if ACL discards are intended or due to a misconfigured

        ACL, i.e., with configuration validation before deployment or by

        detecting a significant change in ACL discards after a

        configuration change compared to before.

Will add clarification in the terminology section:

Device discard counters do not by themselves establish operator intent. 
Discards reported under policy (e.g., ACL/policer) indicate only that traffic 
matched a configured rule; such discards may still be unintended if the 
configuration is in error. Determining intent for policy discards requires 
external context (e.g., configuration validation and change history) which is 
out of scope for this specification.

The scope of this document is limited to reporting packet loss at
Layer 3 and frames discarded at Layer 2. This document considers
only the signals that may trigger automated mitigation actions and
not how the actions are defined or executed.

The fundamental problem for network operators is how to automatically
detect when unintended packet loss is occurring
^ and where

je# added

FEATURE-DISCARD-CLASS: The type or class of discards, which is
crucial for selecting the appropriate of mitigation - for example:
error discards may require taking faulty components out of
service; no-buffer discards may require traffic redistribution;
policy discards typically require no automated action

policy discards may be due to misconfiguration of policies
je# clarified as: “intended policy discards”

The discard reporting can be organized into several types: control
plane, interface, flow, and device.

is a drop reported in multiple types? i.e. on the device, the interface
where it was dropped, the flow it affected, ...? while 5.2 makes this
clear, comment here might be helpful.

je# went to and fro on this but ended up overlapping with 5.2 so propose to 
leave as is unless strong objection?

The "ietf-packet-discard-reporting-sx" module uses the "sx" structure
defined in [RFC8791].

the Features list in 4.3 is not the same order as the abstract data
model structure in 4.1

je# will align

 identity ingress {

   base direction;

   description

     "Reports statistics for the received from the network

      packets.";

 }

 identity egress {

   base direction;

   description

     "Reports statistics for the sent to the network

      packets.";

 }

in a complex device, i wonder if ingress and egress could be a bit
confusing
je# We’ve successfully implemented this information model in both centralised 
forwarding and linecard/chassis based devices

 grouping qos {

   description

     "Quality of Service (QoS) traffic counters.";

not differentiated from security/acl, no-route, etc. drops?

je# sorry – don’t follow your feedback – could you please clarify?

 grouping errors-l3-rx {

 ...

   leaf no-route {

no-route on a received packet, tx, sure, but rx? or are you expecting a
route back toward the source?
je# The rx in this case meant received packet – is that confusing?

i wonder about being able to differentiate between drops due to static
vs dynamic (e.g. flow export) security/TE policies.

je# that seems like another case of external context?

If all of the requirements listed in Section 5.2 are met, a "good"
unicast IPv4 packet received would increment:
...

the analogous rules for counting L2 frames are not formally described

je# added an L2 example:

a “good” Layer-2 frame received would increment:
- interface/ingress/traffic/l2/frames
- interface/ingress/traffic/l2/bytes
- interface/ingress/traffic/qos/class[id="0"]/packets
- interface/ingress/traffic/qos/class[id="0"]/bytes

                          +-----------+

                          |           |

                          |    CPU    |

                          |           |

                          +---+---^---+

                     from_cpu |   | to_cpu

i suspect the intent is s/CPU/Control Plane/
je# Yes - changed

+----+----+  +----------+  +---------+  +----------+  +----+----+

|         |  |          |  |         |  |          |  |         |

Rx--> PHY/MAC +--> Ingress  +--> Buffers +--> Egress   +--> PHY/MAC +-> Tx

|         |  | Pipeline |  |         |  | Pipeline |  |         |

+---------+  +----------+  +---------+  +----------+  +---------+

on complex devices, there are more buffers. at a minimum input vs
output buffers. C.3 starts to address this.

je# the model supports reporting no-buffer discards on ingress, egress and at 
the device level

worse, the control plane can be more complex than ingress and egress.
punt path is good, but what about rib to (distributed) fib?

je# control plane is really capturing discards of packets to or from the 
control plane.  I don’t follow what you’re referring to with the rib to 
(distributed) fib case – could you please clarify?

The effectiveness of automated mitigation depends on correctly
mapping discard signals to root causes and appropriate actions.
Table 1 gives example discard signal-to-mitigation action mappings
based on the features described in section 3.

i wonder about the effects of different mitigation actions across
different vendors in a multi-vendor environment. with more coffee, i
suspect one could posit undesirable behavior.
je# We have not come across any major issues across 8 hardware platforms across 
4 vendors

i abuse the excuse of not being a yang expert to not dive deeply into
the model presentations :)

and again, i am a n00b here. but no refunds will be provided. :)

randy

Amazon Data Services UK Limited. Registered in England and Wales with 
registration number 09959151 with its registered office at 1 Principal Place, 
Worship Street, London, EC2A 2FA, United Kingdom.

_______________________________________________
OPSAWG mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review

Reply via email to