Hi Randy, Thanks for taking the time to review the draft and your feedback. Please see inline: je#
Cheers John From: Randy Bush <[email protected]> Date: Monday 8 September 2025 at 20:41 To: "[email protected]" <[email protected]> Subject: [EXTERNAL] [OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. i was poked to take a look at this draft. despite lack of yang fu, here are some cursory comments. While certain types of packet loss, such as policy-based discards, are intentional and part of normal network operation, unintended packet loss can impact customer services. intentional drops also impact the customer. who are we kidding here? when debugging loss, ignoring intentional drop can hide misconfigured drop configuration. je# I would make the clarification that discards due to a mis-configured ACL are still unintended; it’s just that you don’t know that without additional context. In either case, the point really is that we need precise classification of discard reporting. With errored discards, the discard metric may be enough to determine unintended loss. With mis-configured ACL discards, we may need the metric + other context, e.g. that we made an ACL change and when. This is partly covered in Appendix C 7. It is not possible to identify a configuration error - e.g., when intended discards are unintended - with device discard metrics alone. For example, additional context is needed to determine if ACL discards are intended or due to a misconfigured ACL, i.e., with configuration validation before deployment or by detecting a significant change in ACL discards after a configuration change compared to before. Will add clarification in the terminology section: Device discard counters do not by themselves establish operator intent. Discards reported under policy (e.g., ACL/policer) indicate only that traffic matched a configured rule; such discards may still be unintended if the configuration is in error. Determining intent for policy discards requires external context (e.g., configuration validation and change history) which is out of scope for this specification. The scope of this document is limited to reporting packet loss at Layer 3 and frames discarded at Layer 2. This document considers only the signals that may trigger automated mitigation actions and not how the actions are defined or executed. The fundamental problem for network operators is how to automatically detect when unintended packet loss is occurring ^ and where je# added FEATURE-DISCARD-CLASS: The type or class of discards, which is crucial for selecting the appropriate of mitigation - for example: error discards may require taking faulty components out of service; no-buffer discards may require traffic redistribution; policy discards typically require no automated action policy discards may be due to misconfiguration of policies je# clarified as: “intended policy discards” The discard reporting can be organized into several types: control plane, interface, flow, and device. is a drop reported in multiple types? i.e. on the device, the interface where it was dropped, the flow it affected, ...? while 5.2 makes this clear, comment here might be helpful. je# went to and fro on this but ended up overlapping with 5.2 so propose to leave as is unless strong objection? The "ietf-packet-discard-reporting-sx" module uses the "sx" structure defined in [RFC8791]. the Features list in 4.3 is not the same order as the abstract data model structure in 4.1 je# will align identity ingress { base direction; description "Reports statistics for the received from the network packets."; } identity egress { base direction; description "Reports statistics for the sent to the network packets."; } in a complex device, i wonder if ingress and egress could be a bit confusing je# We’ve successfully implemented this information model in both centralised forwarding and linecard/chassis based devices grouping qos { description "Quality of Service (QoS) traffic counters."; not differentiated from security/acl, no-route, etc. drops? je# sorry – don’t follow your feedback – could you please clarify? grouping errors-l3-rx { ... leaf no-route { no-route on a received packet, tx, sure, but rx? or are you expecting a route back toward the source? je# The rx in this case meant received packet – is that confusing? i wonder about being able to differentiate between drops due to static vs dynamic (e.g. flow export) security/TE policies. je# that seems like another case of external context? If all of the requirements listed in Section 5.2 are met, a "good" unicast IPv4 packet received would increment: ... the analogous rules for counting L2 frames are not formally described je# added an L2 example: a “good” Layer-2 frame received would increment: - interface/ingress/traffic/l2/frames - interface/ingress/traffic/l2/bytes - interface/ingress/traffic/qos/class[id="0"]/packets - interface/ingress/traffic/qos/class[id="0"]/bytes +-----------+ | | | CPU | | | +---+---^---+ from_cpu | | to_cpu i suspect the intent is s/CPU/Control Plane/ je# Yes - changed +----+----+ +----------+ +---------+ +----------+ +----+----+ | | | | | | | | | | Rx--> PHY/MAC +--> Ingress +--> Buffers +--> Egress +--> PHY/MAC +-> Tx | | | Pipeline | | | | Pipeline | | | +---------+ +----------+ +---------+ +----------+ +---------+ on complex devices, there are more buffers. at a minimum input vs output buffers. C.3 starts to address this. je# the model supports reporting no-buffer discards on ingress, egress and at the device level worse, the control plane can be more complex than ingress and egress. punt path is good, but what about rib to (distributed) fib? je# control plane is really capturing discards of packets to or from the control plane. I don’t follow what you’re referring to with the rib to (distributed) fib case – could you please clarify? The effectiveness of automated mitigation depends on correctly mapping discard signals to root causes and appropriate actions. Table 1 gives example discard signal-to-mitigation action mappings based on the features described in section 3. i wonder about the effects of different mitigation actions across different vendors in a multi-vendor environment. with more coffee, i suspect one could posit undesirable behavior. je# We have not come across any major issues across 8 hardware platforms across 4 vendors i abuse the excuse of not being a yang expert to not dive deeply into the model presentations :) and again, i am a n00b here. but no refunds will be provided. :) randy Amazon Data Services UK Limited. Registered in England and Wales with registration number 09959151 with its registered office at 1 Principal Place, Worship Street, London, EC2A 2FA, United Kingdom.
_______________________________________________ OPSAWG mailing list -- [email protected] To unsubscribe send an email to [email protected]
