John Scudder's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)

John Scudder via Datatracker Tue, 16 Apr 2024 13:14:17 -0700

John Scudder has entered the following ballot position for
draft-ietf-rtgwg-segment-routing-ti-lfa-13: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)

Please refer to
https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/
for more information about how to handle DISCUSS and COMMENT positions.

The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-segment-routing-ti-lfa/

----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

# John Scudder, RTG AD, comments for draft-ietf-rtgwg-segment-routing-ti-lfa-13
CC @jgscudder

Thanks for this document. The technology is valuable, and the underlying
techniques sound. Despite what you might guess from my abundance of comments, I
like this document. I suspect it has suffered from having been edited
repeatedly by the same set of experts, it becomes hard to see it as a new
reader might. So, please accept my comments in the spirit of a new set of eyes
looking over some long-established text.

Although most of my remarks are non-blocking COMMENTs, I am putting two in as
DISCUSS points. Some of my other comments are of a similar nature, but I think
they're less serious for the overall coherence of the document.

## DISCUSS

### Whole document, is post-convergence of the essence, or not?

The document seems to be arguing with itself about whether following the
post-convergence path is, or is not an essential/required feature. In the
Abstract:

A key aspect of
TI-LFA is the FRR path selection approach establishing protection
over the expected post-convergence paths from the point of local
repair

It's a *key* aspect! OK! But then, in the Introduction:

Although not a Ti-LFA requirement or constraint, TI-LFA also brings
the benefit of the ability to provide a backup path that follows the
expected post-convergence path

Wait, it's "key" but "not a requirement or constraint"? Moving on to Section 6,

The repair list encodes the
explicit post-convergence path to the destination

So it "encodes the explicit post-convergence path". "Encodes", not "might
encode" or "can encode". So the Abstract is right and Section 2 is wrong. But
wait there's more! Later in Section 11,

traffic can
be steered by the PLR onto its expected post-convergence path during
the FRR phase

So it "can"... which implies "doesn't have to be".

There's more, for example, all of Section 5 talks about post-convergence paths,
and there are many more mentions in Section 6 too.

Given that Sections 5-8 seem to come closest to being the normative ones
(though the document is sadly not very precise in this regard) I'm left with
the impression that the Abstract is right ("key"), and the quoted passages of
Sections 2 and 11 are wrong. In any case, I think this needs to be resolved in
some way.

### Section 10, multiple unrelated failures

Implementations of TI-LFA should deal with the
occurence of multiple unrelated failures in accordance to the IP Fast
Reroute Framework [RFC5714].

(Nit, you misspelled "occurrence".)

Can you explain what you mean by this sentence? I haven't reviewed it carefully
but RFC 5714 is a framework, and I don’t know what it means to be "in
accordance to" it. The only relevant text I was able to find in it was RFC 5714
Section 5.2.6,

However, it is important that the occurrence of
a second failure while one failure is undergoing repair should not
result in a level of service which is significantly worse than that
which would have been achieved in the absence of any repair strategy.

Putting that together with the quote from your specification, I come up with an
interpretation like "it's important to behave reasonably in the face of
multiple failures, we aren't going tell you how to do it, this other document
we are citing isn't going to tell you how to do it either, it's just going to
tell you that our specification was supposed to cover this, but we didn't.”

----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

## COMMENTS

### General Gripe, Y U no XML?

It looks like you must have uploaded the TXT rendering instead of the XML
source. Whenever it is you upload your next revision, please consider uploading
the XML source instead. The set of renderings available when you upload text is
inferior to the renderings available when you upload source (notably, the
modern HTML rendering is not available). If there's some reason uploading
source is difficult or impossible for you, disregard this request of course.
(The only reason I can think of for it to be hard is if you used a less-common
tool to produce your draft, for example, nroff or the Microsoft Word template.
If you use the more mainstream XML or MD workflows, it should be easy enough.)

### Abstract, for want of a hyphen the meaning was lost

It extends these concepts to provide guaranteed coverage in
any two connected networks using a link-state IGP.

The only way I can make sense of this is to add a hyphen:

It extends these concepts to provide guaranteed coverage in
any two-connected networks using a link-state IGP.

Is that what you meant? I would also suggest changing "using" to "that uses",
as in,

It extends these concepts to provide guaranteed coverage in
any two-connected network that uses a link-state IGP.

I'm not sure the concept of two-connectedness is universally understood enough
that it's suitable for use in an Abstract, so I'd support further editing to
get rid of the graph theory term entirely, but at least this edit makes it
correct.

### Section 1, orphan definitions

- RSPT is defined, but never used. Delete?

- SLA is defined, only used once (and in a comment below I suggest deleting
that use!). Delete and just expand on first use, if the first-and-only use is
kept at all?

- SPT is defined, but never used. Delete?

- SRGB is defined, only used once. Delete and just expand on first use?

- TLDP is defined, only used once. Delete and just expand on first use? (If you
keep it, please correct the nit from Ben Niven-Jenkins’ RTGAREA review.)

### Section 2, "aims at"

Segment Routing aims at supporting services with tight SLA guarantees
[RFC8402].

This seems to be rewriting history. That is to say, sure SR “aims at supporting
services with tight SLA guarantees”, but only in the sense that every
general-purpose packet transport does; without specifics this isn’t very
meaningful. In any case, the citation doesn’t supply evidence for the
statement, indeed the string “SLA” or “service level” never occurs in RFC 8402.
Maybe just remove this sentence? It doesn’t appear essential in any case.

### Section 2, can't parse this sentence

By relying on SR this document provides a local repair
mechanism for standard link-state IGP shortest path capable of
restoring end-to-end connectivity in the case of a sudden directly
connected failure of a network component.

I can’t parse this sentence. I am *guessing* that what you mean is something
like,

NEW:
This document leverages Segment Routing (SR) to provide a local
repair mechanism for a shortest path computed by a standard
link-state IGP. The local repair is capable of restoring end-to-end
connectivity in the case of a failure of a directly connected
network component.

If this is what you mean, you’re welcome to use this text if you want. If it’s
not what you mean, please explain?

Note I removed “sudden” in my NEW text because I’m guessing you don’t mean to
exclude a gradual or a foreseeable failure. A failure is a failure, after all.

### Section 2, When the network reconverges

When the network reconverges

I suggest,

NEW:
When the network reconverges after a failure

See also the next comment.

### Section 2, microloops

Much of Section 2 is a discussion of microloops and exactly how TI-LFA relates
to them. This doesn’t seem like introductory material, especially because the
rest of the specification doesn't talk about microloops at all. I suggest
moving that material to a new (sub)section for that purpose and just mentioning
it in the Introduction, as in something like "microloops are not addressed by
TI-LFA and can be a concern in some deployments. This is discussed in <xref>."

I don't insist that you make this change, the document is still usable without
it. However, I think that the Introduction as it stands is not very useful *as
an introduction* because of the abundance of non-introductory material. (Some
of my subsequent comments fall under the same heading.)

### Section 2, “primary link”

You mention the “primary link” in a few places here (and nowhere else). What is
the “primary link”? Please clarify or re-word. For example, maybe you mean “the
link whose failure is detected”.

### Section 2, what value does comparing to older FRR techniques add to an
intro?

By using SR, TI-LFA does not require the establishment of TLDP
sessions (Targeted Label Distribution Protocol) with remote nodes in
order to take advantage of the applicability of remote LFAs (RLFA)
[RFC7490][RFC7916] or remote LFAs with directed forwarding
(DLFA)[RFC5714]. All the Segment Identifiers (SIDs) are available in
the link state database (LSDB) of the IGP. As a result, preferring
LFAs over RLFAs or DLFAs, as well as minimizing the number of RLFA or
DLFA repair nodes is not required anymore.

I see why you wanted this paragraph during the process of developing this spec
and persuading the WG of its value. I don’t see how it contributes any value to
the final spec. Similarly,

By using SR, there is no need to create state in the network in order
to enforce an explicit FRR path. This relieves the nodes themselves
from having to maintain extra state, and it relieves the operator
from having to deploy an extra protocol or extra protocol sessions
just to enhance the protection coverage.

Appears to just be a restatement of the value proposition of SR itself. I don’t
see value in restating it in this document.

I think you could remove both these paragraphs without harm, and it would make
the document a quicker and clearer read.

### Section 2, encoding challenges

One of the challenges of TI-
LFA is to encode the expected post-convergence path by combining
adjacency segments and node segments.

Do you mean “compactly encode”, “efficiently encode”, or similar? Is encoding,
per se, the challenge?

### Section 2, roadmap should be a roadmap

I agree with Ben Niven-Jenkins that the omission of Sections 8-10 in the
overview list at the end of the Introduction is a bit jarring to the reader. (I
would add Section 13, too.)

### Section 3, exclude from that set

I think this is wrong:

Exclude from that set of neighbors that are reachable from R using X.

Did you mean,

NEW:
Exclude from that set, the neighbors that are reachable from R using X.

### Section 3, defined but not used

A symmetric network is a network such that the IGP metric of each
link is the same in both directions of the link.

This definition is never used.

### Section 6, you can't guarantee that

is guaranteed to be loop-
free irrespective of the state of FIBs along the nodes belonging to
the explicit path.

As written, there’s no way to guarantee that. (Trivial proof, one possible
state of a FIB is to point back to the preceding node along the path. That
might not be an *expected* state, but it is *a* state.) I think this is just a
case of overly casual writing, and you mean that the loop-free property will
exist regardless of whether the nodes belonging to the explicit path have
converged to recognize failure X or not. Consider rewriting along those lines?

### Section 6 and others

Please supply definitions for “P node” and “Q node”.

### Section 6, terminology is inconsistent and unclear

I can understand NodeSID(R1) well enough even though you haven't supplied a
definition, it’s the node SID for router R1. But what is “Node_SID(P)”? P isn’t
a router, it’s a set of routers, or a space. Please clarify, whether with a
definition or otherwise. Probably your clarification will fix both this and the
previous point.

While you’re at it, you might as well make your terminology consistent. In one
of the node SID cases above you use an underscore, and in the other, you don’t.
Also, your node SID notation is inconsistent with your adjacency SID notation
which looks like AdjSID_R1R2 — so in one case an underscore and parentheses, in
another case parentheses with no underscore, and in the final case an
underscore with no parentheses. Pick one.

(And then there's Section 12 with "node-SID"...)

### Sections 6.1, 6.2, 6.3, SHOULD NOT use SHOULD

What work are the SHOULDs doing here? Considering that in other parts of the
document you leave the computation of the repair path up to the implementation,
why are you mandating it here? And, if you’re mandating it, why not mandate it
all the way with a MUST?

It seems to me that if you don’t want to mandate implementation, it would be
sensible to take the RFC 2119 language out altogether. If you do want to
mandate implementation, I don’t see why you wouldn’t make this a MUST. If you
really do want it to be SHOULD, Please explain your reasoning.

### Section 7, primary outgoing interface

The existence of a "primary outgoing interface" seems to imply the existence of
a secondary outgoing interface, tertiary outgoing interface, etc. Please define
primary outgoing interface, or if this isn't an important distinction, consider
whether you can simplify to just say "outgoing interface".

### Section 7.1, first

The active segment becomes the first
segment of the repair list.

By “first” do you mean “first to be pushed, last to be processed”? If so, I
suggest clarifying that in the text, because the plain English reading of
"first" is the opposite.

### Section 7.2, "as stated"... where?

As stated in Section 2, when SR
policies are involved and a strict compliance of the policy is
required, an end-to-end protection should be preferred over a local
repair mechanism.

I don’t see this in section 2 (I searched for “end-to-end”, “prefer”, “policy”
and “policies”). Can you help me understand what text in section 2 you’re
talking about?

### Section 7.2.1, what's an Adj()?

Please define Adj(). Is this yet another terminology variation? (c.f. Section 6
comment on AdjSID_R1R2)

### Section 8.1, what's the tail end of a node segment?

1. If the active segment is a node segment that has been signaled
with penultimate hop popping and the repair list ends with an
adjacency segment terminating on the tail-end of the active
segment, then the active segment MUST be popped before pushing
the repair list.

What is the tail end of a node segment? I can’t figure out what that means.

I think I know what you’re trying to say, but please find a way to reword it
that doesn’t end up making me try to parse the above with my "what did the
authors actually *mean*?" glasses on.

### Section 8.2 and others, description of SRv6 behaviors

RFC 8754 describes forwarding behaviors using a kind of line-numbered
pseudocode, and later documents that modify forwarding behaviors specify
updates to the pseudocode. (Examples: RFC 8986,
draft-ietf-spring-srv6-srh-compression-15,
draft-ietf-rtgwg-srv6-egress-protection-16,
draft-ietf-spring-sr-redundancy-protection-03,
draft-ietf-spring-srv6-path-segment-07)

You don’t do this, you use a descriptive approach instead. I'm ok with this in
isolation, but I’d like to know if you made an affirmative decision to diverge
from the usual SRv6 way of doing things, and if so, why, and if the SPRING
working group specifically considered this and is OK with it.

### Section 8.2, shorter than what?

In such case, there is no need for a preceding
Prefix SID and the resulting repair list is likely shorter.

Shorter than what? This is the first place in this document the string “Prefix
SID” occurs, so I’m confused.

### Section 11, limit the implementation of local FRR policies

Based on this assumption, in order to facilitate the operation of
FRR, and limit the implementation of local FRR policies

Do you mean "limit the need for implementation of local FRR policies"? (And you
could drop “implementation of” for that matter.)

### Section 11, TI-LFA and SR policies don't mix?

The last paragraph, regarding the use of SR policies, and also Section 9,
leaves me wondering whether a simpler statement would be that TI-LFA is
inappropriate for use in a network that makes use of SR policies. Is this a
fair characterization?

### Section 12, can't this be an appendix?

Shouldn’t this be an Appendix? In general, that seems like a common (and good!)
practice for inessential information like this, especially when it has a
potentially limited shelf-life.

Also, although I appreciate that you provided some rudimentary parameterization
of the topologies in Table 1, I think it would be helpful to at least say what
time period the topologies reflect —
draft-francois-rtgwg-segment-routing-ti-lfa-00 dates to summer of 2015; are we
talking about the topologies that were in vogue in 2015? Those of 2023? Etc.

### Section 12, granularity wut

We do not cover the
case for 2 SIDs (Section 6.3) separately because there was no
granularity in the result.

I don’t understand what this means. Can you rephrase it? Generally, I find the
words “granular”. “granularity”, and related have almost zero descriptive
power. :-(

### Section 12, "2 or more" or "2", "3", and no more?

In your description of the table, you say,

The convention that we use is as follows

* 0 SIDs: the calculated repair path starts with a directly
...
* 1 SIDs: the repair node is a PQ node, in which case only 1 SID is
...
* 2 or more SIDs: The repair path consists of 2 or more SIDs as
described in Section 6.3 and Section 6.4. We do not cover the
case for 2 SIDs (Section 6.3) separately
...

But the table headers show:

+-------------+------------+------------+------------+------------+
| Network | 0 SIDs | 1 SID | 2 SIDs | 3 SIDs |
+-------------+------------+------------+------------+------------+

I.e. the table headers don’t show “2 or more” they show 2, and 3, broken out
distinctly, and no "or more" case. Seems like these need to be reconciled.

### Section 13, guaranteed upper bound

The techniques described in this document are internal
functionalities to a router that result in the ability to guarantee
an upper bound on the time taken to restore traffic flow upon the
failure of a directly connected link or node.

This is the only place in the document where you talk about guaranteed upper
bound. This is a fairly strong promise to make, I think you shouldn't be
mentioning it unless you provide some kind of support for how the guarantee is
provided. Note, I don't question that TI-LFA can be part of the machinery
providing such a guarantee, but without showing your work I don't think you can
make this claim.

## Notes

This review is in the ["IETF Comments" Markdown format][ICMF], You can use the
[`ietf-comments` tool][ICT] to automatically convert this review into
individual GitHub issues.

[ICMF]: https://github.com/mnot/ietf-comments/blob/main/format.md
[ICT]: https://github.com/mnot/ietf-comments

_______________________________________________
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg

John Scudder's Discuss on draft-ietf-rtgwg-segment-routing-ti-lfa-13: (with DISCUSS and COMMENT)

Reply via email to