On 04/12/2024 17:59, Tony Li wrote:

Les,

Upgrades are the motivation for deploying multiple algorithms. It allows for incremental rollout of a new algorithm. Yes, there are significant operational considerations.

Reverting to full flooding is neither practical nor necessary.  Migration has the strong advantage of having a minimal blast radius, as has been requested.

One can migrate from one algo to the other without reverting to the full flooding using the leader announced algo.

Just upgrade all routers participating in V1 algo so that they support V2 algo. After all of them are upgraded, let the leader announce the V2 version. This would switch all of them from V1 to V2.

From my perspective this is way easier than migrating routers one by one and messing around with the presence of the multiple algos.

thanks,
Peter



Interoperability is not a serious problem as there is a boundary of legacy flooding between dissimilar algorithms.

Once you grasp that you have only a single algorithm within a subgraph, debugging gets a whole lot easier.

T

p.s. Tony and I have discussed things offline and I am hoping that he will revise his drafts so that they are easier to absorb.


On Dec 4, 2024, at 8:36 AM, Les Ginsberg (ginsberg) - ginsberg at cisco.com <[email protected]> wrote:

Tony –

Upgrades are orthogonal to my comments.

I am speaking about the need to deploy multiple flooding algorithms in a network (one of which may be “static”).

We have never considered that in scope before – and there are obvious challenges to doing so – not least of which is the ability to test.

I think when you say “upgrade” you are talking about needing to migrate from algorithm X to algorithm Y – or from Algo X-V1 to Algo X-V2 where V2 has some fix that isn’t fully interoperable with V1.

We already have a way handling this case:

Revert to base flooding everywhere – do the upgrade – and then enable the upgraded algo.

Conceptually, this is consistent with how we have deployed major infra upgrades (e.g., narrow to wide metrics).

This is far safer than trying to deal with co-existence – not least because once you allow co-existence you have to allow that a customer might use this as a permanent state – not just an upgrade state.

Given the challenges we already face with interoperability even when all routers are trying to “do the same thing” (and I am not limiting this comment to just flooding)   the idea that we should now embrace a persistent state where routers are intentionally doing inconsistent things seems at best naïve.

Imagine that you and I are called to root cause problems in a customer network.

Your implementation supports algorithm X and doesn’t understand algorithm Y.

My implementation supports algorithm Y and doesn’t understand algorithm X.

Flooding issues are notoriously difficult to diagnose – even when all nodes are supposed to be doing the same thing.

All the while our mutual customer is (rightfully) pressuring to get this fixed ASAP.

We might well ask “how did we get into this mess”.

Les

*From:*Tony Li <[email protected]> *On Behalf Of *Tony Li
*Sent:* Wednesday, December 4, 2024 7:54 AM
*To:* Les Ginsberg (ginsberg) <[email protected]>
*Cc:* Tony Przygienda <[email protected]>; Peter Psenak (ppsenak) <[email protected]>; Shraddha Hegde <[email protected]>; Robert Raszuk <[email protected]>; lsr <[email protected]>
*Subject:* Re: [Lsr] Another counter-example

Les,

The step that you’re missing is that upgrades are inevitable and thus an operational necessity.

We are very, very, very unlikely to get things right on the first go. Therefore, we will need to fix our bugs. How do you deploy that bug fix? Add to the mix that we’re not willing to do a flag day cutover to the fix.

A better way of thinking of mesh groups is that they are the ’static routes’ of legacy flooding.  They are installed by network operators and are presumed to be perfect. No signaling necessary.

Tony



    On Dec 4, 2024, at 7:28 AM, Les Ginsberg (ginsberg) - ginsberg at
    cisco.com <[email protected]> wrote:

    I am very much in agreement with Peter – though I think his
    commentary is “too kind”. 😊

    The issue w mesh groups is that they are opaque to other nodes
    i.e., you may come up with a way of signaling that a node has
    configured mesh groups (which BTW the distoptflood draft does NOT
    currently have – and I hope it never does…) but unless you are
    going to also propose that a node signal what links are/are not
    being used for flooding the best you can do from the POV of other
    nodes is treat the node as if it is running a flooding algorithm
    which is totally opaque – and which is also “brittle” i.e., it
    doesn’t do well in the event of topology changes.

    To Tony P – one of the things that disturbs me about the way this
    discussion is taking place is how we seem to have “skipped steps”.

    The interest in optimized flooding dates back decades.

    Early attempts include:

    https://datatracker.ietf.org/doc/rfc2973/ (Mesh Groups) (circa 2000)

    https://datatracker.ietf.org/doc/html/draft-ietf-ospf-isis-flood-opt-01
    (circa 2001)

    MANET work (circa 2014)

    All of these attempts were very conservative in nature. The
    notion of deploying multiple solutions simultaneously and
    thinking about how they might “interoperate” was quite
    deliberately not looked at. The general view has been “be very
    very careful when you mess with flooding”.

    Suddenly, we now seemed to “leaped off the cliff” and are talking
    about deploying multiple algorithms and trying to get them to
    “interoperate”.

    At what point did the WG conclude that this is a real requirement
    and that it actually can be deployed safely?

    If people want to discuss this – the WG is a fine place to do it.
    But I would appreciate discussion that does not skip over the
    very real concerns that have kept us from even considering this
    for the last three decades.

    Les

    *From:*Tony Przygienda <[email protected]>
    *Sent:* Wednesday, December 4, 2024 12:35 AM
    *To:* Peter Psenak (ppsenak) <[email protected]>
    *Cc:* Shraddha Hegde <[email protected]>; Robert Raszuk
    <[email protected]>; Tony Li <[email protected]>; lsr <[email protected]>
    *Subject:* [Lsr] Re: Another counter-example

    Valid point of view but there are other solutions possible to the
    whole thing as well that don't precondition mesh-group node lift
    up, if consensus passes and we start to work on details of the
    necessary leaderless signalling in some framework that's part of
    operational considerations then would be my take ...

    thanks

    -- tony

    On Wed, Dec 4, 2024 at 9:25 AM Peter Psenak <[email protected]>
    wrote:

        Hi Shraddha,

        so you define mesh-groups to be a separate flooding algorithm
        itself, requiring all routers using them to be upgraded.  By
        the time you do that, you can also replace mesh-groups with
        the distop on all routers and be done with it, instead of
        trying to solve the coexistence of the two.

        thanks,
        Peter

        On 04/12/2024 07:48, Shraddha Hegde wrote:

            Hi Robert,

            With dist-opt flood reduction running in leaderless mode
            it is possible for the operator to run

            Mesh-groups in some part of the network and introduce
            distopt flooding in other part where needed. The nodes
            configured with  mesh-groups have to be upgraded to
            advertise, they are running a different flood reduction
            algorithm and the distopt algorithm will ensure the
            neighbors of the Nodes running meshgroups will always
            become reflooders and hence the CDS where distopt runs,
            is ensured correct flooding behaviour.

            Some networks have the mesh-groups deployed where it’s a
            well defined part of the topology and reduces 50%
            back-flooding with mesh-groups configured. Has been
            deployed for many years and serving well.  If an operator
            wants to keep that config and introduce distopt in other
            parts of the topology (during migration or otherwise),
            It’s a very valid usecase and can be supported with
            distopt algorithm.

            Rgds

            Shraddha

            *Juniper Business Use Only*

            *From:*Robert Raszuk <[email protected]>
            <mailto:[email protected]>
            *Sent:* 27 November 2024 15:58
            *To:* Peter Psenak <[email protected]>
            <mailto:[email protected]>
            *Cc:* Tony Li <[email protected]> <mailto:[email protected]>;
            Tony Przygienda <[email protected]>
            <mailto:[email protected]>; lsr <[email protected]>
            <mailto:[email protected]>
            *Subject:* [Lsr] Re: Another counter-example

            *[External Email. Be cautious of content]*

            > you are talking about mixing the manual mesh group with
            optimized flooding.

            I am talking about an accidental mix (legacy
            configuration at some nodes) not a planned one.

            And you either auto detect it and disable the ability to
            optimally flood or you push full responsibility to the
            operator.

            Thx,

            R.

            On Wed, Nov 27, 2024 at 11:16 AM Peter Psenak
            <[email protected]> wrote:

                Robert,

                On 27/11/2024 10:32, Robert Raszuk wrote:

                    Peter,

                    My point was that this should be at least
                    mentioned in operational considerations section
                    if dynamic flooding is expected to work in mixed
                    networks where some nodes support new algorithm
                    and some do not your "regular flooding case".

                you are talking about mixing the manual mesh group
                with optimized flooding. I don't think we want to go
                that path.

                thanks,

                Peter

                    On Wed, Nov 27, 2024 at 10:28 AM Peter Psenak
                    <[email protected]> wrote:

                        Robert,

                        On 27/11/2024 10:22, Robert Raszuk wrote:

                            Peter,

                            I am not sure if what Tony said is a
                            requirement or an observation.

                            > Note that combining routers that run
                            the elected optimized algorithm

                            > with routers that do run the regular
                            flooding is not a problem.

                            Note that static mesh groups can be
                            present today too and you can't assume
                            that it is either an optimized algorithm
                            or full flooding.

                        please do not compare apples with oranges.

                        Static mesh groups are manually configured
                        and if not done correctly can result in
                        broken flooding. What we are discussing here
                        is a dynamic flooding algorithm, not manual
                        flooding blocking.

                        thanks,
                        Peter

                            Thx,

                            R.

                            On Wed, Nov 27, 2024 at 9:58 AM Peter
                            Psenak
                            <[email protected]> wrote:

                                On 27/11/2024 00:18, Tony Li wrote:
                                > A distributed algorithm computing a
                                flooding topology must only
                                > operate upon nodes running the same
                                algorithm (and version). If
                                > multiple algorithms (and/or
                                versions) are running in the same
                                network,
                                > then any given algorithm and
                                version defines a subgraph and the
                                > algorithm can only optimize
                                flooding within its own subgraph. Legacy
                                > full flooding must be used between
                                subgraphs of different algorithms
                                > or versions.

                                This is a new requirement for the
                                flooding algorithm itself. This does
                                not exist with the existing leader
                                based election, as that guarantees
                                that only one optimized flooding
                                algorithm is ever present in the area.
                                Note that combining routers that run
                                the elected optimized algorithm
                                with routers that do run the regular
                                flooding is not a problem.

                                thanks,
                                Peter

                                _______________________________________________
                                Lsr mailing list -- [email protected]
                                To unsubscribe send an email to
                                [email protected]


_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to