Fred, I understood full well that you only envision a small number of reassembly devices.  After all, on any given path only one device will likely reassemble.  Still, that device will be spending a lot of resources in a very expensive part of the path (fast path forwarding) to provide a small benefit to some hosts.

Fundamentally you are asking the archtiecture to spend those resources for use case that you have not explained.  "I have proof" i snot relevant.  Without knowing the scenarios and the assumptions, it does not help us to judge.  It is worse than the case in the early days of the MANET working group where the competing proposal repeatedly said "my simulation shows ..."

Fundamentally, it is not the network's job to reassemble packets for a host.  If you want NICs to do that, as Tom has said, that's fine.  It is a private matter between the host and the NIC.  But you are asking for functionality in the network.

I note also that you are assuming that hosts have links that support actual MTUs larger than 64K.  I know of no link that has those properties in current use.  (I am vaguely familiar with HIPPI and FiberChannel.  Neither appears to be relevant.)

Yours,

Joel

On 7/12/2022 10:02 AM, Templin (US), Fred L wrote:

Joel, you are misunderstanding what nodes would be involved in reassembly; this would

not be at every single IP layer router in the path. It would only be at possibly 0, 1 or 2

adaptation layer middleboxes in the path from source to destination. And, then most

likely only at a near-end middlebox very near the destination that happens to know the

destination would prefer to receive larger parcels.

About segment size, I have proof that using segment sizes significantly larger than the

path MTU can often produce dramatic performance increases even when fragmentation

is intentionally invoked. I also have proof that packaging multiple segments in the same

system call can drive performance even higher an without reducing the segment size.

IP parcels takes it the logical next step of allowing multiple segments to travel together

in the same packet, which may or may not be subject to fragmentation and reassembly.

But, let’s not get so hung up on the middlebox question that we forget the benefits

for end-to-end.

Fred

*From:*Joel Halpern [mailto:j...@joelhalpern.com]
*Sent:* Monday, July 11, 2022 4:02 PM
*To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
*Cc:* int-area@ietf.org
*Subject:* Re: [Int-area] Re: Call for WG adoption of draft-templin-intarea-parcels-10


No, intermediate reassembly is not an optimization.

First, it is a bad idea.  It is very painful for routers to perform reassembly.  They have to burn expensive resources managing such attttempted reassesmbly.  It has major cost even if the router decides to give up and forward the pieces.

And second, unless one makes some unstated assumptions in the absence of such reassembly the sending host will be throttled to the receiving host rate.  So the benefit of the entire system is markedly reduced.

Net: we should not adopt this draft.

Yours,

Joel

On 7/11/2022 6:41 PM, Templin (US), Fred L wrote:

    Tom,

    > Why would someone put six segments in a parcel if they already
    have a 9K link MTU?

    > Why not just send one segment in 9K?

    This is the mindset that we need to overcome. We have had it
    drilled into our heads

    that MSS must be the same as the path MTU, but it does not need to
    be that way.

    If the MSS is smaller than the path MTU, but we can send multiple
    segments in a

    single parcel that more closely approaches the size of the path
    MTU then

    amortization savings are possible.

    >The algorithm isn't the problem, it's supporting new protocols and
    multiple

    >checksums in a packet in hardware.

    But Tom, how hard can this be? Instead of running the Internet
    checksum 1 time

    over N octets of data simply run it M times over N/M octet chunks
    of the data in

    succession but still in a single pass. You spoke before of NICs
    adapting to support

    TCP jumbograms – if they can do that, why not a very
    straightforward application

    of Internet checksum? I haven’t looked at this in a long while,
    but isn’t this also

    similar to what UDP-lite did?

    > Either you're trivializing reassembly or maybe you're thinking of
    some new method that

    > somehow avoids all the pitfalls and problems we've had with
    reassembly over the years!

    Intermediate node parcel reassembly is really just an optimization
    to try to pass the

    largest possible parcels on to the next hop instead of passing
    many smaller ones. It is

    really just a concatenation of segments of sub-parcels belonging
    to the same original

    parcel. Reordering is unimportant – it is OK to concatenate
    sub-parcels 3,8,5,2 in that

    order and without even waiting for any other sub-parcels to show
    up. The application

    will simply perceive it as a case of network reordering and the
    upper layer protocol

    will do the correct thing with the sequence numbers. AFAICT, the
    only hard requirement

    is that the final sub-parcel must not be concatenated as an
    intermediate sub-parcel.

    This stuff will all work, and it will work for the betterment of
    the Internet.

    Fred

    *From:*Tom Herbert [mailto:t...@herbertland.com
    <mailto:t...@herbertland.com>]
    *Sent:* Monday, July 11, 2022 2:57 PM
    *To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
    <mailto:fred.l.temp...@boeing.com>
    *Cc:* Richard Li <richard...@futurewei.com>
    <mailto:richard...@futurewei.com>; Juan Carlos Zuniga (juzuniga)
    <juzuniga=40cisco....@dmarc.ietf.org>
    <mailto:juzuniga=40cisco....@dmarc.ietf.org>; int-area@ietf.org
    *Subject:* Re: [EXTERNAL] Re: [Int-area] Call for WG adoption of
    draft-templin-intarea-parcels-10


        

    EXT email: be mindful of links/attachments.

    On Mon, Jul 11, 2022 at 2:20 PM Templin (US), Fred L
    <fred.l.temp...@boeing.com> wrote:

        Tom, some rejoinders:

        >Yes, I agree if the packet is fragmented by the network then this
        is a nice feature.

        >However, today we already have this from a host perspective property
        by just

        >sending "small" packets.

        It can be readily shown that some applications get much
        greater performance by

        sending larger packets that trigger fragmentation/reassembly
        than by sending

        smaller packets that do not. Multiple order of magnitude
        performance increases

        are indeed possible.

        >I'm not sure the savings qualify as significant. 9K MTUs are
        becoming common in data centers

        >and the standard TCP/IPv6 header is 80 bytes so that's already
        less than 1% overhead.

        I think 9K is only a starting point, and IP parcels pave the
        way to much larger link MTUs,

        possibly even in excess of 64KB. And, doing the math, even for
        just a 9K link sending a

        single parcel that contains 6x 1440 octet segments would save
        5 * 60 == 300 octets in

    Why would someone put six segments in a parcel if they already
    have a 9K link MTU? Why not just send one segment in 9K?

        comparison with sending 6x  1500 octet packets with 60 octets
        of IP/TCP headers per

        packet. For links with larger MTUs, the savings for sending
        parcels with lots of segments

        (up to 64) becomes even greater.

        >As I already mentioned, this is addressed by the BiGTCP work
        (https://lwn.net/Articles/884104).

        >Sending or receiving multi-megabytes TCP segments in one system call
        is now feasible. Also, it's

        >inevitable that NIC vendors will apply this also to be able to offload
        TCP jumbo grams. Given this

        >is just software that doesn't require hardware change or
        on-the-wire protocols to change, it's

        >immediately deployable with just a softwar change which is a huge 
benefit
        to datacenter operators.

        As I have said, IP parcels has the same advantage within the
        host system-call (user-space

        to kernel-space) context. But, IP parcels goes a step further
        to provide efficient packaging

        over-the-wire, whereas the approach you are referring to opens
        the box inside the

        kernel and sends individual packets instead of aggregates.

        >All modern NIC HW can deal with offloading a single checksum per
        packet, it's going to be

        >a major effort for them to offload multiple checksum like IP
        parcels needs. Without checksum

        >offload, this would be a non-starter for a lot of deployments.

        Check the latest spec (now at -12 and likely to stay that way
        until IETF114. Any H/W checksum

        that can run over the first segment of a packet should be
        possible to make run over the N-1

        additional segments of the same packet (parcel) by applying
        the very familiar Internet

        checksum algorithm.

    The algorithm isn't the problem, it's supporting new protocols and
    multiple checksums in a packet in hardware.

        >I'm not convinced of that. For instance, I'm skeptical that
        intermediate devices trying to reassemble

        >packets that aren't addressed to themselves could ever be robust or
        efficient (i.e. complexity, non-work

        >conserving resource requirements, security issues with reassembly,
        multi-path that causes latency

        >increase, potential DoS vector, etc.). Can you comment on this?

        Perhaps what is confusing this matter is that the intermediate
        devices referred to

        here most certainly do not refer to all routers in the path.
        Instead, what is intended

        here is an OMNI intermediate device, of which there may be
        something on the order

        of 0, 1, or 2 of them on the path between the OMNI source and
        destination even

        though there may be many 10’s or even 100’s of ordinary IP
        routers on the path.

        And, again, this is not a strict reassembly case – instead, it
        is an opportunistic

        “combine if convenient; else forward” swift decision.

    Either you're trivializing reassembly or maybe you're thinking of
    some new method that somehow avoids all the pitfalls and problems
    we've had with reassembly over the years! Consider that many NIC
    vendors have tried, and largely failed, to get any sort of device
    reassembly widely deployed (e.g. IP reassembly, TCP segmentation
    reassembly, etc.). The reason they failed is because they can't
    give the host stack transparency and control over the reassembly
    process.

    In its nature reassembly can only be done with at least packets.
    That means a device performing reassembly has to receive one
    packet, hold it, and wait for the following packet to perform
    reassembly. That makes reassembly, unlike fragmentation, a
    non-work conserving process. Many issues and policies arise from
    this. For instance, what happens if a packet is held and the
    following packet is never seen? (usually implies a reassembly
    timer). What happens if a packet is received OOO and is already
    forwarded, but the preceding packet is then received, do we try to
    reassemble that one? (the solution here seems to be to maintain
    some sort of flow state)? What about overlapping fragments and the
    security issues around that?

    IMO, if the WG does pursue this, I believe a lot of the effort
    will be in specifying how reassembly in intermediate nodes works.

    Tom

        Thanks - Fred

        *From:*Tom Herbert [mailto:t...@herbertland.com]
        *Sent:* Monday, July 11, 2022 1:34 PM
        *To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
        *Cc:* Richard Li <richard...@futurewei.com>; Juan Carlos
        Zuniga (juzuniga) <juzuniga=40cisco....@dmarc.ietf.org>;
        int-area@ietf.org
        *Subject:* [EXTERNAL] Re: [Int-area] Call for WG adoption of
        draft-templin-intarea-parcels-10


                

        EXT email: be mindful of links/attachments.

        On Mon, Jul 11, 2022 at 12:22 PM Templin (US), Fred L
        <fred.l.temp...@boeing.com> wrote:

            Richard and others, thank you for these comments and for
            the ensuing discussion that

            took place over the time I was away on vacation. Strange
            how the timing hit when I

            was away from the office and off the grid - I was on a
            camping trip in Canada not far

            from where Steve Deering lives although I did not visit him.

            In any event, I was able to push out a new draft version
            ahead of the deadline that

            may address some (but likely not all) of your concerns:

            https://datatracker.ietf.org/doc/draft-templin-intarea-parcels/

            The major change is that the draft now talks about
            interactions with upper layer

            protocols including TCP and UDP, whereas the previous
            draft versions were silent

            regarding upper layer protocol framing.

            To others who have commented, I beg to differ and maintain
            that IP parcels do

            represent a significant improvement over the current state
            of affairs and over

            just regular IP jumbograms. In particular:

        Hi Fred, some comments in line.

            1) IP parcels make it so that the loss unit is a single
            segment instead of the entire

            packet/parcel, and loss of a segment often results in
            retransmission of just that

            segment instead of the entire packet/parcel.

        Yes, I agree if the packet is fragmented by the network then
        this is a nice feature. However, today we already have this
        from a host perspective property by just sending "small" packets.

            2) IP parcels are more efficient than sending a single
            segment per IP packet, since

            the parcel includes a single IP header plus single full
            {TCP,UDP} header for possibly

            many segments. This can result in significant savings in
            terms of bits over the wire

            for omitting unnecessary header bytes.

        I'm not sure the savings qualify as significant. 9K MTUs are
        becoming common in data centers and the standard TCP/IPv6
        header is 80 bytes so that's already less than 1% overhead.

            Consider the postal service analogy; when

            many items can be sent together in a single package/parcel
            there is a large savings

            in shippeing and handling costs than when each individual
            item is shipped separately.

        As I already mentioned, this is addressed by the BiGTCP work
        (https://lwn.net/Articles/884104). Sending or receiving
        multi-megabytes TCP segments in one system call is now
        feasible. Also, it's inevitable that NIC vendors will apply
        this also to be able to offload TCP jumbo grams. Given this is
        just software that doesn't require hardware change or
        on-the-wire protocols to change, it's immediately deployable
        with just a softwar change which is a huge benefit to
        datacenter operators.

            3) IP parcels improve large packet integrity by including
            a separate checksum for

            each segment instead of a single checksum for the entire
            packet.

        All modern NIC HW can deal with offloading a single checksum
        per packet, it's going to be a major effort for them to
        offload multiple checksum like IP parcels needs. Without
        checksum offload, this would be a non-starter for a lot of
        deployments.

            This means that

            large parcels (up to a few MB) can be sent in one piece
            over links with sufficiently

            large MTU without requiring the link itself to provide
            strong integrity checks over

            the entire length of the parcel. This means that link MTUs
            significantly larger than

            9KB are now safely possible.

            4) IP parcels offer all of the efficiency advantages to
            upper layers as are offered

            by GSO/GRO, etc. but also provide benefits 1) through 3)
            above that are not

            offered by GSO/GRO.

        Most of this is doable in GSO/GRO.

            5) Plus, the idea is just plain neat. Better packaging is
            good. More efficient

            handling is good. Reduced header overhead is good. SAFE
            larger MTUs are

            good. The idea itself is good.

        I'm not convinced of that. For instance, I'm skeptical that
        intermediate devices trying to reassemble packets that aren't
        addressed to themselves could ever be robust or efficient
        (i.e. complexity, non-work conserving resource requirements,
        security issues with reassembly, multi-path that causes
        latency increase, potential DoS vector, etc.). Can you comment
        on this?

        Tom

            Fred

            *From:*Int-area [mailto:int-area-boun...@ietf.org] *On
            Behalf Of *Richard Li
            *Sent:* Friday, July 01, 2022 3:11 PM
            *To:* Juan Carlos Zuniga (juzuniga)
            <juzuniga=40cisco....@dmarc.ietf.org>
            *Cc:* int-area@ietf.org
            *Subject:* Re: [Int-area] Call for WG adoption of
            draft-templin-intarea-parcels-10

            Chairs and Authors,

            I always like every new idea and effort to improve the
            Internet performance, and thus I have read this draft with
            a great interest. The following are my
            observations/comments/questions. If they don’t make any
            sense to you, please accept my apology, and disregard them.

            1.The text “multiple upper layer protocol segments” is
            ambiguous. It seems that you really mean “multiple
            segments from ‘the same’ upper layer protocol”, doesn’t
            it? It seems that multiple segments from different upper
            layer protocols are not allowed in your parcel.

            2.Is the following a fair statement? All segments in the
            same packet come from the same application identified by
            the 5-tupe (source address, destination address, source
            port, destination port, protocol number).

            3.Segment size

            You require that their sizes be the same except for the
            last one. Is this required for easy implementation or
            what? Do you require it for any other reasons?

            4.TTL issue

            You described how parcels are forwarded over the
            Internetwork, and in particular you described what the
            ingress/egress middlebox does about parcels. I understand
            that the ingress middlebox may break the parcel into
            smaller ones, which may rejoin at the egress middlebox. My
            question is about TTL. As different smaller parcels may
            traverse along different paths, as a result their TTLs may
            be different when they reach the egress middlebox . How
            does the egress middlebox set up the TTL value? Please
            provide more descriptions.

            5.Reordering at the egress middlebox

            The parcels would arrive one after another, and therefore
            the egress middlebox would “wait” for a little bit to
            identify and pick up enough parcels/packets for their
            rejoining and repackaging. A description of the egress
            middlebox behavior would be useful and helpful, in
            particular I would like to know more about the waiting
            time if any, and how you deal with the reordering and loss.

            6.IPv4 option

            Does IETF still allow to change/add IPv4 option fields? I
            might be wrong, but aren’t they frozen? Also, do
            commercial routers still care about IPv4 options?

            7.IPv6 option

            This draft has defined a hop-by-hop option, it will
            require every intermediate IPv6 router to inspect this
            option. There have been some discussions on the pros/cons
            about Hop-by-Hop IPv6 Option. Is there any feedback from
            WG 6man?

            8.Parcel Path Qualification

            This draft has described a method for parcel path
            qualification probe from end to end. It is nice to have
            it, but it is unreliable simply for the following reason:
            a probe parcel goes along one specific path, and your real
            application parcels may take different paths.

            9.Integrity

            First paragraph of Section 7. More explanation/elaboration
            should be useful. I might have missed it in previous
            paragraphs, but if I do, please provide a reference to it
            such as “as described in …”.

            10.Implementation Status

            In section 10. TSO’s performance gain and Parcel’s gain
            should be regarded as two different things. Since this
            draft is adding a hop-by-hop option, every intermediate
            router is required to process the hop-by-hop option, which
            will, theoretically speaking, lead to performance
            downgrade. Of course, the whole performance would depend
            on many other factors, such as the total numbers of
            routing table lookups and number of segments.

            11.General observation

            This proposal essentially tries to solve a problem caused
            by MTU. If MTU be very big, one would simply put the whole
            data in a single packet. Since MTU is limited, a packet
            has to be cut into many smaller pieces (segments). In the
            existing specification, when an intermediate router sees a
            packet with its size larger than MTU, the router would be
            expected to fragment it so that the fragments could be
            forwarded. Here let me call it “fragmentation as needed”.
            In reality, however, some (if not all) commercial routers
            don’t do “fragmentation as needed”, instead of fragmenting
            the packet they simply discard it in order to achieve the
            wire-speed. This draft defines a new way to address the
            MTU issue: when a router sees a packet with its size
            larger than MTU, the router is asked to fragment it in a
            prescribed way (fragment it into pre-packaged segments).
            If I may, let me call it “fragmentation as prescribed”.
            Both “fragmentation as needed” and “fragmentation as
            prescribed” would require the support from intermediate
            routers. As the same as fragmentation as needed,
            fragmentation as prescribed may downgrade the performance
            of intermediate routers. What is more, intermediate
            routers/boxes may perform “rejoining and repackaging”,
            which will adversely impact the performance of the
            intermediate routers/boxes.

            Best regards,

            Richard

            *From:*Int-area <int-area-boun...@ietf.org> *On Behalf Of
            *Juan Carlos Zuniga (juzuniga)
            *Sent:* Wednesday, June 22, 2022 12:25 PM
            *To:* int-area@ietf.org
            *Subject:* [Int-area] Call for WG adoption of
            draft-templin-intarea-parcels-10

            Dear IntArea WG,

            We are starting a 2-week call for adoption of the
            IP-Parcels draft:

            
https://www.ietf.org/archive/id/draft-templin-intarea-parcels-10.html
            
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Farchive%2Fid%2Fdraft-templin-intarea-parcels-10.html&data=05%7C01%7Crichard.li%40futurewei.com%7C715b5db213134932c70208da5484f702%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637915227299598680%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=w4G5ypaSRv%2FR31%2F%2B857XT2xUqHdEXv90ubD5GGjqBEQ%3D&reserved=0>


            The document has been discussed for some time and it has
            received multiple comments.

            If you have an opinion on whether this document should be
            adopted by the IntArea WG please indicate it on the list
            by the end of Wednesday July 6^th .

            Thanks,

            Juan-Carlos & Wassim

            (IntArea WG chairs)

            _______________________________________________
            Int-area mailing list
            Int-area@ietf.org
            https://www.ietf.org/mailman/listinfo/int-area



    _______________________________________________

    Int-area mailing list

    Int-area@ietf.org

    https://www.ietf.org/mailman/listinfo/int-area
_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

Reply via email to