No, intermediate reassembly is not an optimization.

First, it is a bad idea.  It is very painful for routers to perform reassembly.  They have to burn expensive resources managing such attttempted reassesmbly.  It has major cost even if the router decides to give up and forward the pieces.

And second, unless one makes some unstated assumptions in the absence of such reassembly the sending host will be throttled to the receiving host rate.  So the benefit of the entire system is markedly reduced.

Net: we should not adopt this draft.

Yours,

Joel

On 7/11/2022 6:41 PM, Templin (US), Fred L wrote:

Tom,

> Why would someone put six segments in a parcel if they already have a 9K link MTU?

> Why not just send one segment in 9K?

This is the mindset that we need to overcome. We have had it drilled into our heads

that MSS must be the same as the path MTU, but it does not need to be that way.

If the MSS is smaller than the path MTU, but we can send multiple segments in a

single parcel that more closely approaches the size of the path MTU then

amortization savings are possible.

>The algorithm isn't the problem, it's supporting new protocols and multiple

>checksums in a packet in hardware.

But Tom, how hard can this be? Instead of running the Internet checksum 1 time

over N octets of data simply run it M times over N/M octet chunks of the data in

succession but still in a single pass. You spoke before of NICs adapting to support

TCP jumbograms – if they can do that, why not a very straightforward application

of Internet checksum? I haven’t looked at this in a long while, but isn’t this also

similar to what UDP-lite did?

> Either you're trivializing reassembly or maybe you're thinking of some new method that

> somehow avoids all the pitfalls and problems we've had with reassembly over the years!

Intermediate node parcel reassembly is really just an optimization to try to pass the

largest possible parcels on to the next hop instead of passing many smaller ones. It is

really just a concatenation of segments of sub-parcels belonging to the same original

parcel. Reordering is unimportant – it is OK to concatenate sub-parcels 3,8,5,2 in that

order and without even waiting for any other sub-parcels to show up. The application

will simply perceive it as a case of network reordering and the upper layer protocol

will do the correct thing with the sequence numbers. AFAICT, the only hard requirement

is that the final sub-parcel must not be concatenated as an intermediate sub-parcel.

This stuff will all work, and it will work for the betterment of the Internet.

Fred

*From:*Tom Herbert [mailto:t...@herbertland.com]
*Sent:* Monday, July 11, 2022 2:57 PM
*To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
*Cc:* Richard Li <richard...@futurewei.com>; Juan Carlos Zuniga (juzuniga) <juzuniga=40cisco....@dmarc.ietf.org>; int-area@ietf.org *Subject:* Re: [EXTERNAL] Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10


        

EXT email: be mindful of links/attachments.

On Mon, Jul 11, 2022 at 2:20 PM Templin (US), Fred L <fred.l.temp...@boeing.com> wrote:

    Tom, some rejoinders:

    >Yes, I agree if the packet is fragmented by the network then this is a
    nice feature.

    >However, today we already have this from a host perspective property by 
just

    >sending "small" packets.

    It can be readily shown that some applications get much greater
    performance by

    sending larger packets that trigger fragmentation/reassembly than
    by sending

    smaller packets that do not. Multiple order of magnitude
    performance increases

    are indeed possible.

    >I'm not sure the savings qualify as significant. 9K MTUs are becoming
    common in data centers

    >and the standard TCP/IPv6 header is 80 bytes so that's already less
    than 1% overhead.

    I think 9K is only a starting point, and IP parcels pave the way
    to much larger link MTUs,

    possibly even in excess of 64KB. And, doing the math, even for
    just a 9K link sending a

    single parcel that contains 6x 1440 octet segments would save 5 *
    60 == 300 octets in

Why would someone put six segments in a parcel if they already have a 9K link MTU? Why not just send one segment in 9K?

    comparison with sending 6x  1500 octet packets with 60 octets of
    IP/TCP headers per

    packet. For links with larger MTUs, the savings for sending
    parcels with lots of segments

    (up to 64) becomes even greater.

    >As I already mentioned, this is addressed by the BiGTCP work
    (https://lwn.net/Articles/884104).

    >Sending or receiving multi-megabytes TCP segments in one system call is
    now feasible. Also, it's

    >inevitable that NIC vendors will apply this also to be able to offload TCP
    jumbo grams. Given this

    >is just software that doesn't require hardware change or on-the-wire
    protocols to change, it's

    >immediately deployable with just a softwar change which is a huge benefit 
to
    datacenter operators.

    As I have said, IP parcels has the same advantage within the host
    system-call (user-space

    to kernel-space) context. But, IP parcels goes a step further to
    provide efficient packaging

    over-the-wire, whereas the approach you are referring to opens the
    box inside the

    kernel and sends individual packets instead of aggregates.

    >All modern NIC HW can deal with offloading a single checksum per
    packet, it's going to be

    >a major effort for them to offload multiple checksum like IP
    parcels needs. Without checksum

    >offload, this would be a non-starter for a lot of deployments.

    Check the latest spec (now at -12 and likely to stay that way
    until IETF114. Any H/W checksum

    that can run over the first segment of a packet should be possible
    to make run over the N-1

    additional segments of the same packet (parcel) by applying the
    very familiar Internet

    checksum algorithm.

The algorithm isn't the problem, it's supporting new protocols and multiple checksums in a packet in hardware.

    >I'm not convinced of that. For instance, I'm skeptical that
    intermediate devices trying to reassemble

    >packets that aren't addressed to themselves could ever be robust or
    efficient (i.e. complexity, non-work

    >conserving resource requirements, security issues with reassembly,
    multi-path that causes latency

    >increase, potential DoS vector, etc.). Can you comment on this?

    Perhaps what is confusing this matter is that the intermediate
    devices referred to

    here most certainly do not refer to all routers in the path.
    Instead, what is intended

    here is an OMNI intermediate device, of which there may be
    something on the order

    of 0, 1, or 2 of them on the path between the OMNI source and
    destination even

    though there may be many 10’s or even 100’s of ordinary IP routers
    on the path.

    And, again, this is not a strict reassembly case – instead, it is
    an opportunistic

    “combine if convenient; else forward” swift decision.

Either you're trivializing reassembly or maybe you're thinking of some new method that somehow avoids all the pitfalls and problems we've had with reassembly over the years! Consider that many NIC vendors have tried, and largely failed, to get any sort of device reassembly widely deployed (e.g. IP reassembly, TCP segmentation reassembly, etc.). The reason they failed is because they can't give the host stack transparency and control over the reassembly process.

In its nature reassembly can only be done with at least packets. That means a device performing reassembly has to receive one packet, hold it, and wait for the following packet to perform reassembly. That makes reassembly, unlike fragmentation, a non-work conserving process. Many issues and policies arise from this. For instance, what happens if a packet is held and the following packet is never seen? (usually implies a reassembly timer). What happens if a packet is received OOO and is already forwarded, but the preceding packet is then received, do we try to reassemble that one? (the solution here seems to be to maintain some sort of flow state)? What about overlapping fragments and the security issues around that?

IMO, if the WG does pursue this, I believe a lot of the effort will be in specifying how reassembly in intermediate nodes works.

Tom

    Thanks - Fred

    *From:*Tom Herbert [mailto:t...@herbertland.com]
    *Sent:* Monday, July 11, 2022 1:34 PM
    *To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
    *Cc:* Richard Li <richard...@futurewei.com>; Juan Carlos Zuniga
    (juzuniga) <juzuniga=40cisco....@dmarc.ietf.org>; int-area@ietf.org
    *Subject:* [EXTERNAL] Re: [Int-area] Call for WG adoption of
    draft-templin-intarea-parcels-10


        

    EXT email: be mindful of links/attachments.

    On Mon, Jul 11, 2022 at 12:22 PM Templin (US), Fred L
    <fred.l.temp...@boeing.com> wrote:

        Richard and others, thank you for these comments and for the
        ensuing discussion that

        took place over the time I was away on vacation. Strange how
        the timing hit when I

        was away from the office and off the grid - I was on a camping
        trip in Canada not far

        from where Steve Deering lives although I did not visit him.

        In any event, I was able to push out a new draft version ahead
        of the deadline that

        may address some (but likely not all) of your concerns:

        https://datatracker.ietf.org/doc/draft-templin-intarea-parcels/

        The major change is that the draft now talks about
        interactions with upper layer

        protocols including TCP and UDP, whereas the previous draft
        versions were silent

        regarding upper layer protocol framing.

        To others who have commented, I beg to differ and maintain
        that IP parcels do

        represent a significant improvement over the current state of
        affairs and over

        just regular IP jumbograms. In particular:

    Hi Fred, some comments in line.

        1) IP parcels make it so that the loss unit is a single
        segment instead of the entire

        packet/parcel, and loss of a segment often results in
        retransmission of just that

        segment instead of the entire packet/parcel.

    Yes, I agree if the packet is fragmented by the network then this
    is a nice feature. However, today we already have this from a host
    perspective property by just sending "small" packets.

        2) IP parcels are more efficient than sending a single segment
        per IP packet, since

        the parcel includes a single IP header plus single full
        {TCP,UDP} header for possibly

        many segments. This can result in significant savings in terms
        of bits over the wire

        for omitting unnecessary header bytes.

    I'm not sure the savings qualify as significant. 9K MTUs are
    becoming common in data centers and the standard TCP/IPv6 header
    is 80 bytes so that's already less than 1% overhead.

        Consider the postal service analogy; when

        many items can be sent together in a single package/parcel
        there is a large savings

        in shippeing and handling costs than when each individual item
        is shipped separately.

    As I already mentioned, this is addressed by the BiGTCP work
    (https://lwn.net/Articles/884104). Sending or receiving
    multi-megabytes TCP segments in one system call is now feasible.
    Also, it's inevitable that NIC vendors will apply this also to be
    able to offload TCP jumbo grams. Given this is just software that
    doesn't require hardware change or on-the-wire protocols to
    change, it's immediately deployable with just a softwar change
    which is a huge benefit to datacenter operators.

        3) IP parcels improve large packet integrity by including a
        separate checksum for

        each segment instead of a single checksum for the entire packet.

    All modern NIC HW can deal with offloading a single checksum per
    packet, it's going to be a major effort for them to offload
    multiple checksum like IP parcels needs. Without checksum offload,
    this would be a non-starter for a lot of deployments.

        This means that

        large parcels (up to a few MB) can be sent in one piece over
        links with sufficiently

        large MTU without requiring the link itself to provide strong
        integrity checks over

        the entire length of the parcel. This means that link MTUs
        significantly larger than

        9KB are now safely possible.

        4) IP parcels offer all of the efficiency advantages to upper
        layers as are offered

        by GSO/GRO, etc. but also provide benefits 1) through 3) above
        that are not

        offered by GSO/GRO.

    Most of this is doable in GSO/GRO.

        5) Plus, the idea is just plain neat. Better packaging is
        good. More efficient

        handling is good. Reduced header overhead is good. SAFE larger
        MTUs are

        good. The idea itself is good.

    I'm not convinced of that. For instance, I'm skeptical that
    intermediate devices trying to reassemble packets that aren't
    addressed to themselves could ever be robust or efficient (i.e.
    complexity, non-work conserving resource requirements, security
    issues with reassembly, multi-path that causes latency increase,
    potential DoS vector, etc.). Can you comment on this?

    Tom

        Fred

        *From:*Int-area [mailto:int-area-boun...@ietf.org] *On Behalf
        Of *Richard Li
        *Sent:* Friday, July 01, 2022 3:11 PM
        *To:* Juan Carlos Zuniga (juzuniga)
        <juzuniga=40cisco....@dmarc.ietf.org>
        *Cc:* int-area@ietf.org
        *Subject:* Re: [Int-area] Call for WG adoption of
        draft-templin-intarea-parcels-10

        Chairs and Authors,

        I always like every new idea and effort to improve the
        Internet performance, and thus I have read this draft with a
        great interest. The following are my
        observations/comments/questions. If they don’t make any sense
        to you, please accept my apology, and disregard them.

        1.The text “multiple upper layer protocol segments” is
        ambiguous. It seems that you really mean “multiple segments
        from ‘the same’ upper layer protocol”, doesn’t it? It seems
        that multiple segments from different upper layer protocols
        are not allowed in your parcel.

        2.Is the following a fair statement? All segments in the same
        packet come from the same application identified by the 5-tupe
        (source address, destination address, source port, destination
        port, protocol number).

        3.Segment size

        You require that their sizes be the same except for the last
        one. Is this required for easy implementation or what? Do you
        require it for any other reasons?

        4.TTL issue

        You described how parcels are forwarded over the Internetwork,
        and in particular you described what the ingress/egress
        middlebox does about parcels. I understand that the ingress
        middlebox may break the parcel into smaller ones, which may
        rejoin at the egress middlebox. My question is about TTL. As
        different smaller parcels may traverse along different paths,
        as a result their TTLs may be different when they reach the
        egress middlebox . How does the egress middlebox set up the
        TTL value? Please provide more descriptions.

        5.Reordering at the egress middlebox

        The parcels would arrive one after another, and therefore the
        egress middlebox would “wait” for a little bit to identify and
        pick up enough parcels/packets for their rejoining and
        repackaging. A description of the egress middlebox behavior
        would be useful and helpful, in particular I would like to
        know more about the waiting time if any, and how you deal with
        the reordering and loss.

        6.IPv4 option

        Does IETF still allow to change/add IPv4 option fields? I
        might be wrong, but aren’t they frozen? Also, do commercial
        routers still care about IPv4 options?

        7.IPv6 option

        This draft has defined a hop-by-hop option, it will require
        every intermediate IPv6 router to inspect this option. There
        have been some discussions on the pros/cons about Hop-by-Hop
        IPv6 Option. Is there any feedback from WG 6man?

        8.Parcel Path Qualification

        This draft has described a method for parcel path
        qualification probe from end to end. It is nice to have it,
        but it is unreliable simply for the following reason: a probe
        parcel goes along one specific path, and your real application
        parcels may take different paths.

        9.Integrity

        First paragraph of Section 7. More explanation/elaboration
        should be useful. I might have missed it in previous
        paragraphs, but if I do, please provide a reference to it such
        as “as described in …”.

        10.Implementation Status

        In section 10. TSO’s performance gain and Parcel’s gain should
        be regarded as two different things. Since this draft is
        adding a hop-by-hop option, every intermediate router is
        required to process the hop-by-hop option, which will,
        theoretically speaking, lead to performance downgrade. Of
        course, the whole performance would depend on many other
        factors, such as the total numbers of routing table lookups
        and number of segments.

        11.General observation

        This proposal essentially tries to solve a problem caused by
        MTU. If MTU be very big, one would simply put the whole data
        in a single packet. Since MTU is limited, a packet has to be
        cut into many smaller pieces (segments). In the existing
        specification, when an intermediate router sees a packet with
        its size larger than MTU, the router would be expected to
        fragment it so that the fragments could be forwarded. Here let
        me call it “fragmentation as needed”. In reality, however,
        some (if not all) commercial routers don’t do “fragmentation
        as needed”, instead of fragmenting the packet they simply
        discard it in order to achieve the wire-speed. This draft
        defines a new way to address the MTU issue: when a router sees
        a packet with its size larger than MTU, the router is asked to
        fragment it in a prescribed way (fragment it into pre-packaged
        segments). If I may, let me call it “fragmentation as
        prescribed”. Both “fragmentation as needed” and “fragmentation
        as prescribed” would require the support from intermediate
        routers. As the same as fragmentation as needed, fragmentation
        as prescribed may downgrade the performance of intermediate
        routers. What is more, intermediate routers/boxes may perform
        “rejoining and repackaging”, which will adversely impact the
        performance of the intermediate routers/boxes.

        Best regards,

        Richard

        *From:*Int-area <int-area-boun...@ietf.org> *On Behalf Of
        *Juan Carlos Zuniga (juzuniga)
        *Sent:* Wednesday, June 22, 2022 12:25 PM
        *To:* int-area@ietf.org
        *Subject:* [Int-area] Call for WG adoption of
        draft-templin-intarea-parcels-10

        Dear IntArea WG,

        We are starting a 2-week call for adoption of the IP-Parcels
        draft:

        https://www.ietf.org/archive/id/draft-templin-intarea-parcels-10.html
        
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Farchive%2Fid%2Fdraft-templin-intarea-parcels-10.html&data=05%7C01%7Crichard.li%40futurewei.com%7C715b5db213134932c70208da5484f702%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637915227299598680%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=w4G5ypaSRv%2FR31%2F%2B857XT2xUqHdEXv90ubD5GGjqBEQ%3D&reserved=0>


        The document has been discussed for some time and it has
        received multiple comments.

        If you have an opinion on whether this document should be
        adopted by the IntArea WG please indicate it on the list by
        the end of Wednesday July 6^th .

        Thanks,

        Juan-Carlos & Wassim

        (IntArea WG chairs)

        _______________________________________________
        Int-area mailing list
        Int-area@ietf.org
        https://www.ietf.org/mailman/listinfo/int-area


_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area
_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

Reply via email to