Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Joel Halpern Tue, 12 Jul 2022 07:44:28 -0700

Fred, I understood full well that you only envision a small number ofreassembly devices. After all, on any given path only one device willlikely reassemble. Still, that device will be spending a lot ofresources in a very expensive part of the path (fast path forwarding) toprovide a small benefit to some hosts.

Fundamentally you are asking the archtiecture to spend those resourcesfor use case that you have not explained. "I have proof" i snotrelevant. Without knowing the scenarios and the assumptions, it doesnot help us to judge. It is worse than the case in the early days ofthe MANET working group where the competing proposal repeatedly said "mysimulation shows ..."

Fundamentally, it is not the network's job to reassemble packets for ahost. If you want NICs to do that, as Tom has said, that's fine. It isa private matter between the host and the NIC. But you are asking forfunctionality in the network.

I note also that you are assuming that hosts have links that supportactual MTUs larger than 64K. I know of no link that has thoseproperties in current use. (I am vaguely familiar with HIPPI andFiberChannel. Neither appears to be relevant.)


Yours,

Joel

On 7/12/2022 10:02 AM, Templin (US), Fred L wrote:

Joel, you are misunderstanding what nodes would be involved inreassembly; this would

not be at every single IP layer router in the path. It would only beat possibly 0, 1 or 2

adaptation layer middleboxes in the path from source to destination.And, then most

likely only at a near-end middlebox very near the destination thathappens to know the


destination would prefer to receive larger parcels.

About segment size, I have proof that using segment sizessignificantly larger than the

path MTU can often produce dramatic performance increases even whenfragmentation

is intentionally invoked. I also have proof that packaging multiplesegments in the same

system call can drive performance even higher an without reducing thesegment size.

IP parcels takes it the logical next step of allowing multiplesegments to travel together

in the same packet, which may or may not be subject to fragmentationand reassembly.

But, let’s not get so hung up on the middlebox question that we forgetthe benefits


for end-to-end.

Fred

*From:*Joel Halpern [mailto:j...@joelhalpern.com]
*Sent:* Monday, July 11, 2022 4:02 PM
*To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
*Cc:* int-area@ietf.org

*Subject:* Re: [Int-area] Re: Call for WG adoption ofdraft-templin-intarea-parcels-10



No, intermediate reassembly is not an optimization.

First, it is a bad idea. It is very painful for routers to performreassembly. They have to burn expensive resources managing suchattttempted reassesmbly. It has major cost even if the router decidesto give up and forward the pieces.

And second, unless one makes some unstated assumptions in the absenceof such reassembly the sending host will be throttled to the receivinghost rate. So the benefit of the entire system is markedly reduced.

Net: we should not adopt this draft.

Yours,

Joel

On 7/11/2022 6:41 PM, Templin (US), Fred L wrote:

Tom,

> Why would someone put six segments in a parcel if they already
have a 9K link MTU?

> Why not just send one segment in 9K?

This is the mindset that we need to overcome. We have had it
drilled into our heads

that MSS must be the same as the path MTU, but it does not need to
be that way.

If the MSS is smaller than the path MTU, but we can send multiple
segments in a

single parcel that more closely approaches the size of the path
MTU then

amortization savings are possible.

>The algorithm isn't the problem, it's supporting new protocols and
multiple

>checksums in a packet in hardware.

But Tom, how hard can this be? Instead of running the Internet
checksum 1 time

over N octets of data simply run it M times over N/M octet chunks
of the data in

succession but still in a single pass. You spoke before of NICs
adapting to support

TCP jumbograms – if they can do that, why not a very
straightforward application

of Internet checksum? I haven’t looked at this in a long while,
but isn’t this also

similar to what UDP-lite did?

> Either you're trivializing reassembly or maybe you're thinking of
some new method that

> somehow avoids all the pitfalls and problems we've had with
reassembly over the years!

Intermediate node parcel reassembly is really just an optimization
to try to pass the

largest possible parcels on to the next hop instead of passing
many smaller ones. It is

really just a concatenation of segments of sub-parcels belonging
to the same original

parcel. Reordering is unimportant – it is OK to concatenate
sub-parcels 3,8,5,2 in that

order and without even waiting for any other sub-parcels to show
up. The application

will simply perceive it as a case of network reordering and the
upper layer protocol

will do the correct thing with the sequence numbers. AFAICT, the
only hard requirement

is that the final sub-parcel must not be concatenated as an
intermediate sub-parcel.

This stuff will all work, and it will work for the betterment of
the Internet.

Fred

*From:*Tom Herbert [mailto:t...@herbertland.com
<mailto:t...@herbertland.com>]
*Sent:* Monday, July 11, 2022 2:57 PM
*To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
<mailto:fred.l.temp...@boeing.com>
*Cc:* Richard Li <richard...@futurewei.com>
<mailto:richard...@futurewei.com>; Juan Carlos Zuniga (juzuniga)
<juzuniga=40cisco....@dmarc.ietf.org>
<mailto:juzuniga=40cisco....@dmarc.ietf.org>; int-area@ietf.org
*Subject:* Re: [EXTERNAL] Re: [Int-area] Call for WG adoption of
draft-templin-intarea-parcels-10

EXT email: be mindful of links/attachments.

On Mon, Jul 11, 2022 at 2:20 PM Templin (US), Fred L
<fred.l.temp...@boeing.com> wrote:

Tom, some rejoinders:

>Yes, I agree if the packet is fragmented by the network then this
is a nice feature.

>However, today we already have this from a host perspective property
by just

>sending "small" packets.

It can be readily shown that some applications get much
greater performance by

sending larger packets that trigger fragmentation/reassembly
than by sending

smaller packets that do not. Multiple order of magnitude
performance increases

are indeed possible.

>I'm not sure the savings qualify as significant. 9K MTUs are
becoming common in data centers

>and the standard TCP/IPv6 header is 80 bytes so that's already
less than 1% overhead.

I think 9K is only a starting point, and IP parcels pave the
way to much larger link MTUs,

possibly even in excess of 64KB. And, doing the math, even for
just a 9K link sending a

single parcel that contains 6x 1440 octet segments would save
5 * 60 == 300 octets in

Why would someone put six segments in a parcel if they already
have a 9K link MTU? Why not just send one segment in 9K?

comparison with sending 6x 1500 octet packets with 60 octets
of IP/TCP headers per

packet. For links with larger MTUs, the savings for sending
parcels with lots of segments

(up to 64) becomes even greater.

>As I already mentioned, this is addressed by the BiGTCP work
(https://lwn.net/Articles/884104).

>Sending or receiving multi-megabytes TCP segments in one system call
is now feasible. Also, it's

>inevitable that NIC vendors will apply this also to be able to offload
TCP jumbo grams. Given this

>is just software that doesn't require hardware change or
on-the-wire protocols to change, it's

>immediately deployable with just a softwar change which is a huge
benefit
to datacenter operators.

As I have said, IP parcels has the same advantage within the
host system-call (user-space

to kernel-space) context. But, IP parcels goes a step further
to provide efficient packaging

over-the-wire, whereas the approach you are referring to opens
the box inside the

kernel and sends individual packets instead of aggregates.

>All modern NIC HW can deal with offloading a single checksum per
packet, it's going to be

>a major effort for them to offload multiple checksum like IP
parcels needs. Without checksum

>offload, this would be a non-starter for a lot of deployments.

Check the latest spec (now at -12 and likely to stay that way
until IETF114. Any H/W checksum

that can run over the first segment of a packet should be
possible to make run over the N-1

additional segments of the same packet (parcel) by applying
the very familiar Internet

checksum algorithm.

The algorithm isn't the problem, it's supporting new protocols and
multiple checksums in a packet in hardware.

>I'm not convinced of that. For instance, I'm skeptical that
intermediate devices trying to reassemble

>packets that aren't addressed to themselves could ever be robust or
efficient (i.e. complexity, non-work

>conserving resource requirements, security issues with reassembly,
multi-path that causes latency

>increase, potential DoS vector, etc.). Can you comment on this?

Perhaps what is confusing this matter is that the intermediate
devices referred to

here most certainly do not refer to all routers in the path.
Instead, what is intended

here is an OMNI intermediate device, of which there may be
something on the order

of 0, 1, or 2 of them on the path between the OMNI source and
destination even

though there may be many 10’s or even 100’s of ordinary IP
routers on the path.

And, again, this is not a strict reassembly case – instead, it
is an opportunistic

“combine if convenient; else forward” swift decision.

Either you're trivializing reassembly or maybe you're thinking of
some new method that somehow avoids all the pitfalls and problems
we've had with reassembly over the years! Consider that many NIC
vendors have tried, and largely failed, to get any sort of device
reassembly widely deployed (e.g. IP reassembly, TCP segmentation
reassembly, etc.). The reason they failed is because they can't
give the host stack transparency and control over the reassembly
process.

In its nature reassembly can only be done with at least packets.
That means a device performing reassembly has to receive one
packet, hold it, and wait for the following packet to perform
reassembly. That makes reassembly, unlike fragmentation, a
non-work conserving process. Many issues and policies arise from
this. For instance, what happens if a packet is held and the
following packet is never seen? (usually implies a reassembly
timer). What happens if a packet is received OOO and is already
forwarded, but the preceding packet is then received, do we try to
reassemble that one? (the solution here seems to be to maintain
some sort of flow state)? What about overlapping fragments and the
security issues around that?

IMO, if the WG does pursue this, I believe a lot of the effort
will be in specifying how reassembly in intermediate nodes works.

Tom

Thanks - Fred

*From:*Tom Herbert [mailto:t...@herbertland.com]
*Sent:* Monday, July 11, 2022 1:34 PM
*To:* Templin (US), Fred L <fred.l.temp...@boeing.com>
*Cc:* Richard Li <richard...@futurewei.com>; Juan Carlos
Zuniga (juzuniga) <juzuniga=40cisco....@dmarc.ietf.org>;
int-area@ietf.org
*Subject:* [EXTERNAL] Re: [Int-area] Call for WG adoption of
draft-templin-intarea-parcels-10

EXT email: be mindful of links/attachments.

On Mon, Jul 11, 2022 at 12:22 PM Templin (US), Fred L
<fred.l.temp...@boeing.com> wrote:

Richard and others, thank you for these comments and for
the ensuing discussion that

took place over the time I was away on vacation. Strange
how the timing hit when I

was away from the office and off the grid - I was on a
camping trip in Canada not far

from where Steve Deering lives although I did not visit him.

In any event, I was able to push out a new draft version
ahead of the deadline that

may address some (but likely not all) of your concerns:

https://datatracker.ietf.org/doc/draft-templin-intarea-parcels/

The major change is that the draft now talks about
interactions with upper layer

protocols including TCP and UDP, whereas the previous
draft versions were silent

regarding upper layer protocol framing.

To others who have commented, I beg to differ and maintain
that IP parcels do

represent a significant improvement over the current state
of affairs and over

just regular IP jumbograms. In particular:

Hi Fred, some comments in line.

1) IP parcels make it so that the loss unit is a single
segment instead of the entire

packet/parcel, and loss of a segment often results in
retransmission of just that

segment instead of the entire packet/parcel.

Yes, I agree if the packet is fragmented by the network then
this is a nice feature. However, today we already have this
from a host perspective property by just sending "small" packets.

2) IP parcels are more efficient than sending a single
segment per IP packet, since

the parcel includes a single IP header plus single full
{TCP,UDP} header for possibly

many segments. This can result in significant savings in
terms of bits over the wire

for omitting unnecessary header bytes.

I'm not sure the savings qualify as significant. 9K MTUs are
becoming common in data centers and the standard TCP/IPv6
header is 80 bytes so that's already less than 1% overhead.

Consider the postal service analogy; when

many items can be sent together in a single package/parcel
there is a large savings

in shippeing and handling costs than when each individual
item is shipped separately.

As I already mentioned, this is addressed by the BiGTCP work
(https://lwn.net/Articles/884104). Sending or receiving
multi-megabytes TCP segments in one system call is now
feasible. Also, it's inevitable that NIC vendors will apply
this also to be able to offload TCP jumbo grams. Given this is
just software that doesn't require hardware change or
on-the-wire protocols to change, it's immediately deployable
with just a softwar change which is a huge benefit to
datacenter operators.

3) IP parcels improve large packet integrity by including
a separate checksum for

each segment instead of a single checksum for the entire
packet.

All modern NIC HW can deal with offloading a single checksum
per packet, it's going to be a major effort for them to
offload multiple checksum like IP parcels needs. Without
checksum offload, this would be a non-starter for a lot of
deployments.

This means that

large parcels (up to a few MB) can be sent in one piece
over links with sufficiently

large MTU without requiring the link itself to provide
strong integrity checks over

the entire length of the parcel. This means that link MTUs
significantly larger than

9KB are now safely possible.

4) IP parcels offer all of the efficiency advantages to
upper layers as are offered

by GSO/GRO, etc. but also provide benefits 1) through 3)
above that are not

offered by GSO/GRO.

Most of this is doable in GSO/GRO.

5) Plus, the idea is just plain neat. Better packaging is
good. More efficient

handling is good. Reduced header overhead is good. SAFE
larger MTUs are

good. The idea itself is good.

I'm not convinced of that. For instance, I'm skeptical that
intermediate devices trying to reassemble packets that aren't
addressed to themselves could ever be robust or efficient
(i.e. complexity, non-work conserving resource requirements,
security issues with reassembly, multi-path that causes
latency increase, potential DoS vector, etc.). Can you comment
on this?

Tom

Fred

*From:*Int-area [mailto:int-area-boun...@ietf.org] *On
Behalf Of *Richard Li
*Sent:* Friday, July 01, 2022 3:11 PM
*To:* Juan Carlos Zuniga (juzuniga)
<juzuniga=40cisco....@dmarc.ietf.org>
*Cc:* int-area@ietf.org
*Subject:* Re: [Int-area] Call for WG adoption of
draft-templin-intarea-parcels-10

Chairs and Authors,

I always like every new idea and effort to improve the
Internet performance, and thus I have read this draft with
a great interest. The following are my
observations/comments/questions. If they don’t make any
sense to you, please accept my apology, and disregard them.

1.The text “multiple upper layer protocol segments” is
ambiguous. It seems that you really mean “multiple
segments from ‘the same’ upper layer protocol”, doesn’t
it? It seems that multiple segments from different upper
layer protocols are not allowed in your parcel.

2.Is the following a fair statement? All segments in the
same packet come from the same application identified by
the 5-tupe (source address, destination address, source
port, destination port, protocol number).

3.Segment size

You require that their sizes be the same except for the
last one. Is this required for easy implementation or
what? Do you require it for any other reasons?

4.TTL issue

You described how parcels are forwarded over the
Internetwork, and in particular you described what the
ingress/egress middlebox does about parcels. I understand
that the ingress middlebox may break the parcel into
smaller ones, which may rejoin at the egress middlebox. My
question is about TTL. As different smaller parcels may
traverse along different paths, as a result their TTLs may
be different when they reach the egress middlebox . How
does the egress middlebox set up the TTL value? Please
provide more descriptions.

5.Reordering at the egress middlebox

The parcels would arrive one after another, and therefore
the egress middlebox would “wait” for a little bit to
identify and pick up enough parcels/packets for their
rejoining and repackaging. A description of the egress
middlebox behavior would be useful and helpful, in
particular I would like to know more about the waiting
time if any, and how you deal with the reordering and loss.

6.IPv4 option

Does IETF still allow to change/add IPv4 option fields? I
might be wrong, but aren’t they frozen? Also, do
commercial routers still care about IPv4 options?

7.IPv6 option

This draft has defined a hop-by-hop option, it will
require every intermediate IPv6 router to inspect this
option. There have been some discussions on the pros/cons
about Hop-by-Hop IPv6 Option. Is there any feedback from
WG 6man?

8.Parcel Path Qualification

This draft has described a method for parcel path
qualification probe from end to end. It is nice to have
it, but it is unreliable simply for the following reason:
a probe parcel goes along one specific path, and your real
application parcels may take different paths.

9.Integrity

First paragraph of Section 7. More explanation/elaboration
should be useful. I might have missed it in previous
paragraphs, but if I do, please provide a reference to it
such as “as described in …”.

10.Implementation Status

In section 10. TSO’s performance gain and Parcel’s gain
should be regarded as two different things. Since this
draft is adding a hop-by-hop option, every intermediate
router is required to process the hop-by-hop option, which
will, theoretically speaking, lead to performance
downgrade. Of course, the whole performance would depend
on many other factors, such as the total numbers of
routing table lookups and number of segments.

11.General observation

This proposal essentially tries to solve a problem caused
by MTU. If MTU be very big, one would simply put the whole
data in a single packet. Since MTU is limited, a packet
has to be cut into many smaller pieces (segments). In the
existing specification, when an intermediate router sees a
packet with its size larger than MTU, the router would be
expected to fragment it so that the fragments could be
forwarded. Here let me call it “fragmentation as needed”.
In reality, however, some (if not all) commercial routers
don’t do “fragmentation as needed”, instead of fragmenting
the packet they simply discard it in order to achieve the
wire-speed. This draft defines a new way to address the
MTU issue: when a router sees a packet with its size
larger than MTU, the router is asked to fragment it in a
prescribed way (fragment it into pre-packaged segments).
If I may, let me call it “fragmentation as prescribed”.
Both “fragmentation as needed” and “fragmentation as
prescribed” would require the support from intermediate
routers. As the same as fragmentation as needed,
fragmentation as prescribed may downgrade the performance
of intermediate routers. What is more, intermediate
routers/boxes may perform “rejoining and repackaging”,
which will adversely impact the performance of the
intermediate routers/boxes.

Best regards,

Richard

*From:*Int-area <int-area-boun...@ietf.org> *On Behalf Of
*Juan Carlos Zuniga (juzuniga)
*Sent:* Wednesday, June 22, 2022 12:25 PM
*To:* int-area@ietf.org
*Subject:* [Int-area] Call for WG adoption of
draft-templin-intarea-parcels-10

Dear IntArea WG,

We are starting a 2-week call for adoption of the
IP-Parcels draft:

https://www.ietf.org/archive/id/draft-templin-intarea-parcels-10.html

<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Farchive%2Fid%2Fdraft-templin-intarea-parcels-10.html&data=05%7C01%7Crichard.li%40futurewei.com%7C715b5db213134932c70208da5484f702%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637915227299598680%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=w4G5ypaSRv%2FR31%2F%2B857XT2xUqHdEXv90ubD5GGjqBEQ%3D&reserved=0>

The document has been discussed for some time and it has
received multiple comments.

If you have an opinion on whether this document should be
adopted by the IntArea WG please indicate it on the list
by the end of Wednesday July 6^th .

Thanks,

Juan-Carlos & Wassim

(IntArea WG chairs)

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________

Int-area mailing list

Int-area@ietf.org

https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Reply via email to