Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Templin (US), Fred L Tue, 12 Jul 2022 08:24:20 -0700

Joel, I can show you an orders-of-magnitude performance speed-up when I send
large blocks of data using larger segment sizes that invoke fragmentation and
reassembly. I can also show a significant speed-up when system calls pass
multiple larger segments in a single system call instead of one at a time. This
is on real systems with real data, and not in simulations.

About links with larger MTUs, I am specifically NOT saying that we need to wait
until we have links with MTU>64K. What I am saying is that parcels would pave
the way toward evolution of links with larger MTUs than what we have in the
current practice allowing a path forward for future innovation. But, parcels are
still good even for the smallish MTUs in widescale deployment today.

Fred

From: Joel Halpern [mailto:jmh.dir...@joelhalpern.com]
Sent: Tuesday, July 12, 2022 7:44 AM
To: Templin (US), Fred L <fred.l.temp...@boeing.com>
Cc: int-area@ietf.org
Subject: [EXTERNAL] Re: [Int-area] Re: Call for WG adoption of 
draft-templin-intarea-parcels-10

EXT email: be mindful of links/attachments.

Fred, I understood full well that you only envision a small number of 
reassembly devices.  After all, on any given path only one device will likely 
reassemble.  Still, that device will be spending a lot of resources in a very 
expensive part of the path (fast path forwarding) to provide a small benefit to 
some hosts.

Fundamentally you are asking the archtiecture to spend those resources for use 
case that you have not explained.  "I have proof" i snot relevant.  Without 
knowing the scenarios and the assumptions, it does not help us to judge.  It is 
worse than the case in the early days of the MANET working group where the 
competing proposal repeatedly said "my simulation shows ..."

Fundamentally, it is not the network's job to reassemble packets for a host.  
If you want NICs to do that, as Tom has said, that's fine.  It is a private 
matter between the host and the NIC.  But you are asking for functionality in 
the network.

I note also that you are assuming that hosts have links that support actual 
MTUs larger than 64K.  I know of no link that has those properties in current 
use.  (I am vaguely familiar with HIPPI and FiberChannel.  Neither appears to 
be relevant.)

Yours,

Joel
On 7/12/2022 10:02 AM, Templin (US), Fred L wrote:
Joel, you are misunderstanding what nodes would be involved in reassembly; this 
would
not be at every single IP layer router in the path. It would only be at 
possibly 0, 1 or 2
adaptation layer middleboxes in the path from source to destination. And, then 
most
likely only at a near-end middlebox very near the destination that happens to 
know the
destination would prefer to receive larger parcels.

About segment size, I have proof that using segment sizes significantly larger 
than the
path MTU can often produce dramatic performance increases even when 
fragmentation
is intentionally invoked. I also have proof that packaging multiple segments in 
the same
system call can drive performance even higher an without reducing the segment 
size.
IP parcels takes it the logical next step of allowing multiple segments to 
travel together
in the same packet, which may or may not be subject to fragmentation and 
reassembly.
But, let’s not get so hung up on the middlebox question that we forget the 
benefits
for end-to-end.

Fred

From: Joel Halpern [mailto:j...@joelhalpern.com]
Sent: Monday, July 11, 2022 4:02 PM
To: Templin (US), Fred L 
<fred.l.temp...@boeing.com><mailto:fred.l.temp...@boeing.com>
Cc: int-area@ietf.org<mailto:int-area@ietf.org>
Subject: Re: [Int-area] Re: Call for WG adoption of 
draft-templin-intarea-parcels-10

No, intermediate reassembly is not an optimization.

First, it is a bad idea.  It is very painful for routers to perform reassembly. 
 They have to burn expensive resources managing such attttempted reassesmbly.  
It has major cost even if the router decides to give up and forward the pieces.

And second, unless one makes some unstated assumptions in the absence of such 
reassembly the sending host will be throttled to the receiving host rate.  So 
the benefit of the entire system is markedly reduced.

Net: we should not adopt this draft.

Yours,

Joel
On 7/11/2022 6:41 PM, Templin (US), Fred L wrote:
Tom,

> Why would someone put six segments in a parcel if they already have a 9K link 
> MTU?
> Why not just send one segment in 9K?

This is the mindset that we need to overcome. We have had it drilled into our 
heads
that MSS must be the same as the path MTU, but it does not need to be that way.
If the MSS is smaller than the path MTU, but we can send multiple segments in a
single parcel that more closely approaches the size of the path MTU then
amortization savings are possible.

>The algorithm isn't the problem, it's supporting new protocols and multiple
>checksums in a packet in hardware.

But Tom, how hard can this be? Instead of running the Internet checksum 1 time
over N octets of data simply run it M times over N/M octet chunks of the data in
succession but still in a single pass. You spoke before of NICs adapting to 
support
TCP jumbograms – if they can do that, why not a very straightforward application
of Internet checksum? I haven’t looked at this in a long while, but isn’t this 
also
similar to what UDP-lite did?

> Either you're trivializing reassembly or maybe you're thinking of some new 
> method that
> somehow avoids all the pitfalls and problems we've had with reassembly over 
> the years!

Intermediate node parcel reassembly is really just an optimization to try to 
pass the
largest possible parcels on to the next hop instead of passing many smaller 
ones. It is
really just a concatenation of segments of sub-parcels belonging to the same 
original
parcel. Reordering is unimportant – it is OK to concatenate sub-parcels 3,8,5,2 
in that
order and without even waiting for any other sub-parcels to show up. The 
application
will simply perceive it as a case of network reordering and the upper layer 
protocol
will do the correct thing with the sequence numbers. AFAICT, the only hard 
requirement
is that the final sub-parcel must not be concatenated as an intermediate 
sub-parcel.

This stuff will all work, and it will work for the betterment of the Internet.

Fred

From: Tom Herbert [mailto:t...@herbertland.com]
Sent: Monday, July 11, 2022 2:57 PM
To: Templin (US), Fred L 
<fred.l.temp...@boeing.com><mailto:fred.l.temp...@boeing.com>
Cc: Richard Li <richard...@futurewei.com><mailto:richard...@futurewei.com>; 
Juan Carlos Zuniga (juzuniga) 
<juzuniga=40cisco....@dmarc.ietf.org><mailto:juzuniga=40cisco....@dmarc.ietf.org>;
 int-area@ietf.org<mailto:int-area@ietf.org>
Subject: Re: [EXTERNAL] Re: [Int-area] Call for WG adoption of 
draft-templin-intarea-parcels-10

EXT email: be mindful of links/attachments.

On Mon, Jul 11, 2022 at 2:20 PM Templin (US), Fred L 
<fred.l.temp...@boeing.com<mailto:fred.l.temp...@boeing.com>> wrote:
Tom, some rejoinders:

>Yes, I agree if the packet is fragmented by the network then this is a nice 
>feature.
>However, today we already have this from a host perspective property by just
>sending "small" packets.

It can be readily shown that some applications get much greater performance by
sending larger packets that trigger fragmentation/reassembly than by sending
smaller packets that do not. Multiple order of magnitude performance increases
are indeed possible.

>I'm not sure the savings qualify as significant. 9K MTUs are becoming common 
>in data centers
>and the standard TCP/IPv6 header is 80 bytes so that's already less than 1% 
>overhead.

I think 9K is only a starting point, and IP parcels pave the way to much larger 
link MTUs,
possibly even in excess of 64KB. And, doing the math, even for just a 9K link 
sending a
single parcel that contains 6x 1440 octet segments would save 5 * 60 == 300 
octets in

Why would someone put six segments in a parcel if they already have a 9K link 
MTU? Why not just send one segment in 9K?

comparison with sending 6x  1500 octet packets with 60 octets of IP/TCP headers 
per
packet. For links with larger MTUs, the savings for sending parcels with lots 
of segments
(up to 64) becomes even greater.

>As I already mentioned, this is addressed by the BiGTCP work 
>(https://lwn.net/Articles/884104).
>Sending or receiving multi-megabytes TCP segments in one system call is now 
>feasible. Also, it's
>inevitable that NIC vendors will apply this also to be able to offload TCP 
>jumbo grams. Given this
>is just software that doesn't require hardware change or on-the-wire protocols 
>to change, it's
>immediately deployable with just a softwar change which is a huge benefit to 
>datacenter operators.

As I have said, IP parcels has the same advantage within the host system-call 
(user-space
to kernel-space) context. But, IP parcels goes a step further to provide 
efficient packaging
over-the-wire, whereas the approach you are referring to opens the box inside 
the
kernel and sends individual packets instead of aggregates.

>All modern NIC HW can deal with offloading a single checksum per packet, it's 
>going to be
>a major effort for them to offload multiple checksum like IP parcels needs. 
>Without checksum
>offload, this would be a non-starter for a lot of deployments.

Check the latest spec (now at -12 and likely to stay that way until IETF114. 
Any H/W checksum
that can run over the first segment of a packet should be possible to make run 
over the N-1
additional segments of the same packet (parcel) by applying the very familiar 
Internet
checksum algorithm.

The algorithm isn't the problem, it's supporting new protocols and multiple 
checksums in a packet in hardware.

>I'm not convinced of that. For instance, I'm skeptical that intermediate 
>devices trying to reassemble
>packets that aren't addressed to themselves could ever be robust or efficient 
>(i.e. complexity, non-work
>conserving resource requirements, security issues with reassembly, multi-path 
>that causes latency
>increase, potential DoS vector, etc.). Can you comment on this?

Perhaps what is confusing this matter is that the intermediate devices referred 
to
here most certainly do not refer to all routers in the path. Instead, what is 
intended
here is an OMNI intermediate device, of which there may be something on the 
order
of 0, 1, or 2 of them on the path between the OMNI source and destination even
though there may be many 10’s or even 100’s of ordinary IP routers on the path.
And, again, this is not a strict reassembly case – instead, it is an 
opportunistic
“combine if convenient; else forward” swift decision.

Either you're trivializing reassembly or maybe you're thinking of some new 
method that somehow avoids all the pitfalls and problems we've had with 
reassembly over the years! Consider that many NIC vendors have tried, and 
largely failed, to get any sort of device reassembly widely deployed (e.g. IP 
reassembly, TCP segmentation reassembly, etc.). The reason they failed is 
because they can't give the host stack transparency and control over the 
reassembly process.

In its nature reassembly can only be done with at least packets. That means a 
device performing reassembly has to receive one packet, hold it, and wait for 
the following packet to perform reassembly. That makes reassembly, unlike 
fragmentation, a non-work conserving process. Many issues and policies arise 
from this. For instance, what happens if a packet is held and the following 
packet is never seen? (usually implies a reassembly timer). What happens if a 
packet is received OOO and is already forwarded, but the preceding packet is 
then received, do we try to reassemble that one? (the solution here seems to be 
to maintain some sort of flow state)? What about overlapping fragments and the 
security issues around that?

IMO, if the WG does pursue this, I believe a lot of the effort will be in 
specifying how reassembly in intermediate nodes works.

Tom

Thanks - Fred

From: Tom Herbert [mailto:t...@herbertland.com<mailto:t...@herbertland.com>]
Sent: Monday, July 11, 2022 1:34 PM
To: Templin (US), Fred L 
<fred.l.temp...@boeing.com<mailto:fred.l.temp...@boeing.com>>
Cc: Richard Li <richard...@futurewei.com<mailto:richard...@futurewei.com>>; 
Juan Carlos Zuniga (juzuniga) 
<juzuniga=40cisco....@dmarc.ietf.org<mailto:40cisco....@dmarc.ietf.org>>; 
int-area@ietf.org<mailto:int-area@ietf.org>
Subject: [EXTERNAL] Re: [Int-area] Call for WG adoption of 
draft-templin-intarea-parcels-10

EXT email: be mindful of links/attachments.

On Mon, Jul 11, 2022 at 12:22 PM Templin (US), Fred L 
<fred.l.temp...@boeing.com<mailto:fred.l.temp...@boeing.com>> wrote:
Richard and others, thank you for these comments and for the ensuing discussion 
that
took place over the time I was away on vacation. Strange how the timing hit 
when I
was away from the office and off the grid - I was on a camping trip in Canada 
not far
from where Steve Deering lives although I did not visit him.

In any event, I was able to push out a new draft version ahead of the deadline 
that
may address some (but likely not all) of your concerns:

https://datatracker.ietf.org/doc/draft-templin-intarea-parcels/

The major change is that the draft now talks about interactions with upper layer
protocols including TCP and UDP, whereas the previous draft versions were silent
regarding upper layer protocol framing.

To others who have commented, I beg to differ and maintain that IP parcels do
represent a significant improvement over the current state of affairs and over
just regular IP jumbograms. In particular:

Hi Fred, some comments in line.

1) IP parcels make it so that the loss unit is a single segment instead of the 
entire
packet/parcel, and loss of a segment often results in retransmission of just 
that
segment instead of the entire packet/parcel.

Yes, I agree if the packet is fragmented by the network then this is a nice 
feature. However, today we already have this from a host perspective property 
by just sending "small" packets.

2) IP parcels are more efficient than sending a single segment per IP packet, 
since
the parcel includes a single IP header plus single full {TCP,UDP} header for 
possibly
many segments. This can result in significant savings in terms of bits over the 
wire
for omitting unnecessary header bytes.

I'm not sure the savings qualify as significant. 9K MTUs are becoming common in 
data centers and the standard TCP/IPv6 header is 80 bytes so that's already 
less than 1% overhead.

Consider the postal service analogy; when
many items can be sent together in a single package/parcel there is a large 
savings
in shippeing and handling costs than when each individual item is shipped 
separately.

As I already mentioned, this is addressed by the BiGTCP work 
(https://lwn.net/Articles/884104). Sending or receiving multi-megabytes TCP 
segments in one system call is now feasible. Also, it's inevitable that NIC 
vendors will apply this also to be able to offload TCP jumbo grams. Given this 
is just software that doesn't require hardware change or on-the-wire protocols 
to change, it's immediately deployable with just a softwar change which is a 
huge benefit to datacenter operators.

3) IP parcels improve large packet integrity by including a separate checksum 
for
each segment instead of a single checksum for the entire packet.

All modern NIC HW can deal with offloading a single checksum per packet, it's 
going to be a major effort for them to offload multiple checksum like IP 
parcels needs. Without checksum offload, this would be a non-starter for a lot 
of deployments.

This means that
large parcels (up to a few MB) can be sent in one piece over links with 
sufficiently
large MTU without requiring the link itself to provide strong integrity checks 
over
the entire length of the parcel. This means that link MTUs significantly larger 
than
9KB are now safely possible.

4) IP parcels offer all of the efficiency advantages to upper layers as are 
offered
by GSO/GRO, etc. but also provide benefits 1) through 3) above that are not
offered by GSO/GRO.

Most of this is doable in GSO/GRO.

5) Plus, the idea is just plain neat. Better packaging is good. More efficient
handling is good. Reduced header overhead is good. SAFE larger MTUs are
good. The idea itself is good.

I'm not convinced of that. For instance, I'm skeptical that intermediate 
devices trying to reassemble packets that aren't addressed to themselves could 
ever be robust or efficient (i.e. complexity, non-work conserving resource 
requirements, security issues with reassembly, multi-path that causes latency 
increase, potential DoS vector, etc.). Can you comment on this?

Tom

Fred

From: Int-area 
[mailto:int-area-boun...@ietf.org<mailto:int-area-boun...@ietf.org>] On Behalf 
Of Richard Li
Sent: Friday, July 01, 2022 3:11 PM
To: Juan Carlos Zuniga (juzuniga) 
<juzuniga=40cisco....@dmarc.ietf.org<mailto:40cisco....@dmarc.ietf.org>>
Cc: int-area@ietf.org<mailto:int-area@ietf.org>
Subject: Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Chairs and Authors,

I always like every new idea and effort to improve the Internet performance, 
and thus I have read this draft with a great interest. The following are my 
observations/comments/questions. If they don’t make any sense to you, please 
accept my apology, and disregard them.

1.      The text “multiple upper layer protocol segments” is ambiguous. It 
seems that you really mean “multiple segments from ‘the same’ upper layer 
protocol”, doesn’t it? It seems that multiple segments from different upper 
layer protocols are not allowed in your parcel.

2.      Is the following a fair statement? All segments in the same packet come 
from the same application identified by the 5-tupe (source address, destination 
address, source port, destination port, protocol number).

3.      Segment size
You require that their sizes be the same except for the last one. Is this 
required for easy implementation or what? Do you require it for any other 
reasons?

4.      TTL issue
You described how parcels are forwarded over the Internetwork, and in 
particular you described what the ingress/egress middlebox does about parcels. 
I understand that the ingress middlebox may break the parcel into smaller ones, 
which may rejoin at the egress middlebox. My question is about TTL. As 
different smaller parcels may traverse along different paths, as a result their 
TTLs may be different when they reach the egress middlebox . How does the 
egress middlebox set up the TTL value? Please provide more descriptions.

5.      Reordering at the egress middlebox
The parcels would arrive one after another, and therefore the egress middlebox 
would “wait” for a little bit to identify and pick up enough parcels/packets 
for their rejoining and repackaging. A description of the egress middlebox 
behavior would be useful and helpful, in particular I would like to know more 
about the waiting time if any, and how you deal with the reordering and loss.

6.      IPv4 option
Does IETF still allow to change/add IPv4 option fields? I might be wrong, but 
aren’t they frozen? Also, do commercial routers still care about IPv4 options?

7.      IPv6 option
This draft has defined a hop-by-hop option, it will require every intermediate 
IPv6 router to inspect this option. There have been some discussions on the 
pros/cons about Hop-by-Hop IPv6 Option. Is there any feedback from WG 6man?

8.      Parcel Path Qualification
This draft has described a method for parcel path qualification probe from end 
to end. It is nice to have it, but it is unreliable simply for the following 
reason: a probe parcel goes along one specific path, and your real application 
parcels may take different paths.

9.      Integrity
First paragraph of Section 7. More explanation/elaboration should be useful. I 
might have missed it in previous paragraphs, but if I do, please provide a 
reference to it such as “as described in …”.

10.   Implementation Status
In section 10. TSO’s performance gain and Parcel’s gain should be regarded as 
two different things. Since this draft is adding a hop-by-hop option, every 
intermediate router is required to process the hop-by-hop option, which will, 
theoretically speaking, lead to performance downgrade. Of course, the whole 
performance would depend on many other factors, such as the total numbers of 
routing table lookups and number of segments.

11.   General observation
This proposal essentially tries to solve a problem caused by MTU. If MTU be 
very big, one would simply put the whole data in a single packet. Since MTU is 
limited, a packet has to be cut into many smaller pieces (segments). In the 
existing specification, when an intermediate router sees a packet with its size 
larger than MTU, the router would be expected to fragment it so that the 
fragments could be forwarded. Here let me call it “fragmentation as needed”. In 
reality, however, some (if not all) commercial routers don’t do “fragmentation 
as needed”, instead of fragmenting the packet they simply discard it in order 
to achieve the wire-speed. This draft defines a new way to address the MTU 
issue: when a router sees a packet with its size larger than MTU, the router is 
asked to fragment it in a prescribed way (fragment it into pre-packaged 
segments). If I may, let me call it “fragmentation as prescribed”. Both 
“fragmentation as needed” and “fragmentation as prescribed” would require the 
support from intermediate routers. As the same as fragmentation as needed, 
fragmentation as prescribed may downgrade the performance of intermediate 
routers. What is more, intermediate routers/boxes may perform “rejoining and 
repackaging”, which will adversely impact the performance of the intermediate 
routers/boxes.

Best regards,

Richard

From: Int-area <int-area-boun...@ietf.org<mailto:int-area-boun...@ietf.org>> On 
Behalf Of Juan Carlos Zuniga (juzuniga)
Sent: Wednesday, June 22, 2022 12:25 PM
To: int-area@ietf.org<mailto:int-area@ietf.org>
Subject: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Dear IntArea WG,

We are starting a 2-week call for adoption of the IP-Parcels draft:
https://www.ietf.org/archive/id/draft-templin-intarea-parcels-10.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ietf.org%2Farchive%2Fid%2Fdraft-templin-intarea-parcels-10.html&data=05%7C01%7Crichard.li%40futurewei.com%7C715b5db213134932c70208da5484f702%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637915227299598680%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=w4G5ypaSRv%2FR31%2F%2B857XT2xUqHdEXv90ubD5GGjqBEQ%3D&reserved=0>

The document has been discussed for some time and it has received multiple 
comments.

If you have an opinion on whether this document should be adopted by the 
IntArea WG please indicate it on the list by the end of Wednesday July 6th.

Thanks,

Juan-Carlos & Wassim
(IntArea WG chairs)

_______________________________________________
Int-area mailing list
Int-area@ietf.org<mailto:Int-area@ietf.org>
https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________

Int-area mailing list

Int-area@ietf.org<mailto:Int-area@ietf.org>

https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

Re: [Int-area] Call for WG adoption of draft-templin-intarea-parcels-10

Reply via email to