After talking the IPV6 PMTU situation over with Herbert
this afternoon, we discovered that IPV4 has the same
problem :-)

It calls ip_rt_frag_needed() unconditionally from
net/ipv4/icmp.c:icmp_unreach(), which makes all of the
verifications done by TCP (sequence number checks etc.)
basically for nothing.

We can fix this without too much pain I think.  There are some
important questions to answer, but the basic idea is to move at least
all of the PMTU icmp handling into the ipprot->err_handler().

There is another issue to address, which is what Herbert and I were
discussing when we discovered that ipv4 was broken too, and that is
how ipv6 routes work.

Routes can be prefixed on ipv6, unlike ipv4 which uses a destination
cache and all routes are to precise destinations.  So when we get PMTU
messages under ipv6, we might have to clone the prefixed route into
one to a specific destination in that prefix.

In order to clone/cow the route properly, we need to know the full
source and destination addresses.  Currently, dst->ops->update_pmtu()
has no way to pass in that information.  To be honest, ipv6 should
have a routing cache or similar, as that would solve this along with
many other problems.  But that is a discussion for another time.

One idea is to pass in a "const struct flowi *flp" into the
dst->ops->update_pmtu() handler which has the src/dest address fields
filled in.  I did code something like that up, then threw it away, but
such a simple change is easily rematerializable.

If we start to do the PMTU route metric update call solely from
ipprot->err_handler(), this would fix a nagging issue some people see
in ipv4 RSVP and MPLS environments, wherein the MPLS/RSVP cloud
rewrites the TOS field so we update the wrong routing cache entry on
PMTU reception or no entry at all since none of the routing cache
entries have the matching TOS value.

We have the same problem with redirects, but we get those in response
to forwarded frames so these changes being discussed won't fix that
case of mismatching TOS.

Herbert also mentioned the idea of doing a hostcache to store the TCP
metrics into just like BSD does.  This has been discussed before, and
if we decide to do that it's a more involved project.  We already have
the inetpeer cache on the ipv4 side, and we could create something
similar for ipv6 for this purpose.  Again, this is just another idea.

I always believed that TCP metrics should have the deepest granularity
possible, addresses and ports and everything else that can key a
route, because even a change in destination port can elicit a
different IPSEC rule and thus a totally different path with totally
different PMTU, RTT, etc.

But the arguments are mounting for not doing metrics like this any
more.

This also brings up the older topic of doing IPSEC PMTU handling
properly.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to