After talking the IPV6 PMTU situation over with Herbert this afternoon, we discovered that IPV4 has the same problem :-)
It calls ip_rt_frag_needed() unconditionally from net/ipv4/icmp.c:icmp_unreach(), which makes all of the verifications done by TCP (sequence number checks etc.) basically for nothing. We can fix this without too much pain I think. There are some important questions to answer, but the basic idea is to move at least all of the PMTU icmp handling into the ipprot->err_handler(). There is another issue to address, which is what Herbert and I were discussing when we discovered that ipv4 was broken too, and that is how ipv6 routes work. Routes can be prefixed on ipv6, unlike ipv4 which uses a destination cache and all routes are to precise destinations. So when we get PMTU messages under ipv6, we might have to clone the prefixed route into one to a specific destination in that prefix. In order to clone/cow the route properly, we need to know the full source and destination addresses. Currently, dst->ops->update_pmtu() has no way to pass in that information. To be honest, ipv6 should have a routing cache or similar, as that would solve this along with many other problems. But that is a discussion for another time. One idea is to pass in a "const struct flowi *flp" into the dst->ops->update_pmtu() handler which has the src/dest address fields filled in. I did code something like that up, then threw it away, but such a simple change is easily rematerializable. If we start to do the PMTU route metric update call solely from ipprot->err_handler(), this would fix a nagging issue some people see in ipv4 RSVP and MPLS environments, wherein the MPLS/RSVP cloud rewrites the TOS field so we update the wrong routing cache entry on PMTU reception or no entry at all since none of the routing cache entries have the matching TOS value. We have the same problem with redirects, but we get those in response to forwarded frames so these changes being discussed won't fix that case of mismatching TOS. Herbert also mentioned the idea of doing a hostcache to store the TCP metrics into just like BSD does. This has been discussed before, and if we decide to do that it's a more involved project. We already have the inetpeer cache on the ipv4 side, and we could create something similar for ipv6 for this purpose. Again, this is just another idea. I always believed that TCP metrics should have the deepest granularity possible, addresses and ports and everything else that can key a route, because even a change in destination port can elicit a different IPSEC rule and thus a totally different path with totally different PMTU, RTT, etc. But the arguments are mounting for not doing metrics like this any more. This also brings up the older topic of doing IPSEC PMTU handling properly. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html