Hello, We would like to propose this patchset again. Only minor details changed since the last version, we incorporated the suggestion from Jesse to always store the size of the largest fragment received, regardless of the DF bit.
Thus we never generate bigger fragments as originally received regardless if DF is set ot not. We would like to summarize the current discussion on this topic and again would like you to consider applying this patchset to net-next: Several proposals were suggested: #1 employ GRO engine - Reassembly would only work within one napi poll run. But reassembly must happen even independently of the interface the frame gets received. Delays cause single fragments to arrive in different napi runs, which wouldn't be aggregated. - We would have to kill the 1:1 correspondence between aggregation and segmentation: within the TCP protocol we can stop aggregating frames at any point without any harm because of it being a streaming protocol. Fragmentation is different in the way that we need to reassemble the complete packet before processing, we cannot make sense of 'half skbs'. #2 keep fragments attached to reassembled The idea is to attach the original skbs to the reassembled one, so the networking stack can choose which ones to use depending on the use case. Forwarding would operate on the original ones while code dealing with PACKET_HOST frames would use the reassembled one. - We have the overhead to carry more skbs around, which we currently don't do. - This information cannot be stored in any of the currently available fields in the skb or shared_info. That said, a new pointer would be necessary in every skb, independently if it is fragmented or not. This change does impact fast path and skb size. - sometimes using reassembled skb or the original ones could lead to TOCTTOU attacks in some situations, like packet is split in the TCP header, core stacks sees complete reassembled TCP packet but netfilter only part of the header, so different decisions might be done - it does impact fast path in netfilter for every packet: pskb_may_pull is not enough to check if we can eat enough of the header, actually because of overlapping or duplicate fragments we have to touch all those fragments, thus creating new slow paths in netfilter - all netfilter helpers would need to adapt in case e.g. a udp packet containing a sip message is fragmented. - in case we change fragment size, we don't have clear semantics and the only behaviour which makes sense is what this patchset does (i.e., refragment). - still, even such complex change does not allow us to act as transparent router/bridge: we still have to queue up fragments; in case we cannot reassemble we have to drop them (else firewall bypass is possible). #3 max_frag_size vector As it is based on the idea of keep fragments attached to reassembly it inherits a lot of the problems stated in section #2. - Still needs an additional way to store this information in the skb, thus enlarging a structure we try to shrink. - TOCTTOU attacks are not possible because we do inspect the same data all the time - ... but at the same time, we cannot deal with overlapping or duplicated fragments (without making this complex again) For years the linux kernel never correctly handled fragmented packets in forwarding L3 or L2 cases. We never heard any complaints. These patches try to make Linux a better internet citizen, correctly handling some edge cases, without harming core code and affecting performance. Thus we consider our proposed patches superior in all aspects. We are happy to discuss any ideas how to solve this otherwise. We investigated alternate approaches to allow transparent refragmentation for the common case of "well-formed" (i.e., non-overlapping, no duplicates, ..) fragments. Unfortunately it involves removing an ip defragmentation optimization in case netfilter conntrack is active. The two patches that enable this are included as [RFC] as part of this series so they can be discussed. Thanks, Hannes, Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html