On 1 August 2016 at 13:07, pravin shelar <pshe...@ovn.org> wrote: > On Mon, Aug 1, 2016 at 11:46 AM, Joe Stringer <j...@ovn.org> wrote: >> On 1 August 2016 at 10:21, pravin shelar <pshe...@ovn.org> wrote: >>> On Tue, Jul 12, 2016 at 3:26 PM, Joe Stringer <j...@ovn.org> wrote: >>>> The core fragmentation handling logic is exported on all supported >>>> kernels, so it's not necessary to backport the latest version of this. >>>> This greatly simplifies the code due to inconsistencies between the old >>>> per-lookup garbage collection and the newer workqueue based garbage >>>> collection. >>>> >>>> As a result of simplifying and removing unnecessary backport code, a few >>>> bugs are fixed for corner cases such as when some fragments remain in >>>> the fragment cache when openvswitch is unloaded. >>>> >>>> Some backported ip functions need a little extra logic than what is seen >>>> on the latest code due to this, for instance on kernels <3.17: >>>> * Call inet_frag_evictor() before defrag >>>> * Limit hashsize in ip{,6}_fragment logic >>>> >>>> The pernet init/exit logic also differs a little from upstream. Upstream >>>> ipv[46]_defrag logic initializes the various pernet fragment parameters >>>> and its own global fragments cache. In the OVS backport, the pernet >>>> parameters are shared while the fragments cache is separate. The >>>> backport relies upon upstream pernet initialization to perform the >>>> shared setup, and performs no pernet initialization of its own. When it >>>> comes to pernet exit however, the backport must ensure that all >>>> OVS-specific fragment state is cleared, while the shared state remains >>>> untouched so that the regular ipv[46] logic may do its own cleanup. In >>>> practice this means that OVS must have its own divergent implementation >>>> of inet_frags_exit_net(). >>>> >>>> Fixes the following crash: >>>> >>>> Call Trace: >>>> <IRQ> >>>> [<ffffffff810744f6>] ? call_timer_fn+0x36/0x100 >>>> [<ffffffff8107548f>] run_timer_softirq+0x1ef/0x2f0 >>>> [<ffffffff8106cccc>] __do_softirq+0xec/0x2c0 >>>> [<ffffffff8106d215>] irq_exit+0x105/0x110 >>>> [<ffffffff81737095>] smp_apic_timer_interrupt+0x45/0x60 >>>> [<ffffffff81735a1d>] apic_timer_interrupt+0x6d/0x80 >>>> <EOI> >>>> [<ffffffff8104f596>] ? native_safe_halt+0x6/0x10 >>>> [<ffffffff8101cb2f>] default_idle+0x1f/0xc0 >>>> [<ffffffff8101d406>] arch_cpu_idle+0x26/0x30 >>>> [<ffffffff810bf3a5>] cpu_startup_entry+0xc5/0x290 >>>> [<ffffffff810415ed>] start_secondary+0x21d/0x2d0 >>>> Code: Bad RIP value. >>>> RIP [<ffffffffa0177480>] 0xffffffffa0177480 >>>> RSP <ffff88003f703e78> >>>> CR2: ffffffffa0177480 >>>> ---[ end trace eb98ca80ba07bd9c ]--- >>>> Kernel panic - not syncing: Fatal exception in interrupt >>>> >>>> Signed-off-by: Joe Stringer <j...@ovn.org> >>>> --- >>>> I've tested this on CentOS kernel 3.10.0-327 and Ubuntu kernels >>>> 3.13.0-68, 3.16.0-70, 3.19.0-58, and 4.2.0-35. Some earlier kernel >>>> versions may still trigger fragmentation-related crashes due to upstream >>>> bugs, for instance on Ubuntu's 3.13.0-24. >>>> --- >>>> acinclude.m4 | 1 + >>>> datapath/linux/compat/include/net/inet_frag.h | 58 ++- >>>> datapath/linux/compat/inet_fragment.c | 486 >>>> ++------------------------ >>>> datapath/linux/compat/ip_fragment.c | 36 +- >>>> datapath/linux/compat/nf_conntrack_reasm.c | 17 + >>>> 5 files changed, 91 insertions(+), 507 deletions(-) >>>> >>> ... >>> >>>> -void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q, >>>> - const char *prefix) >>>> -{ >>>> - static const char msg[] = "inet_frag_find: Fragment hash bucket" >>>> - " list length grew over limit " >>>> __stringify(INETFRAGS_MAXDEPTH) >>>> - ". Dropping fragment.\n"; >>>> + local_bh_disable(); >>>> + inet_frag_evictor(nf, f, true); >>>> + local_bh_enable(); >>>> >>>> - if (PTR_ERR(q) == -ENOBUFS) >>>> - net_dbg_ratelimited("%s%s", prefix, msg); >>>> + nf->low_thresh = thresh; >>>> } >>> >>> Upstream inet_frags_exit_net() and back ported looks the same, we can >>> just use the exported symbol. >> >> We share nf->mem with the upstream frag code, so we shouldn't >> percpu_counter_destroy(&nf->mem) it when we exit the namespace; the >> upstream frag code will handle that. >> >> My other concern was that we use the shared nf->low_thresh to clear >> out our fragments cache, but if we are exiting our module and the >> upstream frag code remains, we should restore the nf->low_thresh >> afterwards so that we don't cause flushing of the upstream frag cache. >> > I think we can restore the both objects in compat code after calling > default exit function. > > But the exit code implementation itself is not that complicated so I > am fine with the code. > >>>> +#endif /* HAVE_INET_FRAGS_WITH_FRAGS_WORK */ >>>> >>>> #endif /* !HAVE_CORRECT_MRU_HANDLING */ >>>> diff --git a/datapath/linux/compat/ip_fragment.c >>>> b/datapath/linux/compat/ip_fragment.c >>>> index 8d01088abc0a..64e2cf23c327 100644 >>>> --- a/datapath/linux/compat/ip_fragment.c >>>> +++ b/datapath/linux/compat/ip_fragment.c >>>> @@ -76,12 +76,7 @@ struct ipfrag_skb_cb >>>> >>>> /* Describe an entry in the "incomplete datagrams" queue. */ >>>> struct ipq { >>>> - union { >>>> - struct inet_frag_queue q; >>>> -#ifndef HAVE_INET_FRAG_QUEUE_WITH_LIST_EVICTOR >>>> - struct ovs_inet_frag_queue oq; >>>> -#endif >>>> - }; >>>> + struct inet_frag_queue q; >>>> >>>> u32 user; >>>> __be32 saddr; >>>> @@ -119,6 +114,12 @@ static unsigned int ipqhashfn(__be16 id, __be32 >>>> saddr, __be32 daddr, u8 prot) >>>> (__force u32)saddr, (__force u32)daddr, >>>> ip4_frags.rnd); >>>> } >>>> +/* fb3cfe6e75b9 ("inet: frag: remove hash size assumptions from callers") >>>> + * shifted this logic into inet_fragment, but prior kernels still need >>>> this. >>>> + */ >>>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,17,0) >>>> +#define ipqhashfn(a, b, c, d) (ipqhashfn(a, b, c, d) & (INETFRAGS_HASHSZ >>>> - 1)) >>>> +#endif >>>> >>>> #ifdef HAVE_INET_FRAGS_CONST >>>> static unsigned int ip4_hashfn(const struct inet_frag_queue *q) >>>> @@ -267,6 +268,23 @@ out: >>>> ipq_put(qp); >>>> } >>>> >>>> +/* Memory limiting on fragments. Evictor trashes the oldest >>>> + * fragment queue until we are back under the threshold. >>>> + * >>>> + * Necessary for kernels earlier than v3.17. Replaced in commit >>>> + * b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue"). >>>> + */ >>>> +static void ip_evictor(struct net *net) >>>> +{ >>>> +#ifdef HAVE_INET_FRAG_EVICTOR >>>> + int evicted; >>>> + >>>> + evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false); >>>> + if (evicted) >>>> + IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted); >>>> +#endif >>>> +} >>>> + >>>> /* Find the correct entry in the "incomplete datagrams" queue for >>>> * this IP datagram, and create new one, if nothing is found. >>>> */ >>>> @@ -281,6 +299,11 @@ static struct ipq *ip_find(struct net *net, struct >>>> iphdr *iph, >>>> arg.user = user; >>>> arg.vif = vif; >>>> >>>> + ip_evictor(net); >>>> + >>> IP fragment eviction can be done after this lookup. So that there is >>> better chance of finding the fragment. >> >> I think I was trying to follow upstream more closely, although it's >> actually in ip_defrag() in v3.16. I could shift this there instead, I >> don't remember specifically why I put it here in the first place. >> >> This seems like a reasonable suggestion at face value, on affected >> kernels they maintain an LRU for frag lists so if the current frag >> adds to the oldest queue then it will be bumped to the front of the >> LRU and be saved, so if it /would/ have been deleted by the >> ip_evictor() then it's no longer affected; it could also end up >> reassembling the message, therefore releasing the memory and reducing >> the amount of work for ip_evictor() to do. >> >> The tradeoff is that if it's actually a fragment from a brand new >> message, then if the fragment cache is full and we don't clean out old >> fragments first, we'll fail to allocate a fragqueue for the new >> message. We'll clean out the frags list afterwards, then subsequently >> receive later fragments for the same message. These fragments won't be >> able to reassemble the message because we dropped the first one. >> >> I'm inclined to think that the latter is more problematic than the >> former, most likely fragments arrive in quick succession so if we're >> deleting the oldest frag queue then it is more likely to be stale & >> not receiving the remaining fragments. So we should prioritize >> ensuring that the new fragment can create a new fragqueue. >> >> Thoughts? > > I agree the later seems bigger issue. we could insert the evict > between search and create, but that will diverge compat code a lot > compared to upstream. So lets keep it as it is. > > At this point I am fine with patch as it is. > > Acked-by: Pravin B Shelar <pshe...@ovn.org>
Thanks for the reviews. To bring the ip_evictor() closer to upstream I plan to also apply this incremental on this series and push soon: diff --git a/datapath/linux/compat/ip_fragment.c b/datapath/linux/compat/ip_fragment.c index 3b737f304287..e707c607fb30 100644 --- a/datapath/linux/compat/ip_fragment.c +++ b/datapath/linux/compat/ip_fragment.c @@ -268,6 +268,7 @@ out: ipq_put(qp); } +#ifdef HAVE_INET_FRAG_EVICTOR /* Memory limiting on fragments. Evictor trashes the oldest * fragment queue until we are back under the threshold. * @@ -276,14 +277,13 @@ out: */ static void ip_evictor(struct net *net) { -#ifdef HAVE_INET_FRAG_EVICTOR int evicted; evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false); if (evicted) IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted); -#endif } +#endif /* Find the correct entry in the "incomplete datagrams" queue for * this IP datagram, and create new one, if nothing is found. @@ -299,8 +299,6 @@ static struct ipq *ip_find(struct net *net, struct iphdr *iph, arg.user = user; arg.vif = vif; - ip_evictor(net); - #ifdef HAVE_INET_FRAGS_WITH_RWLOCK read_lock(&ip4_frags.lock); #endif @@ -706,6 +704,11 @@ int rpl_ip_defrag(struct net *net, struct sk_buff *skb, u32 user) IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS); skb_orphan(skb); +#ifdef HAVE_INET_FRAG_EVICTOR + /* Start by cleaning up the memory. */ + ip_evictor(net); +#endif + /* Lookup (or create) queue header */ qp = ip_find(net, ip_hdr(skb), user, vif); if (qp) { Lastly, looking over the LRU-related code upstream I saw that we actually don't bump the entries each time they are touched in our backport. I sent a patch: http://openvswitch.org/pipermail/dev/2016-August/076961.html _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev