Re: [ovs-dev] Fwd: [ovs-discuss] OVS segfault in recirculation

Andy Zhou Tue, 17 Feb 2015 15:25:15 -0800

Thanks for the confirmation. I have pushed the fix to both master and
branch-2.3. Race conditions can some times hard to track down. Glad
this is not one of those painful experiences.


On Mon, Feb 16, 2015 at 2:23 AM, Salvatore Cambria
<salvatore.camb...@citrix.com> wrote:
> Hi Andy,
>
> I can tell you that this morning the test is still running without any
> segfault. The patch seems to fix the problem!
>
> Thanks for your help,
>
> Salvatore
>
>
>
> On 13/02/15 17:18, Andy Zhou wrote:
>>
>> Great! looking forward to the test results.
>>
>> On Fri, Feb 13, 2015 at 3:56 AM, Salvatore Cambria
>> <salvatore.camb...@citrix.com> wrote:
>>>
>>> Hi Andy,
>>>
>>> Yes, it makes a lot of sense!
>>>
>>> I applied the patch and I'm running my script again in order to test it.
>>> It
>>> looks good so far!
>>> I will leave the script running all across the weekend and I'll come back
>>> to
>>> you with my results on Monday.
>>>
>>> Thank you for your very fast response and for your effort,
>>>
>>> Salvatore
>>>
>>> On 13/02/15 02:10, Andy Zhou wrote:
>>>>
>>>> Does not mean to drop the list.
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Andy Zhou <az...@nicira.com>
>>>> Date: Thu, Feb 12, 2015 at 3:42 PM
>>>> Subject: Re: [ovs-discuss] OVS segfault in recirculation
>>>> To: Salvatore Cambria <salvatore.camb...@citrix.com>
>>>>
>>>>
>>>> Hi, Salvatore,
>>>>
>>>> I think I found a bug: hmap_remove needs to be protected by a lock. As
>>>> you pointed out, the output_normal() path is not properly protected.
>>>> Would you please
>>>> try the attached patch to see if it helps?
>>>>
>>>> Bond object is protected by reference counting.  Assuming reference
>>>> counter is valid, it should be O.K. for a bundle to hold on to the
>>>> object. Notice
>>>> bundle_destroy() calls unref_bond().  At any rate, I did not spot
>>>> anything obvious there.
>>>>
>>>> Thanks for providing such a detailed description. It is very helpful.
>>>>
>>>> Andy
>>>>
>>>> ================================
>>>>
>>>> iff --git a/ofproto/bond.c b/ofproto/bond.c
>>>> index c4b3a3a..a9b7a83 100644
>>>> --- a/ofproto/bond.c
>>>> +++ b/ofproto/bond.c
>>>> @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t
>>>> *recirc_id,
>>>>        }
>>>>    }
>>>>
>>>> -void
>>>> -bond_update_post_recirc_rules(struct bond* bond, const bool force)
>>>> +static void
>>>> +bond_update_post_recirc_rules__(struct bond* bond, const bool force)
>>>>    {
>>>>       struct bond_entry *e;
>>>>       bool update_rules = force;  /* Always update rules if caller
>>>> forces
>>>> it. */
>>>> @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond,
>>>> const bool force)
>>>>            update_recirc_rules(bond);
>>>>       }
>>>>    }
>>>> +
>>>> +void
>>>> +bond_update_post_recirc_rules(struct bond* bond, const bool force)
>>>> +{
>>>> +    ovs_rwlock_wrlock(&rwlock);
>>>> +    bond_update_post_recirc_rules__(bond, force);
>>>> +    ovs_rwlock_unlock(&rwlock);
>>>> +}
>>>>    ^L
>>>>    /* Rebalancing. */
>>>>
>>>> @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond)
>>>>        }
>>>>
>>>>        if (use_recirc && rebalanced) {
>>>> -        bond_update_post_recirc_rules(bond,true);
>>>> +        bond_update_post_recirc_rules__(bond,true);
>>>>        }
>>>>
>>>>    done:
>>>>
>>>>
>>>> On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria
>>>> <salvatore.camb...@citrix.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am a Software Engineer from Citrix, Cambridge UK office.
>>>>>
>>>>> We are having a problem when running our nightly test on a XenServer
>>>>> host
>>>>> with lacp bonds with OVS. Indeed, when deleting the bond the following
>>>>> segfault error appears (from GDB):
>>>>>
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>
>>>>> #0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730,
>>>>> hmap=0x1c3e9e8)
>>>>>
>>>>>       at lib/hmap.h:236
>>>>>
>>>>> 236         while (*bucket != node) {
>>>>>
>>>>> (gdb) bt
>>>>>
>>>>> #0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730,
>>>>> hmap=0x1c3e9e8)
>>>>>
>>>>>       at lib/hmap.h:236
>>>>>
>>>>> #1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382
>>>>>
>>>>> #2  0x0000000000001ff5 in ?? ()
>>>>>
>>>>> (gdb) p *hmap
>>>>>
>>>>> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}
>>>>>
>>>>> (gdb) p *node
>>>>>
>>>>> $2 = {hash = 2450845263, next = 0x0}
>>>>>
>>>>> (gdb) p *bucket
>>>>>
>>>>> Cannot access memory at address 0x8
>>>>>
>>>>> (gdb) p *(hmap->buckets)
>>>>>
>>>>> $3 = (struct hmap_node *) 0x0
>>>>>
>>>>> (gdb) p node->hash & hmap->mask
>>>>>
>>>>> $4 = 79
>>>>>
>>>>> (gdb) p *(&hmap->buckets[79])
>>>>>
>>>>> $5 = (struct hmap_node *) 0x0
>>>>>
>>>>> I managed to reproduce the error using a while loop in which I
>>>>> repeatedly
>>>>> create and destroy a lacp bond, waiting for 15 seconds after each of
>>>>> these
>>>>> operations.
>>>>>
>>>>> The error is not constant in timing and I think that it can be a race
>>>>> condition due to the recirculation process. It happens indeed that when
>>>>> calling hmap_remove() from update_recirc_rules() (in case of DEL), the
>>>>> bucket points to NULL. My idea is the following:
>>>>>
>>>>> hmap_remove() is called from two different places in ofproto/bond.c:
>>>>>
>>>>> 1. from update_recirc_rules() (in which it raises the segfault)
>>>>>
>>>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>>> *pr_op->pr_rule = NULL;
>>>>> free(pr_op);
>>>>>
>>>>> 2. and from bond_unref() (called when I try to delete the bond)
>>>>>
>>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>>>       hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>>>       free(pr_op);
>>>>> }
>>>>>
>>>>> Now, if these two calls happen at the same time, a conflict on the bond
>>>>> pr_rule may happen. Looking at the code I can see that the
>>>>> recirculation
>>>>> hmap_remove() is executed inside an external lock
>>>>> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance()
>>>>>
>>>>> void bond_rebalance(struct bond *bond) {
>>>>> ...
>>>>> ovs_rwlock_wrlock(&rwlock);
>>>>> ...
>>>>> ...
>>>>> if (use_recirc && rebalanced) {
>>>>>       bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
>>>>> }
>>>>>
>>>>> done:
>>>>>       ovs_rwlock_unlock(&rwlock);
>>>>> }
>>>>>
>>>>> but I can't see any lock when called by output_normal()
>>>>>
>>>>> static void output_normal(struct xlate_ctx *ctx, const struct xbundle
>>>>> *out_xbundle, uint16_t vlan) {
>>>>> ...
>>>>> if (ctx->use_recirc) {
>>>>>       ...
>>>>>       bond_update_post_recirc_rules(out_xbundle->bond, false); <----
>>>>> hmap_remove()
>>>>>       ...
>>>>> }
>>>>> ...
>>>>>
>>>>> The same lock is used in bond_unref() but only for
>>>>> hmap_remove(all_bonds,
>>>>> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops,
>>>>> &pr_op->hmap_node).
>>>>>
>>>>> ...
>>>>> ovs_rwlock_wrlock(&rwlock);
>>>>> hmap_remove(all_bonds, &bond->hmap_node);
>>>>> ovs_rwlock_unlock(&rwlock);
>>>>> ...
>>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>>>       hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>>>       free(pr_op);
>>>>> }
>>>>> ...
>>>>>
>>>>> I suppose that the goal is to remove any pointer to the bond, from
>>>>> all_bonds, inside the lock and to work on it locally. My doubt is: can
>>>>> the
>>>>> bond be still accessible from somewhere else (let's say from a bundle)?
>>>>> If
>>>>> yes, what happen if a bundle tries to access a bond which was
>>>>> previously
>>>>> removed from all_bonds (let's say from bundle_destroy())?
>>>>>
>>>>> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64.
>>>>>
>>>>> Could you please help me to find the problem?
>>>>>
>>>>>
>>>>> Thank you for the help & kind regards,
>>>>>
>>>>> Salvatore
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> disc...@openvswitch.org
>>>>> http://openvswitch.org/mailman/listinfo/discuss
>>>>>
>
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] Fwd: [ovs-discuss] OVS segfault in recirculation

Reply via email to