Thanks for the confirmation. I have pushed the fix to both master and branch-2.3. Race conditions can some times hard to track down. Glad this is not one of those painful experiences.
On Mon, Feb 16, 2015 at 2:23 AM, Salvatore Cambria <salvatore.camb...@citrix.com> wrote: > Hi Andy, > > I can tell you that this morning the test is still running without any > segfault. The patch seems to fix the problem! > > Thanks for your help, > > Salvatore > > > > On 13/02/15 17:18, Andy Zhou wrote: >> >> Great! looking forward to the test results. >> >> On Fri, Feb 13, 2015 at 3:56 AM, Salvatore Cambria >> <salvatore.camb...@citrix.com> wrote: >>> >>> Hi Andy, >>> >>> Yes, it makes a lot of sense! >>> >>> I applied the patch and I'm running my script again in order to test it. >>> It >>> looks good so far! >>> I will leave the script running all across the weekend and I'll come back >>> to >>> you with my results on Monday. >>> >>> Thank you for your very fast response and for your effort, >>> >>> Salvatore >>> >>> On 13/02/15 02:10, Andy Zhou wrote: >>>> >>>> Does not mean to drop the list. >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Andy Zhou <az...@nicira.com> >>>> Date: Thu, Feb 12, 2015 at 3:42 PM >>>> Subject: Re: [ovs-discuss] OVS segfault in recirculation >>>> To: Salvatore Cambria <salvatore.camb...@citrix.com> >>>> >>>> >>>> Hi, Salvatore, >>>> >>>> I think I found a bug: hmap_remove needs to be protected by a lock. As >>>> you pointed out, the output_normal() path is not properly protected. >>>> Would you please >>>> try the attached patch to see if it helps? >>>> >>>> Bond object is protected by reference counting. Assuming reference >>>> counter is valid, it should be O.K. for a bundle to hold on to the >>>> object. Notice >>>> bundle_destroy() calls unref_bond(). At any rate, I did not spot >>>> anything obvious there. >>>> >>>> Thanks for providing such a detailed description. It is very helpful. >>>> >>>> Andy >>>> >>>> ================================ >>>> >>>> iff --git a/ofproto/bond.c b/ofproto/bond.c >>>> index c4b3a3a..a9b7a83 100644 >>>> --- a/ofproto/bond.c >>>> +++ b/ofproto/bond.c >>>> @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t >>>> *recirc_id, >>>> } >>>> } >>>> >>>> -void >>>> -bond_update_post_recirc_rules(struct bond* bond, const bool force) >>>> +static void >>>> +bond_update_post_recirc_rules__(struct bond* bond, const bool force) >>>> { >>>> struct bond_entry *e; >>>> bool update_rules = force; /* Always update rules if caller >>>> forces >>>> it. */ >>>> @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond, >>>> const bool force) >>>> update_recirc_rules(bond); >>>> } >>>> } >>>> + >>>> +void >>>> +bond_update_post_recirc_rules(struct bond* bond, const bool force) >>>> +{ >>>> + ovs_rwlock_wrlock(&rwlock); >>>> + bond_update_post_recirc_rules__(bond, force); >>>> + ovs_rwlock_unlock(&rwlock); >>>> +} >>>> ^L >>>> /* Rebalancing. */ >>>> >>>> @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond) >>>> } >>>> >>>> if (use_recirc && rebalanced) { >>>> - bond_update_post_recirc_rules(bond,true); >>>> + bond_update_post_recirc_rules__(bond,true); >>>> } >>>> >>>> done: >>>> >>>> >>>> On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria >>>> <salvatore.camb...@citrix.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I am a Software Engineer from Citrix, Cambridge UK office. >>>>> >>>>> We are having a problem when running our nightly test on a XenServer >>>>> host >>>>> with lacp bonds with OVS. Indeed, when deleting the bond the following >>>>> segfault error appears (from GDB): >>>>> >>>>> Program terminated with signal 11, Segmentation fault. >>>>> >>>>> #0 0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, >>>>> hmap=0x1c3e9e8) >>>>> >>>>> at lib/hmap.h:236 >>>>> >>>>> 236 while (*bucket != node) { >>>>> >>>>> (gdb) bt >>>>> >>>>> #0 0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, >>>>> hmap=0x1c3e9e8) >>>>> >>>>> at lib/hmap.h:236 >>>>> >>>>> #1 update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382 >>>>> >>>>> #2 0x0000000000001ff5 in ?? () >>>>> >>>>> (gdb) p *hmap >>>>> >>>>> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147} >>>>> >>>>> (gdb) p *node >>>>> >>>>> $2 = {hash = 2450845263, next = 0x0} >>>>> >>>>> (gdb) p *bucket >>>>> >>>>> Cannot access memory at address 0x8 >>>>> >>>>> (gdb) p *(hmap->buckets) >>>>> >>>>> $3 = (struct hmap_node *) 0x0 >>>>> >>>>> (gdb) p node->hash & hmap->mask >>>>> >>>>> $4 = 79 >>>>> >>>>> (gdb) p *(&hmap->buckets[79]) >>>>> >>>>> $5 = (struct hmap_node *) 0x0 >>>>> >>>>> I managed to reproduce the error using a while loop in which I >>>>> repeatedly >>>>> create and destroy a lacp bond, waiting for 15 seconds after each of >>>>> these >>>>> operations. >>>>> >>>>> The error is not constant in timing and I think that it can be a race >>>>> condition due to the recirculation process. It happens indeed that when >>>>> calling hmap_remove() from update_recirc_rules() (in case of DEL), the >>>>> bucket points to NULL. My idea is the following: >>>>> >>>>> hmap_remove() is called from two different places in ofproto/bond.c: >>>>> >>>>> 1. from update_recirc_rules() (in which it raises the segfault) >>>>> >>>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>>>> *pr_op->pr_rule = NULL; >>>>> free(pr_op); >>>>> >>>>> 2. and from bond_unref() (called when I try to delete the bond) >>>>> >>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { >>>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>>>> free(pr_op); >>>>> } >>>>> >>>>> Now, if these two calls happen at the same time, a conflict on the bond >>>>> pr_rule may happen. Looking at the code I can see that the >>>>> recirculation >>>>> hmap_remove() is executed inside an external lock >>>>> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance() >>>>> >>>>> void bond_rebalance(struct bond *bond) { >>>>> ... >>>>> ovs_rwlock_wrlock(&rwlock); >>>>> ... >>>>> ... >>>>> if (use_recirc && rebalanced) { >>>>> bond_update_post_recirc_rules(bond,true); <---- hmap_remove() >>>>> } >>>>> >>>>> done: >>>>> ovs_rwlock_unlock(&rwlock); >>>>> } >>>>> >>>>> but I can't see any lock when called by output_normal() >>>>> >>>>> static void output_normal(struct xlate_ctx *ctx, const struct xbundle >>>>> *out_xbundle, uint16_t vlan) { >>>>> ... >>>>> if (ctx->use_recirc) { >>>>> ... >>>>> bond_update_post_recirc_rules(out_xbundle->bond, false); <---- >>>>> hmap_remove() >>>>> ... >>>>> } >>>>> ... >>>>> >>>>> The same lock is used in bond_unref() but only for >>>>> hmap_remove(all_bonds, >>>>> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops, >>>>> &pr_op->hmap_node). >>>>> >>>>> ... >>>>> ovs_rwlock_wrlock(&rwlock); >>>>> hmap_remove(all_bonds, &bond->hmap_node); >>>>> ovs_rwlock_unlock(&rwlock); >>>>> ... >>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { >>>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>>>> free(pr_op); >>>>> } >>>>> ... >>>>> >>>>> I suppose that the goal is to remove any pointer to the bond, from >>>>> all_bonds, inside the lock and to work on it locally. My doubt is: can >>>>> the >>>>> bond be still accessible from somewhere else (let's say from a bundle)? >>>>> If >>>>> yes, what happen if a bundle tries to access a bond which was >>>>> previously >>>>> removed from all_bonds (let's say from bundle_destroy())? >>>>> >>>>> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64. >>>>> >>>>> Could you please help me to find the problem? >>>>> >>>>> >>>>> Thank you for the help & kind regards, >>>>> >>>>> Salvatore >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> disc...@openvswitch.org >>>>> http://openvswitch.org/mailman/listinfo/discuss >>>>> > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev