Hi Andy,

Yes, it makes a lot of sense!

I applied the patch and I'm running my script again in order to test it. It looks good so far! I will leave the script running all across the weekend and I'll come back to you with my results on Monday.

Thank you for your very fast response and for your effort,

Salvatore

On 13/02/15 02:10, Andy Zhou wrote:
Does not mean to drop the list.


---------- Forwarded message ----------
From: Andy Zhou <az...@nicira.com>
Date: Thu, Feb 12, 2015 at 3:42 PM
Subject: Re: [ovs-discuss] OVS segfault in recirculation
To: Salvatore Cambria <salvatore.camb...@citrix.com>


Hi, Salvatore,

I think I found a bug: hmap_remove needs to be protected by a lock. As
you pointed out, the output_normal() path is not properly protected.
Would you please
try the attached patch to see if it helps?

Bond object is protected by reference counting.  Assuming reference
counter is valid, it should be O.K. for a bundle to hold on to the
object. Notice
bundle_destroy() calls unref_bond().  At any rate, I did not spot
anything obvious there.

Thanks for providing such a detailed description. It is very helpful.

Andy

================================

iff --git a/ofproto/bond.c b/ofproto/bond.c
index c4b3a3a..a9b7a83 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t
*recirc_id,
      }
  }

-void
-bond_update_post_recirc_rules(struct bond* bond, const bool force)
+static void
+bond_update_post_recirc_rules__(struct bond* bond, const bool force)
  {
     struct bond_entry *e;
     bool update_rules = force;  /* Always update rules if caller forces it. */
@@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond,
const bool force)
          update_recirc_rules(bond);
     }
  }
+
+void
+bond_update_post_recirc_rules(struct bond* bond, const bool force)
+{
+    ovs_rwlock_wrlock(&rwlock);
+    bond_update_post_recirc_rules__(bond, force);
+    ovs_rwlock_unlock(&rwlock);
+}
  ^L
  /* Rebalancing. */

@@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond)
      }

      if (use_recirc && rebalanced) {
-        bond_update_post_recirc_rules(bond,true);
+        bond_update_post_recirc_rules__(bond,true);
      }

  done:


On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria
<salvatore.camb...@citrix.com> wrote:
Hello,

I am a Software Engineer from Citrix, Cambridge UK office.

We are having a problem when running our nightly test on a XenServer host
with lacp bonds with OVS. Indeed, when deleting the bond the following
segfault error appears (from GDB):

Program terminated with signal 11, Segmentation fault.

#0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)

     at lib/hmap.h:236

236         while (*bucket != node) {

(gdb) bt

#0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)

     at lib/hmap.h:236

#1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382

#2  0x0000000000001ff5 in ?? ()

(gdb) p *hmap

$1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}

(gdb) p *node

$2 = {hash = 2450845263, next = 0x0}

(gdb) p *bucket

Cannot access memory at address 0x8

(gdb) p *(hmap->buckets)

$3 = (struct hmap_node *) 0x0

(gdb) p node->hash & hmap->mask

$4 = 79

(gdb) p *(&hmap->buckets[79])

$5 = (struct hmap_node *) 0x0

I managed to reproduce the error using a while loop in which I repeatedly
create and destroy a lacp bond, waiting for 15 seconds after each of these
operations.

The error is not constant in timing and I think that it can be a race
condition due to the recirculation process. It happens indeed that when
calling hmap_remove() from update_recirc_rules() (in case of DEL), the
bucket points to NULL. My idea is the following:

hmap_remove() is called from two different places in ofproto/bond.c:

1. from update_recirc_rules() (in which it raises the segfault)

hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
*pr_op->pr_rule = NULL;
free(pr_op);

2. and from bond_unref() (called when I try to delete the bond)

HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
     hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
     free(pr_op);
}

Now, if these two calls happen at the same time, a conflict on the bond
pr_rule may happen. Looking at the code I can see that the recirculation
hmap_remove() is executed inside an external lock
(ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance()

void bond_rebalance(struct bond *bond) {
...
ovs_rwlock_wrlock(&rwlock);
...
...
if (use_recirc && rebalanced) {
     bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
}

done:
     ovs_rwlock_unlock(&rwlock);
}

but I can't see any lock when called by output_normal()

static void output_normal(struct xlate_ctx *ctx, const struct xbundle
*out_xbundle, uint16_t vlan) {
...
if (ctx->use_recirc) {
     ...
     bond_update_post_recirc_rules(out_xbundle->bond, false); <----
hmap_remove()
     ...
}
...

The same lock is used in bond_unref() but only for hmap_remove(all_bonds,
&bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops,
&pr_op->hmap_node).

...
ovs_rwlock_wrlock(&rwlock);
hmap_remove(all_bonds, &bond->hmap_node);
ovs_rwlock_unlock(&rwlock);
...
HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
     hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
     free(pr_op);
}
...

I suppose that the goal is to remove any pointer to the bond, from
all_bonds, inside the lock and to work on it locally. My doubt is: can the
bond be still accessible from somewhere else (let's say from a bundle)? If
yes, what happen if a bundle tries to access a bond which was previously
removed from all_bonds (let's say from bundle_destroy())?

The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64.

Could you please help me to find the problem?


Thank you for the help & kind regards,

Salvatore



_______________________________________________
discuss mailing list
disc...@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss


_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to