Thanks, applied to all relevant branches.
On Thu, Jul 10, 2014 at 04:42:31PM -0700, Alex Wang wrote: > Thx for the explanation, looks good to me, > > And confirm that it solve the pkt duplication issue, > > Acked-by: Alex Wang <al...@nicira.com> > > > On Thu, Jul 10, 2014 at 4:03 PM, Ben Pfaff <b...@nicira.com> wrote: > > > The Linux kernel Netlink implementation has two races that cause problems > > for processes that attempt to dump a table in a multithreaded manner. > > > > The first race is in the structure of the kernel netlink_recv() function. > > This function pulls a message from the socket queue and, if there is none, > > reports EAGAIN: > > skb = skb_recv_datagram(sk, flags, noblock, &err); > > if (skb == NULL) > > goto out; > > Only if a message is successfully read from the socket queue does the > > function, toward the end, try to queue up a new message to be dumped: > > if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / > > 2) { > > ret = netlink_dump(sk); > > if (ret) { > > sk->sk_err = ret; > > sk->sk_error_report(sk); > > } > > } > > This means that if thread A reads a message from a dump, then thread B > > attempts to read one before A queues up the next, B will get EAGAIN. This > > means that, following EAGAIN, B needs to wait until A returns to userspace > > before it tries to read the socket again. nl_dump_next() already does > > this, using 'dump->status_seq' (although the need for it has never been > > explained clearly, to my knowledge). > > > > The second race is more serious. Suppose thread X and thread Y both > > simultaneously attempt to queue up a new message to be dumped, using the > > call to netlink_dump() quoted above. netlink_dump() begins with: > > mutex_lock(nlk->cb_mutex); > > > > cb = nlk->cb; > > if (cb == NULL) { > > err = -EINVAL; > > goto errout_skb; > > } > > Suppose that X gets cb_mutex first and finds that the dump is complete. It > > will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to > > indicate that no dump is in progress and release the mutex: > > nlk->cb = NULL; > > mutex_unlock(nlk->cb_mutex); > > When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and > > return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err, > > but that error is not reported immediately; instead, it is saved for the > > next read from the socket. Since Open vSwitch maintains a pool of Netlink > > sockets, that next failure can crop up pretty much anywhere. One of the > > worst places for it to crop up is in the execution of a later transaction > > (e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink > > transactions as idempotent and will re-execute them when socket errors > > occur. For a transaction that sends a packet, this causes packet > > duplication, which we actually observed in practice. (ENOBUFS should > > actually cause transactions to be re-executed in many cases, but EINVAL > > should not; this is a separate bug in the userspace netlink code.) > > > > VMware-BZ: #1283188 > > Reported-and-tested-by: Alex Wang <al...@nicira.com> > > Signed-off-by: Ben Pfaff <b...@nicira.com> > > --- > > lib/netlink-socket.c | 10 ++++++++++ > > lib/netlink-socket.h | 2 ++ > > 2 files changed, 12 insertions(+) > > > > diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c > > index 42ba7e1..8f8d603 100644 > > --- a/lib/netlink-socket.c > > +++ b/lib/netlink-socket.c > > @@ -714,6 +714,7 @@ nl_dump_start(struct nl_dump *dump, int protocol, > > const struct ofpbuf *request) > > atomic_init(&dump->status, status << 1); > > dump->nl_seq = nl_msg_nlmsghdr(request)->nlmsg_seq; > > dump->status_seq = seq_create(); > > + ovs_mutex_init(&dump->mutex); > > } > > > > /* Attempts to retrieve another reply from 'dump' into 'buffer'. 'dump' > > must > > @@ -756,7 +757,15 @@ nl_dump_next(struct nl_dump *dump, struct ofpbuf > > *reply, struct ofpbuf *buffer) > > return false; > > } > > > > + /* Take the mutex here to avoid a in-kernel race. If two threads > > try > > + * to read from a Netlink dump socket at once, then the socket > > error > > + * can be set to EINVAL, which will be encountered on the next > > recv on > > + * that socket, which could be anywhere due to the way that we > > pool > > + * Netlink sockets. Serializing the recv calls avoids the issue. > > */ > > + ovs_mutex_lock(&dump->mutex); > > retval = nl_sock_recv__(dump->sock, buffer, false); > > + ovs_mutex_unlock(&dump->mutex); > > + > > if (retval) { > > ofpbuf_clear(buffer); > > if (retval == EAGAIN) { > > @@ -835,6 +844,7 @@ nl_dump_done(struct nl_dump *dump) > > } > > nl_pool_release(dump->sock); > > seq_destroy(dump->status_seq); > > + ovs_mutex_destroy(&dump->mutex); > > return status >> 1; > > } > > > > diff --git a/lib/netlink-socket.h b/lib/netlink-socket.h > > index dd32409..8ac201a 100644 > > --- a/lib/netlink-socket.h > > +++ b/lib/netlink-socket.h > > @@ -52,6 +52,7 @@ > > #include <stdint.h> > > #include "ofpbuf.h" > > #include "ovs-atomic.h" > > +#include "ovs-thread.h" > > > > struct nl_sock; > > > > @@ -114,6 +115,7 @@ struct nl_dump { > > atomic_uint status; /* Low bit set if we read final message. > > * Other bits hold an errno (0 for > > success). */ > > struct seq *status_seq; /* Tracks changes to the above 'status'. > > */ > > + struct ovs_mutex mutex; > > }; > > > > void nl_dump_start(struct nl_dump *, int protocol, > > -- > > 1.7.10.4 > > > > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev