On Wed, Sep 21, 2011 at 11:02:03PM -0700, Jesse Gross wrote: > On Wed, Sep 21, 2011 at 9:37 PM, Ben Pfaff <b...@nicira.com> wrote: > > On Mon, Sep 19, 2011 at 03:00:04PM -0700, Jesse Gross wrote: > >> In the future, the kernel will use unicast messages instead of > >> multicast to send upcalls. ??As a result, we need to be able to > >> tell it where to direct the traffic. ??This adds a function to expose > >> the Netlink pid of a socket so it can be included in messages to the > >> kernel. > > > > The Netlink socket library in OVS has logic to keep fds used for Netlink > > dump operations separate from fds used for multicast reception, because > > we don't want reading dump output to discard an arbitrary number of > > possibly interleaved multicast messages. > > > > Even though it's not technically multicast, this new way of receiving > > upcalls is still asynchronous and could still have the same issue as a > > socket subscribed to a multicast group. > > That's a good point although you'll also have similar issues if you > execute transactions on the same socket so it's still necessary to > segregate operations on a single socket (as is done here). We could > try to deal with that too although I'm not sure it's worth it and all > these games with swapping FDs behind the scenes makes me somewhat > nervous.
The biggest reason for the fd-swapping games is to overcome an awkward limitation of Netlink dump operations, which is that you can only have one of them going on at a time on a single Netlink socket. So either the Netlink clients like dpif-linux have to deal with this themselves, by keeping track of when a dump is going on and opening new sockets when another one starts, or the netlink library has to do it. I think it's less error-prone to have the logic in just one place, instead of spreading it out among all the clients. Note that dpif-linux doesn't read all of a dump's output in one go. It exposes an iterator interface to its clients and reads additional Netlink packets as the client consumes the existing ones. It would waste a lot of memory to consume the whole flow dump at once. This problem is admittedly a bit of a corner case. It's not too common to want to do two dumps at the same time. (In my opinion, that's another good reason to centralize the logic; if it's distributed and redundant across clients that rarely encounter the problem, then if they ever do we'll be running untested code.) The next biggest reason is much less of a corner case: if there's an ongoing dump operation on a netlink fd, then simple request-reply transactions on the same fd will screw up the dump, by skipping past some (perhaps all) of the dump replies. We could store those dump replies, but that requires an unbounded amount of memory. Using a separate fd avoids the problem. Keeping multicast traffic separate from other traffic is way down the importance scale, since it's easy for clients to segregate the nl_socks they're using for transactions and dumps from the nl_socks they're using for receiving multicast traffic. In fact, it looks like all the existing clients in OVS already do that. I think that it is necessary for correctness, so I guess that we should simply declare that a socket that receives asynchronous messages must not be used for transactions or dumps. Here's a patch that does that. After this is applied, I agree that your original patch is fine. --8<--------------------------cut here-------------------------->8-- From: Ben Pfaff <b...@nicira.com> Date: Thu, 22 Sep 2011 10:37:59 -0700 Subject: [PATCH] netlink-socket: Async notifications are incompatible with other operations. A Netlink socket that receives asynchronous notifications (e.g. from a multicast group) cannot be used for transactions or dumps, because those operations would discard asynchronous messages that arrive while waiting for replies. This commit documents this issue in a comment on nl_sock_join_mcgroup(). It also removes an internal attempt to avoid mixing multicast reception with other operations. The attempt was incomplete, because it only handled dumps even though ordinary transactions are also problematic. It seems better to remove it than to fix it because, first, all of the existing users in OVS already separate multicast reception from other operations and, second, an upcoming commit will start using unicast Netlink for asynchronous notifications, which has the same issues but doesn't use nl_sock_join_mcgroup(). --- lib/netlink-socket.c | 15 ++++++++------- 1 files changed, 8 insertions(+), 7 deletions(-) diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c index f4635c8..085918d 100644 --- a/lib/netlink-socket.c +++ b/lib/netlink-socket.c @@ -62,7 +62,6 @@ struct nl_sock int fd; uint32_t pid; int protocol; - bool any_groups; struct nl_dump *dump; }; @@ -92,7 +91,6 @@ nl_sock_create(int protocol, struct nl_sock **sockp) goto error; } sock->protocol = protocol; - sock->any_groups = false; sock->dump = NULL; retval = alloc_pid(&sock->pid); @@ -164,6 +162,10 @@ nl_sock_destroy(struct nl_sock *sock) /* Tries to add 'sock' as a listener for 'multicast_group'. Returns 0 if * successful, otherwise a positive errno value. * + * A socket that is subscribed to a multicast group that receives asynchronous + * notifications must not be used for Netlink transactions or dumps, because + * transactions and dumps can cause notifications to be lost. + * * Multicast group numbers are always positive. * * It is not an error to attempt to join a multicast group to which a socket @@ -181,7 +183,6 @@ nl_sock_join_mcgroup(struct nl_sock *sock, unsigned int multicast_group) multicast_group, strerror(errno)); return errno; } - sock->any_groups = true; return 0; } @@ -536,10 +537,10 @@ nl_dump_start(struct nl_dump *dump, nlmsghdr->nlmsg_flags |= NLM_F_DUMP | NLM_F_ACK; dump->seq = nlmsghdr->nlmsg_seq; dump->buffer = NULL; - if (sock->any_groups || sock->dump) { - /* 'sock' might belong to some multicast group, or it already has an - * ongoing dump. Clone the socket to avoid possibly intermixing - * multicast messages or previous dump results with our results. */ + if (sock->dump) { + /* 'sock' already has an ongoing dump. Clone the socket to avoid + * possibly intermixing multicast messages or previous dump results + * with our results. */ dump->status = nl_sock_clone(sock, &dump->sock); if (dump->status) { return; -- 1.7.4.4 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev