On Thu, Mar 10, 2016 at 11:24:36PM +0000, Daniele Di Proietto wrote: > ovs-numa reads the CPUs and NUMA nodes configuration from sysfs. > It is only used to decide how to set the affinity of pmd threads. > > This new code just needs get_mempolicy() from libnuma: it is used > to get the NUMA node of a virtual address and it is simply a wrapper > to a system call. Want do you think? Should we implement that in OVS?
Looking at the implementation in libnuma, although it's a wrapper around a system call, it requires a lot of architecture-specific crud. I don't object to using libnuma for this. > If we implement the system call wrapper in ovs-numa we could avoid > depending on libnuma-dev for the build, but (I guess) we would still > need to link with -lnuma, as it is a dependency of the statically linked > DPDK library > > Thanks > > On 10/03/2016 14:38, "Ben Pfaff" <b...@ovn.org> wrote: > > >How does this numa library relate to lib/ovs-numa.[ch]? > > > >On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote: > >> Thanks for the patch, I'll put this in the use case list for > >> my series if I need to resend it! > >> > >> It would be nice to get the numa socket information without > >> linking OVS with libnuma, maybe using some DPDK api. From > >> a quick look I didn't find any way, but maybe you know a > >> better way. > >> > >> Some preliminary comments inline > >> > >> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus" > >> <dev-boun...@openvswitch.org on behalf of ciara.lof...@intel.com> wrote: > >> > >> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well > >> >as the servicing PMD, to all come from the same socket. > >> > > >> >DPDK v2.2 introduces a new configuration option: > >> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket > >> >from which a vhost device's memory has been allocated by QEMU, and > >> >accordingly reallocates device memory managed by DPDK to that same > >> >socket. > >> > > >> >OVS by default sets the socket id of a vhost port to that of the > >> >master lcore. This commit introduces the ability to update the > >> >socket id of the port if it is detected (during VM boot) that the > >> >port memory is not on the default NUMA node. If this is the case, the > >> >mempool of the port is also changed to the new node, and a PMD > >> >thread currently servicing the port will no longer, in favour of a > >> >thread from the new node (if enabled in the CPU mask). > >> > > >> >Signed-off-by: Ciara Loftus <ciara.lof...@intel.com> > >> >--- > >> > INSTALL.DPDK.md | 6 +++++- > >> > acinclude.m4 | 2 +- > >> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++-- > >> > 3 files changed, 29 insertions(+), 4 deletions(-) > >> > > >> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md > >> >index dca79bd..82e6908 100644 > >> >--- a/INSTALL.DPDK.md > >> >+++ b/INSTALL.DPDK.md > >> >@@ -33,6 +33,10 @@ on Debian/Ubuntu) > >> > > >> > `CONFIG_RTE_BUILD_COMBINE_LIBS=y` > >> > > >> >+ Enable NUMA-aware vHost by modifying the following in the same > >>file: > >> >+ > >> >+ `CONFIG_RTE_LIBRTE_VHOST_NUMA=y` > >> >+ > >> > >> I guess we should also update install_dpdk() in ./travis/build.sh to do > >> this if it's required > >> > >> > Then run `make install` to build and install the library. > >> > For default install without IVSHMEM: > >> > > >> >@@ -383,7 +387,7 @@ Performance Tuning: > >> > > >> > It is good practice to ensure that threads that are in the datapath > >>are > >> > pinned to cores in the same NUMA area. e.g. pmd threads and QEMU > >>vCPUs > >> >- responsible for forwarding. > >> >+ responsible for forwarding. This is now default behavior for vHost > >> >ports. > >> > > >> > 9. Rx Mergeable buffers > >> > > >> >diff --git a/acinclude.m4 b/acinclude.m4 > >> >index 11c7787..432bdbd 100644 > >> >--- a/acinclude.m4 > >> >+++ b/acinclude.m4 > >> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [ > >> > found=false > >> > save_LIBS=$LIBS > >> > for extras in "" "-ldl"; do > >> >- LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB" > >> >+ LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma" > >> > >> I guess we should also list libnuma-dev in .travis.yml and something > >> similar > >> in rhel/openvswitch-fedora.spec > >> > >> > AC_LINK_IFELSE( > >> > [AC_LANG_PROGRAM([#include <rte_config.h> > >> > #include <rte_eal.h>], > >> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > >> >index 17b8d51..4e1ce53 100644 > >> >--- a/lib/netdev-dpdk.c > >> >+++ b/lib/netdev-dpdk.c > >> >@@ -29,6 +29,7 @@ > >> > #include <stdio.h> > >> > #include <sys/types.h> > >> > #include <sys/stat.h> > >> >+#include <numaif.h> > >> > > >> > #include "dirs.h" > >> > #include "dp-packet.h" > >> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev) > >> > { > >> > struct netdev_dpdk *netdev; > >> > bool exists = false; > >> >+ int newnode = 0; > >> >+ long err = 0; > >> > > >> > ovs_mutex_lock(&dpdk_mutex); > >> > /* Add device to the vhost port with the same name as that passed > >> >down. */ > >> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev) > >> > } > >> > ovsrcu_set(&netdev->virtio_dev, dev); > >> > exists = true; > >> >+ > >> >+ /* Get NUMA information */ > >> >+ err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE | > >> >MPOL_F_ADDR); > >> >+ if (err) { > >> >+ VLOG_INFO("Error getting NUMA info for vHost Device > >> >'%s'", > >> >+ dev->ifname); > >> >+ newnode = netdev->socket_id; > >> >+ } else if (newnode != netdev->socket_id) { > >> >+ netdev->socket_id = newnode; > >> >+ /* Change mempool to new NUMA Node */ > >> >+ dpdk_mp_put(netdev->dpdk_mp); > >> >+ netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, > >> >netdev->mtu); > >> >+ /* Request netdev reconfiguration. The port may now be > >> >+ * serviced by a PMD on the new node if enabled in the > >> >cpu > >> >+ * mask */ > >> >+ netdev_request_reconfigure(&netdev->up); > >> > >> I think here I would prefer: > >> > >> 1) Remembering the configuration change request > >> netdev->requested_socket_id = newnode > >> 2) Calling netdev_request_reconfigure() > >> 3) In the netdev_dpdk_vhost_user_reconfigure() method: > >> netdev->socket_id = netdev_requested_socket_id > >> > >> Otherwise the datapath might be confused, because it assumes that the > >> socket_id doesn't change while the device is polled by the pmd threads. > >> > >> It's safe to change almost everything inside the > >> netdev_dpdk_vhost_user_reconfigure(), because the device will not be > >> polled by the datapath when the function is called. > >> > >> The same applies for the mempool: the actual change should be done > >> in netdev_dpdk_vhost_user_reconfigure(), because there might still be > >> threads using it. > >> > >> >+ } > >> >+ > >> > dev->flags |= VIRTIO_DEV_RUNNING; > >> > /* Disable notifications. */ > >> > set_irq_status(dev); > >> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev) > >> > return -1; > >> > } > >> > > >> >- VLOG_INFO("vHost Device '%s' %"PRIu64" has been added", > >>dev->ifname, > >> >- dev->device_fh); > >> >+ VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket > >>%i", > >> >+ dev->ifname, dev->device_fh, newnode); > >> > return 0; > >> > } > >> > > >> >-- > >> >2.4.3 > >> > > >> >_______________________________________________ > >> >dev mailing list > >> >dev@openvswitch.org > >> >http://openvswitch.org/mailman/listinfo/dev > >> > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> http://openvswitch.org/mailman/listinfo/dev > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev