On Thu, Mar 10, 2016 at 11:24:36PM +0000, Daniele Di Proietto wrote:
> ovs-numa reads the CPUs and NUMA nodes configuration from sysfs.
> It is only used to decide how to set the affinity of pmd threads.
> 
> This new code just needs get_mempolicy() from libnuma: it is used
> to get the NUMA node of a virtual address and it is simply a wrapper
> to a system call.  Want do you think? Should we implement that in OVS?

Looking at the implementation in libnuma, although it's a wrapper around
a system call, it requires a lot of architecture-specific crud.  I don't
object to using libnuma for this.

> If we implement the system call wrapper in ovs-numa we could avoid
> depending on libnuma-dev for the build, but (I guess) we would still
> need to link with -lnuma, as it is a dependency of the statically linked
> DPDK library
> 
> Thanks
> 
> On 10/03/2016 14:38, "Ben Pfaff" <b...@ovn.org> wrote:
> 
> >How does this numa library relate to lib/ovs-numa.[ch]?
> >
> >On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
> >> Thanks for the patch, I'll put this in the use case list for
> >> my series if I need to resend it!
> >> 
> >> It would be nice to get the numa socket information without
> >> linking OVS with libnuma, maybe using some DPDK api. From
> >> a quick look I didn't find any way, but maybe you know a
> >> better way.
> >> 
> >> Some preliminary comments inline
> >> 
> >> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
> >> <dev-boun...@openvswitch.org on behalf of ciara.lof...@intel.com> wrote:
> >> 
> >> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
> >> >as the servicing PMD, to all come from the same socket.
> >> >
> >> >DPDK v2.2 introduces a new configuration option:
> >> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
> >> >from which a vhost device's memory has been allocated by QEMU, and
> >> >accordingly reallocates device memory managed by DPDK to that same
> >> >socket.
> >> >
> >> >OVS by default sets the socket id of a vhost port to that of the
> >> >master lcore. This commit introduces the ability to update the
> >> >socket id of the port if it is detected (during VM boot) that the
> >> >port memory is not on the default NUMA node. If this is the case, the
> >> >mempool of the port is also changed to the new node, and a PMD
> >> >thread currently servicing the port will no longer, in favour of a
> >> >thread from the new node (if enabled in the CPU mask).
> >> >
> >> >Signed-off-by: Ciara Loftus <ciara.lof...@intel.com>
> >> >---
> >> > INSTALL.DPDK.md   |  6 +++++-
> >> > acinclude.m4      |  2 +-
> >> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
> >> > 3 files changed, 29 insertions(+), 4 deletions(-)
> >> >
> >> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> >> >index dca79bd..82e6908 100644
> >> >--- a/INSTALL.DPDK.md
> >> >+++ b/INSTALL.DPDK.md
> >> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
> >> > 
> >> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
> >> > 
> >> >+     Enable NUMA-aware vHost by modifying the following in the same
> >>file:
> >> >+
> >> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
> >> >+
> >> 
> >> I guess we should also update install_dpdk() in ./travis/build.sh to do
> >> this if it's required
> >> 
> >> >      Then run `make install` to build and install the library.
> >> >      For default install without IVSHMEM:
> >> > 
> >> >@@ -383,7 +387,7 @@ Performance Tuning:
> >> > 
> >> >  It is good practice to ensure that threads that are in the datapath
> >>are
> >> >  pinned to cores in the same NUMA area. e.g. pmd threads and QEMU
> >>vCPUs
> >> >- responsible for forwarding.
> >> >+ responsible for forwarding. This is now default behavior for vHost
> >> >ports.
> >> > 
> >> >   9. Rx Mergeable buffers
> >> > 
> >> >diff --git a/acinclude.m4 b/acinclude.m4
> >> >index 11c7787..432bdbd 100644
> >> >--- a/acinclude.m4
> >> >+++ b/acinclude.m4
> >> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
> >> >     found=false
> >> >     save_LIBS=$LIBS
> >> >     for extras in "" "-ldl"; do
> >> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
> >> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
> >> 
> >> I guess we should also list libnuma-dev in .travis.yml and something
> >> similar
> >> in rhel/openvswitch-fedora.spec
> >> 
> >> >         AC_LINK_IFELSE(
> >> >            [AC_LANG_PROGRAM([#include <rte_config.h>
> >> >                              #include <rte_eal.h>],
> >> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >> >index 17b8d51..4e1ce53 100644
> >> >--- a/lib/netdev-dpdk.c
> >> >+++ b/lib/netdev-dpdk.c
> >> >@@ -29,6 +29,7 @@
> >> > #include <stdio.h>
> >> > #include <sys/types.h>
> >> > #include <sys/stat.h>
> >> >+#include <numaif.h>
> >> > 
> >> > #include "dirs.h"
> >> > #include "dp-packet.h"
> >> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
> >> > {
> >> >     struct netdev_dpdk *netdev;
> >> >     bool exists = false;
> >> >+    int newnode = 0;
> >> >+    long err = 0;
> >> > 
> >> >     ovs_mutex_lock(&dpdk_mutex);
> >> >     /* Add device to the vhost port with the same name as that passed
> >> >down. */
> >> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
> >> >             }
> >> >             ovsrcu_set(&netdev->virtio_dev, dev);
> >> >             exists = true;
> >> >+
> >> >+            /* Get NUMA information */
> >> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
> >> >MPOL_F_ADDR);
> >> >+            if (err) {
> >> >+                VLOG_INFO("Error getting NUMA info for vHost Device
> >> >'%s'",
> >> >+                        dev->ifname);
> >> >+                newnode = netdev->socket_id;
> >> >+            } else if (newnode != netdev->socket_id) {
> >> >+                netdev->socket_id = newnode;
> >> >+                /* Change mempool to new NUMA Node */
> >> >+                dpdk_mp_put(netdev->dpdk_mp);
> >> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> >> >netdev->mtu);
> >> >+                /* Request netdev reconfiguration. The port may now be
> >> >+                 * serviced by a PMD on the new node if enabled in the
> >> >cpu
> >> >+                 * mask */
> >> >+                netdev_request_reconfigure(&netdev->up);
> >> 
> >> I think here I would prefer:
> >> 
> >> 1) Remembering the configuration change request
> >>     netdev->requested_socket_id = newnode
> >> 2) Calling netdev_request_reconfigure()
> >> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
> >>     netdev->socket_id = netdev_requested_socket_id
> >> 
> >> Otherwise the datapath might be confused, because it assumes that the
> >> socket_id doesn't change while the device is polled by the pmd threads.
> >> 
> >> It's safe to change almost everything inside the
> >> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
> >> polled by the datapath when the function is called.
> >> 
> >> The same applies for the mempool: the actual change should be done
> >> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
> >> threads using it.
> >> 
> >> >+            }
> >> >+
> >> >             dev->flags |= VIRTIO_DEV_RUNNING;
> >> >             /* Disable notifications. */
> >> >             set_irq_status(dev);
> >> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
> >> >         return -1;
> >> >     }
> >> > 
> >> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added",
> >>dev->ifname,
> >> >-              dev->device_fh);
> >> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
> >>%i",
> >> >+              dev->ifname, dev->device_fh, newnode);
> >> >     return 0;
> >> > }
> >> > 
> >> >-- 
> >> >2.4.3
> >> >
> >> >_______________________________________________
> >> >dev mailing list
> >> >dev@openvswitch.org
> >> >http://openvswitch.org/mailman/listinfo/dev
> >> 
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> http://openvswitch.org/mailman/listinfo/dev
> 
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to