> 
> > However, there is a bit of a problem with this model as it currently
> > stands which is kind of independent of this patch and was suggested by
> > Daniel above. Freeing mbufs in the guest may cause packet corruption.
> > This was an issue with DPDK (I think it still is). When returning an
> > mbuf to the mempool, DPDK will first try to return the mbuf to a
> > per-lcore cache. The per-lcore cache will get periodically flushed to the
> central mempool. The per-lcore cache is not threadsafe but the central
> mempool is.
> > This is not a major issue when everything is running on the host but
> > when you run a dpdk application in the guest you potentially have two
> > DPDK threads with an lcore of 0 (one in the host and one in the
> > guest!). You can hit a situation whereby both threads simultaneously try to
> access the per-lcore cache for lcore 0 and corrupt it.
> 
> Is this just an issue if two cores share lcore 0, or would the same thing
> happen if two cores have lcore 1?  Would a hacky solution be to just
> somehow detect overlapping lcore ids between the guest and the host, and
> avoid them?

If any cores share the lcore id, you'll see something similar. The solution you
describe above should work. You could even set the lcore_id that is in the
thread local storage and you could choose which mempool cache to use.

We've asked the dpdk guys to have a look at this, so maybe a better solution
will appear in DPDK.
> 
> > When we did something similar previously, we approached it differently
> > by adding ivshmem support directly to DPDK (there is an ivshmem
> > target). Using this, you are able to select which dpdk objects you
> > would like to share to the guest from the host. For example, you would say
> "I want to share this ring and this mempool".
> > Then, the DPDK process running in the guest would automatically detect
> > that it was running in a virtualized environment with ivshmem (it
> > would detect the ivshmem pci device) and allow the guest application
> > to get those objects. In this model, you have better security (you
> > don’t need to share an entire page - you can just share objects), you can
> run a primary process in the guest (even if the host process is a primary
> process).
> > However, you still have the problem with per-lcore cache corruption
> described above.
> > The way you can resolve this is :
> 
> > a) don’t ever free (or allocate) packets in the guest
> 
> This doesn't seem feasible in general.  It's somewhat of an odd requirement
> that would be hard to enforce I suspect.

Pretty hard to get a networking application in software to not drop packets!
> 
> > b) create a free and alloc ring that gets handled in the host. If the
> > guest needs an mbuf, it would request one from the alloc ring. If it
> > needs to free an mbuf, it would push it to the free ring.
> 
> This seems like a lot of work.  Do we really care that much about IVSHMEM?
> My impression was that we're more-or-less standardizing around vhost-user.
> My inclination would be to drop the patch instead of implementing this logic
> unless there's consensus that IVSHMEM is going to be an relevant IO
> mechanism in the future.
> 

I think ivshmem is very useful for communicating with another application in the
host - acting as some kind of service port - and it wouldn’t hit this issue.

If you move ahead with this patch, there were some comments in the code
below that you should look at.

> > However, maybe for your use-case this isn’t an issue?
> 
> Yeah for our (Melvin + Me) specific use case, we'll just do (a) and not free 
> or
> allocate packets, shouldn't be too hard for us.
> 
> Ethan
> 
> >
> >> >
> >> >Signed-off-by: Melvin Walls <mwall...@gmail.com>
> >> >Signed-off-by: Ethan Jackson <et...@nicira.com>
> >> >---
> >> > lib/netdev-dpdk.c | 158
> >> >+++++++++++++++++++++++++++++++++++++++++-------------
> >> > 1 file changed, 122 insertions(+), 36 deletions(-)
> >> >
> >> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >> >5ae805e..5abe90f 100644
> >> >--- a/lib/netdev-dpdk.c
> >> >+++ b/lib/netdev-dpdk.c
> >> >@@ -227,6 +227,10 @@ struct netdev_dpdk {
> >> >     /* Identifier used to distinguish vhost devices from each other */
> >> >     char vhost_id[PATH_MAX];
> >> >
> >> >+    /* Rings for secondary processes in IVSHMEM setups, NULL
> >> >+ otherwise
> >> */
> >> >+    struct rte_ring *rx_ring;
> >> >+    struct rte_ring *tx_ring;
> >> >+
> >> >     /* In dpdk_list. */
> >> >     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);  }; @@
> >> >-340,12 +344,16 @@ dpdk_mp_get(int socket_id, int mtu)
> >> >OVS_REQUIRES(dpdk_mutex)
> >> >             return NULL;
> >> >         }
> >> >
> >> >-        dmp->mp = rte_mempool_create(mp_name, mp_size,
> >> MBUF_SIZE(mtu),
> >> >-                                     MP_CACHE_SZ,
> >> >-                                     sizeof(struct
> >> >rte_pktmbuf_pool_private),
> >> >-                                     rte_pktmbuf_pool_init, NULL,
> >> >-                                     ovs_rte_pktmbuf_init, NULL,
> >> >-                                     socket_id, 0);
> >> >+        if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> >> >+            dmp->mp = rte_mempool_create(mp_name, mp_size,
> >> >MBUF_SIZE(mtu),
> >> >+                                         MP_CACHE_SZ,
> >> >+                                         sizeof(struct
> >> >rte_pktmbuf_pool_private),
> >> >+                                         rte_pktmbuf_pool_init, NULL,
> >> >+                                         ovs_rte_pktmbuf_init, NULL,
> >> >+                                         socket_id, 0);
> >> >+        } else {
> >> >+            dmp->mp = rte_mempool_lookup(mp_name);
> >> >+        }
> >
> > Does this not break when someone sets the mtu?
> >
> > On a side note, I saw some DPDK patches on dpdk.org that will enable
> > the freeing of a memzone!
> >
> >> >     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
> >> >MIN_NB_MBUF);
> >> >
> >> >     if (dmp->mp == NULL) {
> >> >@@ -439,39 +447,41 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >> >OVS_REQUIRES(dpdk_mutex)
> >> >     dev->up.n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
> >> >     dev->real_n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
> >> >
> >> >-    diag = rte_eth_dev_configure(dev->port_id, dev->up.n_rxq,
> >> >dev->real_n_txq,
> >> >-                                 &port_conf);
> >> >-    if (diag) {
> >> >-        VLOG_ERR("eth dev config error %d. rxq:%d txq:%d", diag,
> >> >dev->up.n_rxq,
> >> >-                 dev->real_n_txq);
> >> >-        return -diag;
> >> >-    }
> >> >-
> >> >-    for (i = 0; i < dev->real_n_txq; i++) {
> >> >-        diag = rte_eth_tx_queue_setup(dev->port_id, i,
> >> >NIC_PORT_TX_Q_SIZE,
> >> >-                                      dev->socket_id, NULL);
> >> >+    if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> >> >+        diag = rte_eth_dev_configure(dev->port_id, dev->up.n_rxq,
> >> >dev->real_n_txq,
> >> >+                                     &port_conf);
> >> >         if (diag) {
> >> >-            VLOG_ERR("eth dev tx queue setup error %d",diag);
> >> >+            VLOG_ERR("eth dev config error %d. rxq:%d txq:%d",
> >> >+ diag,
> >> >dev->up.n_rxq,
> >> >+                     dev->real_n_txq);
> >> >             return -diag;
> >> >         }
> >> >-    }
> >> >
> >> >-    for (i = 0; i < dev->up.n_rxq; i++) {
> >> >-        diag = rte_eth_rx_queue_setup(dev->port_id, i,
> >> >NIC_PORT_RX_Q_SIZE,
> >> >-                                      dev->socket_id,
> >> >-                                      NULL, dev->dpdk_mp->mp);
> >> >+        for (i = 0; i < dev->real_n_txq; i++) {
> >> >+            diag = rte_eth_tx_queue_setup(dev->port_id, i,
> >> >NIC_PORT_TX_Q_SIZE,
> >> >+                                          dev->socket_id, NULL);
> >> >+            if (diag) {
> >> >+                VLOG_ERR("eth dev tx queue setup error %d",diag);
> >> >+                return -diag;
> >> >+            }
> >> >+        }
> >> >+
> >> >+        for (i = 0; i < dev->up.n_rxq; i++) {
> >> >+            diag = rte_eth_rx_queue_setup(dev->port_id, i,
> >> >NIC_PORT_RX_Q_SIZE,
> >> >+                                          dev->socket_id,
> >> >+                                          NULL, dev->dpdk_mp->mp);
> >> >+            if (diag) {
> >> >+                VLOG_ERR("eth dev rx queue setup error %d",diag);
> >> >+                return -diag;
> >> >+            }
> >> >+        }
> >> >+
> >> >+        diag = rte_eth_dev_start(dev->port_id);
> >> >         if (diag) {
> >> >-            VLOG_ERR("eth dev rx queue setup error %d",diag);
> >> >+            VLOG_ERR("eth dev start error %d",diag);
> >> >             return -diag;
> >> >         }
> >
> > I don’t think you should enable this. Are you suggesting that a
> > primary process could initialize the NIC but the vswitch would be the
> process that would actually use the NIC?
> >
> > I think that when the vswitch is started as a secondary process it
> > should not be allowed to add physical ports at all.
> >
> >> >     }
> >> >
> >> >-    diag = rte_eth_dev_start(dev->port_id);
> >> >-    if (diag) {
> >> >-        VLOG_ERR("eth dev start error %d",diag);
> >> >-        return -diag;
> >> >-    }
> >> >-
> >> >     rte_eth_promiscuous_enable(dev->port_id);
> >> >     rte_eth_allmulticast_enable(dev->port_id);
> >> >
> >> >@@ -532,6 +542,8 @@ netdev_dpdk_init(struct netdev *netdev_,
> >> unsigned
> >> >int port_no,
> >> >     OVS_REQUIRES(dpdk_mutex)
> >> > {
> >> >     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> >> >+    char *rxq_name = xasprintf("%s_tx", netdev->up.name);
> >> >+    char *txq_name = xasprintf("%s_rx", netdev->up.name);
> >
> > I think this patch would be more useful if you could arbitrarily give
> > the name of the ring you want to connect to rather than just the
> > prefix
> >  - maybe an option to the add-port command?
> >
> >> >     int sid;
> >> >     int err = 0;
> >> >
> >> >@@ -574,6 +586,19 @@ netdev_dpdk_init(struct netdev *netdev_,
> >> unsigned
> >> >int port_no,
> >> >         }
> >> >     }
> >> >
> >> >+    if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> >> >+        netdev->rx_ring = netdev->tx_ring = NULL;
> >> >+    } else {
> >> >+        netdev->rx_ring = rte_ring_lookup(rxq_name);
> >> >+        netdev->tx_ring = rte_ring_lookup(txq_name);
> >> >+        if (!netdev->rx_ring || !netdev->tx_ring) {
> >> >+            err = ENOMEM;
> >> >+        }
> >> >+    }
> >> >+
> >> >+    free(rxq_name);
> >> >+    free(txq_name);
> >> >+
> >> >     list_push_back(&dpdk_list, &netdev->list_node);
> >> >
> >> > unlock:
> >> >@@ -957,6 +982,36 @@ netdev_dpdk_rxq_recv(struct netdev_rxq
> *rxq_,
> >> >struct dp_packet **packets,
> >> >     return 0;
> >> > }
> >> >
> >> >+static int
> >> >+netdev_dpdk_ring_rxq_recv(struct netdev_rxq *rxq_,
> >> >+                           struct dp_packet **packets, int *c) {
> >> >+    struct netdev_dpdk *netdev = netdev_dpdk_cast(rxq_->netdev);
> >> >+    struct rte_ring *rx_ring = netdev->rx_ring;
> >> >+    unsigned rx_pkts = NETDEV_MAX_BURST;
> >> >+
> >> >+    /* Only use netdev_dpdk_ring_rxq_recv() as a secondary process.
> >> >There are operations
> >> >+     * performed by netdev_dpdk_rxq_recv() that primary processes
> >> >+ are
> >> >responsible for and
> >> >+     * cannot be performed by secondary processes. */
> >> >+    if (OVS_LIKELY(rte_eal_process_type() == RTE_PROC_PRIMARY)) {
> >> >+        return netdev_dpdk_rxq_recv(rxq_,packets,c);
> >> >+    }
> >> >+
> >> >+    while (OVS_UNLIKELY(rte_ring_dequeue_bulk(rx_ring, (void
> >> >+ **)packets,
> >> >rx_pkts) != 0) &&
> >> >+        rx_pkts > 0) {
> >> >+        rx_pkts = rte_ring_count(rx_ring);
> >> >+        rx_pkts = (unsigned)MIN(rx_pkts,NETDEV_MAX_BURST);
> >
> > I wonder would it be more efficient to return EAGAIN at this point
> > rather than retry? Would the following 'if' statement ever hit?
> >
> >> >+    }
> >> >+
> >> >+    if (!rx_pkts) {
> >> >+        return EAGAIN;
> >> >+    }
> >> >+
> >> >+    *c = rx_pkts;
> >> >+
> >> >+    return 0;
> >> >+}
> >> >+
> >> > static void
> >> > __netdev_dpdk_vhost_send(struct netdev *netdev, struct dp_packet
> >> **pkts,
> >> >                          int cnt, bool may_steal) @@ -1147,6
> >> >+1202,20 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid
> >> OVS_UNUSED,
> >> >struct dp_pack  }
> >> >
> >> > static inline void
> >> >+netdev_dpdk_ring_send__(struct netdev_dpdk *netdev,
> >> >+                        struct dp_packet **pkts, int cnt) {
> >> >+    struct rte_ring *tx_ring = netdev->tx_ring;
> >> >+    int rslt = 0;
> >> >+
> >> >+    if (tx_ring != NULL) {
> >> >+        do {
> >> >+            rslt = rte_ring_enqueue_bulk(tx_ring, (void **)pkts, cnt);
> >> >+        } while (rslt == -ENOBUFS);
> >
> > Bit of a nit but convention in the file is to use err instead or rslt
> >
> >> >+    }
> >> >+}
> >> >+
> >> >+static inline void
> >> > netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> >> >                    struct dp_packet **pkts, int cnt, bool
> >> >may_steal) { @@ -1812,8 +1881,13 @@ dpdk_ring_create(const char
> >> >dev_name[], unsigned int port_no,
> >> >     }
> >> >
> >> >     /* Create single consumer/producer rings, netdev does explicit
> >> >locking. */
> >> >-    ivshmem->cring_tx = rte_ring_create(ring_name, DPDK_RING_SIZE,
> >> >SOCKET0,
> >> >-                                        RING_F_SP_ENQ | RING_F_SC_DEQ);
> >> >+    if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> >> >+        ivshmem->cring_tx = rte_ring_create(ring_name,
> >> >+ DPDK_RING_SIZE,
> >> >SOCKET0,
> >> >+                                            RING_F_SP_ENQ |
> >> >RING_F_SC_DEQ);
> >> >+    } else {
> >> >+        ivshmem->cring_tx = rte_ring_lookup(ring_name);
> >> >+    }
> >> >+
> >> >     if (ivshmem->cring_tx == NULL) {
> >> >         rte_free(ivshmem);
> >> >         return ENOMEM;
> >> >@@ -1825,8 +1899,13 @@ dpdk_ring_create(const char dev_name[],
> >> unsigned
> >> >int port_no,
> >> >     }
> >> >
> >> >     /* Create single consumer/producer rings, netdev does explicit
> >> >locking. */
> >> >-    ivshmem->cring_rx = rte_ring_create(ring_name, DPDK_RING_SIZE,
> >> >SOCKET0,
> >> >-                                        RING_F_SP_ENQ | RING_F_SC_DEQ);
> >> >+    if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> >> >+        ivshmem->cring_rx = rte_ring_create(ring_name,
> >> >+ DPDK_RING_SIZE,
> >> >SOCKET0,
> >> >+                                            RING_F_SP_ENQ |
> >> >RING_F_SC_DEQ);
> >> >+    } else {
> >> >+        ivshmem->cring_rx = rte_ring_lookup(ring_name);
> >> >+    }
> >> >+
> >> >     if (ivshmem->cring_rx == NULL) {
> >> >         rte_free(ivshmem);
> >> >         return ENOMEM;
> >> >@@ -1888,7 +1967,14 @@ netdev_dpdk_ring_send(struct netdev
> >> *netdev_,
> >> >int qid,
> >> >         dp_packet_set_rss_hash(pkts[i], 0);
> >> >     }
> >> >
> >> >-    netdev_dpdk_send__(netdev, qid, pkts, cnt, may_steal);
> >> >+    /* Only use netdev_dpdk_send__() as a primary process. It leads
> >> >+ to
> >> >the execution
> >> >+     * of code that cannot be executed by secondary processes. */
> >> >+    if (OVS_LIKELY(rte_eal_process_type() == RTE_PROC_PRIMARY)) {
> >> >+        netdev_dpdk_send__(netdev, qid, pkts, cnt, may_steal);
> >> >+    } else {
> >> >+        netdev_dpdk_ring_send__(netdev, pkts, cnt);
> >
> > If this is a single producer ring and there is only one qid, how do
> > you ensure thread safety?
> >
> > What happens if may_steal is true?
> >
> >> >+    }
> >> >+
> >> >     return 0;
> >> > }
> >> >
> >> >@@ -2101,7 +2187,7 @@ static const struct netdev_class
> dpdk_ring_class =
> >> >         netdev_dpdk_get_stats,
> >> >         netdev_dpdk_get_features,
> >> >         netdev_dpdk_get_status,
> >> >-        netdev_dpdk_rxq_recv);
> >> >+        netdev_dpdk_ring_rxq_recv);
> >> >
> >> > static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
> >> >     NETDEV_DPDK_CLASS(
> >> >--
> >> >1.9.3 (Apple Git-50)
> >> >_______________________________________________
> >> >dev mailing list
> >> >dev@openvswitch.org
> >> >https://urldefense.proofpoint.com/v2/url?u=http-
> >> 3A__openvswitch.org_mai
> >> >lma
> >> >n_listinfo_dev&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-
> >> YihVMNtXt-uEs&r
> >> >=Sm
> >> >B5nZacmXNq0gKCC1s_Cw5yUNjxgD4v5kJqZ2uWLlE&m=ksY9ulCC5TLZ-
> >> tQnj3WWXeuAHQU
> >> >k9J o-mP_ruJ124K0&s=HiTz0cRNA8O4ZBvUJqeOL-KOwygu2-
> >> QlDmyVeLTpNBE&e=
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> http://openvswitch.org/mailman/listinfo/dev
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > http://openvswitch.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to