> >"Kavanagh, Mark B" <mark.b.kavan...@intel.com> writes: >> Further clarifications below Aaron - please let me know if you have >> any additional comments. >> >> Thanks, >> Mark > >Thanks again for this work, Mark! > >>> >>>One addendum below, Aaron. >>> >>>> >>>>>Hi Mark, >>>>> >>>>>Thanks for the patch! I've not done a very thorough review of this, but >>>>>my first-blush comments are inline. >>>> >>>>Thanks for the review Aaron! >>>> >>>>Any additional comments are welcome - my initial responses are inline, >>>>below. >>>> >>>>Thanks, >>>>Mark >>>> >>>>> >>>>>Mark Kavanagh <mark.b.kavan...@intel.com> writes: >>>>>> Add support for Jumbo Frames to DPDK-enabled port types, >>>>>> using single-segment-mbufs. >>>>>> >>>>>> Using this approach, the amount of memory allocated for each mbuf >>>>>> to store frame data is increased to a value greater than 1518B >>>>>> (typical Ethernet maximum frame length). The increased space >>>>>> available in the mbuf means that an entire Jumbo Frame can be carried >>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf >>>>>> segments. >>>>>> >>>>>> The amount of space allocated to each mbuf to hold frame data is >>>>>> defined dynamically by the user when adding a DPDK port to a bridge. >>>>>> If an MTU value is not supplied, or the user-supplied value is invalid, >>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B). >>>>>> >>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>>>>> --- >>>>> >>>>>This is a new feature which has user-visible impact, so should be >>>>>announced in NEWS (plus, it's great news). >>>> >>>>No problem - I'll add an entry. >>>> >>>>> >>>>>> INSTALL.DPDK.md | 59 ++++++++- >>>>>> lib/netdev-dpdk.c | 356 >> +++++++++++++++++++++++++++++++++++++++++------------- >>>>>> 2 files changed, 328 insertions(+), 87 deletions(-) >>>>>> >>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >>>>>> index 96b686c..2f23e27 100644 >>>>>> --- a/INSTALL.DPDK.md >>>>>> +++ b/INSTALL.DPDK.md >>>>>> @@ -859,10 +859,61 @@ by adding the following string: >>>>>> to <interface> sections of all network devices used by DPDK. Parameter >>>>>> 'N' >>>>>> determines how many queues can be used by the guest. >>>>>> >>>>>> + >>>>>> +Jumbo Frames >>>>>> +------------ >>>>>> + >>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports. >>>>>> + >>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to >> the ovs-vsctl >>>>>> +'add-port' command-line, along with the required MTU for the port. >>>>>> +e.g. >>>>> >>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid. >>>> >>>>Good catch, thanks. >>>> >>>>> >>>>>> + >>>>>> + ``` >>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk >> options:dpdk-mtu=9000 >>>>>> + ``` >>>>>> + >>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments >>>>>> are >>>>>> +increased, such that a full Jumbo Frame may be accommodated >> inside a single >>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable. >>>>> >>>>>Why no support? DPDK supports changing the mtu. >>>> >>>>I guess my rationale here is that an MTU change can't be triggered >> via OVS command-line, nor >>>>can it be triggered programmatically via DPDK (apart from an explicit call >>>>to >>>>rte_eth_dev_set_mtu). >>>>So, while technically it's possibly, from a user's point of view, >> there's no way to >>>configure >>>>it, outside of modifying the code directly. If I've missed something >> here, please let me >>>>know. >>>> >>>>> >>>>>> +Jumbo frame support has been validated against 13312B frames, using the >>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may >>>>>> +theoretically be supported. Supported port types excludes >> vHost-Cuse ports, as >>>>>> +this feature is pending deprecation. >>>>>> + >>>>>> + >>>>>> +vHost Ports and Jumbo Frames >>>>>> +---------------------------- >>>>>> +Jumbo frame support is available for DPDK vHost-User ports >> only. Some additional >>>>>> +configuration is needed to take advantage of this feature: >>>>>> + >>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as >> demonstrated in >>>>>> + the QEMU command line snippet below: >>>>>> + >>>>>> + ``` >>>>>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >>>>>> + '-device >> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >>>>>> + ``` >>>>>> + >>>>>> + 2. Where virtio devices are bound to the Linux kernel driver in a >>>>>> guest >>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK >> driver), the >>>>>> + MTU of those logical network interfaces must also be increased. >>>>>> This >>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that >> 'MTU' refers >>>>>> + to the length of the IP packet only, and not that of the entire >> frame. >>>>>> + >>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame, >> subtract the L2 >>>>>> + header and CRC lengths (i.e. 18B) from the max supported frame >>>>>> size. >>>>>> + So, to set the MTU for a 13312B Jumbo Frame: >>>>>> + >>>>>> + ``` >>>>>> + ifconfig eth1 mtu 13294 >>>>>> + ``` >>>>>> + >>>>>> + >>>>>> Restrictions: >>>>>> ------------- >>>>>> >>>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. >>>>>> - Currently DPDK port does not make use any offload functionality. >>>>>> - DPDK-vHost support works with 1G huge pages. >>>>>> >>>>>> @@ -903,6 +954,12 @@ Restrictions: >>>>>> the next release of DPDK (which includes the above patch) is >> available and >>>>>> integrated into OVS. >>>>>> >>>>>> + Jumbo Frames: >>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit >> gracefully. The source of >>>>>> + this issue is currently being investigated. >>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse >>>>>> ports. >>>>>> + >>>>>> + >>>>>> Bug Reporting: >>>>>> -------------- >>>>>> >>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>>>>> index de7e488..5f23e28 100644 >>>>>> --- a/lib/netdev-dpdk.c >>>>>> +++ b/lib/netdev-dpdk.c >>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = >> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE >>>>>> #define OVS_VPORT_DPDK "ovs_dpdk" >>>>>> >>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1 >>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024 >>>>>> + >>>>>> /* >>>>>> * need to reserve tons of extra space in the mbufs so we can align the >>>>>> * DMA addresses to 4KB. >>>>>> * The minimum mbuf size is limited to avoid scatter behaviour and drop >>>>>> in >>>>>> * performance for standard Ethernet MTU. >>>>>> */ >>>>>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>>>>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \ >>>>>> - + sizeof(struct dp_packet) \ >>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>> -#define MBUF_SIZE_DRIVER (2048 \ >>>>>> - + sizeof (struct rte_mbuf) \ >>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER) >>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + >> ETHER_CRC_LEN) >>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN >> - ETHER_CRC_LEN) >>>>>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \ >>>>>> + + sizeof(struct dp_packet) \ >>>>>> + + RTE_PKTMBUF_HEADROOM) >>>>>> + >>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's >>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). >>>>>> + */ >>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312 >>>>>> >>>>>> /* Max and min number of packets in the mempool. OVS tries to allocate >>>>>> a >>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system >> doesn't have >>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = >> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>> #define MIN_NB_MBUF (4096 * 4) >>>>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE >>>>>> >>>>>> +#define DPDK_VLAN_TAG_LEN 4 >>>>>> + >>>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ >>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF % >> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0); >>>>>> >>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = { >>>>>> .header_split = 0, /* Header Split disabled */ >>>>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */ >>>>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */ >>>>>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */ >>>>>> .hw_strip_crc = 0, >>>>>> }, >>>>>> .rx_adv_conf = { >>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class) >>>>>> return class->construct == netdev_dpdk_construct; >>>>>> } >>>>>> >>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity >>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K >> for igb_uio). >>>>>> + * If 'frame_len' is not a multiple of this value, insufficient >> buffers are >>>>>> + * allocated to accomodate the packet in its >> entirety. Furthermore, the igb_uio >>>>>> + * driver needs to ensure that there is also sufficient space in >> the Rx buffer >>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is >>>>>> too >>>>>> + * small, then the driver enables scatter RX behaviour, which reduces >>>>>> + * performance. To prevent this, use a buffer size that is closest to >>>>>> + * 'frame_len', but which satisfies the aforementioned criteria. >>>>>> + */ >>>>>> +static uint32_t >>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len) >>>>>> +{ >>>>>> + struct rte_eth_dev_info info; >>>>>> + uint32_t buf_size; >>>>>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN); >>>>>> + >>>>>> + if(netdev->type == DPDK_DEV_ETH) { >>>>>> + rte_eth_dev_info_get(netdev->port_id, &info); >>>>>> + buf_size = (info.min_rx_bufsize == 0) ? >>>>>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE : >>>>>> + info.min_rx_bufsize; >>>>> >>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu? >>>> >>>>I'm not sure if I follow - the MTU isn't the issue here, but rather >> the buffer size used >>>>when creating the rte_mempool, which must be a multiple of the >> driver's min_rx_bufsize. >>>> >>>>> >>>>>> + } else { >>>>>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE; >>>>>> + } >>>>>> + >>>>>> + if(len % buf_size) { >>>>>> + len = buf_size * ((len/buf_size) + 1); >>>>>> + } >>>>>> + >>>>>> + return len; >>>>>> +} >>>>>> + >>>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used >>>>>> * for all other segments data, bss and text. */ >>>>>> >>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p) >>>>>> } >>>>>> >>>>>> static void >>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp, >>>>>> - void *opaque_arg OVS_UNUSED, >>>>>> - void *_m, >>>>>> - unsigned i OVS_UNUSED) >>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg) >>>>>> { >>>>>> - struct rte_mbuf *m = _m; >>>>>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet); >>>>>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv; >>>>>> + struct rte_pktmbuf_pool_private default_mbp_priv; >>>>>> + uint16_t roomsz; >>>>>> >>>>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet)); >>>>>> >>>>>> - memset(m, 0, mp->elt_size); >>>>>> + /* if no structure is provided, assume no mbuf private area */ >>>>>> + >>>>>> + user_mbp_priv = opaque_arg; >>>>>> + if (user_mbp_priv == NULL) { >>>>>> + default_mbp_priv.mbuf_priv_size = 0; >>>>>> + if (mp->elt_size > sizeof(struct dp_packet)) { >>>>>> + roomsz = mp->elt_size - sizeof(struct dp_packet); >>>>>> + } else { >>>>>> + roomsz = 0; >>>>>> + } >>>>>> + default_mbp_priv.mbuf_data_room_size = roomsz; >>>>>> + user_mbp_priv = &default_mbp_priv; >>>>>> + } >>>>>> + >>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) + >>>>>> + user_mbp_priv->mbuf_data_room_size + >>>>>> + user_mbp_priv->mbuf_priv_size); >>>>>> + >>>>>> + mbp_priv = rte_mempool_get_priv(mp); >>>>>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv)); >>>>>> +} >>>>>> + >>>>>> +/* Initialise some fields in the mbuf structure that are not >> modified by the >>>>>> + * user once created (origin pool, buffer start address, etc.*/ >>>>>> +static void >>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>> + void *opaque_arg OVS_UNUSED, >>>>>> + void *_m, >>>>>> + unsigned i OVS_UNUSED) >>>>>> +{ >>>>>> + struct rte_mbuf *m = _m; >>>>>> + uint32_t buf_size, buf_len, priv_size; >>>>>> + >>>>>> + priv_size = rte_pktmbuf_priv_size(mp); >>>>>> + buf_size = sizeof(struct dp_packet) + priv_size; >>>>>> + buf_len = rte_pktmbuf_data_room_size(mp); >>>>>> >>>>>> - /* start of buffer is just after mbuf structure */ >>>>>> - m->buf_addr = (char *)m + sizeof(struct dp_packet); >>>>>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + >>>>>> - sizeof(struct dp_packet); >>>>>> - m->buf_len = (uint16_t)buf_len; >>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == >>>>>> priv_size); >>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size); >>>>>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX); >>>>>> >>>>>> - /* keep some headroom between start of buffer and data */ >>>>>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len); >>>>>> + memset(m, 0, mp->elt_size); >>>>>> >>>>>> - /* init some constant fields */ >>>>>> - m->pool = mp; >>>>>> - m->nb_segs = 1; >>>>>> - m->port = 0xff; >>>>>> + /* start of buffer is after dp_packet structure and priv data */ >>>>>> + m->priv_size = priv_size; >>>>>> + m->buf_addr = (char *)m + buf_size; >>>>>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size; >>>>>> + m->buf_len = (uint16_t)buf_len; >>>>>> + >>>>>> + /* keep some headroom between start of buffer and data */ >>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, >>>>>> (uint16_t)m->buf_len); >>>>>> + >>>>>> + /* init some constant fields */ >>>>>> + m->pool = mp; >>>>>> + m->nb_segs = 1; >>>>>> + m->port = 0xff; >>>>> >>>>>Please don't mix tabs and spaces in the file. >>>> >>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs. >>>> >>>>> >>>>>> } >>>>>> >>>>>> static void >>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>> { >>>>>> struct rte_mbuf *m = _m; >>>>>> >>>>>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i); >>>>>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i); >>>>>> >>>>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len); >>>>>> } >>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) >> OVS_REQUIRES(dpdk_mutex) >>>>>> struct dpdk_mp *dmp = NULL; >>>>>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>>>>> unsigned mp_size; >>>>>> + struct rte_pktmbuf_pool_private mbp_priv; >>>>>> >>>>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { >>>>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) { >>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) >> OVS_REQUIRES(dpdk_mutex) >>>>>> dmp->socket_id = socket_id; >>>>>> dmp->mtu = mtu; >>>>>> dmp->refcount = 1; >>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + >> RTE_PKTMBUF_HEADROOM; >>>>>> + mbp_priv.mbuf_priv_size = 0; >>>>>> >>>>>> mp_size = MAX_NB_MBUF; >>>>>> do { >>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) >> OVS_REQUIRES(dpdk_mutex) >>>>>> return NULL; >>>>>> } >>>>>> >>>>>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu), >>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size, >> MBUF_SEGMENT_SIZE(mtu), >>>>>> MP_CACHE_SZ, >>>>>> sizeof(struct rte_pktmbuf_pool_private), >>>>>> - rte_pktmbuf_pool_init, NULL, >>>>>> + ovs_rte_pktmbuf_pool_init, >>>>>> &mbp_priv, >>>>>> ovs_rte_pktmbuf_init, NULL, >>>>>> socket_id, 0); >>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= >> MIN_NB_MBUF); >>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >> *dev, int n_rxq, int >>>>n_txq) >>>>>> { >>>>>> int diag = 0; >>>>>> int i; >>>>>> + struct rte_eth_conf conf = port_conf; >>>>>> >>>>>> /* A device may report more queues than it makes available (this has >>>>>> * been observed for Intel xl710, which reserves some of them for >>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >> *dev, int n_rxq, int >>>>>n_txq) >>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); >>>>>> } >>>>>> >>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >> &port_conf); >>>>>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) { >>>>>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED; >>>>>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>> + } >>>>>> + >>>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf); >>>>>> if (diag) { >>>>>> break; >>>>>> } >>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, >> unsigned int port_no, >>>>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); >>>>>> int sid; >>>>>> int err = 0; >>>>>> + uint32_t buf_size; >>>>>> >>>>>> ovs_mutex_init(&netdev->mutex); >>>>>> ovs_mutex_lock(&netdev->mutex); >>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, >> unsigned int port_no, >>>>>> netdev->port_id = port_no; >>>>>> netdev->type = type; >>>>>> netdev->flags = 0; >>>>>> + >>>>>> + /* Initialize port's MTU and frame len to the default Ethernet >>>>>> values. >>>>>> + * Larger, user-specified (jumbo) frame buffers are accommodated in >>>>>> + * netdev_dpdk_set_config. >>>>>> + */ >>>>>> + netdev->max_packet_len = ETHER_MAX_LEN; >>>>>> netdev->mtu = ETHER_MTU; >>>>>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); >>>>>> >>>>>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); >>>>>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN); >>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, >> FRAME_LEN_TO_MTU(buf_size)); >>>>>> if (!netdev->dpdk_mp) { >>>>>> err = ENOMEM; >>>>>> goto unlock; >>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], >> const char prefix[], >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +static void >>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu) >>>>>> +{ >>>>>> + const char *mtu_str = smap_get(args, "dpdk-mtu"); >>>>>> + int local_mtu; >>>>>> + >>>>>> + if(mtu_str) { >>>>>> + local_mtu = atoi(mtu_str); >>>>> >>>>>Please use strtol or strtoul here, and detect errors in the string by >>>>>checking endptr. That way, we can be sure that random garbage on the >>>>>stack won't accidentally become the mtu. >>>> >>>>Will do, thanks. >>>> >>>>> >>>>>> + } >>>>>> + if(!mtu_str || local_mtu < ETHER_MTU || >>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) { >>>>>> + local_mtu = ETHER_MTU; >>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting >> to %d.\n", >>>>>> + local_mtu); >>>>>> + } >>>>>> + *mtu = local_mtu; >>>>>> +} >>>>>> + >>>>>> static int >>>>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex) >>>>>> { >>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev >> *netdev_, struct smap >>>>>*args) >>>>>> smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq); >>>>>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq); >>>>>> smap_add_format(args, "configured_tx_queues", "%d", >>>>>> dev->real_n_txq); >>>>>> + smap_add_format(args, "mtu", "%d", dev->mtu); >>>>>> ovs_mutex_unlock(&dev->mutex); >>>>>> >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */ >>>>>> +static int >>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>> +{ >>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>> + int old_mtu, err; >>>>>> + uint32_t buf_size; >>>>>> + int dpdk_mtu; >>>>>> + struct dpdk_mp *old_mp; >>>>>> + struct dpdk_mp *mp; >>>>>> + >>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>> + if (dev->mtu == mtu) { >>>>>> + err = 0; >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu)); >>>>>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size); >>>>>> + >>>>>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu); >>>>>> + if (!mp) { >>>>>> + err = ENOMEM; >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + rte_eth_dev_stop(dev->port_id); >>>>>> + >>>>>> + old_mtu = dev->mtu; >>>>>> + old_mp = dev->dpdk_mp; >>>>>> + dev->dpdk_mp = mp; >>>>>> + dev->mtu = mtu; >>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>> + >>>>>> + err = dpdk_eth_dev_init(dev); >>>>> >>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling >>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the >>>>>rte_eth_dev_start() instead? >> >> The rx queues for the device need to be assigned the mempool that was >> allocated for the jumbo frames; since dpdk_eth_dev_init handles the >> queue setup, rte_eth_dev configuration, etc. it makes sense to call it >> here. >> >>>> >>>>This is legacy code, but I can refactor if needs be. >>>> >>>>> >>>>>> + if (err) { >>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last >> known " >>>>>> + "good value '%d'\n", mtu, dev->up.name, old_mtu); >>>>>> + dpdk_mp_put(mp); >>>>>> + dev->mtu = old_mtu; >>>>>> + dev->dpdk_mp = old_mp; >>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>> + err = dpdk_eth_dev_init(dev); >>>>> >>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you >>>>>will cause a queue reconfiguration action, which is not needed, and can >>>>>error. >> >> This call is needed, in case the eth_dev's queue configuration was >> changed by the previous call. >> The error handling for this call is arguably outside the scope of this >> patch, and could be submitted as a separate patch - what do you think? > >That's probably a good idea. > >>>>> >>>> >>>>As above. >>>> >>>>>If you are intending to set MTU, I would have expected calls to >>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I >>>>>recognize that much of this is legacy, but since you are touching it :) >>>> >>>>I actually had a number of issues with rte_eth_dev_set_mtu early on >> in the patch; I can take >>>>a second look now that the code is more stable though. >>> >>>I remembered why I didn't use this. >>> >>>If you look at the IXGBE driver's mtu_set function, it refuses frame >> lengths which are larger >>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM). >>>This might be avoided by configuring the device for scatter RX >> behavior, but this incurs a >>>performance impact. > >It seems strange to me that we need to ignore the library API for this >one case, right? What if the user is only doing vhostuser PMDs mixed >with a bnx2x card? >
I can't say that I've familiar enough with this setup to form an answer on that one Aaron. >It would be better to get it fixed in the library, I think. After all, >they gave a generic setting to change MTU - why don't they make it >behave well? > Let me follow up on that with some local DPDK folks here. Would you consider this a showstopper for upstreaming of this feature? If we need to wait for a change to DPDK, Jumbo Frames will miss the 2.5 window. >>><snippet> >>>ixgbe_dev_mtu_set(...) { >>> >>>... >>> >>> /* refuse mtu that requires the support of scattered packets when this >>> * feature has not been enabled before. */ >>> if (!dev->data->scattered_rx && >>> (frame_size + 2 * IXGBE_VLAN_TAG_SIZE > >>> dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM)) >>> return -EINVAL; >>> >>>... >>>} >>></snippet> >>> >>> >>>> >>>>> >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + dpdk_mp_put(old_mp); >>>>>> + netdev_change_seq_changed(netdev); >>>>>> +out: >>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>> + return err; >>>>>> +} >>>>>> + >>>>>> +static int >>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args) >>>>>> +{ >>>>>> + int mtu; >>>>>> + >>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>> + >>>>>> + return netdev_dpdk_set_mtu(netdev_, mtu); >>>>>> +} >>>>>> + >>>>>> static int >>>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_) >>>>>> { >>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev >> *netdev, int *mtup) >>>>>> >>>>>> return 0; >>>>>> } >>>>>> - >>>>>> -static int >>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>> -{ >>>>>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>> - int old_mtu, err; >>>>>> - struct dpdk_mp *old_mp; >>>>>> - struct dpdk_mp *mp; >>>>>> - >>>>>> - ovs_mutex_lock(&dpdk_mutex); >>>>>> - ovs_mutex_lock(&dev->mutex); >>>>>> - if (dev->mtu == mtu) { >>>>>> - err = 0; >>>>>> - goto out; >>>>>> - } >>>>>> - >>>>>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu); >>>>>> - if (!mp) { >>>>>> - err = ENOMEM; >>>>>> - goto out; >>>>>> - } >>>>>> - >>>>>> - rte_eth_dev_stop(dev->port_id); >>>>>> - >>>>>> - old_mtu = dev->mtu; >>>>>> - old_mp = dev->dpdk_mp; >>>>>> - dev->dpdk_mp = mp; >>>>>> - dev->mtu = mtu; >>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>> - >>>>>> - err = dpdk_eth_dev_init(dev); >>>>>> - if (err) { >>>>>> - dpdk_mp_put(mp); >>>>>> - dev->mtu = old_mtu; >>>>>> - dev->dpdk_mp = old_mp; >>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>> - dpdk_eth_dev_init(dev); >>>>>> - goto out; >>>>>> - } >>>>>> - >>>>>> - dpdk_mp_put(old_mp); >>>>>> - netdev_change_seq_changed(netdev); >>>>>> -out: >>>>>> - ovs_mutex_unlock(&dev->mutex); >>>>>> - ovs_mutex_unlock(&dpdk_mutex); >>>>>> - return err; >>>>>> -} >>>>>> - >>>>>> static int >>>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier); >>>>>> >>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev >> *netdev_, struct smap >>>>>*args) >>>>>> smap_add_format(args, "numa_id", "%d", >> rte_eth_dev_socket_id(dev->port_id)); >>>>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name); >>>>>> smap_add_format(args, "min_rx_bufsize", "%u", >> dev_info.min_rx_bufsize); >>>>>> - smap_add_format(args, "max_rx_pktlen", "%u", >>>>>> dev_info.max_rx_pktlen); >>>>>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len); >>>>>> smap_add_format(args, "max_rx_queues", "%u", >>>>>> dev_info.max_rx_queues); >>>>>> smap_add_format(args, "max_tx_queues", "%u", >>>>>> dev_info.max_tx_queues); >>>>>> smap_add_format(args, "max_mac_addrs", "%u", >>>>>> dev_info.max_mac_addrs); >>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void) >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */ >>>>>> +static int >>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu) >>>>>> +{ >>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>> + int err = 0; >>>>>> + struct dpdk_mp *old_mp; >>>>>> + struct dpdk_mp *mp; >>>>>> + >>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>> + if (dev->mtu == mtu) { >>>>>> + err = 0; >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + mp = dpdk_mp_get(dev->socket_id, mtu); >>>>>> + if (!mp) { >>>>>> + err = ENOMEM; >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + old_mp = dev->dpdk_mp; >>>>>> + dev->dpdk_mp = mp; >>>>>> + dev->mtu = mtu; >>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>> + >>>>>> + dpdk_mp_put(old_mp); >>>>>> + netdev_change_seq_changed(netdev); >>>>>> +out: >>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>> + return err; >>>>>> +} >>>>>> + >>>>>> +static int >>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const >> struct smap *args) >>>>>> +{ >>>>>> + int mtu; >>>>>> + >>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>> + >>>>>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu); >>>>>> +} >>>>>> + >>>>>> static void >>>>>> dpdk_common_init(void) >>>>>> { >>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk: >>>>>> return err; >>>>>> } >>>>>> >>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, >>>>>> SEND, \ >>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) >>>>>> \ >>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \ >>>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, \ >>>>>> + GET_STATUS, RXQ_RECV) \ >>>>>> { \ >>>>>> NAME, \ >>>>>> INIT, /* init */ \ >>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk: >>>>>> DESTRUCT, \ >>>>>> netdev_dpdk_dealloc, \ >>>>>> netdev_dpdk_get_config, \ >>>>>> - NULL, /* netdev_dpdk_set_config */ \ >>>>>> + SET_CONFIG, \ >>>>>> NULL, /* get_tunnel_config */ \ >>>>>> NULL, /* build header */ \ >>>>>> NULL, /* push header */ \ >>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk: >>>>>> netdev_dpdk_set_etheraddr, \ >>>>>> netdev_dpdk_get_etheraddr, \ >>>>>> netdev_dpdk_get_mtu, \ >>>>>> - netdev_dpdk_set_mtu, \ >>>>>> + SET_MTU, \ >>>>>> netdev_dpdk_get_ifindex, \ >>>>>> GET_CARRIER, \ >>>>>> netdev_dpdk_get_carrier_resets, \ >>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class = >>>>>> NULL, >>>>>> netdev_dpdk_construct, >>>>>> netdev_dpdk_destruct, >>>>>> + netdev_dpdk_set_config, >>>>>> netdev_dpdk_set_multiq, >>>>>> netdev_dpdk_eth_send, >>>>>> + netdev_dpdk_set_mtu, >>>>>> netdev_dpdk_get_carrier, >>>>>> netdev_dpdk_get_stats, >>>>>> netdev_dpdk_get_features, >>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class = >>>>>> NULL, >>>>>> netdev_dpdk_ring_construct, >>>>>> netdev_dpdk_destruct, >>>>>> + netdev_dpdk_set_config, >>>>>> netdev_dpdk_set_multiq, >>>>>> netdev_dpdk_ring_send, >>>>>> + netdev_dpdk_set_mtu, >>>>>> netdev_dpdk_get_carrier, >>>>>> netdev_dpdk_get_stats, >>>>>> netdev_dpdk_get_features, >>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class >> OVS_UNUSED dpdk_vhost_cuse_class >>>= >>>>>> dpdk_vhost_cuse_class_init, >>>>>> netdev_dpdk_vhost_cuse_construct, >>>>>> netdev_dpdk_vhost_destruct, >>>>>> + NULL, >>>>>> netdev_dpdk_vhost_set_multiq, >>>>>> netdev_dpdk_vhost_send, >>>>>> + NULL, >>>>>> netdev_dpdk_vhost_get_carrier, >>>>>> netdev_dpdk_vhost_get_stats, >>>>>> NULL, >>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class >> OVS_UNUSED dpdk_vhost_user_class >>>= >>>>>> dpdk_vhost_user_class_init, >>>>>> netdev_dpdk_vhost_user_construct, >>>>>> netdev_dpdk_vhost_destruct, >>>>>> + netdev_dpdk_vhost_set_config, >>>>>> netdev_dpdk_vhost_set_multiq, >>>>>> netdev_dpdk_vhost_send, >>>>>> + netdev_dpdk_vhost_set_mtu, >>>>>> netdev_dpdk_vhost_get_carrier, >>>>>> netdev_dpdk_vhost_get_stats, >>>>>> NULL, >>>>_______________________________________________ >>>>dev mailing list >>>>dev@openvswitch.org >>>>http://openvswitch.org/mailman/listinfo/dev >>>_______________________________________________ >>>dev mailing list >>>dev@openvswitch.org >>>http://openvswitch.org/mailman/listinfo/dev ><#secure method=pgpmime mode=encrypt> _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev