One addendum below, Aaron. > >>Hi Mark, >> >>Thanks for the patch! I've not done a very thorough review of this, but >>my first-blush comments are inline. > >Thanks for the review Aaron! > >Any additional comments are welcome - my initial responses are inline, below. > >Thanks, >Mark > >> >>Mark Kavanagh <mark.b.kavan...@intel.com> writes: >>> Add support for Jumbo Frames to DPDK-enabled port types, >>> using single-segment-mbufs. >>> >>> Using this approach, the amount of memory allocated for each mbuf >>> to store frame data is increased to a value greater than 1518B >>> (typical Ethernet maximum frame length). The increased space >>> available in the mbuf means that an entire Jumbo Frame can be carried >>> in a single mbuf, as opposed to partitioning it across multiple mbuf >>> segments. >>> >>> The amount of space allocated to each mbuf to hold frame data is >>> defined dynamically by the user when adding a DPDK port to a bridge. >>> If an MTU value is not supplied, or the user-supplied value is invalid, >>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B). >>> >>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>> --- >> >>This is a new feature which has user-visible impact, so should be >>announced in NEWS (plus, it's great news). > >No problem - I'll add an entry. > >> >>> INSTALL.DPDK.md | 59 ++++++++- >>> lib/netdev-dpdk.c | 356 >>> +++++++++++++++++++++++++++++++++++++++++------------- >>> 2 files changed, 328 insertions(+), 87 deletions(-) >>> >>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >>> index 96b686c..2f23e27 100644 >>> --- a/INSTALL.DPDK.md >>> +++ b/INSTALL.DPDK.md >>> @@ -859,10 +859,61 @@ by adding the following string: >>> to <interface> sections of all network devices used by DPDK. Parameter 'N' >>> determines how many queues can be used by the guest. >>> >>> + >>> +Jumbo Frames >>> +------------ >>> + >>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports. >>> + >>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the >>> ovs-vsctl >>> +'add-port' command-line, along with the required MTU for the port. >>> +e.g. >> >>This text and the example do not match. I think '--dpdk-mtu' is not valid. > >Good catch, thanks. > >> >>> + >>> + ``` >>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk >>> options:dpdk-mtu=9000 >>> + ``` >>> + >>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are >>> +increased, such that a full Jumbo Frame may be accommodated inside a single >>> +mbuf segment. Once set, the MTU for a DPDK port is immutable. >> >>Why no support? DPDK supports changing the mtu. > >I guess my rationale here is that an MTU change can't be triggered via OVS >command-line, nor >can it be triggered programmatically via DPDK (apart from an explicit call to >rte_eth_dev_set_mtu). >So, while technically it's possibly, from a user's point of view, there's no >way to configure >it, outside of modifying the code directly. If I've missed something here, >please let me >know. > >> >>> +Jumbo frame support has been validated against 13312B frames, using the >>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may >>> +theoretically be supported. Supported port types excludes vHost-Cuse >>> ports, as >>> +this feature is pending deprecation. >>> + >>> + >>> +vHost Ports and Jumbo Frames >>> +---------------------------- >>> +Jumbo frame support is available for DPDK vHost-User ports only. Some >>> additional >>> +configuration is needed to take advantage of this feature: >>> + >>> + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated >>> in >>> + the QEMU command line snippet below: >>> + >>> + ``` >>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >>> + '-device >>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >>> + ``` >>> + >>> + 2. Where virtio devices are bound to the Linux kernel driver in a guest >>> + environment (i.e. interfaces are not bound to an in-guest DPDK >>> driver), the >>> + MTU of those logical network interfaces must also be increased. This >>> + avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' >>> refers >>> + to the length of the IP packet only, and not that of the entire frame. >>> + >>> + e.g. To calculate the exact MTU of a standard IPv4 frame, subtract >>> the L2 >>> + header and CRC lengths (i.e. 18B) from the max supported frame size. >>> + So, to set the MTU for a 13312B Jumbo Frame: >>> + >>> + ``` >>> + ifconfig eth1 mtu 13294 >>> + ``` >>> + >>> + >>> Restrictions: >>> ------------- >>> >>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. >>> - Currently DPDK port does not make use any offload functionality. >>> - DPDK-vHost support works with 1G huge pages. >>> >>> @@ -903,6 +954,12 @@ Restrictions: >>> the next release of DPDK (which includes the above patch) is available >>> and >>> integrated into OVS. >>> >>> + Jumbo Frames: >>> + - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The >>> source of >>> + this issue is currently being investigated. >>> + - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports. >>> + >>> + >>> Bug Reporting: >>> -------------- >>> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>> index de7e488..5f23e28 100644 >>> --- a/lib/netdev-dpdk.c >>> +++ b/lib/netdev-dpdk.c >>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = >>> VLOG_RATE_LIMIT_INIT(5, 20); >>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE >>> #define OVS_VPORT_DPDK "ovs_dpdk" >>> >>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1 >>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024 >>> + >>> /* >>> * need to reserve tons of extra space in the mbufs so we can align the >>> * DMA addresses to 4KB. >>> * The minimum mbuf size is limited to avoid scatter behaviour and drop in >>> * performance for standard Ethernet MTU. >>> */ >>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \ >>> - + sizeof(struct dp_packet) \ >>> - + RTE_PKTMBUF_HEADROOM) >>> -#define MBUF_SIZE_DRIVER (2048 \ >>> - + sizeof (struct rte_mbuf) \ >>> - + RTE_PKTMBUF_HEADROOM) >>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER) >>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - >>> ETHER_CRC_LEN) >>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \ >>> + + sizeof(struct dp_packet) \ >>> + + RTE_PKTMBUF_HEADROOM) >>> + >>> +/* This value should be specified as a multiple of the DPDK NIC driver's >>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). >>> + */ >>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312 >>> >>> /* Max and min number of packets in the mempool. OVS tries to allocate a >>> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have >>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = >>> VLOG_RATE_LIMIT_INIT(5, 20); >>> #define MIN_NB_MBUF (4096 * 4) >>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE >>> >>> +#define DPDK_VLAN_TAG_LEN 4 >>> + >>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ >>> BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) >>> == 0); >>> >>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = { >>> .header_split = 0, /* Header Split disabled */ >>> .hw_ip_checksum = 0, /* IP checksum offload disabled */ >>> .hw_vlan_filter = 0, /* VLAN filtering disabled */ >>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */ >>> .hw_strip_crc = 0, >>> }, >>> .rx_adv_conf = { >>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class) >>> return class->construct == netdev_dpdk_construct; >>> } >>> >>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity >>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for >>> igb_uio). >>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are >>> + * allocated to accomodate the packet in its entirety. Furthermore, the >>> igb_uio >>> + * driver needs to ensure that there is also sufficient space in the Rx >>> buffer >>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too >>> + * small, then the driver enables scatter RX behaviour, which reduces >>> + * performance. To prevent this, use a buffer size that is closest to >>> + * 'frame_len', but which satisfies the aforementioned criteria. >>> + */ >>> +static uint32_t >>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len) >>> +{ >>> + struct rte_eth_dev_info info; >>> + uint32_t buf_size; >>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN); >>> + >>> + if(netdev->type == DPDK_DEV_ETH) { >>> + rte_eth_dev_info_get(netdev->port_id, &info); >>> + buf_size = (info.min_rx_bufsize == 0) ? >>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE : >>> + info.min_rx_bufsize; >> >>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu? > >I'm not sure if I follow - the MTU isn't the issue here, but rather the >buffer size used >when creating the rte_mempool, which must be a multiple of the driver's >min_rx_bufsize. > >> >>> + } else { >>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE; >>> + } >>> + >>> + if(len % buf_size) { >>> + len = buf_size * ((len/buf_size) + 1); >>> + } >>> + >>> + return len; >>> +} >>> + >>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used >>> * for all other segments data, bss and text. */ >>> >>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p) >>> } >>> >>> static void >>> -__rte_pktmbuf_init(struct rte_mempool *mp, >>> - void *opaque_arg OVS_UNUSED, >>> - void *_m, >>> - unsigned i OVS_UNUSED) >>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg) >>> { >>> - struct rte_mbuf *m = _m; >>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet); >>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv; >>> + struct rte_pktmbuf_pool_private default_mbp_priv; >>> + uint16_t roomsz; >>> >>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet)); >>> >>> - memset(m, 0, mp->elt_size); >>> + /* if no structure is provided, assume no mbuf private area */ >>> + >>> + user_mbp_priv = opaque_arg; >>> + if (user_mbp_priv == NULL) { >>> + default_mbp_priv.mbuf_priv_size = 0; >>> + if (mp->elt_size > sizeof(struct dp_packet)) { >>> + roomsz = mp->elt_size - sizeof(struct dp_packet); >>> + } else { >>> + roomsz = 0; >>> + } >>> + default_mbp_priv.mbuf_data_room_size = roomsz; >>> + user_mbp_priv = &default_mbp_priv; >>> + } >>> + >>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) + >>> + user_mbp_priv->mbuf_data_room_size + >>> + user_mbp_priv->mbuf_priv_size); >>> + >>> + mbp_priv = rte_mempool_get_priv(mp); >>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv)); >>> +} >>> + >>> +/* Initialise some fields in the mbuf structure that are not modified by >>> the >>> + * user once created (origin pool, buffer start address, etc.*/ >>> +static void >>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>> + void *opaque_arg OVS_UNUSED, >>> + void *_m, >>> + unsigned i OVS_UNUSED) >>> +{ >>> + struct rte_mbuf *m = _m; >>> + uint32_t buf_size, buf_len, priv_size; >>> + >>> + priv_size = rte_pktmbuf_priv_size(mp); >>> + buf_size = sizeof(struct dp_packet) + priv_size; >>> + buf_len = rte_pktmbuf_data_room_size(mp); >>> >>> - /* start of buffer is just after mbuf structure */ >>> - m->buf_addr = (char *)m + sizeof(struct dp_packet); >>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + >>> - sizeof(struct dp_packet); >>> - m->buf_len = (uint16_t)buf_len; >>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size); >>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size); >>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX); >>> >>> - /* keep some headroom between start of buffer and data */ >>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len); >>> + memset(m, 0, mp->elt_size); >>> >>> - /* init some constant fields */ >>> - m->pool = mp; >>> - m->nb_segs = 1; >>> - m->port = 0xff; >>> + /* start of buffer is after dp_packet structure and priv data */ >>> + m->priv_size = priv_size; >>> + m->buf_addr = (char *)m + buf_size; >>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size; >>> + m->buf_len = (uint16_t)buf_len; >>> + >>> + /* keep some headroom between start of buffer and data */ >>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len); >>> + >>> + /* init some constant fields */ >>> + m->pool = mp; >>> + m->nb_segs = 1; >>> + m->port = 0xff; >> >>Please don't mix tabs and spaces in the file. > >Apologies - this was a copy/paste from a DPDK file, which uses tabs. > >> >>> } >>> >>> static void >>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>> { >>> struct rte_mbuf *m = _m; >>> >>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i); >>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i); >>> >>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len); >>> } >>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>> struct dpdk_mp *dmp = NULL; >>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>> unsigned mp_size; >>> + struct rte_pktmbuf_pool_private mbp_priv; >>> >>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { >>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) { >>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>> dmp->socket_id = socket_id; >>> dmp->mtu = mtu; >>> dmp->refcount = 1; >>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + >>> RTE_PKTMBUF_HEADROOM; >>> + mbp_priv.mbuf_priv_size = 0; >>> >>> mp_size = MAX_NB_MBUF; >>> do { >>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>> return NULL; >>> } >>> >>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu), >>> + dmp->mp = rte_mempool_create(mp_name, mp_size, >>> MBUF_SEGMENT_SIZE(mtu), >>> MP_CACHE_SZ, >>> sizeof(struct >>> rte_pktmbuf_pool_private), >>> - rte_pktmbuf_pool_init, NULL, >>> + ovs_rte_pktmbuf_pool_init, &mbp_priv, >>> ovs_rte_pktmbuf_init, NULL, >>> socket_id, 0); >>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= >>> MIN_NB_MBUF); >>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >>> n_rxq, int >n_txq) >>> { >>> int diag = 0; >>> int i; >>> + struct rte_eth_conf conf = port_conf; >>> >>> /* A device may report more queues than it makes available (this has >>> * been observed for Intel xl710, which reserves some of them for >>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >>> n_rxq, int >>n_txq) >>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); >>> } >>> >>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>> &port_conf); >>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) { >>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED; >>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu); >>> + } >>> + >>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf); >>> if (diag) { >>> break; >>> } >>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int >>> port_no, >>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); >>> int sid; >>> int err = 0; >>> + uint32_t buf_size; >>> >>> ovs_mutex_init(&netdev->mutex); >>> ovs_mutex_lock(&netdev->mutex); >>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int >>> port_no, >>> netdev->port_id = port_no; >>> netdev->type = type; >>> netdev->flags = 0; >>> + >>> + /* Initialize port's MTU and frame len to the default Ethernet values. >>> + * Larger, user-specified (jumbo) frame buffers are accommodated in >>> + * netdev_dpdk_set_config. >>> + */ >>> + netdev->max_packet_len = ETHER_MAX_LEN; >>> netdev->mtu = ETHER_MTU; >>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); >>> >>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); >>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN); >>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, >>> FRAME_LEN_TO_MTU(buf_size)); >>> if (!netdev->dpdk_mp) { >>> err = ENOMEM; >>> goto unlock; >>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char >>> prefix[], >>> return 0; >>> } >>> >>> +static void >>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu) >>> +{ >>> + const char *mtu_str = smap_get(args, "dpdk-mtu"); >>> + int local_mtu; >>> + >>> + if(mtu_str) { >>> + local_mtu = atoi(mtu_str); >> >>Please use strtol or strtoul here, and detect errors in the string by >>checking endptr. That way, we can be sure that random garbage on the >>stack won't accidentally become the mtu. > >Will do, thanks. > >> >>> + } >>> + if(!mtu_str || local_mtu < ETHER_MTU || >>> + local_mtu > >>> FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) { >>> + local_mtu = ETHER_MTU; >>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to >>> %d.\n", >>> + local_mtu); >>> + } >>> + *mtu = local_mtu; >>> +} >>> + >>> static int >>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex) >>> { >>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, >>> struct smap >>*args) >>> smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq); >>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq); >>> smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq); >>> + smap_add_format(args, "mtu", "%d", dev->mtu); >>> ovs_mutex_unlock(&dev->mutex); >>> >>> return 0; >>> } >>> >>> +/* Set the mtu of DPDK_DEV_ETH ports */ >>> +static int >>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>> +{ >>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>> + int old_mtu, err; >>> + uint32_t buf_size; >>> + int dpdk_mtu; >>> + struct dpdk_mp *old_mp; >>> + struct dpdk_mp *mp; >>> + >>> + ovs_mutex_lock(&dpdk_mutex); >>> + ovs_mutex_lock(&dev->mutex); >>> + if (dev->mtu == mtu) { >>> + err = 0; >>> + goto out; >>> + } >>> + >>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu)); >>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size); >>> + >>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu); >>> + if (!mp) { >>> + err = ENOMEM; >>> + goto out; >>> + } >>> + >>> + rte_eth_dev_stop(dev->port_id); >>> + >>> + old_mtu = dev->mtu; >>> + old_mp = dev->dpdk_mp; >>> + dev->dpdk_mp = mp; >>> + dev->mtu = mtu; >>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>> + >>> + err = dpdk_eth_dev_init(dev); >> >>Why call the dpdk_eth_dev_init here? Since you are directly calling >>rte_eth_dev_X functions here - wouldn't it make sense to use the >>rte_eth_dev_start() instead? > >This is legacy code, but I can refactor if needs be. > >> >>> + if (err) { >>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last >>> known " >>> + "good value '%d'\n", mtu, dev->up.name, old_mtu); >>> + dpdk_mp_put(mp); >>> + dev->mtu = old_mtu; >>> + dev->dpdk_mp = old_mp; >>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>> + err = dpdk_eth_dev_init(dev); >> >>What happens if this returns a failure? By using dpdk_eth_dev_init, you >>will cause a queue reconfiguration action, which is not needed, and can >>error. >> > >As above. > >>If you are intending to set MTU, I would have expected calls to >>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I >>recognize that much of this is legacy, but since you are touching it :) > >I actually had a number of issues with rte_eth_dev_set_mtu early on in the >patch; I can take >a second look now that the code is more stable though.
I remembered why I didn't use this. If you look at the IXGBE driver's mtu_set function, it refuses frame lengths which are larger than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM). This might be avoided by configuring the device for scatter RX behavior, but this incurs a performance impact. <snippet> ixgbe_dev_mtu_set(...) { ... /* refuse mtu that requires the support of scattered packets when this * feature has not been enabled before. */ if (!dev->data->scattered_rx && (frame_size + 2 * IXGBE_VLAN_TAG_SIZE > dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM)) return -EINVAL; ... } </snippet> > >> >>> + goto out; >>> + } >>> + >>> + dpdk_mp_put(old_mp); >>> + netdev_change_seq_changed(netdev); >>> +out: >>> + ovs_mutex_unlock(&dev->mutex); >>> + ovs_mutex_unlock(&dpdk_mutex); >>> + return err; >>> +} >>> + >>> +static int >>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args) >>> +{ >>> + int mtu; >>> + >>> + dpdk_dev_parse_mtu(args, &mtu); >>> + >>> + return netdev_dpdk_set_mtu(netdev_, mtu); >>> +} >>> + >>> static int >>> netdev_dpdk_get_numa_id(const struct netdev *netdev_) >>> { >>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int >>> *mtup) >>> >>> return 0; >>> } >>> - >>> -static int >>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>> -{ >>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>> - int old_mtu, err; >>> - struct dpdk_mp *old_mp; >>> - struct dpdk_mp *mp; >>> - >>> - ovs_mutex_lock(&dpdk_mutex); >>> - ovs_mutex_lock(&dev->mutex); >>> - if (dev->mtu == mtu) { >>> - err = 0; >>> - goto out; >>> - } >>> - >>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu); >>> - if (!mp) { >>> - err = ENOMEM; >>> - goto out; >>> - } >>> - >>> - rte_eth_dev_stop(dev->port_id); >>> - >>> - old_mtu = dev->mtu; >>> - old_mp = dev->dpdk_mp; >>> - dev->dpdk_mp = mp; >>> - dev->mtu = mtu; >>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>> - >>> - err = dpdk_eth_dev_init(dev); >>> - if (err) { >>> - dpdk_mp_put(mp); >>> - dev->mtu = old_mtu; >>> - dev->dpdk_mp = old_mp; >>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>> - dpdk_eth_dev_init(dev); >>> - goto out; >>> - } >>> - >>> - dpdk_mp_put(old_mp); >>> - netdev_change_seq_changed(netdev); >>> -out: >>> - ovs_mutex_unlock(&dev->mutex); >>> - ovs_mutex_unlock(&dpdk_mutex); >>> - return err; >>> -} >>> - >>> static int >>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier); >>> >>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, >>> struct smap >>*args) >>> smap_add_format(args, "numa_id", "%d", >>> rte_eth_dev_socket_id(dev->port_id)); >>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name); >>> smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize); >>> - smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen); >>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len); >>> smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues); >>> smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues); >>> smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs); >>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void) >>> return 0; >>> } >>> >>> +/* Set the mtu of DPDK_DEV_VHOST ports */ >>> +static int >>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu) >>> +{ >>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>> + int err = 0; >>> + struct dpdk_mp *old_mp; >>> + struct dpdk_mp *mp; >>> + >>> + ovs_mutex_lock(&dpdk_mutex); >>> + ovs_mutex_lock(&dev->mutex); >>> + if (dev->mtu == mtu) { >>> + err = 0; >>> + goto out; >>> + } >>> + >>> + mp = dpdk_mp_get(dev->socket_id, mtu); >>> + if (!mp) { >>> + err = ENOMEM; >>> + goto out; >>> + } >>> + >>> + old_mp = dev->dpdk_mp; >>> + dev->dpdk_mp = mp; >>> + dev->mtu = mtu; >>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>> + >>> + dpdk_mp_put(old_mp); >>> + netdev_change_seq_changed(netdev); >>> +out: >>> + ovs_mutex_unlock(&dev->mutex); >>> + ovs_mutex_unlock(&dpdk_mutex); >>> + return err; >>> +} >>> + >>> +static int >>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap >>> *args) >>> +{ >>> + int mtu; >>> + >>> + dpdk_dev_parse_mtu(args, &mtu); >>> + >>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu); >>> +} >>> + >>> static void >>> dpdk_common_init(void) >>> { >>> @@ -2040,8 +2215,9 @@ unlock_dpdk: >>> return err; >>> } >>> >>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \ >>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \ >>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \ >>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, \ >>> + GET_STATUS, RXQ_RECV) \ >>> { \ >>> NAME, \ >>> INIT, /* init */ \ >>> @@ -2053,7 +2229,7 @@ unlock_dpdk: >>> DESTRUCT, \ >>> netdev_dpdk_dealloc, \ >>> netdev_dpdk_get_config, \ >>> - NULL, /* netdev_dpdk_set_config */ \ >>> + SET_CONFIG, \ >>> NULL, /* get_tunnel_config */ \ >>> NULL, /* build header */ \ >>> NULL, /* push header */ \ >>> @@ -2067,7 +2243,7 @@ unlock_dpdk: >>> netdev_dpdk_set_etheraddr, \ >>> netdev_dpdk_get_etheraddr, \ >>> netdev_dpdk_get_mtu, \ >>> - netdev_dpdk_set_mtu, \ >>> + SET_MTU, \ >>> netdev_dpdk_get_ifindex, \ >>> GET_CARRIER, \ >>> netdev_dpdk_get_carrier_resets, \ >>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class = >>> NULL, >>> netdev_dpdk_construct, >>> netdev_dpdk_destruct, >>> + netdev_dpdk_set_config, >>> netdev_dpdk_set_multiq, >>> netdev_dpdk_eth_send, >>> + netdev_dpdk_set_mtu, >>> netdev_dpdk_get_carrier, >>> netdev_dpdk_get_stats, >>> netdev_dpdk_get_features, >>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class = >>> NULL, >>> netdev_dpdk_ring_construct, >>> netdev_dpdk_destruct, >>> + netdev_dpdk_set_config, >>> netdev_dpdk_set_multiq, >>> netdev_dpdk_ring_send, >>> + netdev_dpdk_set_mtu, >>> netdev_dpdk_get_carrier, >>> netdev_dpdk_get_stats, >>> netdev_dpdk_get_features, >>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED >>> dpdk_vhost_cuse_class = >>> dpdk_vhost_cuse_class_init, >>> netdev_dpdk_vhost_cuse_construct, >>> netdev_dpdk_vhost_destruct, >>> + NULL, >>> netdev_dpdk_vhost_set_multiq, >>> netdev_dpdk_vhost_send, >>> + NULL, >>> netdev_dpdk_vhost_get_carrier, >>> netdev_dpdk_vhost_get_stats, >>> NULL, >>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED >>> dpdk_vhost_user_class = >>> dpdk_vhost_user_class_init, >>> netdev_dpdk_vhost_user_construct, >>> netdev_dpdk_vhost_destruct, >>> + netdev_dpdk_vhost_set_config, >>> netdev_dpdk_vhost_set_multiq, >>> netdev_dpdk_vhost_send, >>> + netdev_dpdk_vhost_set_mtu, >>> netdev_dpdk_vhost_get_carrier, >>> netdev_dpdk_vhost_get_stats, >>> NULL, >_______________________________________________ >dev mailing list >dev@openvswitch.org >http://openvswitch.org/mailman/listinfo/dev _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev