Further clarifications below Aaron - please let me know if you have any additional comments.
Thanks, Mark > >One addendum below, Aaron. > >> >>>Hi Mark, >>> >>>Thanks for the patch! I've not done a very thorough review of this, but >>>my first-blush comments are inline. >> >>Thanks for the review Aaron! >> >>Any additional comments are welcome - my initial responses are inline, below. >> >>Thanks, >>Mark >> >>> >>>Mark Kavanagh <mark.b.kavan...@intel.com> writes: >>>> Add support for Jumbo Frames to DPDK-enabled port types, >>>> using single-segment-mbufs. >>>> >>>> Using this approach, the amount of memory allocated for each mbuf >>>> to store frame data is increased to a value greater than 1518B >>>> (typical Ethernet maximum frame length). The increased space >>>> available in the mbuf means that an entire Jumbo Frame can be carried >>>> in a single mbuf, as opposed to partitioning it across multiple mbuf >>>> segments. >>>> >>>> The amount of space allocated to each mbuf to hold frame data is >>>> defined dynamically by the user when adding a DPDK port to a bridge. >>>> If an MTU value is not supplied, or the user-supplied value is invalid, >>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B). >>>> >>>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>>> --- >>> >>>This is a new feature which has user-visible impact, so should be >>>announced in NEWS (plus, it's great news). >> >>No problem - I'll add an entry. >> >>> >>>> INSTALL.DPDK.md | 59 ++++++++- >>>> lib/netdev-dpdk.c | 356 >>>> +++++++++++++++++++++++++++++++++++++++++------------- >>>> 2 files changed, 328 insertions(+), 87 deletions(-) >>>> >>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >>>> index 96b686c..2f23e27 100644 >>>> --- a/INSTALL.DPDK.md >>>> +++ b/INSTALL.DPDK.md >>>> @@ -859,10 +859,61 @@ by adding the following string: >>>> to <interface> sections of all network devices used by DPDK. Parameter 'N' >>>> determines how many queues can be used by the guest. >>>> >>>> + >>>> +Jumbo Frames >>>> +------------ >>>> + >>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports. >>>> + >>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the >>>> ovs-vsctl >>>> +'add-port' command-line, along with the required MTU for the port. >>>> +e.g. >>> >>>This text and the example do not match. I think '--dpdk-mtu' is not valid. >> >>Good catch, thanks. >> >>> >>>> + >>>> + ``` >>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk >>>> options:dpdk-mtu=9000 >>>> + ``` >>>> + >>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are >>>> +increased, such that a full Jumbo Frame may be accommodated inside a >>>> single >>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable. >>> >>>Why no support? DPDK supports changing the mtu. >> >>I guess my rationale here is that an MTU change can't be triggered via OVS >>command-line, nor >>can it be triggered programmatically via DPDK (apart from an explicit call to >>rte_eth_dev_set_mtu). >>So, while technically it's possibly, from a user's point of view, there's no >>way to >configure >>it, outside of modifying the code directly. If I've missed something here, >>please let me >>know. >> >>> >>>> +Jumbo frame support has been validated against 13312B frames, using the >>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may >>>> +theoretically be supported. Supported port types excludes vHost-Cuse >>>> ports, as >>>> +this feature is pending deprecation. >>>> + >>>> + >>>> +vHost Ports and Jumbo Frames >>>> +---------------------------- >>>> +Jumbo frame support is available for DPDK vHost-User ports only. Some >>>> additional >>>> +configuration is needed to take advantage of this feature: >>>> + >>>> + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated >>>> in >>>> + the QEMU command line snippet below: >>>> + >>>> + ``` >>>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >>>> + '-device >>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >>>> + ``` >>>> + >>>> + 2. Where virtio devices are bound to the Linux kernel driver in a guest >>>> + environment (i.e. interfaces are not bound to an in-guest DPDK >>>> driver), the >>>> + MTU of those logical network interfaces must also be increased. This >>>> + avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' >>>> refers >>>> + to the length of the IP packet only, and not that of the entire >>>> frame. >>>> + >>>> + e.g. To calculate the exact MTU of a standard IPv4 frame, subtract >>>> the L2 >>>> + header and CRC lengths (i.e. 18B) from the max supported frame size. >>>> + So, to set the MTU for a 13312B Jumbo Frame: >>>> + >>>> + ``` >>>> + ifconfig eth1 mtu 13294 >>>> + ``` >>>> + >>>> + >>>> Restrictions: >>>> ------------- >>>> >>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. >>>> - Currently DPDK port does not make use any offload functionality. >>>> - DPDK-vHost support works with 1G huge pages. >>>> >>>> @@ -903,6 +954,12 @@ Restrictions: >>>> the next release of DPDK (which includes the above patch) is >>>> available and >>>> integrated into OVS. >>>> >>>> + Jumbo Frames: >>>> + - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The >>>> source of >>>> + this issue is currently being investigated. >>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports. >>>> + >>>> + >>>> Bug Reporting: >>>> -------------- >>>> >>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>>> index de7e488..5f23e28 100644 >>>> --- a/lib/netdev-dpdk.c >>>> +++ b/lib/netdev-dpdk.c >>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = >>>> VLOG_RATE_LIMIT_INIT(5, 20); >>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE >>>> #define OVS_VPORT_DPDK "ovs_dpdk" >>>> >>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1 >>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024 >>>> + >>>> /* >>>> * need to reserve tons of extra space in the mbufs so we can align the >>>> * DMA addresses to 4KB. >>>> * The minimum mbuf size is limited to avoid scatter behaviour and drop in >>>> * performance for standard Ethernet MTU. >>>> */ >>>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \ >>>> - + sizeof(struct dp_packet) \ >>>> - + RTE_PKTMBUF_HEADROOM) >>>> -#define MBUF_SIZE_DRIVER (2048 \ >>>> - + sizeof (struct rte_mbuf) \ >>>> - + RTE_PKTMBUF_HEADROOM) >>>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER) >>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + >>>> ETHER_CRC_LEN) >>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - >>>> ETHER_CRC_LEN) >>>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \ >>>> + + sizeof(struct dp_packet) \ >>>> + + RTE_PKTMBUF_HEADROOM) >>>> + >>>> +/* This value should be specified as a multiple of the DPDK NIC driver's >>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). >>>> + */ >>>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312 >>>> >>>> /* Max and min number of packets in the mempool. OVS tries to allocate a >>>> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't >>>> have >>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = >>>> VLOG_RATE_LIMIT_INIT(5, 20); >>>> #define MIN_NB_MBUF (4096 * 4) >>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE >>>> >>>> +#define DPDK_VLAN_TAG_LEN 4 >>>> + >>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ >>>> BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) >>>> == 0); >>>> >>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = { >>>> .header_split = 0, /* Header Split disabled */ >>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */ >>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */ >>>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */ >>>> .hw_strip_crc = 0, >>>> }, >>>> .rx_adv_conf = { >>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class) >>>> return class->construct == netdev_dpdk_construct; >>>> } >>>> >>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity >>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for >>>> igb_uio). >>>> + * If 'frame_len' is not a multiple of this value, insufficient buffers >>>> are >>>> + * allocated to accomodate the packet in its entirety. Furthermore, the >>>> igb_uio >>>> + * driver needs to ensure that there is also sufficient space in the Rx >>>> buffer >>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too >>>> + * small, then the driver enables scatter RX behaviour, which reduces >>>> + * performance. To prevent this, use a buffer size that is closest to >>>> + * 'frame_len', but which satisfies the aforementioned criteria. >>>> + */ >>>> +static uint32_t >>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len) >>>> +{ >>>> + struct rte_eth_dev_info info; >>>> + uint32_t buf_size; >>>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN); >>>> + >>>> + if(netdev->type == DPDK_DEV_ETH) { >>>> + rte_eth_dev_info_get(netdev->port_id, &info); >>>> + buf_size = (info.min_rx_bufsize == 0) ? >>>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE : >>>> + info.min_rx_bufsize; >>> >>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu? >> >>I'm not sure if I follow - the MTU isn't the issue here, but rather the >>buffer size used >>when creating the rte_mempool, which must be a multiple of the driver's >>min_rx_bufsize. >> >>> >>>> + } else { >>>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE; >>>> + } >>>> + >>>> + if(len % buf_size) { >>>> + len = buf_size * ((len/buf_size) + 1); >>>> + } >>>> + >>>> + return len; >>>> +} >>>> + >>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used >>>> * for all other segments data, bss and text. */ >>>> >>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p) >>>> } >>>> >>>> static void >>>> -__rte_pktmbuf_init(struct rte_mempool *mp, >>>> - void *opaque_arg OVS_UNUSED, >>>> - void *_m, >>>> - unsigned i OVS_UNUSED) >>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg) >>>> { >>>> - struct rte_mbuf *m = _m; >>>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet); >>>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv; >>>> + struct rte_pktmbuf_pool_private default_mbp_priv; >>>> + uint16_t roomsz; >>>> >>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet)); >>>> >>>> - memset(m, 0, mp->elt_size); >>>> + /* if no structure is provided, assume no mbuf private area */ >>>> + >>>> + user_mbp_priv = opaque_arg; >>>> + if (user_mbp_priv == NULL) { >>>> + default_mbp_priv.mbuf_priv_size = 0; >>>> + if (mp->elt_size > sizeof(struct dp_packet)) { >>>> + roomsz = mp->elt_size - sizeof(struct dp_packet); >>>> + } else { >>>> + roomsz = 0; >>>> + } >>>> + default_mbp_priv.mbuf_data_room_size = roomsz; >>>> + user_mbp_priv = &default_mbp_priv; >>>> + } >>>> + >>>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) + >>>> + user_mbp_priv->mbuf_data_room_size + >>>> + user_mbp_priv->mbuf_priv_size); >>>> + >>>> + mbp_priv = rte_mempool_get_priv(mp); >>>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv)); >>>> +} >>>> + >>>> +/* Initialise some fields in the mbuf structure that are not modified by >>>> the >>>> + * user once created (origin pool, buffer start address, etc.*/ >>>> +static void >>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>> + void *opaque_arg OVS_UNUSED, >>>> + void *_m, >>>> + unsigned i OVS_UNUSED) >>>> +{ >>>> + struct rte_mbuf *m = _m; >>>> + uint32_t buf_size, buf_len, priv_size; >>>> + >>>> + priv_size = rte_pktmbuf_priv_size(mp); >>>> + buf_size = sizeof(struct dp_packet) + priv_size; >>>> + buf_len = rte_pktmbuf_data_room_size(mp); >>>> >>>> - /* start of buffer is just after mbuf structure */ >>>> - m->buf_addr = (char *)m + sizeof(struct dp_packet); >>>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + >>>> - sizeof(struct dp_packet); >>>> - m->buf_len = (uint16_t)buf_len; >>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size); >>>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size); >>>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX); >>>> >>>> - /* keep some headroom between start of buffer and data */ >>>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len); >>>> + memset(m, 0, mp->elt_size); >>>> >>>> - /* init some constant fields */ >>>> - m->pool = mp; >>>> - m->nb_segs = 1; >>>> - m->port = 0xff; >>>> + /* start of buffer is after dp_packet structure and priv data */ >>>> + m->priv_size = priv_size; >>>> + m->buf_addr = (char *)m + buf_size; >>>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size; >>>> + m->buf_len = (uint16_t)buf_len; >>>> + >>>> + /* keep some headroom between start of buffer and data */ >>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len); >>>> + >>>> + /* init some constant fields */ >>>> + m->pool = mp; >>>> + m->nb_segs = 1; >>>> + m->port = 0xff; >>> >>>Please don't mix tabs and spaces in the file. >> >>Apologies - this was a copy/paste from a DPDK file, which uses tabs. >> >>> >>>> } >>>> >>>> static void >>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>> { >>>> struct rte_mbuf *m = _m; >>>> >>>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i); >>>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i); >>>> >>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len); >>>> } >>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>> struct dpdk_mp *dmp = NULL; >>>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>>> unsigned mp_size; >>>> + struct rte_pktmbuf_pool_private mbp_priv; >>>> >>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { >>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) { >>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>> dmp->socket_id = socket_id; >>>> dmp->mtu = mtu; >>>> dmp->refcount = 1; >>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + >>>> RTE_PKTMBUF_HEADROOM; >>>> + mbp_priv.mbuf_priv_size = 0; >>>> >>>> mp_size = MAX_NB_MBUF; >>>> do { >>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>> return NULL; >>>> } >>>> >>>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu), >>>> + dmp->mp = rte_mempool_create(mp_name, mp_size, >>>> MBUF_SEGMENT_SIZE(mtu), >>>> MP_CACHE_SZ, >>>> sizeof(struct >>>> rte_pktmbuf_pool_private), >>>> - rte_pktmbuf_pool_init, NULL, >>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv, >>>> ovs_rte_pktmbuf_init, NULL, >>>> socket_id, 0); >>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= >>>> MIN_NB_MBUF); >>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >>>> n_rxq, int >>n_txq) >>>> { >>>> int diag = 0; >>>> int i; >>>> + struct rte_eth_conf conf = port_conf; >>>> >>>> /* A device may report more queues than it makes available (this has >>>> * been observed for Intel xl710, which reserves some of them for >>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >>>> n_rxq, int >>>n_txq) >>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, >>>> n_txq); >>>> } >>>> >>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>>> &port_conf); >>>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) { >>>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED; >>>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu); >>>> + } >>>> + >>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf); >>>> if (diag) { >>>> break; >>>> } >>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int >>>> port_no, >>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); >>>> int sid; >>>> int err = 0; >>>> + uint32_t buf_size; >>>> >>>> ovs_mutex_init(&netdev->mutex); >>>> ovs_mutex_lock(&netdev->mutex); >>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned >>>> int port_no, >>>> netdev->port_id = port_no; >>>> netdev->type = type; >>>> netdev->flags = 0; >>>> + >>>> + /* Initialize port's MTU and frame len to the default Ethernet values. >>>> + * Larger, user-specified (jumbo) frame buffers are accommodated in >>>> + * netdev_dpdk_set_config. >>>> + */ >>>> + netdev->max_packet_len = ETHER_MAX_LEN; >>>> netdev->mtu = ETHER_MTU; >>>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); >>>> >>>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); >>>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN); >>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, >>>> FRAME_LEN_TO_MTU(buf_size)); >>>> if (!netdev->dpdk_mp) { >>>> err = ENOMEM; >>>> goto unlock; >>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char >>>> prefix[], >>>> return 0; >>>> } >>>> >>>> +static void >>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu) >>>> +{ >>>> + const char *mtu_str = smap_get(args, "dpdk-mtu"); >>>> + int local_mtu; >>>> + >>>> + if(mtu_str) { >>>> + local_mtu = atoi(mtu_str); >>> >>>Please use strtol or strtoul here, and detect errors in the string by >>>checking endptr. That way, we can be sure that random garbage on the >>>stack won't accidentally become the mtu. >> >>Will do, thanks. >> >>> >>>> + } >>>> + if(!mtu_str || local_mtu < ETHER_MTU || >>>> + local_mtu > >>>> FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) { >>>> + local_mtu = ETHER_MTU; >>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to >>>> %d.\n", >>>> + local_mtu); >>>> + } >>>> + *mtu = local_mtu; >>>> +} >>>> + >>>> static int >>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex) >>>> { >>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, >>>> struct smap >>>*args) >>>> smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq); >>>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq); >>>> smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq); >>>> + smap_add_format(args, "mtu", "%d", dev->mtu); >>>> ovs_mutex_unlock(&dev->mutex); >>>> >>>> return 0; >>>> } >>>> >>>> +/* Set the mtu of DPDK_DEV_ETH ports */ >>>> +static int >>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>> +{ >>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>> + int old_mtu, err; >>>> + uint32_t buf_size; >>>> + int dpdk_mtu; >>>> + struct dpdk_mp *old_mp; >>>> + struct dpdk_mp *mp; >>>> + >>>> + ovs_mutex_lock(&dpdk_mutex); >>>> + ovs_mutex_lock(&dev->mutex); >>>> + if (dev->mtu == mtu) { >>>> + err = 0; >>>> + goto out; >>>> + } >>>> + >>>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu)); >>>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size); >>>> + >>>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu); >>>> + if (!mp) { >>>> + err = ENOMEM; >>>> + goto out; >>>> + } >>>> + >>>> + rte_eth_dev_stop(dev->port_id); >>>> + >>>> + old_mtu = dev->mtu; >>>> + old_mp = dev->dpdk_mp; >>>> + dev->dpdk_mp = mp; >>>> + dev->mtu = mtu; >>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>> + >>>> + err = dpdk_eth_dev_init(dev); >>> >>>Why call the dpdk_eth_dev_init here? Since you are directly calling >>>rte_eth_dev_X functions here - wouldn't it make sense to use the >>>rte_eth_dev_start() instead? The rx queues for the device need to be assigned the mempool that was allocated for the jumbo frames; since dpdk_eth_dev_init handles the queue setup, rte_eth_dev configuration, etc. it makes sense to call it here. >> >>This is legacy code, but I can refactor if needs be. >> >>> >>>> + if (err) { >>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last >>>> known " >>>> + "good value '%d'\n", mtu, dev->up.name, old_mtu); >>>> + dpdk_mp_put(mp); >>>> + dev->mtu = old_mtu; >>>> + dev->dpdk_mp = old_mp; >>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>> + err = dpdk_eth_dev_init(dev); >>> >>>What happens if this returns a failure? By using dpdk_eth_dev_init, you >>>will cause a queue reconfiguration action, which is not needed, and can >>>error. This call is needed, in case the eth_dev's queue configuration was changed by the previous call. The error handling for this call is arguably outside the scope of this patch, and could be submitted as a separate patch - what do you think? >>> >> >>As above. >> >>>If you are intending to set MTU, I would have expected calls to >>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I >>>recognize that much of this is legacy, but since you are touching it :) >> >>I actually had a number of issues with rte_eth_dev_set_mtu early on in the >>patch; I can take >>a second look now that the code is more stable though. > >I remembered why I didn't use this. > >If you look at the IXGBE driver's mtu_set function, it refuses frame lengths >which are larger >than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM). >This might be avoided by configuring the device for scatter RX behavior, but >this incurs a >performance impact. > ><snippet> >ixgbe_dev_mtu_set(...) { > >... > > /* refuse mtu that requires the support of scattered packets when this > * feature has not been enabled before. */ > if (!dev->data->scattered_rx && > (frame_size + 2 * IXGBE_VLAN_TAG_SIZE > > dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM)) > return -EINVAL; > >... >} ></snippet> > > >> >>> >>>> + goto out; >>>> + } >>>> + >>>> + dpdk_mp_put(old_mp); >>>> + netdev_change_seq_changed(netdev); >>>> +out: >>>> + ovs_mutex_unlock(&dev->mutex); >>>> + ovs_mutex_unlock(&dpdk_mutex); >>>> + return err; >>>> +} >>>> + >>>> +static int >>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args) >>>> +{ >>>> + int mtu; >>>> + >>>> + dpdk_dev_parse_mtu(args, &mtu); >>>> + >>>> + return netdev_dpdk_set_mtu(netdev_, mtu); >>>> +} >>>> + >>>> static int >>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_) >>>> { >>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, >>>> int *mtup) >>>> >>>> return 0; >>>> } >>>> - >>>> -static int >>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>> -{ >>>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>> - int old_mtu, err; >>>> - struct dpdk_mp *old_mp; >>>> - struct dpdk_mp *mp; >>>> - >>>> - ovs_mutex_lock(&dpdk_mutex); >>>> - ovs_mutex_lock(&dev->mutex); >>>> - if (dev->mtu == mtu) { >>>> - err = 0; >>>> - goto out; >>>> - } >>>> - >>>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu); >>>> - if (!mp) { >>>> - err = ENOMEM; >>>> - goto out; >>>> - } >>>> - >>>> - rte_eth_dev_stop(dev->port_id); >>>> - >>>> - old_mtu = dev->mtu; >>>> - old_mp = dev->dpdk_mp; >>>> - dev->dpdk_mp = mp; >>>> - dev->mtu = mtu; >>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>> - >>>> - err = dpdk_eth_dev_init(dev); >>>> - if (err) { >>>> - dpdk_mp_put(mp); >>>> - dev->mtu = old_mtu; >>>> - dev->dpdk_mp = old_mp; >>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>> - dpdk_eth_dev_init(dev); >>>> - goto out; >>>> - } >>>> - >>>> - dpdk_mp_put(old_mp); >>>> - netdev_change_seq_changed(netdev); >>>> -out: >>>> - ovs_mutex_unlock(&dev->mutex); >>>> - ovs_mutex_unlock(&dpdk_mutex); >>>> - return err; >>>> -} >>>> - >>>> static int >>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier); >>>> >>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, >>>> struct smap >>>*args) >>>> smap_add_format(args, "numa_id", "%d", >>>> rte_eth_dev_socket_id(dev->port_id)); >>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name); >>>> smap_add_format(args, "min_rx_bufsize", "%u", >>>> dev_info.min_rx_bufsize); >>>> - smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen); >>>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len); >>>> smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues); >>>> smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues); >>>> smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs); >>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void) >>>> return 0; >>>> } >>>> >>>> +/* Set the mtu of DPDK_DEV_VHOST ports */ >>>> +static int >>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu) >>>> +{ >>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>> + int err = 0; >>>> + struct dpdk_mp *old_mp; >>>> + struct dpdk_mp *mp; >>>> + >>>> + ovs_mutex_lock(&dpdk_mutex); >>>> + ovs_mutex_lock(&dev->mutex); >>>> + if (dev->mtu == mtu) { >>>> + err = 0; >>>> + goto out; >>>> + } >>>> + >>>> + mp = dpdk_mp_get(dev->socket_id, mtu); >>>> + if (!mp) { >>>> + err = ENOMEM; >>>> + goto out; >>>> + } >>>> + >>>> + old_mp = dev->dpdk_mp; >>>> + dev->dpdk_mp = mp; >>>> + dev->mtu = mtu; >>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>> + >>>> + dpdk_mp_put(old_mp); >>>> + netdev_change_seq_changed(netdev); >>>> +out: >>>> + ovs_mutex_unlock(&dev->mutex); >>>> + ovs_mutex_unlock(&dpdk_mutex); >>>> + return err; >>>> +} >>>> + >>>> +static int >>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap >>>> *args) >>>> +{ >>>> + int mtu; >>>> + >>>> + dpdk_dev_parse_mtu(args, &mtu); >>>> + >>>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu); >>>> +} >>>> + >>>> static void >>>> dpdk_common_init(void) >>>> { >>>> @@ -2040,8 +2215,9 @@ unlock_dpdk: >>>> return err; >>>> } >>>> >>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \ >>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \ >>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \ >>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, \ >>>> + GET_STATUS, RXQ_RECV) \ >>>> { \ >>>> NAME, \ >>>> INIT, /* init */ \ >>>> @@ -2053,7 +2229,7 @@ unlock_dpdk: >>>> DESTRUCT, \ >>>> netdev_dpdk_dealloc, \ >>>> netdev_dpdk_get_config, \ >>>> - NULL, /* netdev_dpdk_set_config */ \ >>>> + SET_CONFIG, \ >>>> NULL, /* get_tunnel_config */ \ >>>> NULL, /* build header */ \ >>>> NULL, /* push header */ \ >>>> @@ -2067,7 +2243,7 @@ unlock_dpdk: >>>> netdev_dpdk_set_etheraddr, \ >>>> netdev_dpdk_get_etheraddr, \ >>>> netdev_dpdk_get_mtu, \ >>>> - netdev_dpdk_set_mtu, \ >>>> + SET_MTU, \ >>>> netdev_dpdk_get_ifindex, \ >>>> GET_CARRIER, \ >>>> netdev_dpdk_get_carrier_resets, \ >>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class = >>>> NULL, >>>> netdev_dpdk_construct, >>>> netdev_dpdk_destruct, >>>> + netdev_dpdk_set_config, >>>> netdev_dpdk_set_multiq, >>>> netdev_dpdk_eth_send, >>>> + netdev_dpdk_set_mtu, >>>> netdev_dpdk_get_carrier, >>>> netdev_dpdk_get_stats, >>>> netdev_dpdk_get_features, >>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class = >>>> NULL, >>>> netdev_dpdk_ring_construct, >>>> netdev_dpdk_destruct, >>>> + netdev_dpdk_set_config, >>>> netdev_dpdk_set_multiq, >>>> netdev_dpdk_ring_send, >>>> + netdev_dpdk_set_mtu, >>>> netdev_dpdk_get_carrier, >>>> netdev_dpdk_get_stats, >>>> netdev_dpdk_get_features, >>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED >>>> dpdk_vhost_cuse_class >= >>>> dpdk_vhost_cuse_class_init, >>>> netdev_dpdk_vhost_cuse_construct, >>>> netdev_dpdk_vhost_destruct, >>>> + NULL, >>>> netdev_dpdk_vhost_set_multiq, >>>> netdev_dpdk_vhost_send, >>>> + NULL, >>>> netdev_dpdk_vhost_get_carrier, >>>> netdev_dpdk_vhost_get_stats, >>>> NULL, >>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED >>>> dpdk_vhost_user_class >= >>>> dpdk_vhost_user_class_init, >>>> netdev_dpdk_vhost_user_construct, >>>> netdev_dpdk_vhost_destruct, >>>> + netdev_dpdk_vhost_set_config, >>>> netdev_dpdk_vhost_set_multiq, >>>> netdev_dpdk_vhost_send, >>>> + netdev_dpdk_vhost_set_mtu, >>>> netdev_dpdk_vhost_get_carrier, >>>> netdev_dpdk_vhost_get_stats, >>>> NULL, >>_______________________________________________ >>dev mailing list >>dev@openvswitch.org >>http://openvswitch.org/mailman/listinfo/dev >_______________________________________________ >dev mailing list >dev@openvswitch.org >http://openvswitch.org/mailman/listinfo/dev _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev