"Kavanagh, Mark B" <mark.b.kavan...@intel.com> writes: >> >>"Kavanagh, Mark B" <mark.b.kavan...@intel.com> writes: >>> Further clarifications below Aaron - please let me know if you have >>> any additional comments. >>> >>> Thanks, >>> Mark >> >>Thanks again for this work, Mark! >> >>>> >>>>One addendum below, Aaron. >>>> >>>>> >>>>>>Hi Mark, >>>>>> >>>>>>Thanks for the patch! I've not done a very thorough review of this, but >>>>>>my first-blush comments are inline. >>>>> >>>>>Thanks for the review Aaron! >>>>> >>>>>Any additional comments are welcome - my initial responses are > inline, below. >>>>> >>>>>Thanks, >>>>>Mark >>>>> >>>>>> >>>>>>Mark Kavanagh <mark.b.kavan...@intel.com> writes: >>>>>>> Add support for Jumbo Frames to DPDK-enabled port types, >>>>>>> using single-segment-mbufs. >>>>>>> >>>>>>> Using this approach, the amount of memory allocated for each mbuf >>>>>>> to store frame data is increased to a value greater than 1518B >>>>>>> (typical Ethernet maximum frame length). The increased space >>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried >>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf >>>>>>> segments. >>>>>>> >>>>>>> The amount of space allocated to each mbuf to hold frame data is >>>>>>> defined dynamically by the user when adding a DPDK port to a bridge. >>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid, >>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B). >>>>>>> >>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>>>>>> --- >>>>>> >>>>>>This is a new feature which has user-visible impact, so should be >>>>>>announced in NEWS (plus, it's great news). >>>>> >>>>>No problem - I'll add an entry. >>>>> >>>>>> >>>>>>> INSTALL.DPDK.md | 59 ++++++++- >>>>>>> lib/netdev-dpdk.c | 356 >>> +++++++++++++++++++++++++++++++++++++++++------------- >>>>>>> 2 files changed, 328 insertions(+), 87 deletions(-) >>>>>>> >>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >>>>>>> index 96b686c..2f23e27 100644 >>>>>>> --- a/INSTALL.DPDK.md >>>>>>> +++ b/INSTALL.DPDK.md >>>>>>> @@ -859,10 +859,61 @@ by adding the following string: >>>>>>> to <interface> sections of all network devices used by > DPDK. Parameter 'N' >>>>>>> determines how many queues can be used by the guest. >>>>>>> >>>>>>> + >>>>>>> +Jumbo Frames >>>>>>> +------------ >>>>>>> + >>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type >>>>>>> ports. >>>>>>> + >>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to >>> the ovs-vsctl >>>>>>> +'add-port' command-line, along with the required MTU for the port. >>>>>>> +e.g. >>>>>> >>>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid. >>>>> >>>>>Good catch, thanks. >>>>> >>>>>> >>>>>>> + >>>>>>> + ``` >>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk >>> options:dpdk-mtu=9000 >>>>>>> + ``` >>>>>>> + >>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf > segments are >>>>>>> +increased, such that a full Jumbo Frame may be accommodated >>> inside a single >>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable. >>>>>> >>>>>>Why no support? DPDK supports changing the mtu. >>>>> >>>>>I guess my rationale here is that an MTU change can't be triggered >>> via OVS command-line, nor >>>>>can it be triggered programmatically via DPDK (apart from an > explicit call to >>>>>rte_eth_dev_set_mtu). >>>>>So, while technically it's possibly, from a user's point of view, >>> there's no way to >>>>configure >>>>>it, outside of modifying the code directly. If I've missed something >>> here, please let me >>>>>know. >>>>> >>>>>> >>>>>>> +Jumbo frame support has been validated against 13312B frames, using the >>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may >>>>>>> +theoretically be supported. Supported port types excludes >>> vHost-Cuse ports, as >>>>>>> +this feature is pending deprecation. >>>>>>> + >>>>>>> + >>>>>>> +vHost Ports and Jumbo Frames >>>>>>> +---------------------------- >>>>>>> +Jumbo frame support is available for DPDK vHost-User ports >>> only. Some additional >>>>>>> +configuration is needed to take advantage of this feature: >>>>>>> + >>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as >>> demonstrated in >>>>>>> + the QEMU command line snippet below: >>>>>>> + >>>>>>> + ``` >>>>>>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >>>>>>> + '-device >>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >>>>>>> + ``` >>>>>>> + >>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver > in a guest >>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK >>> driver), the >>>>>>> + MTU of those logical network interfaces must also be > increased. This >>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that >>> 'MTU' refers >>>>>>> + to the length of the IP packet only, and not that of the entire >>> frame. >>>>>>> + >>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame, >>> subtract the L2 >>>>>>> + header and CRC lengths (i.e. 18B) from the max supported > frame size. >>>>>>> + So, to set the MTU for a 13312B Jumbo Frame: >>>>>>> + >>>>>>> + ``` >>>>>>> + ifconfig eth1 mtu 13294 >>>>>>> + ``` >>>>>>> + >>>>>>> + >>>>>>> Restrictions: >>>>>>> ------------- >>>>>>> >>>>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this >>>>>>> issue. >>>>>>> - Currently DPDK port does not make use any offload functionality. >>>>>>> - DPDK-vHost support works with 1G huge pages. >>>>>>> >>>>>>> @@ -903,6 +954,12 @@ Restrictions: >>>>>>> the next release of DPDK (which includes the above patch) is >>> available and >>>>>>> integrated into OVS. >>>>>>> >>>>>>> + Jumbo Frames: >>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit >>> gracefully. The source of >>>>>>> + this issue is currently being investigated. >>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost > Cuse ports. >>>>>>> + >>>>>>> + >>>>>>> Bug Reporting: >>>>>>> -------------- >>>>>>> >>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>>>>>> index de7e488..5f23e28 100644 >>>>>>> --- a/lib/netdev-dpdk.c >>>>>>> +++ b/lib/netdev-dpdk.c >>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = >>> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE >>>>>>> #define OVS_VPORT_DPDK "ovs_dpdk" >>>>>>> >>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1 >>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024 >>>>>>> + >>>>>>> /* >>>>>>> * need to reserve tons of extra space in the mbufs so we can align the >>>>>>> * DMA addresses to 4KB. >>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour > and drop in >>>>>>> * performance for standard Ethernet MTU. >>>>>>> */ >>>>>>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>>>>>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \ >>>>>>> - + sizeof(struct dp_packet) \ >>>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>>> -#define MBUF_SIZE_DRIVER (2048 \ >>>>>>> - + sizeof (struct rte_mbuf) \ >>>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER) >>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + >>> ETHER_CRC_LEN) >>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN >>> - ETHER_CRC_LEN) >>>>>>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \ >>>>>>> + + sizeof(struct dp_packet) \ >>>>>>> + + RTE_PKTMBUF_HEADROOM) >>>>>>> + >>>>>>> +/* This value should be specified as a multiple of the DPDK NIC >>>>>>> driver's >>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). >>>>>>> + */ >>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312 >>>>>>> >>>>>>> /* Max and min number of packets in the mempool. OVS tries to > allocate a >>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system >>> doesn't have >>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = >>> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>>> #define MIN_NB_MBUF (4096 * 4) >>>>>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE >>>>>>> >>>>>>> +#define DPDK_VLAN_TAG_LEN 4 >>>>>>> + >>>>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ >>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF % >>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0); >>>>>>> >>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = { >>>>>>> .header_split = 0, /* Header Split disabled */ >>>>>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */ >>>>>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */ >>>>>>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */ >>>>>>> .hw_strip_crc = 0, >>>>>>> }, >>>>>>> .rx_adv_conf = { >>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class) >>>>>>> return class->construct == netdev_dpdk_construct; >>>>>>> } >>>>>>> >>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity >>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K >>> for igb_uio). >>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient >>> buffers are >>>>>>> + * allocated to accomodate the packet in its >>> entirety. Furthermore, the igb_uio >>>>>>> + * driver needs to ensure that there is also sufficient space in >>> the Rx buffer >>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX > buffer is too >>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces >>>>>>> + * performance. To prevent this, use a buffer size that is closest to >>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria. >>>>>>> + */ >>>>>>> +static uint32_t >>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len) >>>>>>> +{ >>>>>>> + struct rte_eth_dev_info info; >>>>>>> + uint32_t buf_size; >>>>>>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN); >>>>>>> + >>>>>>> + if(netdev->type == DPDK_DEV_ETH) { >>>>>>> + rte_eth_dev_info_get(netdev->port_id, &info); >>>>>>> + buf_size = (info.min_rx_bufsize == 0) ? >>>>>>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE : >>>>>>> + info.min_rx_bufsize; >>>>>> >>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu? >>>>> >>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather >>> the buffer size used >>>>>when creating the rte_mempool, which must be a multiple of the >>> driver's min_rx_bufsize. >>>>> >>>>>> >>>>>>> + } else { >>>>>>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE; >>>>>>> + } >>>>>>> + >>>>>>> + if(len % buf_size) { >>>>>>> + len = buf_size * ((len/buf_size) + 1); >>>>>>> + } >>>>>>> + >>>>>>> + return len; >>>>>>> +} >>>>>>> + >>>>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be >>>>>>> used >>>>>>> * for all other segments data, bss and text. */ >>>>>>> >>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p) >>>>>>> } >>>>>>> >>>>>>> static void >>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>> - void *opaque_arg OVS_UNUSED, >>>>>>> - void *_m, >>>>>>> - unsigned i OVS_UNUSED) >>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg) >>>>>>> { >>>>>>> - struct rte_mbuf *m = _m; >>>>>>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet); >>>>>>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv; >>>>>>> + struct rte_pktmbuf_pool_private default_mbp_priv; >>>>>>> + uint16_t roomsz; >>>>>>> >>>>>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet)); >>>>>>> >>>>>>> - memset(m, 0, mp->elt_size); >>>>>>> + /* if no structure is provided, assume no mbuf private area */ >>>>>>> + >>>>>>> + user_mbp_priv = opaque_arg; >>>>>>> + if (user_mbp_priv == NULL) { >>>>>>> + default_mbp_priv.mbuf_priv_size = 0; >>>>>>> + if (mp->elt_size > sizeof(struct dp_packet)) { >>>>>>> + roomsz = mp->elt_size - sizeof(struct dp_packet); >>>>>>> + } else { >>>>>>> + roomsz = 0; >>>>>>> + } >>>>>>> + default_mbp_priv.mbuf_data_room_size = roomsz; >>>>>>> + user_mbp_priv = &default_mbp_priv; >>>>>>> + } >>>>>>> + >>>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) + >>>>>>> + user_mbp_priv->mbuf_data_room_size + >>>>>>> + user_mbp_priv->mbuf_priv_size); >>>>>>> + >>>>>>> + mbp_priv = rte_mempool_get_priv(mp); >>>>>>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv)); >>>>>>> +} >>>>>>> + >>>>>>> +/* Initialise some fields in the mbuf structure that are not >>> modified by the >>>>>>> + * user once created (origin pool, buffer start address, etc.*/ >>>>>>> +static void >>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>> + void *opaque_arg OVS_UNUSED, >>>>>>> + void *_m, >>>>>>> + unsigned i OVS_UNUSED) >>>>>>> +{ >>>>>>> + struct rte_mbuf *m = _m; >>>>>>> + uint32_t buf_size, buf_len, priv_size; >>>>>>> + >>>>>>> + priv_size = rte_pktmbuf_priv_size(mp); >>>>>>> + buf_size = sizeof(struct dp_packet) + priv_size; >>>>>>> + buf_len = rte_pktmbuf_data_room_size(mp); >>>>>>> >>>>>>> - /* start of buffer is just after mbuf structure */ >>>>>>> - m->buf_addr = (char *)m + sizeof(struct dp_packet); >>>>>>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + >>>>>>> - sizeof(struct dp_packet); >>>>>>> - m->buf_len = (uint16_t)buf_len; >>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == > priv_size); >>>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size); >>>>>>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX); >>>>>>> >>>>>>> - /* keep some headroom between start of buffer and data */ >>>>>>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len); >>>>>>> + memset(m, 0, mp->elt_size); >>>>>>> >>>>>>> - /* init some constant fields */ >>>>>>> - m->pool = mp; >>>>>>> - m->nb_segs = 1; >>>>>>> - m->port = 0xff; >>>>>>> + /* start of buffer is after dp_packet structure and priv data */ >>>>>>> + m->priv_size = priv_size; >>>>>>> + m->buf_addr = (char *)m + buf_size; >>>>>>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size; >>>>>>> + m->buf_len = (uint16_t)buf_len; >>>>>>> + >>>>>>> + /* keep some headroom between start of buffer and data */ >>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, > (uint16_t)m->buf_len); >>>>>>> + >>>>>>> + /* init some constant fields */ >>>>>>> + m->pool = mp; >>>>>>> + m->nb_segs = 1; >>>>>>> + m->port = 0xff; >>>>>> >>>>>>Please don't mix tabs and spaces in the file. >>>>> >>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs. >>>>> >>>>>> >>>>>>> } >>>>>>> >>>>>>> static void >>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>> { >>>>>>> struct rte_mbuf *m = _m; >>>>>>> >>>>>>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i); >>>>>>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i); >>>>>>> >>>>>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len); >>>>>>> } >>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>>>>>> struct dpdk_mp *dmp = NULL; >>>>>>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>>>>>> unsigned mp_size; >>>>>>> + struct rte_pktmbuf_pool_private mbp_priv; >>>>>>> >>>>>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { >>>>>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) { >>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>>>>>> dmp->socket_id = socket_id; >>>>>>> dmp->mtu = mtu; >>>>>>> dmp->refcount = 1; >>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + >>> RTE_PKTMBUF_HEADROOM; >>>>>>> + mbp_priv.mbuf_priv_size = 0; >>>>>>> >>>>>>> mp_size = MAX_NB_MBUF; >>>>>>> do { >>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) >>> OVS_REQUIRES(dpdk_mutex) >>>>>>> return NULL; >>>>>>> } >>>>>>> >>>>>>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu), >>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size, >>> MBUF_SEGMENT_SIZE(mtu), >>>>>>> MP_CACHE_SZ, >>>>>>> sizeof(struct rte_pktmbuf_pool_private), >>>>>>> - rte_pktmbuf_pool_init, NULL, >>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv, >>>>>>> ovs_rte_pktmbuf_init, NULL, >>>>>>> socket_id, 0); >>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= >>> MIN_NB_MBUF); >>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >>> *dev, int n_rxq, int >>>>>n_txq) >>>>>>> { >>>>>>> int diag = 0; >>>>>>> int i; >>>>>>> + struct rte_eth_conf conf = port_conf; >>>>>>> >>>>>>> /* A device may report more queues than it makes available (this >>>>>>> has >>>>>>> * been observed for Intel xl710, which reserves some of them for >>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >>> *dev, int n_rxq, int >>>>>>n_txq) >>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); >>>>>>> } >>>>>>> >>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>> &port_conf); >>>>>>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) { >>>>>>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED; >>>>>>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>> + } >>>>>>> + >>>>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>>>>>> &conf); >>>>>>> if (diag) { >>>>>>> break; >>>>>>> } >>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, >>> unsigned int port_no, >>>>>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); >>>>>>> int sid; >>>>>>> int err = 0; >>>>>>> + uint32_t buf_size; >>>>>>> >>>>>>> ovs_mutex_init(&netdev->mutex); >>>>>>> ovs_mutex_lock(&netdev->mutex); >>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, >>> unsigned int port_no, >>>>>>> netdev->port_id = port_no; >>>>>>> netdev->type = type; >>>>>>> netdev->flags = 0; >>>>>>> + >>>>>>> + /* Initialize port's MTU and frame len to the default > Ethernet values. >>>>>>> + * Larger, user-specified (jumbo) frame buffers are accommodated in >>>>>>> + * netdev_dpdk_set_config. >>>>>>> + */ >>>>>>> + netdev->max_packet_len = ETHER_MAX_LEN; >>>>>>> netdev->mtu = ETHER_MTU; >>>>>>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); >>>>>>> >>>>>>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); >>>>>>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN); >>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, >>> FRAME_LEN_TO_MTU(buf_size)); >>>>>>> if (!netdev->dpdk_mp) { >>>>>>> err = ENOMEM; >>>>>>> goto unlock; >>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], >>> const char prefix[], >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +static void >>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu) >>>>>>> +{ >>>>>>> + const char *mtu_str = smap_get(args, "dpdk-mtu"); >>>>>>> + int local_mtu; >>>>>>> + >>>>>>> + if(mtu_str) { >>>>>>> + local_mtu = atoi(mtu_str); >>>>>> >>>>>>Please use strtol or strtoul here, and detect errors in the string by >>>>>>checking endptr. That way, we can be sure that random garbage on the >>>>>>stack won't accidentally become the mtu. >>>>> >>>>>Will do, thanks. >>>>> >>>>>> >>>>>>> + } >>>>>>> + if(!mtu_str || local_mtu < ETHER_MTU || >>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) { >>>>>>> + local_mtu = ETHER_MTU; >>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting >>> to %d.\n", >>>>>>> + local_mtu); >>>>>>> + } >>>>>>> + *mtu = local_mtu; >>>>>>> +} >>>>>>> + >>>>>>> static int >>>>>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex) >>>>>>> { >>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev >>> *netdev_, struct smap >>>>>>*args) >>>>>>> smap_add_format(args, "configured_rx_queues", "%d", >>>>>>> netdev_->n_rxq); >>>>>>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq); >>>>>>> smap_add_format(args, "configured_tx_queues", "%d", > dev->real_n_txq); >>>>>>> + smap_add_format(args, "mtu", "%d", dev->mtu); >>>>>>> ovs_mutex_unlock(&dev->mutex); >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */ >>>>>>> +static int >>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>>> +{ >>>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>> + int old_mtu, err; >>>>>>> + uint32_t buf_size; >>>>>>> + int dpdk_mtu; >>>>>>> + struct dpdk_mp *old_mp; >>>>>>> + struct dpdk_mp *mp; >>>>>>> + >>>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>>> + if (dev->mtu == mtu) { >>>>>>> + err = 0; >>>>>>> + goto out; >>>>>>> + } >>>>>>> + >>>>>>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu)); >>>>>>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size); >>>>>>> + >>>>>>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu); >>>>>>> + if (!mp) { >>>>>>> + err = ENOMEM; >>>>>>> + goto out; >>>>>>> + } >>>>>>> + >>>>>>> + rte_eth_dev_stop(dev->port_id); >>>>>>> + >>>>>>> + old_mtu = dev->mtu; >>>>>>> + old_mp = dev->dpdk_mp; >>>>>>> + dev->dpdk_mp = mp; >>>>>>> + dev->mtu = mtu; >>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>> + >>>>>>> + err = dpdk_eth_dev_init(dev); >>>>>> >>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling >>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the >>>>>>rte_eth_dev_start() instead? >>> >>> The rx queues for the device need to be assigned the mempool that was >>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the >>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it >>> here. >>> >>>>> >>>>>This is legacy code, but I can refactor if needs be. >>>>> >>>>>> >>>>>>> + if (err) { >>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last >>> known " >>>>>>> + "good value '%d'\n", mtu, dev->up.name, old_mtu); >>>>>>> + dpdk_mp_put(mp); >>>>>>> + dev->mtu = old_mtu; >>>>>>> + dev->dpdk_mp = old_mp; >>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>> + err = dpdk_eth_dev_init(dev); >>>>>> >>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you >>>>>>will cause a queue reconfiguration action, which is not needed, and can >>>>>>error. >>> >>> This call is needed, in case the eth_dev's queue configuration was >>> changed by the previous call. >>> The error handling for this call is arguably outside the scope of this >>> patch, and could be submitted as a separate patch - what do you think? >> >>That's probably a good idea. >> >>>>>> >>>>> >>>>>As above. >>>>> >>>>>>If you are intending to set MTU, I would have expected calls to >>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I >>>>>>recognize that much of this is legacy, but since you are touching it :) >>>>> >>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on >>> in the patch; I can take >>>>>a second look now that the code is more stable though. >>>> >>>>I remembered why I didn't use this. >>>> >>>>If you look at the IXGBE driver's mtu_set function, it refuses frame >>> lengths which are larger >>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM). >>>>This might be avoided by configuring the device for scatter RX >>> behavior, but this incurs a >>>>performance impact. >> >>It seems strange to me that we need to ignore the library API for this >>one case, right? What if the user is only doing vhostuser PMDs mixed >>with a bnx2x card? >> > > I can't say that I've familiar enough with this setup to form an > answer on that one Aaron. > >>It would be better to get it fixed in the library, I think. After all, >>they gave a generic setting to change MTU - why don't they make it >>behave well? >> > > Let me follow up on that with some local DPDK folks here. > > Would you consider this a showstopper for upstreaming of this feature? > If we need to wait for a change to DPDK, Jumbo Frames will miss the > 2.5 window.
I don't think it would be a show stopper as long as other devices wouldn't be negatively impacted, but it's a place where a /*XXX: This is a workaround for dpdk*/ comment should be included so that we can fix it when/if it is updated in the library. What do you think? >>>><snippet> >>>>ixgbe_dev_mtu_set(...) { >>>> >>>>... >>>> >>>> /* refuse mtu that requires the support of scattered packets when this >>>> * feature has not been enabled before. */ >>>> if (!dev->data->scattered_rx && >>>> (frame_size + 2 * IXGBE_VLAN_TAG_SIZE > >>>> dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM)) >>>> return -EINVAL; >>>> >>>>... >>>>} >>>></snippet> >>>> >>>> >>>>> >>>>>> >>>>>>> + goto out; >>>>>>> + } >>>>>>> + >>>>>>> + dpdk_mp_put(old_mp); >>>>>>> + netdev_change_seq_changed(netdev); >>>>>>> +out: >>>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>>> + return err; >>>>>>> +} >>>>>>> + >>>>>>> +static int >>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args) >>>>>>> +{ >>>>>>> + int mtu; >>>>>>> + >>>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>>> + >>>>>>> + return netdev_dpdk_set_mtu(netdev_, mtu); >>>>>>> +} >>>>>>> + >>>>>>> static int >>>>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_) >>>>>>> { >>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev >>> *netdev, int *mtup) >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>>> - >>>>>>> -static int >>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>>> -{ >>>>>>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>> - int old_mtu, err; >>>>>>> - struct dpdk_mp *old_mp; >>>>>>> - struct dpdk_mp *mp; >>>>>>> - >>>>>>> - ovs_mutex_lock(&dpdk_mutex); >>>>>>> - ovs_mutex_lock(&dev->mutex); >>>>>>> - if (dev->mtu == mtu) { >>>>>>> - err = 0; >>>>>>> - goto out; >>>>>>> - } >>>>>>> - >>>>>>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu); >>>>>>> - if (!mp) { >>>>>>> - err = ENOMEM; >>>>>>> - goto out; >>>>>>> - } >>>>>>> - >>>>>>> - rte_eth_dev_stop(dev->port_id); >>>>>>> - >>>>>>> - old_mtu = dev->mtu; >>>>>>> - old_mp = dev->dpdk_mp; >>>>>>> - dev->dpdk_mp = mp; >>>>>>> - dev->mtu = mtu; >>>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>>> - >>>>>>> - err = dpdk_eth_dev_init(dev); >>>>>>> - if (err) { >>>>>>> - dpdk_mp_put(mp); >>>>>>> - dev->mtu = old_mtu; >>>>>>> - dev->dpdk_mp = old_mp; >>>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>>> - dpdk_eth_dev_init(dev); >>>>>>> - goto out; >>>>>>> - } >>>>>>> - >>>>>>> - dpdk_mp_put(old_mp); >>>>>>> - netdev_change_seq_changed(netdev); >>>>>>> -out: >>>>>>> - ovs_mutex_unlock(&dev->mutex); >>>>>>> - ovs_mutex_unlock(&dpdk_mutex); >>>>>>> - return err; >>>>>>> -} >>>>>>> - >>>>>>> static int >>>>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier); >>>>>>> >>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev >>> *netdev_, struct smap >>>>>>*args) >>>>>>> smap_add_format(args, "numa_id", "%d", >>> rte_eth_dev_socket_id(dev->port_id)); >>>>>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name); >>>>>>> smap_add_format(args, "min_rx_bufsize", "%u", >>> dev_info.min_rx_bufsize); >>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u", > dev_info.max_rx_pktlen); >>>>>>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len); >>>>>>> smap_add_format(args, "max_rx_queues", "%u", > dev_info.max_rx_queues); >>>>>>> smap_add_format(args, "max_tx_queues", "%u", > dev_info.max_tx_queues); >>>>>>> smap_add_format(args, "max_mac_addrs", "%u", > dev_info.max_mac_addrs); >>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void) >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */ >>>>>>> +static int >>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu) >>>>>>> +{ >>>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>> + int err = 0; >>>>>>> + struct dpdk_mp *old_mp; >>>>>>> + struct dpdk_mp *mp; >>>>>>> + >>>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>>> + if (dev->mtu == mtu) { >>>>>>> + err = 0; >>>>>>> + goto out; >>>>>>> + } >>>>>>> + >>>>>>> + mp = dpdk_mp_get(dev->socket_id, mtu); >>>>>>> + if (!mp) { >>>>>>> + err = ENOMEM; >>>>>>> + goto out; >>>>>>> + } >>>>>>> + >>>>>>> + old_mp = dev->dpdk_mp; >>>>>>> + dev->dpdk_mp = mp; >>>>>>> + dev->mtu = mtu; >>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>> + >>>>>>> + dpdk_mp_put(old_mp); >>>>>>> + netdev_change_seq_changed(netdev); >>>>>>> +out: >>>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>>> + return err; >>>>>>> +} >>>>>>> + >>>>>>> +static int >>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const >>> struct smap *args) >>>>>>> +{ >>>>>>> + int mtu; >>>>>>> + >>>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>>> + >>>>>>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu); >>>>>>> +} >>>>>>> + >>>>>>> static void >>>>>>> dpdk_common_init(void) >>>>>>> { >>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk: >>>>>>> return err; >>>>>>> } >>>>>>> >>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, > MULTIQ, SEND, \ >>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \ >>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, >>>>>>> \ >>>>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, >>>>>>> \ >>>>>>> + GET_STATUS, RXQ_RECV) >>>>>>> \ >>>>>>> { \ >>>>>>> NAME, \ >>>>>>> INIT, /* init */ \ >>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk: >>>>>>> DESTRUCT, \ >>>>>>> netdev_dpdk_dealloc, \ >>>>>>> netdev_dpdk_get_config, \ >>>>>>> - NULL, /* netdev_dpdk_set_config */ \ >>>>>>> + SET_CONFIG, \ >>>>>>> NULL, /* get_tunnel_config */ \ >>>>>>> NULL, /* build header */ \ >>>>>>> NULL, /* push header */ \ >>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk: >>>>>>> netdev_dpdk_set_etheraddr, \ >>>>>>> netdev_dpdk_get_etheraddr, \ >>>>>>> netdev_dpdk_get_mtu, \ >>>>>>> - netdev_dpdk_set_mtu, \ >>>>>>> + SET_MTU, \ >>>>>>> netdev_dpdk_get_ifindex, \ >>>>>>> GET_CARRIER, \ >>>>>>> netdev_dpdk_get_carrier_resets, \ >>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class = >>>>>>> NULL, >>>>>>> netdev_dpdk_construct, >>>>>>> netdev_dpdk_destruct, >>>>>>> + netdev_dpdk_set_config, >>>>>>> netdev_dpdk_set_multiq, >>>>>>> netdev_dpdk_eth_send, >>>>>>> + netdev_dpdk_set_mtu, >>>>>>> netdev_dpdk_get_carrier, >>>>>>> netdev_dpdk_get_stats, >>>>>>> netdev_dpdk_get_features, >>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class >>>>>>> = >>>>>>> NULL, >>>>>>> netdev_dpdk_ring_construct, >>>>>>> netdev_dpdk_destruct, >>>>>>> + netdev_dpdk_set_config, >>>>>>> netdev_dpdk_set_multiq, >>>>>>> netdev_dpdk_ring_send, >>>>>>> + netdev_dpdk_set_mtu, >>>>>>> netdev_dpdk_get_carrier, >>>>>>> netdev_dpdk_get_stats, >>>>>>> netdev_dpdk_get_features, >>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class >>> OVS_UNUSED dpdk_vhost_cuse_class >>>>= >>>>>>> dpdk_vhost_cuse_class_init, >>>>>>> netdev_dpdk_vhost_cuse_construct, >>>>>>> netdev_dpdk_vhost_destruct, >>>>>>> + NULL, >>>>>>> netdev_dpdk_vhost_set_multiq, >>>>>>> netdev_dpdk_vhost_send, >>>>>>> + NULL, >>>>>>> netdev_dpdk_vhost_get_carrier, >>>>>>> netdev_dpdk_vhost_get_stats, >>>>>>> NULL, >>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class >>> OVS_UNUSED dpdk_vhost_user_class >>>>= >>>>>>> dpdk_vhost_user_class_init, >>>>>>> netdev_dpdk_vhost_user_construct, >>>>>>> netdev_dpdk_vhost_destruct, >>>>>>> + netdev_dpdk_vhost_set_config, >>>>>>> netdev_dpdk_vhost_set_multiq, >>>>>>> netdev_dpdk_vhost_send, >>>>>>> + netdev_dpdk_vhost_set_mtu, >>>>>>> netdev_dpdk_vhost_get_carrier, >>>>>>> netdev_dpdk_vhost_get_stats, >>>>>>> NULL, >>>>>_______________________________________________ >>>>>dev mailing list >>>>>dev@openvswitch.org >>>>>http://openvswitch.org/mailman/listinfo/dev >>>>_______________________________________________ >>>>dev mailing list >>>>dev@openvswitch.org >>>>http://openvswitch.org/mailman/listinfo/dev >><#secure method=pgpmime mode=encrypt> _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev