Re: [dpdk-dev] [PATCH v1 1/2] net/octeontx: fix null pointer dereference
On Tue, Mar 06, 2018 at 05:51:27PM +, Ferruh Yigit wrote: > On 2/20/2018 5:14 PM, Santosh Shukla wrote: > > Fixes: f18b146c498d ("net/octeontx: create ethdev ports") > > Coverity issue: 195040 > > > > Cc: sta...@dpdk.org > > Signed-off-by: Santosh Shukla > > Series applied to dpdk-next-net/master, thanks. > Hi Ferruh, > BTW, what is the plan to switching new offloading API in PMD? This release it > is > planned to remove support for old API. Thanks for the heads up, we will send out a patch switching to the new offload scheme. Pavan.
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
Hi, 06/03/2018 19:28, Arnon Warshavsky: > The use case addressed here is dpdk environment init > aborting the process due to panic, > preventing the calling process from running its own tear-down actions. Thank you for working on this long standing issue. > A preferred, though ABI breaking solution would be > to have the environment init always return a value > rather than abort upon distress. Yes, it is the preferred solution. We should not use exit (panic & co) inside a library. It is important enough to break the API. I would be in favor of accepting such breakage in 18.05. > This patch defines a couple of callback registration functions, > one for panic and one for exit > in case one wishes to distinguish between these events. > Once a callback is set and panic takes place, > it will be called prior to calling abort. > > Maiden voyage patch for Qwilt and myself. Are you OK to visit the other side of the solution?
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
> > > Are you OK to visit the other side of the solution? > > Sure. If no one is emotionally attached to those panic aborts, this patch can be discarded and I will create a new one with the license to break.
Re: [dpdk-dev] [PATCH 3/3] vhost: support VFIO based accelerator
On Tue, Mar 06, 2018 at 03:24:27PM +0100, Maxime Coquelin wrote: > On 03/06/2018 11:43 AM, Tiwei Bie wrote: [...] > > + > > +static int vhost_user_slave_set_vring_file(struct virtio_net *dev, > > + uint32_t request, > > + struct vhost_vring_file *file) > Why passing the request as an argument? > It seems to be called only with the same request ID. I thought there may be other requests that also need to send a file descriptor for a ring in the future. So I made this a common routine. Maybe it's not really helpful. I won't pass the request as an argument in next version. > > > +{ > > + int *fdp = NULL; > > + size_t fd_num = 0; > > + int ret; > > + struct VhostUserMsg msg = { > > + .request.slave = request, > > + .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY, > > + .payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK, > > + .size = sizeof(msg.payload.u64), > > + }; > > + > > + if (file->fd < 0) > > + msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK; > > + else { > > + fdp = &file->fd; > > + fd_num = 1; > > + } > > + > > + ret = send_vhost_message(dev->slave_req_fd, &msg, fdp, fd_num); > > + if (ret < 0) { > > + RTE_LOG(ERR, VHOST_CONFIG, > > + "Failed to send slave message %u (%d)\n", > > + request, ret); > > + return ret; > > + } > > + > > + return process_slave_message_reply(dev, &msg); > > Maybe not needed right now, but we'll need a lock to avoid concurrent > requests sending and waiting for reply. Yeah, probably, we need a lock for each slave channel. I didn't check the code of Linux. Maybe it will cause problems when two threads send e.g. below messages at the same time: thread A: IOTLB miss message thread B: VFIO group message which has a file descriptor Thanks for the comments! :) Best regards, Tiwei Bie
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
On 07-Mar-18 8:32 AM, Thomas Monjalon wrote: Hi, 06/03/2018 19:28, Arnon Warshavsky: The use case addressed here is dpdk environment init aborting the process due to panic, preventing the calling process from running its own tear-down actions. Thank you for working on this long standing issue. A preferred, though ABI breaking solution would be to have the environment init always return a value rather than abort upon distress. Yes, it is the preferred solution. We should not use exit (panic & co) inside a library. It is important enough to break the API. +1, panic exists mostly for historical reasons AFAIK. it's a pity i didn't think of it at the time of submitting the memory hotplug RFC, because i now hit the same issue with the v1 - we might panic while holding a lock, and didn't realize that it was an API break to change this behavior. Can this really go into current release without deprecation notices? -- Thanks, Anatoly
Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
> -Original Message- > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > Sent: Tuesday, March 6, 2018 5:27 PM > To: Kulasek, TomaszX ; y...@fridaylinux.org > Cc: Verkamp, Daniel ; Harris, James R > ; Wodkowski, PawelX > ; dev@dpdk.org; Stojaczyk, DariuszX > > Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public > vring > data > > Hi Tomasz, > > On 03/05/2018 05:11 PM, Tomasz Kulasek wrote: > > For now DPDK assumes that callfd, kickfd and last_idx are being set just > > once during vring initialization and device cannot be running while DPDK > > receives SET_VRING_KICK, SET_VRING_CALL and SET_VRING_BASE messages. > > However, that assumption is wrong. For Vhost SCSI messages might arrive > > at any point of time, possibly multiple times, one after another. > > > > QEMU issues SET_VRING_CALL once during device initialization, then again > > during device start. The second message will close previous callfd, > > which is still being used by the user-implementation of vhost device. > > This results in writing to invalid (closed) callfd. > > > > Other messages like SET_FEATURES, SET_VRING_ADDR etc also will change > > internal state of VQ or device. To prevent race condition device should > > also be stopped before updateing vring data. > > > > Signed-off-by: Dariusz Stojaczyk > > Signed-off-by: Pawel Wodkowski > > Signed-off-by: Tomasz Kulasek > > --- > > lib/librte_vhost/vhost_user.c | 40 > > > 1 file changed, 40 insertions(+) > > In last release, we have introduced a per-virtqueue lock to protect > vring handling against asynchronous device changes. > > I think that would solve the issue you are facing, but you would need > to export the VQs locking functions to the vhost-user lib API to be > able to use it. > > I don't think your current patch is the right solution anyway, because > it destroys the device in case we don't want it to remain alive, like > set_log_base, or set_features when only the logging feature gets > enabled. Please correct me if I can't see something obvious, but how this lock protect against eg SET_MEM_TABLE message? Current flow you are thinking of is: DPDK vhost-user thread 1.1. vhost_user_lock_all_queue_pairs() 1.2. vhost_user_set_mem_table() 1.3. vhost_user_unlock_all_queue_pairs() BACKEND: virito-net: 2.1. rte_spinlock_lock(&vq->access_lock); 2.2. Process vrings and copy all data 2.3. rte_spinlock_unlock(&vq->access_lock); Yes, it will synchronize access to virtio_net structure but what if the BACKEND is in zero copy mode and/or pass buffers to physical device? The request will not end in 2.2 and you unmap the memory regions in the middle of request. Even worse, the physical device will just abort the request but BACKEND can segfault or write random memory because BACKEND try to use invalid memory address (retrieved at request start). To use this per-virtqueue lock: 1. the lock need to be held from request start to the end - but this can starve DPDK vhost-user thread as there might be many request on-the-fly and when one is done the new one might be started. 2. Becouse we don't know if something changed between requst start and request end BACKEND need walk through all descriptors chain at the request end and do the rte_vhost_gpa_to_vva() again. The SET_MEM_TABLE is most obvious message but the same is true for other like VHOST_IOTLB_INVALIDATE or SET_FEATURES. Pawel > > Cheers, > Maxime
Re: [dpdk-dev] [PATCH v1] net/tap: allow user MAC to be passed as args
Hi Ferruh, You are correct about this, I will add initialization send a next version patch. > -Original Message- > From: Yigit, Ferruh > Sent: Tuesday, March 6, 2018 4:42 PM > To: Varghese, Vipin ; dev@dpdk.org; > pascal.ma...@6wind.com > Cc: Jain, Deepak K > Subject: Re: [PATCH v1] net/tap: allow user MAC to be passed as args > > On 2/12/2018 2:44 PM, Vipin Varghese wrote: > > Allow TAP PMD to pass user desired MAC address as argument. > > The argument value is processed as string delimited by ':', is parsed > > and converted to HEX MAC address after validation. > > > > Signed-off-by: Vipin Varghese > > Signed-off-by: Pascal Mazon > > <...> > > > @@ -1589,7 +1630,7 @@ enum ioctl_mode { > > int speed; > > char tap_name[RTE_ETH_NAME_MAX_LEN]; > > char remote_iface[RTE_ETH_NAME_MAX_LEN]; > > - int fixed_mac_type = 0; > > + struct ether_addr user_mac; > > > > name = rte_vdev_device_name(dev); > > params = rte_vdev_device_args(dev); > > @@ -1626,7 +1667,7 @@ enum ioctl_mode { > > ret = rte_kvargs_process(kvlist, > > ETH_TAP_MAC_ARG, > > &set_mac_type, > > -&fixed_mac_type); > > +&user_mac); > > if (ret == -1) > > goto leave; > > } > > @@ -1637,7 +1678,7 @@ enum ioctl_mode { > > RTE_LOG(NOTICE, PMD, "Initializing pmd_tap for %s as %s\n", > > name, tap_name); > > > > - ret = eth_dev_tap_create(dev, tap_name, remote_iface, > fixed_mac_type); > > + ret = eth_dev_tap_create(dev, tap_name, remote_iface, &user_mac); > > "user_mac" without initial value is leading error when no "mac" argument is > provided. It should be zeroed out.
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
07/03/2018 10:05, Burakov, Anatoly: > On 07-Mar-18 8:32 AM, Thomas Monjalon wrote: > > Hi, > > > > 06/03/2018 19:28, Arnon Warshavsky: > >> The use case addressed here is dpdk environment init > >> aborting the process due to panic, > >> preventing the calling process from running its own tear-down actions. > > > > Thank you for working on this long standing issue. > > > >> A preferred, though ABI breaking solution would be > >> to have the environment init always return a value > >> rather than abort upon distress. > > > > Yes, it is the preferred solution. > > We should not use exit (panic & co) inside a library. > > It is important enough to break the API. > > +1, panic exists mostly for historical reasons AFAIK. it's a pity i > didn't think of it at the time of submitting the memory hotplug RFC, > because i now hit the same issue with the v1 - we might panic while > holding a lock, and didn't realize that it was an API break to change > this behavior. > > Can this really go into current release without deprecation notices? If such an exception is done, it must be approved by the technical board. We need to check few criterias: - which functions need to be changed - how the application is impacted - what is the urgency If a panic is removed and the application is not already checking some error code, the execution will continue without considering the error. Some rte_panic could be probably removed without any impact on applications. Some rte_panic could wait for 18.08 with a notice in 18.05. If some rte_panic cannot wait, it must be discussed specifically.
Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
On 03/07/2018 10:16 AM, Wodkowski, PawelX wrote: -Original Message- From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] Sent: Tuesday, March 6, 2018 5:27 PM To: Kulasek, TomaszX ; y...@fridaylinux.org Cc: Verkamp, Daniel ; Harris, James R ; Wodkowski, PawelX ; dev@dpdk.org; Stojaczyk, DariuszX Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data Hi Tomasz, On 03/05/2018 05:11 PM, Tomasz Kulasek wrote: For now DPDK assumes that callfd, kickfd and last_idx are being set just once during vring initialization and device cannot be running while DPDK receives SET_VRING_KICK, SET_VRING_CALL and SET_VRING_BASE messages. However, that assumption is wrong. For Vhost SCSI messages might arrive at any point of time, possibly multiple times, one after another. QEMU issues SET_VRING_CALL once during device initialization, then again during device start. The second message will close previous callfd, which is still being used by the user-implementation of vhost device. This results in writing to invalid (closed) callfd. Other messages like SET_FEATURES, SET_VRING_ADDR etc also will change internal state of VQ or device. To prevent race condition device should also be stopped before updateing vring data. Signed-off-by: Dariusz Stojaczyk Signed-off-by: Pawel Wodkowski Signed-off-by: Tomasz Kulasek --- lib/librte_vhost/vhost_user.c | 40 1 file changed, 40 insertions(+) In last release, we have introduced a per-virtqueue lock to protect vring handling against asynchronous device changes. I think that would solve the issue you are facing, but you would need to export the VQs locking functions to the vhost-user lib API to be able to use it. I don't think your current patch is the right solution anyway, because it destroys the device in case we don't want it to remain alive, like set_log_base, or set_features when only the logging feature gets enabled. Please correct me if I can't see something obvious, but how this lock protect against eg SET_MEM_TABLE message? Current flow you are thinking of is: DPDK vhost-user thread 1.1. vhost_user_lock_all_queue_pairs() 1.2. vhost_user_set_mem_table() 1.3. vhost_user_unlock_all_queue_pairs() BACKEND: virito-net: 2.1. rte_spinlock_lock(&vq->access_lock); 2.2. Process vrings and copy all data 2.3. rte_spinlock_unlock(&vq->access_lock); Yes, it will synchronize access to virtio_net structure but what if the BACKEND is in zero copy mode and/or pass buffers to physical device? The request will not end in 2.2 and you unmap the memory regions in the middle of request. Even worse, the physical device will just abort the request but BACKEND can segfault or write random memory because BACKEND try to use invalid memory address (retrieved at request start). Right, it doesn't work with zero-copy. To use this per-virtqueue lock: 1. the lock need to be held from request start to the end - but this can starve DPDK vhost-user thread as there might be many request on-the-fly and when one is done the new one might be started. 2. Becouse we don't know if something changed between requst start and request end BACKEND need walk through all descriptors chain at the request end and do the rte_vhost_gpa_to_vva() again. The SET_MEM_TABLE is most obvious message but the same is true for other like VHOST_IOTLB_INVALIDATE or SET_FEATURES. SET_FEATURE should never be sent as soon as the device is started, except to enable logging. For VHOST_IOTLB_INVALIDATE, the solution might be to have a ref counter per entry, and to only remove it for the cache once it is zero and send the reply-ack tothe master once this is done. But the cost would be huge as with large entries, a lot of threads might increment/decrement the same variable so there will be contention. For all other cases, like SET_MEM_TABLE, maybe the solution is to disable/enable all the queues using the existing ops. The application or library would have to take care that no guest buffers are in the wild before returning from the disable. Do you think that would work? Cheers, Maxime Pawel Cheers, Maxime
Re: [dpdk-dev] [PATCH v3] net/null:Different mac address support
On 3/7/2018 3:31 AM, Mallesh Koujalagi wrote: > After attaching two Null device to ovs, seeing "00.00.00.00.00.00" mac > address for both null devices. Fix this issue, by setting different mac > address. > > Signed-off-by: Mallesh Koujalagi Reviewed-by: Ferruh Yigit There are some commit formatting issues which I can fix while applying for this one, but for future can you please run "./devtools/check-git-log.sh" before sending patches.
Re: [dpdk-dev] [PATCH] net/null: Support bulk alloc and free.
On 3/5/2018 3:36 PM, Ananyev, Konstantin wrote: > > >> -Original Message- >> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Ferruh Yigit >> Sent: Monday, March 5, 2018 3:25 PM >> To: Koujalagi, MalleshX ; dev@dpdk.org >> Cc: mtetsu...@gmail.com >> Subject: Re: [dpdk-dev] [PATCH] net/null: Support bulk alloc and free. >> >> On 2/3/2018 3:11 AM, Mallesh Koujalagi wrote: >>> After bulk allocation and freeing of multiple mbufs increase more than ~2% >>> throughput on single core. >>> >>> Signed-off-by: Mallesh Koujalagi >>> --- >>> drivers/net/null/rte_eth_null.c | 16 +++- >>> 1 file changed, 7 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/net/null/rte_eth_null.c >>> b/drivers/net/null/rte_eth_null.c >>> index 9385ffd..247ede0 100644 >>> --- a/drivers/net/null/rte_eth_null.c >>> +++ b/drivers/net/null/rte_eth_null.c >>> @@ -130,10 +130,11 @@ eth_null_copy_rx(void *q, struct rte_mbuf **bufs, >>> uint16_t nb_bufs) >>> return 0; >>> >>> packet_size = h->internals->packet_size; >>> + >>> + if (rte_pktmbuf_alloc_bulk(h->mb_pool, bufs, nb_bufs) != 0) >>> + return 0; >>> + >>> for (i = 0; i < nb_bufs; i++) { >>> - bufs[i] = rte_pktmbuf_alloc(h->mb_pool); >>> - if (!bufs[i]) >>> - break; >>> rte_memcpy(rte_pktmbuf_mtod(bufs[i], void *), h->dummy_packet, >>> packet_size); >>> bufs[i]->data_len = (uint16_t)packet_size; >>> @@ -149,18 +150,15 @@ eth_null_copy_rx(void *q, struct rte_mbuf **bufs, >>> uint16_t nb_bufs) >>> static uint16_t >>> eth_null_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs) >>> { >>> - int i; >>> struct null_queue *h = q; >>> >>> if ((q == NULL) || (bufs == NULL)) >>> return 0; >>> >>> - for (i = 0; i < nb_bufs; i++) >>> - rte_pktmbuf_free(bufs[i]); >>> + rte_mempool_put_bulk(bufs[0]->pool, (void **)bufs, nb_bufs); >> >> Is it guarantied that all mbufs will be from same mempool? > > I don't think it does, plus > rte_pktmbuf_free(mb) != rte_mempool_put_bulk(mb->pool, &mb, 1); Perhaps we can just benefit from bulk alloc. Hi Mallesh, Does it give any performance improvement if we switch "rte_pktmbuf_alloc()" to "rte_pktmbuf_alloc_bulk()" but keep free functions untouched? Thanks, ferruh > Konstantin > >> >>> + rte_atomic64_add(&h->tx_pkts, nb_bufs); >>> >>> - rte_atomic64_add(&(h->tx_pkts), i); >>> - >>> - return i; >>> + return nb_bufs; >>> } >>> >>> static uint16_t >>> >
Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
> -Original Message- > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > Sent: Wednesday, March 7, 2018 11:09 AM > To: Wodkowski, PawelX ; Kulasek, TomaszX > ; y...@fridaylinux.org > Cc: Verkamp, Daniel ; Harris, James R > ; dev@dpdk.org; Stojaczyk, DariuszX > > Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public > vring > data > > > > On 03/07/2018 10:16 AM, Wodkowski, PawelX wrote: > >> -Original Message- > >> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > >> Sent: Tuesday, March 6, 2018 5:27 PM > >> To: Kulasek, TomaszX ; y...@fridaylinux.org > >> Cc: Verkamp, Daniel ; Harris, James R > >> ; Wodkowski, PawelX > >> ; dev@dpdk.org; Stojaczyk, DariuszX > >> > >> Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public > vring > >> data > >> > >> Hi Tomasz, > >> > >> On 03/05/2018 05:11 PM, Tomasz Kulasek wrote: > >>> For now DPDK assumes that callfd, kickfd and last_idx are being set just > >>> once during vring initialization and device cannot be running while DPDK > >>> receives SET_VRING_KICK, SET_VRING_CALL and SET_VRING_BASE > messages. > >>> However, that assumption is wrong. For Vhost SCSI messages might arrive > >>> at any point of time, possibly multiple times, one after another. > >>> > >>> QEMU issues SET_VRING_CALL once during device initialization, then again > >>> during device start. The second message will close previous callfd, > >>> which is still being used by the user-implementation of vhost device. > >>> This results in writing to invalid (closed) callfd. > >>> > >>> Other messages like SET_FEATURES, SET_VRING_ADDR etc also will > change > >>> internal state of VQ or device. To prevent race condition device should > >>> also be stopped before updateing vring data. > >>> > >>> Signed-off-by: Dariusz Stojaczyk > >>> Signed-off-by: Pawel Wodkowski > >>> Signed-off-by: Tomasz Kulasek > >>> --- > >>>lib/librte_vhost/vhost_user.c | 40 > >> > >>>1 file changed, 40 insertions(+) > >> > >> In last release, we have introduced a per-virtqueue lock to protect > >> vring handling against asynchronous device changes. > >> > >> I think that would solve the issue you are facing, but you would need > >> to export the VQs locking functions to the vhost-user lib API to be > >> able to use it. > >> > >> I don't think your current patch is the right solution anyway, because > >> it destroys the device in case we don't want it to remain alive, like > >> set_log_base, or set_features when only the logging feature gets > >> enabled. > > > > Please correct me if I can't see something obvious, but how this lock > > protect > against eg > > SET_MEM_TABLE message? Current flow you are thinking of is: > > > > DPDK vhost-user thread > > 1.1. vhost_user_lock_all_queue_pairs() > > 1.2. vhost_user_set_mem_table() > > 1.3. vhost_user_unlock_all_queue_pairs() > > > > BACKEND: virito-net: > > 2.1. rte_spinlock_lock(&vq->access_lock); > > 2.2. Process vrings and copy all data > > 2.3. rte_spinlock_unlock(&vq->access_lock); > > > > Yes, it will synchronize access to virtio_net structure but what if the > > BACKEND > is in > > zero copy mode and/or pass buffers to physical device? The request will > > not end in 2.2 and you unmap the memory regions in the middle of request. > > Even worse, the physical device will just abort the request but BACKEND can > segfault > > or write random memory because BACKEND try to use invalid memory > address > > (retrieved at request start). > > Right, it doesn't work with zero-copy. > > > To use this per-virtqueue lock: > > 1. the lock need to be held from request start to the end - but this can > > starve > DPDK > > vhost-user thread as there might be many request on-the-fly and when one is > done > > the new one might be started. > > 2. Becouse we don't know if something changed between requst start and > request end > > BACKEND need walk through all descriptors chain at the request end and do > the > > rte_vhost_gpa_to_vva() again. > > > > The SET_MEM_TABLE is most obvious message but the same is true for other > like > > VHOST_IOTLB_INVALIDATE or SET_FEATURES. > > SET_FEATURE should never be sent as soon as the device is started, > except to enable logging. > > For VHOST_IOTLB_INVALIDATE, the solution might be to have a ref counter > per entry, and to only remove it for the cache once it is zero and send > the reply-ack tothe master once this is done. But the cost would be huge > as with large entries, a lot of threads might increment/decrement the > same variable so there will be contention. > > For all other cases, like SET_MEM_TABLE, maybe the solution is to > disable/enable all the queues using the existing ops. > The application or library would have to take care that no guest buffers > are in the wild before returning from the disable. > > Do you think that would work? What kind of ops can be used to reliably disable all queues and inform b
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
> > Can this really go into current release without deprecation notices? > > If such an exception is done, it must be approved by the technical board. > We need to check few criterias: > - which functions need to be changed > - how the application is impacted > - what is the urgency > > If a panic is removed and the application is not already checking some > error code, the execution will continue without considering the error. > > Some rte_panic could be probably removed without any impact on > applications. > Some rte_panic could wait for 18.08 with a notice in 18.05. > If some rte_panic cannot wait, it must be discussed specifically. > Every panic removal must be handled all the way up in all call paths. If not all instances can be removed at once in 18.05 (which seems to be the case) maybe we should keep the callback patch until all the remains are gone.
Re: [dpdk-dev] [PATCH 5/8] net/mrvl: add classifier support
On 2/21/2018 2:14 PM, Tomasz Duszynski wrote: > Add classifier configuration support via rte_flow api. > > Signed-off-by: Natalie Samsonov > Signed-off-by: Tomasz Duszynski > --- > doc/guides/nics/mrvl.rst | 168 +++ > drivers/net/mrvl/Makefile |1 + > drivers/net/mrvl/mrvl_ethdev.c | 59 + > drivers/net/mrvl/mrvl_ethdev.h | 10 + > drivers/net/mrvl/mrvl_flow.c | 2787 > > 5 files changed, 3025 insertions(+) > create mode 100644 drivers/net/mrvl/mrvl_flow.c <...> > diff --git a/drivers/net/mrvl/mrvl_flow.c b/drivers/net/mrvl/mrvl_flow.c > new file mode 100644 > index 000..a2c25e6 > --- /dev/null > +++ b/drivers/net/mrvl/mrvl_flow.c > @@ -0,0 +1,2787 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2018 Marvell International Ltd. > + * Copyright(c) 2018 Semihalf. > + * All rights reserved. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * > + * * Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * * Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in > + * the documentation and/or other materials provided with the > + * distribution. > + * * Neither the name of the copyright holder nor the names of its > + * contributors may be used to endorse or promote products derived > + * from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > + */ Can you please use SPDX licensing tags for new files? And marvell PMD seems not switched to SPDX tags yet, can you please plan the switch? Thanks, ferruh
Re: [dpdk-dev] [PATCH 5/8] net/mrvl: add classifier support
On Wed, Mar 07, 2018 at 11:07:14AM +, Ferruh Yigit wrote: > On 2/21/2018 2:14 PM, Tomasz Duszynski wrote: > > Add classifier configuration support via rte_flow api. > > > > Signed-off-by: Natalie Samsonov > > Signed-off-by: Tomasz Duszynski > > --- > > doc/guides/nics/mrvl.rst | 168 +++ > > drivers/net/mrvl/Makefile |1 + > > drivers/net/mrvl/mrvl_ethdev.c | 59 + > > drivers/net/mrvl/mrvl_ethdev.h | 10 + > > drivers/net/mrvl/mrvl_flow.c | 2787 > > > > 5 files changed, 3025 insertions(+) > > create mode 100644 drivers/net/mrvl/mrvl_flow.c > > <...> > > > diff --git a/drivers/net/mrvl/mrvl_flow.c b/drivers/net/mrvl/mrvl_flow.c > > new file mode 100644 > > index 000..a2c25e6 > > --- /dev/null > > +++ b/drivers/net/mrvl/mrvl_flow.c > > @@ -0,0 +1,2787 @@ > > +/*- > > + * BSD LICENSE > > + * > > + * Copyright(c) 2018 Marvell International Ltd. > > + * Copyright(c) 2018 Semihalf. > > + * All rights reserved. > > + * > > + * Redistribution and use in source and binary forms, with or without > > + * modification, are permitted provided that the following conditions > > + * are met: > > + * > > + * * Redistributions of source code must retain the above copyright > > + * notice, this list of conditions and the following disclaimer. > > + * * Redistributions in binary form must reproduce the above copyright > > + * notice, this list of conditions and the following disclaimer in > > + * the documentation and/or other materials provided with the > > + * distribution. > > + * * Neither the name of the copyright holder nor the names of its > > + * contributors may be used to endorse or promote products derived > > + * from this software without specific prior written permission. > > + * > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > > + */ > > Can you please use SPDX licensing tags for new files? > > And marvell PMD seems not switched to SPDX tags yet, can you please plan the > switch? SPDX conversion patches are waiting for submission. Once Marvell gives final approval they will be pushed out. > > Thanks, > ferruh -- - Tomasz Duszyński
Re: [dpdk-dev] [PATCH 5/8] net/mrvl: add classifier support
On 3/7/2018 11:16 AM, Tomasz Duszynski wrote: > On Wed, Mar 07, 2018 at 11:07:14AM +, Ferruh Yigit wrote: >> On 2/21/2018 2:14 PM, Tomasz Duszynski wrote: >>> Add classifier configuration support via rte_flow api. >>> >>> Signed-off-by: Natalie Samsonov >>> Signed-off-by: Tomasz Duszynski >>> --- >>> doc/guides/nics/mrvl.rst | 168 +++ >>> drivers/net/mrvl/Makefile |1 + >>> drivers/net/mrvl/mrvl_ethdev.c | 59 + >>> drivers/net/mrvl/mrvl_ethdev.h | 10 + >>> drivers/net/mrvl/mrvl_flow.c | 2787 >>> >>> 5 files changed, 3025 insertions(+) >>> create mode 100644 drivers/net/mrvl/mrvl_flow.c >> >> <...> >> >>> diff --git a/drivers/net/mrvl/mrvl_flow.c b/drivers/net/mrvl/mrvl_flow.c >>> new file mode 100644 >>> index 000..a2c25e6 >>> --- /dev/null >>> +++ b/drivers/net/mrvl/mrvl_flow.c >>> @@ -0,0 +1,2787 @@ >>> +/*- >>> + * BSD LICENSE >>> + * >>> + * Copyright(c) 2018 Marvell International Ltd. >>> + * Copyright(c) 2018 Semihalf. >>> + * All rights reserved. >>> + * >>> + * Redistribution and use in source and binary forms, with or without >>> + * modification, are permitted provided that the following conditions >>> + * are met: >>> + * >>> + * * Redistributions of source code must retain the above copyright >>> + * notice, this list of conditions and the following disclaimer. >>> + * * Redistributions in binary form must reproduce the above copyright >>> + * notice, this list of conditions and the following disclaimer in >>> + * the documentation and/or other materials provided with the >>> + * distribution. >>> + * * Neither the name of the copyright holder nor the names of its >>> + * contributors may be used to endorse or promote products derived >>> + * from this software without specific prior written permission. >>> + * >>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS >>> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT >>> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR >>> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT >>> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, >>> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT >>> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, >>> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY >>> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT >>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE >>> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. >>> + */ >> >> Can you please use SPDX licensing tags for new files? >> >> And marvell PMD seems not switched to SPDX tags yet, can you please plan the >> switch? > > SPDX conversion patches are waiting for submission. Once Marvell > gives final approval they will be pushed out. Good, thank you. Will it be possible to send this set with SPDX license? > >> >> Thanks, >> ferruh > > -- > - Tomasz Duszyński >
Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
On 03/07/2018 11:59 AM, Wodkowski, PawelX wrote: -Original Message- From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] Sent: Wednesday, March 7, 2018 11:09 AM To: Wodkowski, PawelX ; Kulasek, TomaszX ; y...@fridaylinux.org Cc: Verkamp, Daniel ; Harris, James R ; dev@dpdk.org; Stojaczyk, DariuszX Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data On 03/07/2018 10:16 AM, Wodkowski, PawelX wrote: -Original Message- From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] Sent: Tuesday, March 6, 2018 5:27 PM To: Kulasek, TomaszX ; y...@fridaylinux.org Cc: Verkamp, Daniel ; Harris, James R ; Wodkowski, PawelX ; dev@dpdk.org; Stojaczyk, DariuszX Subject: Re: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data Hi Tomasz, On 03/05/2018 05:11 PM, Tomasz Kulasek wrote: For now DPDK assumes that callfd, kickfd and last_idx are being set just once during vring initialization and device cannot be running while DPDK receives SET_VRING_KICK, SET_VRING_CALL and SET_VRING_BASE messages. However, that assumption is wrong. For Vhost SCSI messages might arrive at any point of time, possibly multiple times, one after another. QEMU issues SET_VRING_CALL once during device initialization, then again during device start. The second message will close previous callfd, which is still being used by the user-implementation of vhost device. This results in writing to invalid (closed) callfd. Other messages like SET_FEATURES, SET_VRING_ADDR etc also will change internal state of VQ or device. To prevent race condition device should also be stopped before updateing vring data. Signed-off-by: Dariusz Stojaczyk Signed-off-by: Pawel Wodkowski Signed-off-by: Tomasz Kulasek --- lib/librte_vhost/vhost_user.c | 40 1 file changed, 40 insertions(+) In last release, we have introduced a per-virtqueue lock to protect vring handling against asynchronous device changes. I think that would solve the issue you are facing, but you would need to export the VQs locking functions to the vhost-user lib API to be able to use it. I don't think your current patch is the right solution anyway, because it destroys the device in case we don't want it to remain alive, like set_log_base, or set_features when only the logging feature gets enabled. Please correct me if I can't see something obvious, but how this lock protect against eg SET_MEM_TABLE message? Current flow you are thinking of is: DPDK vhost-user thread 1.1. vhost_user_lock_all_queue_pairs() 1.2. vhost_user_set_mem_table() 1.3. vhost_user_unlock_all_queue_pairs() BACKEND: virito-net: 2.1. rte_spinlock_lock(&vq->access_lock); 2.2. Process vrings and copy all data 2.3. rte_spinlock_unlock(&vq->access_lock); Yes, it will synchronize access to virtio_net structure but what if the BACKEND is in zero copy mode and/or pass buffers to physical device? The request will not end in 2.2 and you unmap the memory regions in the middle of request. Even worse, the physical device will just abort the request but BACKEND can segfault or write random memory because BACKEND try to use invalid memory address (retrieved at request start). Right, it doesn't work with zero-copy. To use this per-virtqueue lock: 1. the lock need to be held from request start to the end - but this can starve DPDK vhost-user thread as there might be many request on-the-fly and when one is done the new one might be started. 2. Becouse we don't know if something changed between requst start and request end BACKEND need walk through all descriptors chain at the request end and do the rte_vhost_gpa_to_vva() again. The SET_MEM_TABLE is most obvious message but the same is true for other like VHOST_IOTLB_INVALIDATE or SET_FEATURES. SET_FEATURE should never be sent as soon as the device is started, except to enable logging. For VHOST_IOTLB_INVALIDATE, the solution might be to have a ref counter per entry, and to only remove it for the cache once it is zero and send the reply-ack tothe master once this is done. But the cost would be huge as with large entries, a lot of threads might increment/decrement the same variable so there will be contention. For all other cases, like SET_MEM_TABLE, maybe the solution is to disable/enable all the queues using the existing ops. The application or library would have to take care that no guest buffers are in the wild before returning from the disable. Do you think that would work? What kind of ops can be used to reliably disable all queues and inform backend what changed beside new_device/destroy_device? Those informations are very well hidden inside vhost.c and vhost-user.c files. (struct vhost_device_ops).vring_state_changed() When a queue is disabled, I think we can expect the application won't use its resources anymore. I think we need new set of ops/callbacks in vhost_device_ops struct that let the backend decide how
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
On 07-Mar-18 9:59 AM, Thomas Monjalon wrote: 07/03/2018 10:05, Burakov, Anatoly: On 07-Mar-18 8:32 AM, Thomas Monjalon wrote: Hi, 06/03/2018 19:28, Arnon Warshavsky: The use case addressed here is dpdk environment init aborting the process due to panic, preventing the calling process from running its own tear-down actions. Thank you for working on this long standing issue. A preferred, though ABI breaking solution would be to have the environment init always return a value rather than abort upon distress. Yes, it is the preferred solution. We should not use exit (panic & co) inside a library. It is important enough to break the API. +1, panic exists mostly for historical reasons AFAIK. it's a pity i didn't think of it at the time of submitting the memory hotplug RFC, because i now hit the same issue with the v1 - we might panic while holding a lock, and didn't realize that it was an API break to change this behavior. Can this really go into current release without deprecation notices? If such an exception is done, it must be approved by the technical board. We need to check few criterias: - which functions need to be changed - how the application is impacted - what is the urgency If a panic is removed and the application is not already checking some error code, the execution will continue without considering the error. Some rte_panic could be probably removed without any impact on applications. Some rte_panic could wait for 18.08 with a notice in 18.05. If some rte_panic cannot wait, it must be discussed specifically. Can we add a compile warning for adding new rte_panic's into code? It's a nice tool while debugging, but it probably shouldn't be in any new production code. -- Thanks, Anatoly
Re: [dpdk-dev] [PATCH] net/nfp: add port id to mbuf
On 2/22/2018 11:13 AM, Alejandro Lucero wrote: > Although this can be done by the app, because other PMDs are doing it, > apps expect this behaviour from the PMD. Although it doesn't explicitly stated, I think expectation is PMD to set it, the sample applications I checked in the dpdk don't set this. And it can give better performance setting in PMD where the data is hot. Fixes: b812daadad0d ("nfp: add Rx and Tx") Cc: sta...@dpdk.org > Signed-off-by: Alejandro Lucero Applied to dpdk-next-net/master, thanks.
Re: [dpdk-dev] [dpdk-stable] [PATCH] net/nfp: fix barrier location
On 2/22/2018 11:30 AM, Alejandro Lucero wrote: > The barrier needs to be after reading the DD bit. It has not been > a problem because the potential reads which can not happen before > reading the DD bit seem to be far enough, so the compiler is not > rescheduling them. However, a refactoring could make this problem > to arise. > > Fixes: b812daadad0d ("nfp: add Rx and Tx") Cc: sta...@dpdk.org > > Signed-off-by: Alejandro Lucero Applied to dpdk-next-net/master, thanks. Unrelated to this patch but nfp driver is still missing: 1- SPDX licensing tags 2- new offloading API. Can you please plan for them for this release? Specially second one is important because missing it may break the driver for this release. Thanks, ferruh
Re: [dpdk-dev] [dpdk-stable] [PATCH] net/nfp: fix link speed capabilities reported
On 2/22/2018 11:57 AM, Alejandro Lucero wrote: > Mixing numeric macros with bit shifts macros is not a good idea. > > Fixes: 011411586e03 ("net/nfp: extend speed capabilities advertised") Cc: sta...@dpdk.org > Signed-off-by: Alejandro Lucero Applied to dpdk-next-net/master, thanks.
[dpdk-dev] [RFC PATCH v1 2/4] net/e1000: add TxRx tuning parameters
The optimal values of several transmission & reception related parameters, such as burst sizes, descriptor ring sizes, and number of queues, varies between different network interface devices. This patch allows individual PMDs to specify preferred parameter values. Signed-off-by: Remy Horton --- drivers/net/e1000/em_ethdev.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/e1000/em_ethdev.c b/drivers/net/e1000/em_ethdev.c index 242375f..e81abd1 100644 --- a/drivers/net/e1000/em_ethdev.c +++ b/drivers/net/e1000/em_ethdev.c @@ -1099,6 +1099,8 @@ static void eth_em_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) { struct e1000_hw *hw = E1000_DEV_PRIVATE_TO_HW(dev->data->dev_private); + struct rte_eth_dev_pref_queue_info *pref_q_info = + &dev_info->preferred_queue_values; dev_info->pci_dev = RTE_ETH_DEV_TO_PCI(dev); dev_info->min_rx_bufsize = 256; /* See BSIZE field of RCTL register. */ @@ -1152,6 +1154,12 @@ eth_em_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->speed_capa = ETH_LINK_SPEED_10M_HD | ETH_LINK_SPEED_10M | ETH_LINK_SPEED_100M_HD | ETH_LINK_SPEED_100M | ETH_LINK_SPEED_1G; + + /* Preferred queue parameters */ + pref_q_info->nb_tx_queues = 1; + pref_q_info->nb_rx_queues = 1; + pref_q_info->tx_ring_size = 256; + pref_q_info->rx_ring_size = 256; } /* return 0 means link status changed, -1 means not changed */ -- 2.9.5
[dpdk-dev] [RFC PATCH v1 0/4] ethdev: add per-PMD tuning of RxTx parmeters
The optimal values of several transmission & reception related parameters, such as burst sizes, descriptor ring sizes, and number of queues, varies between different network interface devices. This patchset allows individual PMDs to specify their preferred parameter values, and if so indicated by an application, for them to be used automatically by the ethdev layer. This RFC/V1 includes per-PMD values for e1000 and i40e but it is expected that subsequent patchsets will cover other PMDs. A deprecation notice covering the API/ABI change is in place. Remy Horton (4): ethdev: add support for PMD-tuned Tx/Rx parameters net/e1000: add TxRx tuning parameters net/i40e: add TxRx tuning parameters testpmd: make use of per-PMD TxRx parameters app/test-pmd/testpmd.c | 5 +++-- drivers/net/e1000/em_ethdev.c | 8 drivers/net/i40e/i40e_ethdev.c | 35 --- lib/librte_ether/rte_ethdev.c | 18 ++ lib/librte_ether/rte_ethdev.h | 15 +++ 5 files changed, 76 insertions(+), 5 deletions(-) -- 2.9.5
[dpdk-dev] [RFC PATCH v1 1/4] ethdev: add support for PMD-tuned Tx/Rx parameters
The optimal values of several transmission & reception related parameters, such as burst sizes, descriptor ring sizes, and number of queues, varies between different network interface devices. This patch allows individual PMDs to specify preferred parameter values. Signed-off-by: Remy Horton --- lib/librte_ether/rte_ethdev.c | 18 ++ lib/librte_ether/rte_ethdev.h | 15 +++ 2 files changed, 33 insertions(+) diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c index 0590f0c..1630407 100644 --- a/lib/librte_ether/rte_ethdev.c +++ b/lib/librte_ether/rte_ethdev.c @@ -1461,6 +1461,10 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id, return -EINVAL; } + /* Use default specified by driver, if nb_rc_desc is zero */ + if (nb_rx_desc == 0) + nb_rx_desc = dev_info.preferred_queue_values.rx_ring_size; + if (nb_rx_desc > dev_info.rx_desc_lim.nb_max || nb_rx_desc < dev_info.rx_desc_lim.nb_min || nb_rx_desc % dev_info.rx_desc_lim.nb_align != 0) { @@ -1584,6 +1588,10 @@ rte_eth_tx_queue_setup(uint16_t port_id, uint16_t tx_queue_id, rte_eth_dev_info_get(port_id, &dev_info); + /* Use default specified by driver, if nb_tx_desc is zero */ + if (nb_tx_desc == 0) + nb_tx_desc = dev_info.preferred_queue_values.tx_ring_size; + if (nb_tx_desc > dev_info.tx_desc_lim.nb_max || nb_tx_desc < dev_info.tx_desc_lim.nb_min || nb_tx_desc % dev_info.tx_desc_lim.nb_align != 0) { @@ -2394,6 +2402,16 @@ rte_eth_dev_info_get(uint16_t port_id, struct rte_eth_dev_info *dev_info) dev_info->rx_desc_lim = lim; dev_info->tx_desc_lim = lim; + /* Defaults for drivers that don't implement preferred +* queue parameters. +*/ + dev_info->preferred_queue_values.rx_burst_size = 0; + dev_info->preferred_queue_values.tx_burst_size = 0; + dev_info->preferred_queue_values.nb_rx_queues = 1; + dev_info->preferred_queue_values.nb_tx_queues = 1; + dev_info->preferred_queue_values.rx_ring_size = 1024; + dev_info->preferred_queue_values.tx_ring_size = 1024; + RTE_FUNC_PTR_OR_RET(*dev->dev_ops->dev_infos_get); (*dev->dev_ops->dev_infos_get)(dev, dev_info); dev_info->driver_name = dev->device->driver->name; diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 0361533..67ce82d 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -988,6 +988,18 @@ struct rte_eth_conf { struct rte_pci_device; +/* + * Preferred queue parameters. + */ +struct rte_eth_dev_pref_queue_info { + uint16_t rx_burst_size; + uint16_t tx_burst_size; + uint16_t rx_ring_size; + uint16_t tx_ring_size; + uint16_t nb_rx_queues; + uint16_t nb_tx_queues; +}; + /** * Ethernet device information */ @@ -1029,6 +1041,9 @@ struct rte_eth_dev_info { /** Configured number of rx/tx queues */ uint16_t nb_rx_queues; /**< Number of RX queues. */ uint16_t nb_tx_queues; /**< Number of TX queues. */ + + /** Queue size recommendations */ + struct rte_eth_dev_pref_queue_info preferred_queue_values; }; /** -- 2.9.5
[dpdk-dev] [RFC PATCH v1 3/4] net/i40e: add TxRx tuning parameters
The optimal values of several transmission & reception related parameters, such as burst sizes, descriptor ring sizes, and number of queues, varies between different network interface devices. This patch allows individual PMDs to specify preferred parameter values. Signed-off-by: Remy Horton --- drivers/net/i40e/i40e_ethdev.c | 35 --- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index 508b417..4bcd05e 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -3168,6 +3168,7 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private); struct i40e_vsi *vsi = pf->main_vsi; struct rte_pci_device *pci_dev = RTE_ETH_DEV_TO_PCI(dev); + struct rte_eth_dev_pref_queue_info *pref_q_info; dev_info->pci_dev = pci_dev; dev_info->max_rx_queues = vsi->nb_qps; @@ -3248,15 +3249,43 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues += dev_info->vmdq_queue_num; } - if (I40E_PHY_TYPE_SUPPORT_40G(hw->phy.phy_types)) + pref_q_info = &dev_info->preferred_queue_values; + if (I40E_PHY_TYPE_SUPPORT_40G(hw->phy.phy_types)) { /* For XL710 */ dev_info->speed_capa = ETH_LINK_SPEED_40G; - else if (I40E_PHY_TYPE_SUPPORT_25G(hw->phy.phy_types)) + pref_q_info->nb_tx_queues = 2; + pref_q_info->nb_rx_queues = 2; + if (dev->data->nb_rx_queues == 1) + pref_q_info->rx_ring_size = 2048; + else + pref_q_info->rx_ring_size = 1024; + if (dev->data->nb_tx_queues == 1) + pref_q_info->tx_ring_size = 1024; + else + pref_q_info->tx_ring_size = 512; + + } else if (I40E_PHY_TYPE_SUPPORT_25G(hw->phy.phy_types)) { /* For XXV710 */ dev_info->speed_capa = ETH_LINK_SPEED_25G; - else + pref_q_info->nb_tx_queues = 1; + pref_q_info->nb_rx_queues = 1; + pref_q_info->rx_ring_size = 256; + pref_q_info->tx_ring_size = 256; + } else { /* For X710 */ dev_info->speed_capa = ETH_LINK_SPEED_1G | ETH_LINK_SPEED_10G; + pref_q_info->nb_tx_queues = 1; + pref_q_info->nb_rx_queues = 1; + if (dev->data->dev_conf.link_speeds & ETH_LINK_SPEED_10G) { + pref_q_info->rx_ring_size = 512; + pref_q_info->tx_ring_size = 256; + } else { + pref_q_info->rx_ring_size = 256; + pref_q_info->tx_ring_size = 256; + } + } + pref_q_info->tx_burst_size = 32; + pref_q_info->rx_burst_size = 32; } static int -- 2.9.5
[dpdk-dev] [RFC PATCH v1 4/4] testpmd: make use of per-PMD TxRx parameters
The optimal values of several transmission & reception related parameters, such as burst sizes, descriptor ring sizes, and number of queues, varies between different network interface devices. This patch allows testpmd to make use of per-PMD tuned parameter values. Signed-off-by: Remy Horton --- app/test-pmd/testpmd.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index 4c0e258..82eb197 100644 --- a/app/test-pmd/testpmd.c +++ b/app/test-pmd/testpmd.c @@ -210,9 +210,10 @@ queueid_t nb_txq = 1; /**< Number of TX queues per port. */ /* * Configurable number of RX/TX ring descriptors. + * Defaults are supplied by drivers via ethdev. */ -#define RTE_TEST_RX_DESC_DEFAULT 1024 -#define RTE_TEST_TX_DESC_DEFAULT 1024 +#define RTE_TEST_RX_DESC_DEFAULT 0 +#define RTE_TEST_TX_DESC_DEFAULT 0 uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; /**< Number of RX descriptors. */ uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; /**< Number of TX descriptors. */ -- 2.9.5
Re: [dpdk-dev] [RFC v1 1/1] lib/cryptodev: add support of asymmetric crypto
Hi Fiona >-Original Message- >From: Trahe, Fiona [mailto:fiona.tr...@intel.com] >Sent: 09 February 2018 23:43 >To: dev@dpdk.org; Athreya, Narayana Prasad >; Murthy, Nidadavolu >; Sahu, Sunila ; Gupta, >Ashish ; Verma, >Shally ; Doherty, Declan ; >Keating, Brian A ; >Griffin, John >Cc: Trahe, Fiona ; De Lara Guarch, Pablo > >Subject: RE: [dpdk-dev] [RFC v1 1/1] lib/cryptodev: add support of asymmetric >crypto > >Hi Shally, >Comments below. > >> -Original Message- >> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Shally Verma >> Sent: Tuesday, January 23, 2018 9:54 AM >> To: Doherty, Declan >> Cc: dev@dpdk.org; pathr...@caviumnetworks.com; nmur...@caviumnetworks.com; >> ss...@caviumnetworks.com; agu...@caviumnetworks.com; Shally Verma >> >> Subject: [dpdk-dev] [RFC v1 1/1] lib/cryptodev: add support of asymmetric >> crypto >> //snip > >> +RTE_CRYPTO_ASYM_XFORM_FECC, >> +/**< Fundamental Elliptic curve operations. >> + * Perform elliptic curve operation: >> + * add, double point, multiplication >> + * Refer to enum rte_crypto_fecc_optype >> + */ >> +RTE_CRYPTO_ASYM_XFORM_MODINV, >> +/**< Modular Inverse */ >[Fiona] would be nicer to group modinv with modexp [Shally] I thought of it but having a xform RTE_CRYPTO_XFORM_MOD with two ops RTE_CRYPTO_OP_MOD_EXP and MOD_INV doesn’t seem to provide any benefit. Or do you have something else in mind? In addition, I am thinking probably we don’t need sessions for modexp or modinv ops. I am thinking to change their support as sessionless only. App can directly attach xform to compute modexp or inverse to op. What do you suggest? > >> +RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END >> +/**< End of list */ >> +}; >> + >> +/** >> + * Asymmetric cryptogr operation type variants >[Fiona] typo: Use crypto or cryptographic > >> + */ >> +enum rte_crypto_asym_op_type { >> +RTE_CRYPTO_ASYM_OP_NOT_SPECIFIED = 1, >[Fiona] Why does this start at 1? >And is it necessary? > [Shally] We need to indicate list of supported op in xform capability structure. Because an implementation may support RSA encrypt and decrypt but not RSA Sign and verify. Or, Can support DSA Sign compute but not verify. So, it was added to indicate end-of-array marker (though doesn’t need to be 1 for that reason). but now when I think to re-design its support, then it won't be needed. So, I thought rather than carrying op_type array, I can add an op_type bitmask in xform capability to show supported ops. Example capability check code then would look like: int rte_crypto_asym_check_op_type ( const rte_crypto_asym_capabilties *capa, int op_type) { If(capa->op_types & (1 << op_type)) return 0; return -1; } Please let me know your feedback, if you have any preferences here. >> +/**< Operation unspecified */ >> +RTE_CRYPTO_ASYM_OP_ENCRYPT, >> +/**< Asymmetric encrypt operation */ >> +RTE_CRYPTO_ASYM_OP_DECRYPT, >> +/**< Asymmetric Decrypt operation */ >> +RTE_CRYPTO_ASYM_OP_SIGN, >> +/**< Signature generation operation */ >> +RTE_CRYPTO_ASYM_OP_VERIFY, >> +/**< Signature verification operation */ >> +RTE_CRYPTO_ASYM_OP_KEY_PAIR_GENERATION, >> +/**< Public/Private key pair generation operation */ >[Fiona] In the comment, clarify that this is for DH and ECDH, and for the > generation of the public key (and optionally the private key?) > [Shally] so far, I was assuming it will generate both but when you say private key optional, where you expect it to be coming from? - from app or generated internally? Is their hw variant which may not generate private key? //snip >> +/** >> + * Fundamental ECC operation type variants. >> + */ >> +enum rte_crypto_fecc_optype { >> +RTE_CRYPTO_FECC_OP_NOT_SPECIFIED = 1, >> +/**< FECC operation type unspecified */ >[Fiona] as above. Why 1? And is it needed? [Shally] This is for same reason to indicate in fecc xform capability list of supported op type in fundamental EC operation. And if we agree to use proposal above to use bitmask, it won't be needed. > >> +RTE_CRYPTO_FECC_OP_POINT_ADD, >> +/**< Fundamental ECC point addition operation */ >> +RTE_CRYPTO_FECC_OP_POINT_DBL, >> +/**< Fundamental ECC point doubling operation */ >> +RTE_CRYPTO_FECC_OP_POINT_MULTIPLY, >> +/**< Fundamental ECC point multiplication operation */ >> +RTE_CRYPTO_FECC_OP_LIST_END >> +}; >> + >> +#define RTE_CRYPTO_EC_CURVE_NOT_SPECIFIED -1 >[Fiona] Wouldn't it be better to put this back in as the initial value in the >enum as originally done? >Else will there not be a compiler warning if a param of that enum type is >initialised to above #define? >And are _BINARY and _PRIME values needed in this case? [Shally] Agreed. We would need to use typecast to avoid warning so I will revert and define _Primary and Binary variant. But before that I have one question on published list. See below. > >> +/** >> + * ECC list o
Re: [dpdk-dev] [PATCH v2] net/qede: fix alloc from socket 0
On 2/26/2018 6:38 PM, Patil, Harish wrote: > -Original Message- > From: Pascal Mazon > Date: Monday, February 26, 2018 at 12:01 AM > To: "dev@dpdk.org" , "Mody, Rasesh" > , Harish Patil , "Shaikh, > Shahed" > Cc: "pascal.ma...@6wind.com" , "sta...@dpdk.org" > > Subject: [PATCH v2] net/qede: fix alloc from socket 0 > >> In case osal_dma_alloc_coherent() or osal_dma_alloc_coherent_aligned() are >> called from a management thread, core_id turn out to be LCORE_ID_ANY, and >> the resulting socket for alloc will be socket 0. >> >> This is not desirable when using a NIC from socket 1 which might very >> likely be configured to use memory from that socket only. >> In that case, allocation will fail. >> >> To address this, use master lcore instead when called from mgmt thread. >> The associated socket should have memory available. >> >> Fixes: ec94dbc57362 ("qede: add base driver") >> Cc: sta...@dpdk.org >> >> Signed-off-by: Pascal Mazon >> Acked-by: Harish Patil > Acked-by: Harish Patil Applied to dpdk-next-net/master, thanks.
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
> > Can we add a compile warning for adding new rte_panic's into code? It's a > nice tool while debugging, but it probably shouldn't be in any new > production code. > I thought about renaming the current function and calls to something like deprecated_rte_panic() , and keep the old API with __rte_deprecated. Is this kind of API break acceptable?
[dpdk-dev] [PATCH v2 0/4] ixgbe: convert to new offloads API
This patch set adds support of per queue VLAN strip offloading in ixgbe PF and VF. This patch support new offloads API in ixgbe PF and VF. --- v2: improve error checking Wei Dai (4): net/ixgbe: support VLAN strip per queue offloading in PF net/ixgbe: support VLAN strip per queue offloading in VF net/ixgbe: convert to new Rx offloads API net/ixgbe: convert to new Tx offloads API drivers/net/ixgbe/ixgbe_ethdev.c | 264 ++ drivers/net/ixgbe/ixgbe_ethdev.h | 4 +- drivers/net/ixgbe/ixgbe_ipsec.c | 13 +- drivers/net/ixgbe/ixgbe_pf.c | 5 +- drivers/net/ixgbe/ixgbe_rxtx.c| 245 --- drivers/net/ixgbe/ixgbe_rxtx.h| 13 ++ drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 2 +- drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 2 +- 8 files changed, 376 insertions(+), 172 deletions(-) -- 2.7.5
[dpdk-dev] [PATCH v2 4/4] net/ixgbe: convert to new Tx offloads API
Ethdev Tx offloads API has changed since: commit cba7f53b717d ("ethdev: introduce Tx queue offloads API") This commit support the new Tx offloads API. Signed-off-by: Wei Dai --- drivers/net/ixgbe/ixgbe_ethdev.c | 56 +-- drivers/net/ixgbe/ixgbe_ipsec.c | 5 ++- drivers/net/ixgbe/ixgbe_rxtx.c | 81 ++-- drivers/net/ixgbe/ixgbe_rxtx.h | 9 + 4 files changed, 116 insertions(+), 35 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 9437f05..6288690 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -2337,6 +2337,7 @@ ixgbe_dev_configure(struct rte_eth_dev *dev) (struct ixgbe_adapter *)dev->data->dev_private; struct rte_eth_dev_info dev_info; uint64_t rx_offloads; + uint64_t tx_offloads; int ret; PMD_INIT_FUNC_TRACE(); @@ -2356,6 +2357,13 @@ ixgbe_dev_configure(struct rte_eth_dev *dev) rx_offloads, dev_info.rx_offload_capa); return -ENOTSUP; } + tx_offloads = dev->data->dev_conf.txmode.offloads; + if ((tx_offloads & dev_info.tx_offload_capa) != tx_offloads) { + PMD_DRV_LOG(ERR, "Some Tx offloads are not supported " + "requested 0x%" PRIx64 " supported 0x%" PRIx64, + tx_offloads, dev_info.tx_offload_capa); + return -ENOTSUP; + } /* set flag to update link status after init */ intr->flags |= IXGBE_FLAG_NEED_LINK_UPDATE; @@ -3649,28 +3657,8 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->rx_queue_offload_capa = ixgbe_get_rx_queue_offloads(dev); dev_info->rx_offload_capa = (ixgbe_get_rx_port_offloads(dev) | dev_info->rx_queue_offload_capa); - - dev_info->tx_offload_capa = - DEV_TX_OFFLOAD_VLAN_INSERT | - DEV_TX_OFFLOAD_IPV4_CKSUM | - DEV_TX_OFFLOAD_UDP_CKSUM | - DEV_TX_OFFLOAD_TCP_CKSUM | - DEV_TX_OFFLOAD_SCTP_CKSUM | - DEV_TX_OFFLOAD_TCP_TSO; - - if (hw->mac.type == ixgbe_mac_82599EB || - hw->mac.type == ixgbe_mac_X540) - dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_MACSEC_INSERT; - - if (hw->mac.type == ixgbe_mac_X550 || - hw->mac.type == ixgbe_mac_X550EM_x || - hw->mac.type == ixgbe_mac_X550EM_a) - dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM; - -#ifdef RTE_LIBRTE_SECURITY - if (dev->security_ctx) - dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_SECURITY; -#endif + dev_info->tx_queue_offload_capa = ixgbe_get_tx_queue_offloads(dev); + dev_info->tx_offload_capa = ixgbe_get_tx_port_offloads(dev); dev_info->default_rxconf = (struct rte_eth_rxconf) { .rx_thresh = { @@ -3692,7 +3680,9 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) .tx_free_thresh = IXGBE_DEFAULT_TX_FREE_THRESH, .tx_rs_thresh = IXGBE_DEFAULT_TX_RSBIT_THRESH, .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | - ETH_TXQ_FLAGS_NOOFFLOADS, +ETH_TXQ_FLAGS_NOOFFLOADS | +ETH_TXQ_FLAGS_IGNORE, + .offloads = 0, }; dev_info->rx_desc_lim = rx_desc_lim; @@ -3776,12 +3766,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev, dev_info->rx_queue_offload_capa = ixgbe_get_rx_queue_offloads(dev); dev_info->rx_offload_capa = (ixgbe_get_rx_port_offloads(dev) | dev_info->rx_queue_offload_capa); - dev_info->tx_offload_capa = DEV_TX_OFFLOAD_VLAN_INSERT | - DEV_TX_OFFLOAD_IPV4_CKSUM | - DEV_TX_OFFLOAD_UDP_CKSUM | - DEV_TX_OFFLOAD_TCP_CKSUM | - DEV_TX_OFFLOAD_SCTP_CKSUM | - DEV_TX_OFFLOAD_TCP_TSO; + dev_info->tx_queue_offload_capa = ixgbe_get_tx_queue_offloads(dev); + dev_info->tx_offload_capa = ixgbe_get_tx_port_offloads(dev); dev_info->default_rxconf = (struct rte_eth_rxconf) { .rx_thresh = { @@ -3803,7 +3789,9 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev, .tx_free_thresh = IXGBE_DEFAULT_TX_FREE_THRESH, .tx_rs_thresh = IXGBE_DEFAULT_TX_RSBIT_THRESH, .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | - ETH_TXQ_FLAGS_NOOFFLOADS, +ETH_TXQ_FLAGS_NOOFFLOADS | +ETH_TXQ_FLAGS_IGNORE, + .offloads = 0, }; dev_info->rx_desc_lim = rx_desc_lim; @@ -4941,6 +4929,7 @@ ixgbevf_dev_configure(struct
[dpdk-dev] [PATCH v2 2/4] net/ixgbe: support VLAN strip per queue offloading in VF
VLAN strip is a per queue offloading in VF. With this patch it can be enabled or disabled on any Rx queue in VF. Signed-off-by: Wei Dai --- drivers/net/ixgbe/ixgbe_ethdev.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 73755d2..8bb67ba 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -5215,15 +5215,17 @@ ixgbevf_vlan_offload_set(struct rte_eth_dev *dev, int mask) { struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); + struct ixgbe_rx_queue *rxq; uint16_t i; int on = 0; /* VF function only support hw strip feature, others are not support */ if (mask & ETH_VLAN_STRIP_MASK) { - on = !!(dev->data->dev_conf.rxmode.hw_vlan_strip); - - for (i = 0; i < hw->mac.max_rx_queues; i++) + for (i = 0; i < hw->mac.max_rx_queues; i++) { + rxq = dev->data->rx_queues[i]; + on = !!(rxq->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); ixgbevf_vlan_strip_queue_set(dev, i, on); + } } return 0; -- 2.7.5
[dpdk-dev] [PATCH v2 3/4] net/ixgbe: convert to new Rx offloads API
Ethdev Rx offloads API has changed since: commit ce17eddefc20 ("ethdev: introduce Rx queue offloads API") This commit support the new Rx offloads API. Signed-off-by: Wei Dai --- drivers/net/ixgbe/ixgbe_ethdev.c | 93 + drivers/net/ixgbe/ixgbe_ipsec.c | 8 +- drivers/net/ixgbe/ixgbe_rxtx.c| 163 ++ drivers/net/ixgbe/ixgbe_rxtx.h| 3 + drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 2 +- drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 2 +- 6 files changed, 205 insertions(+), 66 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 8bb67ba..9437f05 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -2105,19 +2105,22 @@ ixgbe_vlan_hw_strip_config(struct rte_eth_dev *dev) static int ixgbe_vlan_offload_set(struct rte_eth_dev *dev, int mask) { + struct rte_eth_rxmode *rxmode; + rxmode = &dev->data->dev_conf.rxmode; + if (mask & ETH_VLAN_STRIP_MASK) { ixgbe_vlan_hw_strip_config(dev); } if (mask & ETH_VLAN_FILTER_MASK) { - if (dev->data->dev_conf.rxmode.hw_vlan_filter) + if (rxmode->offloads & DEV_RX_OFFLOAD_VLAN_FILTER) ixgbe_vlan_hw_filter_enable(dev); else ixgbe_vlan_hw_filter_disable(dev); } if (mask & ETH_VLAN_EXTEND_MASK) { - if (dev->data->dev_conf.rxmode.hw_vlan_extend) + if (rxmode->offloads & DEV_RX_OFFLOAD_VLAN_EXTEND) ixgbe_vlan_hw_extend_enable(dev); else ixgbe_vlan_hw_extend_disable(dev); @@ -2332,6 +2335,8 @@ ixgbe_dev_configure(struct rte_eth_dev *dev) IXGBE_DEV_PRIVATE_TO_INTR(dev->data->dev_private); struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)dev->data->dev_private; + struct rte_eth_dev_info dev_info; + uint64_t rx_offloads; int ret; PMD_INIT_FUNC_TRACE(); @@ -2343,6 +2348,15 @@ ixgbe_dev_configure(struct rte_eth_dev *dev) return ret; } + ixgbe_dev_info_get(dev, &dev_info); + rx_offloads = dev->data->dev_conf.rxmode.offloads; + if ((rx_offloads & dev_info.rx_offload_capa) != rx_offloads) { + PMD_DRV_LOG(ERR, "Some Rx offloads are not supported " + "requested 0x%" PRIx64 " supported 0x%" PRIx64, + rx_offloads, dev_info.rx_offload_capa); + return -ENOTSUP; + } + /* set flag to update link status after init */ intr->flags |= IXGBE_FLAG_NEED_LINK_UPDATE; @@ -3632,30 +3646,9 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) else dev_info->max_vmdq_pools = ETH_64_POOLS; dev_info->vmdq_queue_num = dev_info->max_rx_queues; - dev_info->rx_offload_capa = - DEV_RX_OFFLOAD_VLAN_STRIP | - DEV_RX_OFFLOAD_IPV4_CKSUM | - DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_CRC_STRIP; - - /* -* RSC is only supported by 82599 and x540 PF devices in a non-SR-IOV -* mode. -*/ - if ((hw->mac.type == ixgbe_mac_82599EB || -hw->mac.type == ixgbe_mac_X540) && - !RTE_ETH_DEV_SRIOV(dev).active) - dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO; - - if (hw->mac.type == ixgbe_mac_82599EB || - hw->mac.type == ixgbe_mac_X540) - dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_MACSEC_STRIP; - - if (hw->mac.type == ixgbe_mac_X550 || - hw->mac.type == ixgbe_mac_X550EM_x || - hw->mac.type == ixgbe_mac_X550EM_a) - dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_OUTER_IPV4_CKSUM; + dev_info->rx_queue_offload_capa = ixgbe_get_rx_queue_offloads(dev); + dev_info->rx_offload_capa = (ixgbe_get_rx_port_offloads(dev) | +dev_info->rx_queue_offload_capa); dev_info->tx_offload_capa = DEV_TX_OFFLOAD_VLAN_INSERT | @@ -3675,10 +3668,8 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM; #ifdef RTE_LIBRTE_SECURITY - if (dev->security_ctx) { - dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_SECURITY; + if (dev->security_ctx) dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_SECURITY; - } #endif dev_info->default_rxconf = (struct rte_eth_rxconf) { @@ -3689,6 +3680,7 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) }, .rx_free_thresh = IXGBE_DEFAULT_RX_FREE_THRESH, .rx_drop_en = 0, +
[dpdk-dev] [PATCH v2 1/4] net/ixgbe: support VLAN strip per queue offloading in PF
VLAN strip is a per queue offloading in PF. With this patch it can be enabled or disabled on any Rx queue in PF. Signed-off-by: Wei Dai --- drivers/net/ixgbe/ixgbe_ethdev.c | 109 +-- drivers/net/ixgbe/ixgbe_ethdev.h | 4 +- drivers/net/ixgbe/ixgbe_pf.c | 5 +- drivers/net/ixgbe/ixgbe_rxtx.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.h | 1 + 5 files changed, 51 insertions(+), 69 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 4483258..73755d2 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -2001,64 +2001,6 @@ ixgbe_vlan_hw_strip_enable(struct rte_eth_dev *dev, uint16_t queue) ixgbe_vlan_hw_strip_bitmap_set(dev, queue, 1); } -void -ixgbe_vlan_hw_strip_disable_all(struct rte_eth_dev *dev) -{ - struct ixgbe_hw *hw = - IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); - uint32_t ctrl; - uint16_t i; - struct ixgbe_rx_queue *rxq; - - PMD_INIT_FUNC_TRACE(); - - if (hw->mac.type == ixgbe_mac_82598EB) { - ctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL); - ctrl &= ~IXGBE_VLNCTRL_VME; - IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, ctrl); - } else { - /* Other 10G NIC, the VLAN strip can be setup per queue in RXDCTL */ - for (i = 0; i < dev->data->nb_rx_queues; i++) { - rxq = dev->data->rx_queues[i]; - ctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(rxq->reg_idx)); - ctrl &= ~IXGBE_RXDCTL_VME; - IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(rxq->reg_idx), ctrl); - - /* record those setting for HW strip per queue */ - ixgbe_vlan_hw_strip_bitmap_set(dev, i, 0); - } - } -} - -void -ixgbe_vlan_hw_strip_enable_all(struct rte_eth_dev *dev) -{ - struct ixgbe_hw *hw = - IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); - uint32_t ctrl; - uint16_t i; - struct ixgbe_rx_queue *rxq; - - PMD_INIT_FUNC_TRACE(); - - if (hw->mac.type == ixgbe_mac_82598EB) { - ctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL); - ctrl |= IXGBE_VLNCTRL_VME; - IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, ctrl); - } else { - /* Other 10G NIC, the VLAN strip can be setup per queue in RXDCTL */ - for (i = 0; i < dev->data->nb_rx_queues; i++) { - rxq = dev->data->rx_queues[i]; - ctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(rxq->reg_idx)); - ctrl |= IXGBE_RXDCTL_VME; - IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(rxq->reg_idx), ctrl); - - /* record those setting for HW strip per queue */ - ixgbe_vlan_hw_strip_bitmap_set(dev, i, 1); - } - } -} - static void ixgbe_vlan_hw_extend_disable(struct rte_eth_dev *dev) { @@ -2114,14 +2056,57 @@ ixgbe_vlan_hw_extend_enable(struct rte_eth_dev *dev) */ } +void +ixgbe_vlan_hw_strip_config(struct rte_eth_dev *dev) +{ + struct ixgbe_hw *hw = + IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); + struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode; + uint32_t ctrl; + uint16_t i; + struct ixgbe_rx_queue *rxq; + bool on; + + PMD_INIT_FUNC_TRACE(); + + if (hw->mac.type == ixgbe_mac_82598EB) { + if (rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP) { + ctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL); + ctrl |= IXGBE_VLNCTRL_VME; + IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, ctrl); + } else { + ctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL); + ctrl &= ~IXGBE_VLNCTRL_VME; + IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, ctrl); + } + } else { + /* +* Other 10G NIC, the VLAN strip can be setup +* per queue in RXDCTL +*/ + for (i = 0; i < dev->data->nb_rx_queues; i++) { + rxq = dev->data->rx_queues[i]; + ctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(rxq->reg_idx)); + if (rxq->offloads & DEV_RX_OFFLOAD_VLAN_STRIP) { + ctrl |= IXGBE_RXDCTL_VME; + on = TRUE; + } else { + ctrl &= ~IXGBE_RXDCTL_VME; + on = FALSE; + } + IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(rxq->reg_idx), ctrl); + + /* record those setting for HW strip per queue */ + ixgbe_vlan_hw_strip_bitmap_set(dev, i, on); + } + } +} + static int ixgbe_vlan_offload_set(
Re: [dpdk-dev] [PATCH] doc: fixing grammar
> -Original Message- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Alejandro Lucero > Sent: Thursday, February 22, 2018 12:16 PM > To: dev@dpdk.org > Cc: sta...@dpdk.org > Subject: [dpdk-dev] [PATCH] doc: fixing grammar > > My english is far worse than those from the marketing team. > > Signed-off-by: Alejandro Lucero > --- > doc/guides/nics/nfp.rst | 43 ++- > 1 file changed, 22 insertions(+), 21 deletions(-) > <...> Acked-by: Marko Kovacevic
Re: [dpdk-dev] [PATCH 2/4] bus/vdev: bus scan by multi-process channel
On 04-Mar-18 3:30 PM, Jianfeng Tan wrote: To scan the vdevs in primary, we send request to primary process to obtain the names for vdevs. Only the name is shared from the primary. In probe(), the device driver is supposed to locate (or request more) the detail information from the primary. Signed-off-by: Jianfeng Tan --- General note - you probably want to syncrhonize access to the tailq. Multiple secondaries may initialize, a vdev hotplug event may be in process, etc. -- Thanks, Anatoly
Re: [dpdk-dev] [PATCH v2 5/5] eal: fix race condition in IPC requests
On 3/2/2018 4:41 PM, Anatoly Burakov wrote: Unlocking the action list before sending message and locking it again aftterwards introduces a window where a response might Typo: afterwards arrive before we have a chance to start waiting on a condition, resulting in timeouts on valid messages. Fixes: 783b6e54971d ("eal: add synchronous multi-process communication") Cc: jianfeng@intel.com Signed-off-by: Anatoly Burakov Acked-by: Jianfeng Tan Thank you for catching another bug :-)
[dpdk-dev] [dpdk-announce] DPDK 16.11.5 (LTS) released
Hi all, Here is a new stable release: http://fast.dpdk.org/rel/dpdk-16.11.5.tar.xz The git tree is at: http://dpdk.org/browse/dpdk-stable/ Apologies for the delays of a few days, but some extra time was necessary to sort through the regression tests results. Luca Boccassi --- MAINTAINERS| 1 + app/Makefile | 2 +- app/test-pmd/cmdline.c | 8 +- app/test-pmd/config.c | 54 +- app/test-pmd/txonly.c | 1 + app/test/test.c| 14 +- app/test/test_cryptodev.c | 2 + app/test/test_memzone.c| 253 +--- app/test/test_pmd_perf.c | 10 +- app/test/test_reorder.c| 11 + app/test/test_ring_perf.c | 36 +- app/test/test_table.c | 44 +- app/test/test_table_acl.c | 2 + app/test/test_timer_perf.c | 1 + buildtools/pmdinfogen/pmdinfogen.c | 5 +- config/common_base | 5 + config/common_linuxapp | 1 + doc/guides/cryptodevs/aesni_mb.rst | 2 +- doc/guides/nics/features/i40e.ini | 1 + doc/guides/nics/features/i40e_vec.ini | 1 + doc/guides/nics/i40e.rst | 27 + doc/guides/rel_notes/release_16_11.rst | 132 + doc/guides/sample_app_ug/keep_alive.rst| 2 +- drivers/crypto/qat/qat_adf/qat_algs_build_desc.c | 10 + drivers/crypto/qat/qat_crypto.c| 5 +- drivers/net/af_packet/rte_eth_af_packet.c | 2 +- drivers/net/bnxt/bnxt.h| 1 + drivers/net/bnxt/bnxt_ethdev.c | 34 +- drivers/net/bnxt/bnxt_hwrm.c | 58 +- drivers/net/bnxt/bnxt_hwrm.h | 4 +- drivers/net/bnxt/bnxt_ring.c | 24 +- drivers/net/bnxt/bnxt_ring.h | 3 +- drivers/net/bnxt/bnxt_rxr.c| 7 +- drivers/net/bnxt/bnxt_txr.c| 17 +- drivers/net/bonding/rte_eth_bond_8023ad.c | 3 +- drivers/net/bonding/rte_eth_bond_api.c | 11 +- drivers/net/bonding/rte_eth_bond_pmd.c | 10 +- drivers/net/e1000/em_ethdev.c | 2 +- drivers/net/e1000/igb_ethdev.c | 20 +- drivers/net/ena/ena_ethdev.c | 10 +- drivers/net/enic/enic.h| 26 +- drivers/net/enic/enic_ethdev.c | 18 +- drivers/net/enic/enic_main.c | 43 +- drivers/net/fm10k/fm10k_ethdev.c | 4 +- drivers/net/i40e/Makefile | 2 + drivers/net/i40e/base/i40e_adminq.c| 23 +- drivers/net/i40e/base/i40e_common.c| 8 +- drivers/net/i40e/base/i40e_nvm.c | 3 +- drivers/net/i40e/base/i40e_type.h | 1 + drivers/net/i40e/i40e_ethdev.c | 473 +++ drivers/net/i40e/i40e_ethdev.h | 63 +- drivers/net/i40e/i40e_ethdev_vf.c | 13 +- drivers/net/i40e/i40e_fdir.c | 8 +- drivers/net/i40e/i40e_rxtx.c | 1 + drivers/net/i40e/i40e_rxtx_vec_altivec.c | 654 + drivers/net/ixgbe/base/ixgbe_82599.c | 7 + drivers/net/ixgbe/base/ixgbe_api.c | 2 + drivers/net/ixgbe/base/ixgbe_common.c | 10 +- drivers/net/ixgbe/base/ixgbe_mbx.c | 22 - drivers/net/ixgbe/base/ixgbe_type.h| 4 +- drivers/net/ixgbe/ixgbe_ethdev.c | 167 +- drivers/net/mlx5/mlx5.h| 16 + drivers/net/mlx5/mlx5_ethdev.c | 18 +- drivers/net/nfp/nfp_net.c | 19 +- drivers/net/null/rte_eth_null.c| 2 +- drivers/net/pcap/rte_eth_pcap.c| 6 +- drivers/net/qede/base/ecore_dcbx.c | 7 +- drivers/net/qede/base/ecore_vf.c | 6 + drivers/net/qede/base/ecore_vfpf_if.h | 2 + drivers/net/qede/qede_ethdev.c | 160 - drivers/net/qede/qede_rxtx.c | 55 +- drivers/net/qede/qede_rxtx.h | 15 +- drivers/net/ring/rte_eth_ring.c| 2 +- drivers/net/szedata2/rte_eth_szedata2.c| 4 +- drivers/net/thunderx/nicvf_ethdev.c| 2 +- drivers/net/thunderx/nicvf_rxtx.c | 2 +- drivers/net/vhost/rte_eth_vhost.c
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
07/03/2018 12:02, Arnon Warshavsky: > > > Can this really go into current release without deprecation notices? > > > > If such an exception is done, it must be approved by the technical board. > > We need to check few criterias: > > - which functions need to be changed > > - how the application is impacted > > - what is the urgency > > > > If a panic is removed and the application is not already checking some > > error code, the execution will continue without considering the error. > > > > > Some rte_panic could be probably removed without any impact on > > applications. > > Some rte_panic could wait for 18.08 with a notice in 18.05. > > If some rte_panic cannot wait, it must be discussed specifically. > > > > Every panic removal must be handled all the way up in all call paths. > If not all instances can be removed at once in 18.05 (which seems to be the > case) > maybe we should keep the callback patch until all the remains are gone. Why introducing a new API for a temporary solution? It has always been like that, so the remaining occurences could wait one more release, isn't it?
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
07/03/2018 14:23, Arnon Warshavsky: > > > > Can we add a compile warning for adding new rte_panic's into code? It's a > > nice tool while debugging, but it probably shouldn't be in any new > > production code. Yes could be nice to automatically detect it in drivers/ or lib/ directories. > I thought about renaming the current function and calls to something like > deprecated_rte_panic() > , and keep the old API with __rte_deprecated. > Is this kind of API break acceptable? No, rte_panic can be used in applications.
Re: [dpdk-dev] [PATCH v2] ether: fix invalid string length in ethdev name comparison
On 2/27/2018 9:38 AM, Ananyev, Konstantin wrote: > > >> -Original Message- >> From: Awal, Mohammad Abdul >> Sent: Tuesday, February 27, 2018 8:58 AM >> To: tho...@monjalon.net >> Cc: rke...@gmail.com; dev@dpdk.org; Ananyev, Konstantin >> ; Awal, Mohammad Abdul >> >> Subject: [PATCH v2] ether: fix invalid string length in ethdev name >> comparison >> >> The current code compares two strings upto the length of 1st string >> (searched name). If the 1st string is prefix of 2nd string (existing name), >> the string comparison returns the port_id of earliest prefix matches. >> This patch fixes the bug by using strcmp instead of strncmp. >> >> Fixes: 9c5b8d8b9fe ("ethdev: clean port id retrieval when attaching") >> >> Signed-off-by: Mohammad Abdul Awal > Acked-by: Konstantin Ananyev Applied to dpdk-next-net/master, thanks.
Re: [dpdk-dev] [PATCH 00/41] Memory Hotplug for DPDK
Hi Anatoly, I am trying to run some test with this series, but it seems to be based on some other commits of yours. I have already identified the following one [1] it seems I am missing some others. It is possible to have a list of commits to apply on the current master branch [2] before this series? Thanks, [1] https://dpdk.org/patch/35043 [2] https://dpdk.org/browse/dpdk/commit/?id=c06ddf9698e0c2a9653cfa971f9ddc205065662c -- Nélio Laranjeiro 6WIND
Re: [dpdk-dev] [PATCH 00/41] Memory Hotplug for DPDK
On 07-Mar-18 3:27 PM, Nélio Laranjeiro wrote: Hi Anatoly, I am trying to run some test with this series, but it seems to be based on some other commits of yours. I have already identified the following one [1] it seems I am missing some others. It is possible to have a list of commits to apply on the current master branch [2] before this series? Thanks, [1] https://dpdk.org/patch/35043 [2] https://dpdk.org/browse/dpdk/commit/?id=c06ddf9698e0c2a9653cfa971f9ddc205065662c Hi Nelio, Yes, my apologies. I'm aware of the apply issues. The issue is due to me missing a rebase on one of the dependent patchsets. I'm preparing a v2 that will fix the issue (pending some internal processes). -- Thanks, Anatoly
Re: [dpdk-dev] [PATCH 00/41] Memory Hotplug for DPDK
On 07-Mar-18 3:27 PM, Nélio Laranjeiro wrote: Hi Anatoly, I am trying to run some test with this series, but it seems to be based on some other commits of yours. I have already identified the following one [1] it seems I am missing some others. It is possible to have a list of commits to apply on the current master branch [2] before this series? Thanks, [1] https://dpdk.org/patch/35043 [2] https://dpdk.org/browse/dpdk/commit/?id=c06ddf9698e0c2a9653cfa971f9ddc205065662c Also, the cover letter you're responding to, lists dependent patches as well :) it's just that current patchset does not apply cleanly atop of them due to rebase errors from my side. -- Thanks, Anatoly
Re: [dpdk-dev] [dpdk-stable] [PATCH] doc: fixing grammar
On 3/7/2018 1:41 PM, Kovacevic, Marko wrote: > > >> -Original Message- >> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Alejandro Lucero >> Sent: Thursday, February 22, 2018 12:16 PM >> To: dev@dpdk.org >> Cc: sta...@dpdk.org >> Subject: [dpdk-dev] [PATCH] doc: fixing grammar >> >> My english is far worse than those from the marketing team. Fixes: 80bc1752f16e ("nfp: add guide") Fixes: d625beafc8be ("doc: update NFP with PF support information") Fixes: 80987c40fd28 ("config: enable nfp driver on Linux") Cc: sta...@dpdk.org >> Signed-off-by: Alejandro Lucero >> --- >> doc/guides/nics/nfp.rst | 43 ++- >> 1 file changed, 22 insertions(+), 21 deletions(-) >> > <...> > > Acked-by: Marko Kovacevic Applied to dpdk-next-net/master, thanks.
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
> > maybe we should keep the callback patch until all the remains are gone. > > Why introducing a new API for a temporary solution? > It has always been like that, so the remaining occurences could wait > one more release, isn't it? > > Yes. I guess I am over excited to get rid of my local changes faster :)
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
> > Can we add a compile warning for adding new rte_panic's into code? It's a > > > nice tool while debugging, but it probably shouldn't be in any new > > > production code. > > Yes could be nice to automatically detect it in drivers/ or lib/ > directories. > How do we apply a warning only to new code? via checkpatch?
[dpdk-dev] [PATCH v5 5/6] eal: simplify IPC sync request timeout code
Signed-off-by: Anatoly Burakov --- Notes: v4: add this patch lib/librte_eal/common/eal_common_proc.c | 18 -- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index c6fef75..fe27d68 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -586,7 +586,6 @@ mp_request_one(const char *dst, struct rte_mp_msg *req, struct rte_mp_reply *reply, const struct timespec *ts) { int ret; - struct timeval now; struct rte_mp_msg msg, *tmp; struct sync_request sync_req, *exist; @@ -618,19 +617,10 @@ mp_request_one(const char *dst, struct rte_mp_msg *req, reply->nb_sent++; do { - pthread_cond_timedwait(&sync_req.cond, &sync_requests.lock, ts); - /* Check spurious wakeups */ - if (sync_req.reply_received == 1) - break; - /* Check if time is out */ - if (gettimeofday(&now, NULL) < 0) - break; - if (ts->tv_sec < now.tv_sec) - break; - else if (now.tv_sec == ts->tv_sec && -now.tv_usec * 1000 < ts->tv_nsec) - break; - } while (1); + ret = pthread_cond_timedwait(&sync_req.cond, + &sync_requests.lock, ts); + } while (ret != 0 && ret != ETIMEDOUT); + /* We got the lock now */ TAILQ_REMOVE(&sync_requests.requests, &sync_req, next); pthread_mutex_unlock(&sync_requests.lock); -- 2.7.4
[dpdk-dev] [PATCH v5 3/6] eal: don't hardcode socket filter value in IPC
Currently, filter value is hardcoded and disconnected from actual value returned by eal_mp_socket_path(). Fix this to generate filter value by deriving it from eal_mp_socket_path() instead. Signed-off-by: Anatoly Burakov Acked-by: Jianfeng Tan --- Notes: v5: removed init files v4: added filtering for init files as well lib/librte_eal/common/eal_common_proc.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index 1aab3ac..9587211 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -359,18 +359,19 @@ int rte_mp_channel_init(void) { char thread_name[RTE_MAX_THREAD_NAME_LEN]; - char *path; + char path[PATH_MAX]; pthread_t tid; - snprintf(mp_filter, PATH_MAX, ".%s_unix_*", -internal_config.hugefile_prefix); + /* create filter path */ + create_socket_path("*", path, sizeof(path)); + snprintf(mp_filter, sizeof(mp_filter), "%s", basename(path)); - path = strdup(eal_mp_socket_path()); - snprintf(mp_dir_path, PATH_MAX, "%s", dirname(path)); - free(path); + /* path may have been modified, so recreate it */ + create_socket_path("*", path, sizeof(path)); + snprintf(mp_dir_path, sizeof(mp_dir_path), "%s", dirname(path)); if (rte_eal_process_type() == RTE_PROC_PRIMARY && - unlink_sockets(mp_filter)) { + unlink_sockets(mp_filter)) { RTE_LOG(ERR, EAL, "failed to unlink mp sockets\n"); return -1; } -- 2.7.4
[dpdk-dev] [PATCH v5 1/6] eal: add internal flag indicating init has completed
Currently, primary process initialization is finalized by setting the RTE_MAGIC value in the shared config. However, it is not possible to check whether secondary process initialization has completed. Add such a value to internal config. Signed-off-by: Anatoly Burakov --- Notes: v4: make init_complete volatile This patchset is dependent upon earlier IPC fixes patchset [1]. [1] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Fixes/ lib/librte_eal/common/eal_common_options.c | 1 + lib/librte_eal/common/eal_internal_cfg.h | 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 2 ++ 3 files changed, 5 insertions(+) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index 9f2f8d2..0be80cb 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -194,6 +194,7 @@ eal_reset_internal_config(struct internal_config *internal_cfg) internal_cfg->vmware_tsc_map = 0; internal_cfg->create_uio_dev = 0; internal_cfg->user_mbuf_pool_ops_name = NULL; + internal_cfg->init_complete = 0; } static int diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index 1169fcc..a0082d1 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -56,6 +56,8 @@ struct internal_config { /**< user defined mbuf pool ops name */ unsigned num_hugepage_sizes; /**< how many sizes on this system */ struct hugepage_info hugepage_info[MAX_HUGEPAGE_SIZES]; + volatile unsigned int init_complete; + /**< indicates whether EAL has completed initialization */ }; extern struct internal_config internal_config; /**< Global EAL configuration. */ diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 38306bf..2ecd07b 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -669,6 +669,8 @@ rte_eal_mcfg_complete(void) /* ALL shared mem_config related INIT DONE */ if (rte_config.process_type == RTE_PROC_PRIMARY) rte_config.mem_config->magic = RTE_MAGIC; + + internal_config.init_complete = 1; } /* -- 2.7.4
[dpdk-dev] [PATCH v5 4/6] eal: lock IPC directory on init and send
When sending IPC messages, prevent new sockets from initializing. Signed-off-by: Anatoly Burakov --- Notes: v5: removed init files introduced in v4 v4: fixed resource leaks and added support for init files introduced in v4 series lib/librte_eal/common/eal_common_proc.c | 59 +++-- 1 file changed, 56 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index 9587211..c6fef75 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -360,6 +361,7 @@ rte_mp_channel_init(void) { char thread_name[RTE_MAX_THREAD_NAME_LEN]; char path[PATH_MAX]; + int dir_fd; pthread_t tid; /* create filter path */ @@ -370,19 +372,38 @@ rte_mp_channel_init(void) create_socket_path("*", path, sizeof(path)); snprintf(mp_dir_path, sizeof(mp_dir_path), "%s", dirname(path)); + /* lock the directory */ + dir_fd = open(mp_dir_path, O_RDONLY); + if (dir_fd < 0) { + RTE_LOG(ERR, EAL, "failed to open %s: %s\n", + mp_dir_path, strerror(errno)); + return -1; + } + + if (flock(dir_fd, LOCK_EX)) { + RTE_LOG(ERR, EAL, "failed to lock %s: %s\n", + mp_dir_path, strerror(errno)); + close(dir_fd); + return -1; + } + if (rte_eal_process_type() == RTE_PROC_PRIMARY && unlink_sockets(mp_filter)) { RTE_LOG(ERR, EAL, "failed to unlink mp sockets\n"); + close(dir_fd); return -1; } - if (open_socket_fd() < 0) + if (open_socket_fd() < 0) { + close(dir_fd); return -1; + } if (pthread_create(&tid, NULL, mp_handle, NULL) < 0) { RTE_LOG(ERR, EAL, "failed to create mp thead: %s\n", strerror(errno)); close(mp_fd); + close(dir_fd); mp_fd = -1; return -1; } @@ -390,6 +411,11 @@ rte_mp_channel_init(void) /* try best to set thread name */ snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN, "rte_mp_handle"); rte_thread_setname(tid, thread_name); + + /* unlock the directory */ + flock(dir_fd, LOCK_UN); + close(dir_fd); + return 0; } @@ -465,7 +491,7 @@ send_msg(const char *dst_path, struct rte_mp_msg *msg, int type) static int mp_send(struct rte_mp_msg *msg, const char *peer, int type) { - int ret = 0; + int dir_fd, ret = 0; DIR *mp_dir; struct dirent *ent; @@ -487,6 +513,17 @@ mp_send(struct rte_mp_msg *msg, const char *peer, int type) rte_errno = errno; return -1; } + + dir_fd = dirfd(mp_dir); + /* lock the directory to prevent processes spinning up while we send */ + if (flock(dir_fd, LOCK_EX)) { + RTE_LOG(ERR, EAL, "Unable to lock directory %s\n", + mp_dir_path); + rte_errno = errno; + closedir(mp_dir); + return -1; + } + while ((ent = readdir(mp_dir))) { char path[PATH_MAX]; @@ -498,7 +535,10 @@ mp_send(struct rte_mp_msg *msg, const char *peer, int type) if (send_msg(path, msg, type) < 0) ret = -1; } + /* unlock the dir */ + flock(dir_fd, LOCK_UN); + /* dir_fd automatically closed on closedir */ closedir(mp_dir); return ret; } @@ -619,7 +659,7 @@ int __rte_experimental rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply, const struct timespec *ts) { - int ret = 0; + int dir_fd, ret = 0; DIR *mp_dir; struct dirent *ent; struct timeval now; @@ -655,6 +695,16 @@ rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply, return -1; } + dir_fd = dirfd(mp_dir); + /* lock the directory to prevent processes spinning up while we send */ + if (flock(dir_fd, LOCK_EX)) { + RTE_LOG(ERR, EAL, "Unable to lock directory %s\n", + mp_dir_path); + closedir(mp_dir); + rte_errno = errno; + return -1; + } + while ((ent = readdir(mp_dir))) { char path[PATH_MAX]; @@ -667,7 +717,10 @@ rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply, if (mp_request_one(path, req, reply, &end)) ret = -1; } + /* unlock the directory */ + flock(dir_fd, LOCK_UN); + /* dir_fd automatically closed on closedir */ closedir(mp_dir); return re
[dpdk-dev] [PATCH v5 6/6] eal: ignore messages until init is complete
If we receive messages that don't have a callback registered for them, and we haven't finished initialization yet, it can be reasonably inferred that we shouldn't have gotten the message in the first place. Therefore, send requester a special message telling them to ignore response to this request, as if this process wasn't there. Signed-off-by: Anatoly Burakov --- Notes: v5: add this patch No changes in mp_send and send_msg - just code move. lib/librte_eal/common/eal_common_proc.c | 280 +--- 1 file changed, 151 insertions(+), 129 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index fe27d68..1ea6045 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -52,6 +52,7 @@ enum mp_type { MP_MSG, /* Share message with peers, will not block */ MP_REQ, /* Request for information, Will block for a reply */ MP_REP, /* Response to previously-received request */ + MP_IGN, /* Response telling requester to ignore this response */ }; struct mp_msg_internal { @@ -205,6 +206,130 @@ rte_mp_action_unregister(const char *name) free(entry); } +/** + * Return -1, as fail to send message and it's caused by the local side. + * Return 0, as fail to send message and it's caused by the remote side. + * Return 1, as succeed to send message. + * + */ +static int +send_msg(const char *dst_path, struct rte_mp_msg *msg, int type) +{ + int snd; + struct iovec iov; + struct msghdr msgh; + struct cmsghdr *cmsg; + struct sockaddr_un dst; + struct mp_msg_internal m; + int fd_size = msg->num_fds * sizeof(int); + char control[CMSG_SPACE(fd_size)]; + + m.type = type; + memcpy(&m.msg, msg, sizeof(*msg)); + + memset(&dst, 0, sizeof(dst)); + dst.sun_family = AF_UNIX; + snprintf(dst.sun_path, sizeof(dst.sun_path), "%s", dst_path); + + memset(&msgh, 0, sizeof(msgh)); + memset(control, 0, sizeof(control)); + + iov.iov_base = &m; + iov.iov_len = sizeof(m) - sizeof(msg->fds); + + msgh.msg_name = &dst; + msgh.msg_namelen = sizeof(dst); + msgh.msg_iov = &iov; + msgh.msg_iovlen = 1; + msgh.msg_control = control; + msgh.msg_controllen = sizeof(control); + + cmsg = CMSG_FIRSTHDR(&msgh); + cmsg->cmsg_len = CMSG_LEN(fd_size); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + memcpy(CMSG_DATA(cmsg), msg->fds, fd_size); + + do { + snd = sendmsg(mp_fd, &msgh, 0); + } while (snd < 0 && errno == EINTR); + + if (snd < 0) { + rte_errno = errno; + /* Check if it caused by peer process exits */ + if (errno == ECONNREFUSED && + rte_eal_process_type() == RTE_PROC_PRIMARY) { + unlink(dst_path); + return 0; + } + if (errno == ENOBUFS) { + RTE_LOG(ERR, EAL, "Peer cannot receive message %s\n", + dst_path); + return 0; + } + RTE_LOG(ERR, EAL, "failed to send to (%s) due to %s\n", + dst_path, strerror(errno)); + return -1; + } + + return 1; +} + +static int +mp_send(struct rte_mp_msg *msg, const char *peer, int type) +{ + int dir_fd, ret = 0; + DIR *mp_dir; + struct dirent *ent; + + if (!peer && (rte_eal_process_type() == RTE_PROC_SECONDARY)) + peer = eal_mp_socket_path(); + + if (peer) { + if (send_msg(peer, msg, type) < 0) + return -1; + else + return 0; + } + + /* broadcast to all secondary processes */ + mp_dir = opendir(mp_dir_path); + if (!mp_dir) { + RTE_LOG(ERR, EAL, "Unable to open directory %s\n", + mp_dir_path); + rte_errno = errno; + return -1; + } + + dir_fd = dirfd(mp_dir); + /* lock the directory to prevent processes spinning up while we send */ + if (flock(dir_fd, LOCK_EX)) { + RTE_LOG(ERR, EAL, "Unable to lock directory %s\n", + mp_dir_path); + rte_errno = errno; + closedir(mp_dir); + return -1; + } + + while ((ent = readdir(mp_dir))) { + char path[PATH_MAX]; + + if (fnmatch(mp_filter, ent->d_name, 0) != 0) + continue; + + snprintf(path, sizeof(path), "%s/%s", mp_dir_path, +ent->d_name); + if (send_msg(path, msg, type) < 0) + ret = -1; + } + /* unlock the dir */ + flock(dir_fd, LOCK_UN); + + /* d
[dpdk-dev] [PATCH v5 0/6] Improvements for DPDK IPC
This is an assortment of loosely related improvements to IPC, mostly related to handling corner cases and avoiding race conditions. Main addition is an attempt to avoid undefined behavior when receiving messages while secondary process is initializing. It is assumed that once callback is registered, it is safe to receive messages. If the callback wasn't registered, then there are two choices - either we haven't reached the stage where we register this callback (init is not finished), or user has forgotten to register callback for this message. The latter can only be known once initialization is complete, so until init is complete, treat this process as not-existing if there is no registered callback for the message. This will handle both scenarios. v5: - added cover-letter :) - drop the "don't send messages to processes which haven't finished initializing" model added in previous version. instead, allow everyone to receive all messages, but check if initialization is completed, and check if there is a callback registered for this message. if there is no callback, assume we just didn't get around to it yet, so just send a special message to the requestor that it should treat this process as if it wasn't there. v4: - make init_complete volatile - changed from "don't process messages until init complete" to "don't send messages to processes which haven't finished initializing", as the former would have resulted in timeouts if init took too long to complete - fixed resource leaks - added patch to simplify IPC timeouts handling v3: - move init_complete until after receiving message v2: - added patch to prevent IPC from sending messages while primary is initializing - added patch to generate filter from eal_mp_socket_path() instead of hardcoding the value Anatoly Burakov (6): eal: add internal flag indicating init has completed eal: abstract away IPC socket path generation eal: don't hardcode socket filter value in IPC eal: lock IPC directory on init and send eal: simplify IPC sync request timeout code eal: ignore messages until init is complete lib/librte_eal/common/eal_common_options.c | 1 + lib/librte_eal/common/eal_common_proc.c| 382 + lib/librte_eal/common/eal_internal_cfg.h | 2 + lib/librte_eal/linuxapp/eal/eal.c | 2 + 4 files changed, 228 insertions(+), 159 deletions(-) -- 2.7.4
[dpdk-dev] [PATCH v3] eal: add asynchronous request API to DPDK IPC
This API is similar to the blocking API that is already present, but reply will be received in a separate callback by the caller. Under the hood, we create a separate thread to deal with replies to asynchronous requests, that will just wait to be notified by the main thread, or woken up on a timer (it'll wake itself up every minute regardless of whether it was called, but if there are no requests in the queue, nothing will be done and it'll go to sleep for another minute). Signed-off-by: Anatoly Burakov --- Notes: v3: - added support for MP_IGN messages introduced in IPC improvements v5 patchset v2: - fixed deadlocks and race conditions by not calling callbacks while iterating over sync request list - fixed use-after-free by making a copy of request - changed API to also give user a copy of original request, so that they know to which message the callback is a reply to - fixed missing .map file entries This patch is dependent upon previously published patchsets for IPC fixes [1] and improvements [2]. rte_mp_action_unregister and rte_mp_async_reply_unregister do the same thing - should we perhaps make it one function? [1] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Fixes/ [2] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Improvements/ lib/librte_eal/common/eal_common_proc.c | 563 ++-- lib/librte_eal/common/include/rte_eal.h | 72 lib/librte_eal/rte_eal_version.map | 3 + 3 files changed, 607 insertions(+), 31 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index 1ea6045..d99ba56 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "eal_private.h" #include "eal_filesystem.h" @@ -39,7 +40,11 @@ static pthread_mutex_t mp_mutex_action = PTHREAD_MUTEX_INITIALIZER; struct action_entry { TAILQ_ENTRY(action_entry) next; char action_name[RTE_MP_MAX_NAME_LEN]; - rte_mp_t action; + RTE_STD_C11 + union { + rte_mp_t action; + rte_mp_async_reply_t reply; + }; }; /** Double linked list of actions. */ @@ -60,13 +65,37 @@ struct mp_msg_internal { struct rte_mp_msg msg; }; +enum mp_request_type { + REQUEST_TYPE_SYNC, + REQUEST_TYPE_ASYNC +}; + +struct async_request_shared_param { + struct rte_mp_reply *user_reply; + struct timespec *end; + int n_requests_processed; +}; + +struct async_request_param { + struct async_request_shared_param *param; +}; + +struct sync_request_param { + pthread_cond_t cond; +}; + struct sync_request { TAILQ_ENTRY(sync_request) next; - int reply_received; + enum mp_request_type type; char dst[PATH_MAX]; struct rte_mp_msg *request; - struct rte_mp_msg *reply; - pthread_cond_t cond; + struct rte_mp_msg *reply_msg; + int reply_received; + RTE_STD_C11 + union { + struct sync_request_param sync; + struct async_request_param async; + }; }; TAILQ_HEAD(sync_request_list, sync_request); @@ -74,9 +103,12 @@ TAILQ_HEAD(sync_request_list, sync_request); static struct { struct sync_request_list requests; pthread_mutex_t lock; + pthread_cond_t async_cond; } sync_requests = { .requests = TAILQ_HEAD_INITIALIZER(sync_requests.requests), - .lock = PTHREAD_MUTEX_INITIALIZER + .lock = PTHREAD_MUTEX_INITIALIZER, + .async_cond = PTHREAD_COND_INITIALIZER + /**< used in async requests only */ }; static struct sync_request * @@ -159,50 +191,50 @@ validate_action_name(const char *name) return 0; } -int __rte_experimental -rte_mp_action_register(const char *name, rte_mp_t action) +static struct action_entry * +action_register(const char *name) { struct action_entry *entry; if (validate_action_name(name)) - return -1; + return NULL; entry = malloc(sizeof(struct action_entry)); if (entry == NULL) { rte_errno = ENOMEM; - return -1; + return NULL; } strcpy(entry->action_name, name); - entry->action = action; - pthread_mutex_lock(&mp_mutex_action); if (find_action_entry_by_name(name) != NULL) { pthread_mutex_unlock(&mp_mutex_action); rte_errno = EEXIST; free(entry); - return -1; + return NULL; } TAILQ_INSERT_TAIL(&action_entry_list, entry, next); - pthread_mutex_unlock(&mp_mutex_action); - return 0; + + /* async and sync replies are handled by different threads, so even +* though they a share pointer in a union, one will nev
[dpdk-dev] [PATCH v5 2/6] eal: abstract away IPC socket path generation
Signed-off-by: Anatoly Burakov --- Notes: v5: remove lock files, leaving only socket paths code v4: replace lock files with init files lib/librte_eal/common/eal_common_proc.c | 48 - 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c index da7930f..1aab3ac 100644 --- a/lib/librte_eal/common/eal_common_proc.c +++ b/lib/librte_eal/common/eal_common_proc.c @@ -91,6 +91,17 @@ find_sync_request(const char *dst, const char *act_name) return r; } +static void +create_socket_path(const char *name, char *buf, int len) +{ + const char *prefix = eal_mp_socket_path(); + + if (strlen(name) > 0) + snprintf(buf, len, "%s_%s", prefix, name); + else + snprintf(buf, len, "%s", prefix); +} + int rte_eal_primary_proc_alive(const char *config_file_path) { @@ -290,8 +301,12 @@ mp_handle(void *arg __rte_unused) static int open_socket_fd(void) { + char peer_name[PATH_MAX] = {0}; struct sockaddr_un un; - const char *prefix = eal_mp_socket_path(); + + if (rte_eal_process_type() == RTE_PROC_SECONDARY) + snprintf(peer_name, sizeof(peer_name), + "%d_%"PRIx64, getpid(), rte_rdtsc()); mp_fd = socket(AF_UNIX, SOCK_DGRAM, 0); if (mp_fd < 0) { @@ -301,13 +316,11 @@ open_socket_fd(void) memset(&un, 0, sizeof(un)); un.sun_family = AF_UNIX; - if (rte_eal_process_type() == RTE_PROC_PRIMARY) - snprintf(un.sun_path, sizeof(un.sun_path), "%s", prefix); - else { - snprintf(un.sun_path, sizeof(un.sun_path), "%s_%d_%"PRIx64, -prefix, getpid(), rte_rdtsc()); - } + + create_socket_path(peer_name, un.sun_path, sizeof(un.sun_path)); + unlink(un.sun_path); /* May still exist since last run */ + if (bind(mp_fd, (struct sockaddr *)&un, sizeof(un)) < 0) { RTE_LOG(ERR, EAL, "failed to bind %s: %s\n", un.sun_path, strerror(errno)); @@ -342,20 +355,6 @@ unlink_sockets(const char *filter) return 0; } -static void -unlink_socket_by_path(const char *path) -{ - char *filename; - char *fullpath = strdup(path); - - if (!fullpath) - return; - filename = basename(fullpath); - unlink_sockets(filename); - free(fullpath); - RTE_LOG(INFO, EAL, "Remove socket %s\n", path); -} - int rte_mp_channel_init(void) { @@ -444,10 +443,9 @@ send_msg(const char *dst_path, struct rte_mp_msg *msg, int type) if (snd < 0) { rte_errno = errno; /* Check if it caused by peer process exits */ - if (errno == ECONNREFUSED) { - /* We don't unlink the primary's socket here */ - if (rte_eal_process_type() == RTE_PROC_PRIMARY) - unlink_socket_by_path(dst_path); + if (errno == ECONNREFUSED && + rte_eal_process_type() == RTE_PROC_PRIMARY) { + unlink(dst_path); return 0; } if (errno == ENOBUFS) { -- 2.7.4
[dpdk-dev] [PATCH v2 01/41] eal: move get_virtual_area out of linuxapp eal_memory.c
Move get_virtual_area out of linuxapp EAL memory and make it common to EAL, so that other code could reserve virtual areas as well. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memory.c | 101 ++ lib/librte_eal/common/eal_private.h | 33 +++ lib/librte_eal/linuxapp/eal/eal_memory.c | 137 ++ 3 files changed, 161 insertions(+), 110 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 852f3bb..042881b 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -2,10 +2,12 @@ * Copyright(c) 2010-2014 Intel Corporation */ +#include #include #include #include #include +#include #include #include #include @@ -14,12 +16,111 @@ #include #include #include +#include #include #include "eal_private.h" #include "eal_internal_cfg.h" /* + * Try to mmap *size bytes in /dev/zero. If it is successful, return the + * pointer to the mmap'd area and keep *size unmodified. Else, retry + * with a smaller zone: decrease *size by hugepage_sz until it reaches + * 0. In this case, return NULL. Note: this function returns an address + * which is a multiple of hugepage size. + */ + +static uint64_t baseaddr_offset; +static uint64_t system_page_sz; + +void * +eal_get_virtual_area(void *requested_addr, uint64_t *size, + uint64_t page_sz, int flags, int mmap_flags) +{ + bool addr_is_hint, allow_shrink, unmap, no_align; + uint64_t map_sz; + void *mapped_addr, *aligned_addr; + + if (system_page_sz == 0) + system_page_sz = sysconf(_SC_PAGESIZE); + + mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS; + + RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size); + + addr_is_hint = (flags & EAL_VIRTUAL_AREA_ADDR_IS_HINT) > 0; + allow_shrink = (flags & EAL_VIRTUAL_AREA_ALLOW_SHRINK) > 0; + unmap = (flags & EAL_VIRTUAL_AREA_UNMAP) > 0; + + if (requested_addr == NULL && internal_config.base_virtaddr != 0) { + requested_addr = (void *) (internal_config.base_virtaddr + + baseaddr_offset); + requested_addr = RTE_PTR_ALIGN(requested_addr, page_sz); + addr_is_hint = true; + } + + /* if requested address is not aligned by page size, or if requested +* address is NULL, add page size to requested length as we may get an +* address that's aligned by system page size, which can be smaller than +* our requested page size. additionally, we shouldn't try to align if +* system page size is the same as requested page size. +*/ + no_align = (requested_addr != NULL && + ((uintptr_t)requested_addr & (page_sz - 1)) == 0) || + page_sz == system_page_sz; + + do { + map_sz = no_align ? *size : *size + page_sz; + + mapped_addr = mmap(requested_addr, map_sz, PROT_READ, + mmap_flags, -1, 0); + if (mapped_addr == MAP_FAILED && allow_shrink) + *size -= page_sz; + } while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0); + + /* align resulting address - if map failed, we will ignore the value +* anyway, so no need to add additional checks. +*/ + aligned_addr = no_align ? mapped_addr : + RTE_PTR_ALIGN(mapped_addr, page_sz); + + if (*size == 0) { + RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n", + strerror(errno)); + rte_errno = errno; + return NULL; + } else if (mapped_addr == MAP_FAILED) { + RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n", + strerror(errno)); + /* pass errno up the call chain */ + rte_errno = errno; + return NULL; + } else if (requested_addr != NULL && !addr_is_hint && + aligned_addr != requested_addr) { + RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n", + requested_addr, aligned_addr); + munmap(mapped_addr, map_sz); + rte_errno = EADDRNOTAVAIL; + return NULL; + } else if (requested_addr != NULL && addr_is_hint && + aligned_addr != requested_addr) { + RTE_LOG(WARNING, EAL, "WARNING! Base virtual address hint (%p != %p) not respected!\n", + requested_addr, aligned_addr); + RTE_LOG(WARNING, EAL, " This may cause issues with mapping memory into secondary processes\n"); + } + + if (unmap) + munmap(mapped_addr, map_sz); + + RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n", + a
[dpdk-dev] [PATCH v2 00/41] Memory Hotplug for DPDK
This patchset introduces dynamic memory allocation for DPDK (aka memory hotplug). Based upon RFC submitted in December [1]. Dependencies (to be applied in specified order): - IPC bugfixes patchset [2] - IPC improvements patchset [3] - IPC asynchronous request API patch [4] - Function to return number of sockets [5] Deprecation notices relevant to this patchset: - General outline of memory hotplug changes [6] - EAL NUMA node count changes [7] The vast majority of changes are in the EAL and malloc, the external API disruption is minimal: a new set of API's are added for contiguous memory allocation for rte_memzone, and a few API additions in rte_memory due to switch to memseg_lists as opposed to memsegs. Every other API change is internal to EAL, and all of the memory allocation/freeing is handled through rte_malloc, with no externally visible API changes. Quick outline of all changes done as part of this patchset: * Malloc heap adjusted to handle holes in address space * Single memseg list replaced by multiple memseg lists * VA space for hugepages is preallocated in advance * Added alloc/free for pages happening as needed on rte_malloc/rte_free * Added contiguous memory allocation API's for rte_memzone * Integrated Pawel Wodkowski's patch for registering/unregistering memory with VFIO [8] * Callbacks for registering memory allocations * Multiprocess support done via DPDK IPC introduced in 18.02 The biggest difference is a "memseg" now represents a single page (as opposed to being a big contiguous block of pages). As a consequence, both memzones and malloc elements are no longer guaranteed to be physically contiguous, unless the user asks for it at reserve time. To preserve whatever functionality that was dependent on previous behavior, a legacy memory option is also provided, however it is expected (or perhaps vainly hoped) to be temporary solution. Why multiple memseg lists instead of one? Since memseg is a single page now, the list of memsegs will get quite big, and we need to locate pages somehow when we allocate and free them. We could of course just walk the list and allocate one contiguous chunk of VA space for memsegs, but this implementation uses separate lists instead in order to speed up many operations with memseg lists. For v1 and v2, the following limitations are present: - FreeBSD does not even compile, let alone run - No 32-bit support - There are some minor quality-of-life improvements planned that aren't ready yet and will be part of v3 - VFIO support is only smoke-tested (but is expected to work), VFIO support with secondary processes is not tested; work is ongoing to validate VFIO for all use cases - Dynamic mapping/unmapping memory with VFIO is not supported in sPAPR IOMMU mode - help from sPAPR maintainers requested Nevertheless, this patchset should be testable under 64-bit Linux, and should work for all use cases bar those mentioned above. v2: - fixed deadlock at init - reverted rte_panic changes at init, this is now handled inside IPC [1] http://dpdk.org/dev/patchwork/bundle/aburakov/Memory_RFC/ [2] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Fixes/ [3] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Improvements/ [4] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Async_Request/ [5] http://dpdk.org/dev/patchwork/bundle/aburakov/Num_Sockets/ [6] http://dpdk.org/dev/patchwork/patch/34002/ [7] http://dpdk.org/dev/patchwork/patch/33853/ [8] http://dpdk.org/dev/patchwork/patch/24484/ Anatoly Burakov (41): eal: move get_virtual_area out of linuxapp eal_memory.c eal: move all locking to heap eal: make malloc heap a doubly-linked list eal: add function to dump malloc heap contents test: add command to dump malloc heap contents eal: make malloc_elem_join_adjacent_free public eal: make malloc free list remove public eal: make malloc free return resulting malloc element eal: add rte_fbarray eal: add "single file segments" command-line option eal: add "legacy memory" option eal: read hugepage counts from node-specific sysfs path eal: replace memseg with memseg lists eal: add support for mapping hugepages at runtime eal: add support for unmapping pages at runtime eal: make use of memory hotplug for init eal: enable memory hotplug support in rte_malloc test: fix malloc autotest to support memory hotplug eal: add API to check if memory is contiguous eal: add backend support for contiguous allocation eal: enable reserving physically contiguous memzones eal: replace memzone array with fbarray mempool: add support for the new allocation methods vfio: allow to map other memory regions eal: map/unmap memory with VFIO when alloc/free pages eal: prepare memseg lists for multiprocess sync eal: add multiprocess init with memory hotplug eal: add support for multiprocess memory hotplug eal: add support for callbacks on memory hotplug eal: enable callbacks on malloc/free and mp sync ethdev: use contiguou
[dpdk-dev] [PATCH v2 03/41] eal: make malloc heap a doubly-linked list
As we are preparing for dynamic memory allocation, we need to be able to handle holes in our malloc heap, hence we're switching to doubly linked list, and prepare infrastructure to support it. Since our heap is now aware where are our first and last elements, there is no longer any need to have a dummy element at the end of each heap, so get rid of that as well. Instead, let insert/remove/ join/split operations handle end-of-list conditions automatically. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/include/rte_malloc_heap.h | 6 + lib/librte_eal/common/malloc_elem.c | 200 +++- lib/librte_eal/common/malloc_elem.h | 14 +- lib/librte_eal/common/malloc_heap.c | 8 +- 4 files changed, 179 insertions(+), 49 deletions(-) diff --git a/lib/librte_eal/common/include/rte_malloc_heap.h b/lib/librte_eal/common/include/rte_malloc_heap.h index ba99ed9..9ec4b62 100644 --- a/lib/librte_eal/common/include/rte_malloc_heap.h +++ b/lib/librte_eal/common/include/rte_malloc_heap.h @@ -13,12 +13,18 @@ /* Number of free lists per heap, grouped by size. */ #define RTE_HEAP_NUM_FREELISTS 13 +/* dummy definition, for pointers */ +struct malloc_elem; + /** * Structure to hold malloc heap */ struct malloc_heap { rte_spinlock_t lock; LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS]; + struct malloc_elem *first; + struct malloc_elem *last; + unsigned alloc_count; size_t total_size; } __rte_cache_aligned; diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index ea041e2..eb41200 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -31,6 +31,7 @@ malloc_elem_init(struct malloc_elem *elem, elem->heap = heap; elem->ms = ms; elem->prev = NULL; + elem->next = NULL; memset(&elem->free_list, 0, sizeof(elem->free_list)); elem->state = ELEM_FREE; elem->size = size; @@ -39,15 +40,56 @@ malloc_elem_init(struct malloc_elem *elem, set_trailer(elem); } -/* - * Initialize a dummy malloc_elem header for the end-of-memseg marker - */ void -malloc_elem_mkend(struct malloc_elem *elem, struct malloc_elem *prev) +malloc_elem_insert(struct malloc_elem *elem) { - malloc_elem_init(elem, prev->heap, prev->ms, 0); - elem->prev = prev; - elem->state = ELEM_BUSY; /* mark busy so its never merged */ + struct malloc_elem *prev_elem, *next_elem; + struct malloc_heap *heap = elem->heap; + + if (heap->first == NULL && heap->last == NULL) { + /* if empty heap */ + heap->first = elem; + heap->last = elem; + prev_elem = NULL; + next_elem = NULL; + } else if (elem < heap->first) { + /* if lower than start */ + prev_elem = NULL; + next_elem = heap->first; + heap->first = elem; + } else if (elem > heap->last) { + /* if higher than end */ + prev_elem = heap->last; + next_elem = NULL; + heap->last = elem; + } else { + /* the new memory is somewhere inbetween start and end */ + uint64_t dist_from_start, dist_from_end; + + dist_from_end = RTE_PTR_DIFF(heap->last, elem); + dist_from_start = RTE_PTR_DIFF(elem, heap->first); + + /* check which is closer, and find closest list entries */ + if (dist_from_start < dist_from_end) { + prev_elem = heap->first; + while (prev_elem->next < elem) + prev_elem = prev_elem->next; + next_elem = prev_elem->next; + } else { + next_elem = heap->last; + while (next_elem->prev > elem) + next_elem = next_elem->prev; + prev_elem = next_elem->prev; + } + } + + /* insert new element */ + elem->prev = prev_elem; + elem->next = next_elem; + if (prev_elem) + prev_elem->next = elem; + if (next_elem) + next_elem->prev = elem; } /* @@ -98,18 +140,58 @@ malloc_elem_can_hold(struct malloc_elem *elem, size_t size,unsigned align, static void split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) { - struct malloc_elem *next_elem = RTE_PTR_ADD(elem, elem->size); + struct malloc_elem *next_elem = elem->next; const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem; const size_t new_elem_size = elem->size - old_elem_size; malloc_elem_init(split_pt, elem->heap, elem->ms, new_elem_size); split_pt->prev = elem; - next_elem->prev = split_pt; + split_pt->next = next_elem; + if (next_elem) +
[dpdk-dev] [PATCH v2 02/41] eal: move all locking to heap
Down the line, we will need to do everything from the heap as any alloc or free may trigger alloc/free OS memory, which would involve growing/shrinking heap. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 16 ++-- lib/librte_eal/common/malloc_heap.c | 38 + lib/librte_eal/common/malloc_heap.h | 6 ++ lib/librte_eal/common/rte_malloc.c | 4 ++-- 4 files changed, 48 insertions(+), 16 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 0cadc8a..ea041e2 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -243,10 +243,6 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) int malloc_elem_free(struct malloc_elem *elem) { - if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) - return -1; - - rte_spinlock_lock(&(elem->heap->lock)); size_t sz = elem->size - sizeof(*elem) - MALLOC_ELEM_TRAILER_LEN; uint8_t *ptr = (uint8_t *)&elem[1]; struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); @@ -274,8 +270,6 @@ malloc_elem_free(struct malloc_elem *elem) memset(ptr, 0, sz); - rte_spinlock_unlock(&(elem->heap->lock)); - return 0; } @@ -292,11 +286,10 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) return 0; struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); - rte_spinlock_lock(&elem->heap->lock); if (next ->state != ELEM_FREE) - goto err_return; + return -1; if (elem->size + next->size < new_size) - goto err_return; + return -1; /* we now know the element fits, so remove from free list, * join the two @@ -311,10 +304,5 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) split_elem(elem, split_pt); malloc_elem_free_list_insert(split_pt); } - rte_spinlock_unlock(&elem->heap->lock); return 0; - -err_return: - rte_spinlock_unlock(&elem->heap->lock); - return -1; } diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 7aafc88..7d8d70a 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -145,6 +145,44 @@ malloc_heap_alloc(struct malloc_heap *heap, return elem == NULL ? NULL : (void *)(&elem[1]); } +int +malloc_heap_free(struct malloc_elem *elem) +{ + struct malloc_heap *heap; + int ret; + + if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) + return -1; + + /* elem may be merged with previous element, so keep heap address */ + heap = elem->heap; + + rte_spinlock_lock(&(heap->lock)); + + ret = malloc_elem_free(elem); + + rte_spinlock_unlock(&(heap->lock)); + + return ret; +} + +int +malloc_heap_resize(struct malloc_elem *elem, size_t size) +{ + int ret; + + if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) + return -1; + + rte_spinlock_lock(&(elem->heap->lock)); + + ret = malloc_elem_resize(elem, size); + + rte_spinlock_unlock(&(elem->heap->lock)); + + return ret; +} + /* * Function to retrieve data for heap on given socket */ diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index e0defa7..ab0005c 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -28,6 +28,12 @@ malloc_heap_alloc(struct malloc_heap *heap, const char *type, size_t size, unsigned flags, size_t align, size_t bound); int +malloc_heap_free(struct malloc_elem *elem); + +int +malloc_heap_resize(struct malloc_elem *elem, size_t size); + +int malloc_heap_get_stats(struct malloc_heap *heap, struct rte_malloc_socket_stats *socket_stats); diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index e0e0d0b..970813e 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -29,7 +29,7 @@ void rte_free(void *addr) { if (addr == NULL) return; - if (malloc_elem_free(malloc_elem_from_data(addr)) < 0) + if (malloc_heap_free(malloc_elem_from_data(addr)) < 0) rte_panic("Fatal error: Invalid memory\n"); } @@ -140,7 +140,7 @@ rte_realloc(void *ptr, size_t size, unsigned align) size = RTE_CACHE_LINE_ROUNDUP(size), align = RTE_CACHE_LINE_ROUNDUP(align); /* check alignment matches first, and if ok, see if we can resize block */ if (RTE_PTR_ALIGN(ptr,align) == ptr && - malloc_elem_resize(elem, size) == 0) + malloc_heap_resize(elem, size) == 0) return ptr; /* either alignment is off, or we have no room to expand, -- 2.7.4
[dpdk-dev] [PATCH v2 05/41] test: add command to dump malloc heap contents
Signed-off-by: Anatoly Burakov --- test/test/commands.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/test/test/commands.c b/test/test/commands.c index cf0b726..6bfdc02 100644 --- a/test/test/commands.c +++ b/test/test/commands.c @@ -137,6 +137,8 @@ static void cmd_dump_parsed(void *parsed_result, rte_log_dump(stdout); else if (!strcmp(res->dump, "dump_malloc_stats")) rte_malloc_dump_stats(stdout, NULL); + else if (!strcmp(res->dump, "dump_malloc_heaps")) + rte_malloc_dump_heaps(stdout); } cmdline_parse_token_string_t cmd_dump_dump = @@ -147,6 +149,7 @@ cmdline_parse_token_string_t cmd_dump_dump = "dump_ring#" "dump_mempool#" "dump_malloc_stats#" +"dump_malloc_heaps#" "dump_devargs#" "dump_log_types"); -- 2.7.4
[dpdk-dev] [PATCH v2 11/41] eal: add "legacy memory" option
This adds a "--legacy-mem" command-line switch. It will be used to go back to the old memory behavior, one where we can't dynamically allocate/free memory (the downside), but one where the user can get physically contiguous memory, like before (the upside). For now, nothing but the legacy behavior exists, non-legacy memory init sequence will be added later. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/eal.c| 3 +++ lib/librte_eal/common/eal_common_options.c | 4 lib/librte_eal/common/eal_internal_cfg.h | 4 lib/librte_eal/common/eal_options.h| 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 1 + lib/librte_eal/linuxapp/eal/eal_memory.c | 24 6 files changed, 34 insertions(+), 4 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c index 4eafcb5..45e5670 100644 --- a/lib/librte_eal/bsdapp/eal/eal.c +++ b/lib/librte_eal/bsdapp/eal/eal.c @@ -531,6 +531,9 @@ rte_eal_init(int argc, char **argv) return -1; } + /* FreeBSD always uses legacy memory model */ + internal_config.legacy_mem = true; + if (eal_plugins_init() < 0) { rte_eal_init_alert("Cannot init plugins\n"); rte_errno = EINVAL; diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index dbc3fb5..3e92551 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -74,6 +74,7 @@ eal_long_options[] = { {OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM}, {OPT_VMWARE_TSC_MAP,0, NULL, OPT_VMWARE_TSC_MAP_NUM }, {OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM}, + {OPT_LEGACY_MEM,0, NULL, OPT_LEGACY_MEM_NUM }, {0, 0, NULL, 0} }; @@ -1165,6 +1166,9 @@ eal_parse_common_option(int opt, const char *optarg, case OPT_SINGLE_FILE_SEGMENTS_NUM: conf->single_file_segments = 1; break; + case OPT_LEGACY_MEM_NUM: + conf->legacy_mem = 1; + break; /* don't know what to do, leave this to caller */ default: diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index d4c02d6..4a43de6 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -51,6 +51,10 @@ struct internal_config { /**< true if storing all pages within single files (per-page-size, * per-node). */ + volatile unsigned legacy_mem; + /**< true to enable legacy memory behavior (no dynamic allocation, +* contiguous segments). +*/ volatile int syslog_facility; /**< facility passed to openlog() */ /** default interrupt mode for VFIO */ volatile enum rte_intr_mode vfio_intr_mode; diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h index a4b80d5..f9a679d 100644 --- a/lib/librte_eal/common/eal_options.h +++ b/lib/librte_eal/common/eal_options.h @@ -57,6 +57,8 @@ enum { OPT_VMWARE_TSC_MAP_NUM, #define OPT_SINGLE_FILE_SEGMENTS"single-file-segments" OPT_SINGLE_FILE_SEGMENTS_NUM, +#define OPT_LEGACY_MEM"legacy-mem" + OPT_LEGACY_MEM_NUM, OPT_LONG_MAX_NUM }; diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index c84e6bf..5207713 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -349,6 +349,7 @@ eal_usage(const char *prgname) " --"OPT_CREATE_UIO_DEV"Create /dev/uioX (usually done by hotplug)\n" " --"OPT_VFIO_INTR" Interrupt mode for VFIO (legacy|msi|msix)\n" " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" + " --"OPT_LEGACY_MEM"Legacy memory mode (no dynamic allocation, contiguous segments)\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if ( rte_application_usage_hook ) { diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 5c11d77..b9bcb75 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -919,8 +919,8 @@ huge_recover_sigbus(void) * 6. unmap the first mapping * 7. fill memsegs in configuration with contiguous zones */ -int -rte_eal_hugepage_init(void) +static int +eal_legacy_hugepage_init(void) { struct rte_mem_config *mcfg; struct hugepage_file *hugepage = NULL, *tmp_hp = NULL; @@ -1262,8 +1262,8 @@ getFileSize(int fd) * configuration and finds the hugepages which form that segment, mapping them * in order to form a contiguous block in the virtual memory space */ -int -rte_eal_hugepage_a
[dpdk-dev] [PATCH v2 12/41] eal: read hugepage counts from node-specific sysfs path
For non-legacy memory init mode, instead of looking at generic sysfs path, look at sysfs paths pertaining to each NUMA node for hugepage counts. Note that per-NUMA node path does not provide information regarding reserved pages, so we might not get the best info from these paths, but this saves us from the whole mapping/remapping business before we're actually able to tell which page is on which socket, because we no longer require our memory to be physically contiguous. Legacy memory init will not use this. Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 79 +++-- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c index 8bbf771..706b6d5 100644 --- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c +++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c @@ -30,6 +30,7 @@ #include "eal_filesystem.h" static const char sys_dir_path[] = "/sys/kernel/mm/hugepages"; +static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node"; /* this function is only called from eal_hugepage_info_init which itself * is only called from a primary process */ @@ -70,6 +71,45 @@ get_num_hugepages(const char *subdir) return num_pages; } +static uint32_t +get_num_hugepages_on_node(const char *subdir, unsigned int socket) +{ + char path[PATH_MAX], socketpath[PATH_MAX]; + DIR *socketdir; + unsigned long num_pages = 0; + const char *nr_hp_file = "free_hugepages"; + + snprintf(socketpath, sizeof(socketpath), "%s/node%u/hugepages", + sys_pages_numa_dir_path, socket); + + socketdir = opendir(socketpath); + if (socketdir) { + /* Keep calm and carry on */ + closedir(socketdir); + } else { + /* Can't find socket dir, so ignore it */ + return 0; + } + + snprintf(path, sizeof(path), "%s/%s/%s", + socketpath, subdir, nr_hp_file); + if (eal_parse_sysfs_value(path, &num_pages) < 0) + return 0; + + if (num_pages == 0) + RTE_LOG(WARNING, EAL, "No free hugepages reported in %s\n", + subdir); + + /* +* we want to return a uint32_t and more than this looks suspicious +* anyway ... +*/ + if (num_pages > UINT32_MAX) + num_pages = UINT32_MAX; + + return num_pages; +} + static uint64_t get_default_hp_size(void) { @@ -248,7 +288,7 @@ eal_hugepage_info_init(void) { const char dirent_start_text[] = "hugepages-"; const size_t dirent_start_len = sizeof(dirent_start_text) - 1; - unsigned i, num_sizes = 0; + unsigned int i, total_pages, num_sizes = 0; DIR *dir; struct dirent *dirent; @@ -302,9 +342,27 @@ eal_hugepage_info_init(void) if (clear_hugedir(hpi->hugedir) == -1) break; - /* for now, put all pages into socket 0, -* later they will be sorted */ - hpi->num_pages[0] = get_num_hugepages(dirent->d_name); + /* +* first, try to put all hugepages into relevant sockets, but +* if first attempts fails, fall back to collecting all pages +* in one socket and sorting them later +*/ + total_pages = 0; + /* we also don't want to do this for legacy init */ + if (!internal_config.legacy_mem) + for (i = 0; i < rte_num_sockets(); i++) { + unsigned int num_pages = + get_num_hugepages_on_node( + dirent->d_name, i); + hpi->num_pages[i] = num_pages; + total_pages += num_pages; + } + /* +* we failed to sort memory from the get go, so fall +* back to old way +*/ + if (total_pages == 0) + hpi->num_pages[0] = get_num_hugepages(dirent->d_name); #ifndef RTE_ARCH_64 /* for 32-bit systems, limit number of hugepages to @@ -328,10 +386,19 @@ eal_hugepage_info_init(void) sizeof(internal_config.hugepage_info[0]), compare_hpi); /* now we have all info, check we have at least one valid size */ - for (i = 0; i < num_sizes; i++) + for (i = 0; i < num_sizes; i++) { + /* pages may no longer all be on socket 0, so check all */ + unsigned int j, num_pages = 0; + + for (j = 0; j < RTE_MAX_NUMA_NODES; j++) { + struct hugepage_info *hpi = + &internal_config.hugepage_info[i]; +
[dpdk-dev] [PATCH v2 07/41] eal: make malloc free list remove public
Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 12 ++-- lib/librte_eal/common/malloc_elem.h | 3 +++ 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 2291ee1..008f5a3 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -245,8 +245,8 @@ malloc_elem_free_list_insert(struct malloc_elem *elem) /* * Remove the specified element from its heap's free list. */ -static void -elem_free_list_remove(struct malloc_elem *elem) +void +malloc_elem_free_list_remove(struct malloc_elem *elem) { LIST_REMOVE(elem, free_list); } @@ -266,7 +266,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align, const size_t trailer_size = elem->size - old_elem_size - size - MALLOC_ELEM_OVERHEAD; - elem_free_list_remove(elem); + malloc_elem_free_list_remove(elem); if (trailer_size > MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { /* split it, too much free space after elem */ @@ -340,7 +340,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) erase = RTE_PTR_SUB(elem->next, MALLOC_ELEM_TRAILER_LEN); /* remove from free list, join to this one */ - elem_free_list_remove(elem->next); + malloc_elem_free_list_remove(elem->next); join_elem(elem, elem->next); /* erase header and trailer */ @@ -360,7 +360,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) erase = RTE_PTR_SUB(elem, MALLOC_ELEM_TRAILER_LEN); /* remove from free list, join to this one */ - elem_free_list_remove(elem->prev); + malloc_elem_free_list_remove(elem->prev); new_elem = elem->prev; join_elem(new_elem, elem); @@ -423,7 +423,7 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) /* we now know the element fits, so remove from free list, * join the two */ - elem_free_list_remove(elem->next); + malloc_elem_free_list_remove(elem->next); join_elem(elem, elem->next); if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD) { diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index 99921d2..46e2383 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -151,6 +151,9 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem); int malloc_elem_resize(struct malloc_elem *elem, size_t size); +void +malloc_elem_free_list_remove(struct malloc_elem *elem); + /* * dump contents of malloc elem to a file. */ -- 2.7.4
[dpdk-dev] [PATCH v2 08/41] eal: make malloc free return resulting malloc element
Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 4 ++-- lib/librte_eal/common/malloc_elem.h | 2 +- lib/librte_eal/common/malloc_heap.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 008f5a3..c18f050 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -379,7 +379,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) * blocks either immediately before or immediately after newly freed block * are also free, the blocks are merged together. */ -int +struct malloc_elem * malloc_elem_free(struct malloc_elem *elem) { void *ptr; @@ -397,7 +397,7 @@ malloc_elem_free(struct malloc_elem *elem) memset(ptr, 0, data_len); - return 0; + return elem; } /* diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index 46e2383..9c1614c 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -138,7 +138,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, * blocks either immediately before or immediately after newly freed block * are also free, the blocks are merged together. */ -int +struct malloc_elem * malloc_elem_free(struct malloc_elem *elem); struct malloc_elem * diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 44538d7..a2c2e4c 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -145,7 +145,7 @@ int malloc_heap_free(struct malloc_elem *elem) { struct malloc_heap *heap; - int ret; + struct malloc_elem *ret; if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) return -1; @@ -159,7 +159,7 @@ malloc_heap_free(struct malloc_elem *elem) rte_spinlock_unlock(&(heap->lock)); - return ret; + return ret != NULL ? 0 : -1; } int -- 2.7.4
[dpdk-dev] [PATCH v2 10/41] eal: add "single file segments" command-line option
For now, this option does nothing, but it will be useful in dynamic memory allocation down the line. Currently, DPDK stores all pages as separate files in hugetlbfs. This option will allow storing all pages in one file (one file per socket, per page size). Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_options.c | 4 lib/librte_eal/common/eal_internal_cfg.h | 4 lib/librte_eal/common/eal_options.h| 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 1 + 4 files changed, 11 insertions(+) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index 0be80cb..dbc3fb5 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -73,6 +73,7 @@ eal_long_options[] = { {OPT_VDEV, 1, NULL, OPT_VDEV_NUM }, {OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM}, {OPT_VMWARE_TSC_MAP,0, NULL, OPT_VMWARE_TSC_MAP_NUM }, + {OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM}, {0, 0, NULL, 0} }; @@ -1161,6 +1162,9 @@ eal_parse_common_option(int opt, const char *optarg, core_parsed = LCORE_OPT_MAP; break; + case OPT_SINGLE_FILE_SEGMENTS_NUM: + conf->single_file_segments = 1; + break; /* don't know what to do, leave this to caller */ default: diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index a0082d1..d4c02d6 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -47,6 +47,10 @@ struct internal_config { volatile unsigned force_sockets; volatile uint64_t socket_mem[RTE_MAX_NUMA_NODES]; /**< amount of memory per socket */ uintptr_t base_virtaddr; /**< base address to try and reserve memory from */ + volatile unsigned single_file_segments; + /**< true if storing all pages within single files (per-page-size, +* per-node). +*/ volatile int syslog_facility; /**< facility passed to openlog() */ /** default interrupt mode for VFIO */ volatile enum rte_intr_mode vfio_intr_mode; diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h index e86c711..a4b80d5 100644 --- a/lib/librte_eal/common/eal_options.h +++ b/lib/librte_eal/common/eal_options.h @@ -55,6 +55,8 @@ enum { OPT_VFIO_INTR_NUM, #define OPT_VMWARE_TSC_MAP"vmware-tsc-map" OPT_VMWARE_TSC_MAP_NUM, +#define OPT_SINGLE_FILE_SEGMENTS"single-file-segments" + OPT_SINGLE_FILE_SEGMENTS_NUM, OPT_LONG_MAX_NUM }; diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 2ecd07b..c84e6bf 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -348,6 +348,7 @@ eal_usage(const char *prgname) " --"OPT_BASE_VIRTADDR" Base virtual address\n" " --"OPT_CREATE_UIO_DEV"Create /dev/uioX (usually done by hotplug)\n" " --"OPT_VFIO_INTR" Interrupt mode for VFIO (legacy|msi|msix)\n" + " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if ( rte_application_usage_hook ) { -- 2.7.4
[dpdk-dev] [PATCH v2 04/41] eal: add function to dump malloc heap contents
Malloc heap is now a doubly linked list, so it's now possible to iterate over each malloc element regardless of its state. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/include/rte_malloc.h | 9 + lib/librte_eal/common/malloc_elem.c| 24 lib/librte_eal/common/malloc_elem.h| 6 ++ lib/librte_eal/common/malloc_heap.c| 22 ++ lib/librte_eal/common/malloc_heap.h| 3 +++ lib/librte_eal/common/rte_malloc.c | 16 lib/librte_eal/rte_eal_version.map | 1 + 7 files changed, 81 insertions(+) diff --git a/lib/librte_eal/common/include/rte_malloc.h b/lib/librte_eal/common/include/rte_malloc.h index f02a8ba..a3fc83e 100644 --- a/lib/librte_eal/common/include/rte_malloc.h +++ b/lib/librte_eal/common/include/rte_malloc.h @@ -278,6 +278,15 @@ void rte_malloc_dump_stats(FILE *f, const char *type); /** + * Dump contents of all malloc heaps to a file. + * + * @param f + * A pointer to a file for output + */ +void +rte_malloc_dump_heaps(FILE *f); + +/** * Set the maximum amount of allocated memory for this type. * * This is not yet implemented diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index eb41200..e02ed88 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2010-2014 Intel Corporation */ +#include #include #include #include @@ -434,3 +435,26 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) } return 0; } + +static inline const char * +elem_state_to_str(enum elem_state state) +{ + switch (state) { + case ELEM_PAD: + return "PAD"; + case ELEM_BUSY: + return "BUSY"; + case ELEM_FREE: + return "FREE"; + } + return "ERROR"; +} + +void +malloc_elem_dump(const struct malloc_elem *elem, FILE *f) +{ + fprintf(f, "Malloc element at %p (%s)\n", elem, + elem_state_to_str(elem->state)); + fprintf(f, " len: 0x%zx pad: 0x%" PRIx32 "\n", elem->size, elem->pad); + fprintf(f, " prev: %p next: %p\n", elem->prev, elem->next); +} diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index 238e451..40e8eb5 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -149,6 +149,12 @@ int malloc_elem_resize(struct malloc_elem *elem, size_t size); /* + * dump contents of malloc elem to a file. + */ +void +malloc_elem_dump(const struct malloc_elem *elem, FILE *f); + +/* * Given an element size, compute its freelist index. */ size_t diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 9c95166..44538d7 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -217,6 +217,28 @@ malloc_heap_get_stats(struct malloc_heap *heap, return 0; } +/* + * Function to retrieve data for heap on given socket + */ +void +malloc_heap_dump(struct malloc_heap *heap, FILE *f) +{ + struct malloc_elem *elem; + + rte_spinlock_lock(&heap->lock); + + fprintf(f, "Heap size: 0x%zx\n", heap->total_size); + fprintf(f, "Heap alloc count: %u\n", heap->alloc_count); + + elem = heap->first; + while (elem) { + malloc_elem_dump(elem, f); + elem = elem->next; + } + + rte_spinlock_unlock(&heap->lock); +} + int rte_eal_malloc_heap_init(void) { diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index ab0005c..bb28422 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -37,6 +37,9 @@ int malloc_heap_get_stats(struct malloc_heap *heap, struct rte_malloc_socket_stats *socket_stats); +void +malloc_heap_dump(struct malloc_heap *heap, FILE *f); + int rte_eal_malloc_heap_init(void); diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index 970813e..80fb6cc 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -182,6 +182,22 @@ rte_malloc_get_socket_stats(int socket, } /* + * Function to dump contents of all heaps + */ +void +rte_malloc_dump_heaps(FILE *f) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + unsigned int socket; + + for (socket = 0; socket < rte_num_sockets(); socket++) { + fprintf(f, "Heap on socket %i:\n", socket); + malloc_heap_dump(&mcfg->malloc_heaps[socket], f); + } + +} + +/* * Print stats on memory type. If type is NULL, info on all types is printed */ void diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map index 52f5940..18b8bf5 100644 --- a/lib/librte_eal/rte_eal_version.map +++ b/l
[dpdk-dev] [PATCH v2 06/41] eal: make malloc_elem_join_adjacent_free public
We need this function to join newly allocated segments with the heap. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 6 +++--- lib/librte_eal/common/malloc_elem.h | 3 +++ 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index e02ed88..2291ee1 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -325,8 +325,8 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) elem1->next = next; } -static struct malloc_elem * -elem_join_adjacent_free(struct malloc_elem *elem) +struct malloc_elem * +malloc_elem_join_adjacent_free(struct malloc_elem *elem) { /* * check if next element exists, is adjacent and is free, if so join @@ -388,7 +388,7 @@ malloc_elem_free(struct malloc_elem *elem) ptr = RTE_PTR_ADD(elem, sizeof(*elem)); data_len = elem->size - MALLOC_ELEM_OVERHEAD; - elem = elem_join_adjacent_free(elem); + elem = malloc_elem_join_adjacent_free(elem); malloc_elem_free_list_insert(elem); diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index 40e8eb5..99921d2 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -141,6 +141,9 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, int malloc_elem_free(struct malloc_elem *elem); +struct malloc_elem * +malloc_elem_join_adjacent_free(struct malloc_elem *elem); + /* * attempt to resize a malloc_elem by expanding into any free space * immediately after it in memory. -- 2.7.4
[dpdk-dev] [PATCH v2 18/41] test: fix malloc autotest to support memory hotplug
The test was expecting memory already being allocated on all sockets, and thus was failing because calling rte_malloc could trigger memory hotplug event and allocate memory where there was none before. Fix it to instead report availability of memory on specific sockets by attempting to allocate a page and see if that succeeds. Technically, this can still cause failure as memory might not be available at the time of check, but become available by the time the test is run, but this is a corner case not worth considering. Signed-off-by: Anatoly Burakov --- test/test/test_malloc.c | 52 + 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/test/test/test_malloc.c b/test/test/test_malloc.c index 8484fb6..2aaf1b8 100644 --- a/test/test/test_malloc.c +++ b/test/test/test_malloc.c @@ -22,6 +22,8 @@ #include #include +#include "../../lib/librte_eal/common/eal_memalloc.h" + #include "test.h" #define N 1 @@ -708,22 +710,56 @@ test_malloc_bad_params(void) /* Check if memory is avilable on a specific socket */ static int -is_mem_on_socket(int32_t socket) +is_mem_on_socket(unsigned int socket) { + struct rte_malloc_socket_stats stats; const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - unsigned i; + uint64_t prev_pgsz; + unsigned int i; + + /* we cannot know if there's memory on a specific socket, since it might +* be available, but not yet allocated. so, in addition to checking +* already mapped memory, we will attempt to allocate a page from that +* socket and see if it works. +*/ + if (socket >= rte_num_sockets()) + return 0; + rte_malloc_get_socket_stats(socket, &stats); + + /* if heap has memory allocated, stop */ + if (stats.heap_totalsz_bytes > 0) + return 1; + + /* to allocate a page, we will have to know its size, so go through all +* supported page sizes and try with each one. +*/ + prev_pgsz = 0; for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { - const struct rte_memseg_list *msl = - &mcfg->memsegs[i]; - const struct rte_fbarray *arr = &msl->memseg_arr; + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + uint64_t page_sz; - if (msl->socket_id != socket) + /* skip unused memseg lists */ + if (msl->memseg_arr.len == 0) continue; + page_sz = msl->hugepage_sz; - if (arr->count) - return 1; + /* skip page sizes we've tried already */ + if (prev_pgsz == page_sz) + continue; + + prev_pgsz = page_sz; + + struct rte_memseg *ms = eal_memalloc_alloc_page(page_sz, + socket); + + if (ms == NULL) + continue; + + eal_memalloc_free_page(ms); + + return 1; } return 0; } -- 2.7.4
[dpdk-dev] [PATCH v2 09/41] eal: add rte_fbarray
rte_fbarray is a simple indexed array stored in shared memory via mapping files into memory. Rationale for its existence is the following: since we are going to map memory page-by-page, there could be quite a lot of memory segments to keep track of (for smaller page sizes, page count can easily reach thousands). We can't really make page lists truly dynamic and infinitely expandable, because that involves reallocating memory (which is a big no-no in multiprocess). What we can do instead is have a maximum capacity as something really, really large, and decide at allocation time how big the array is going to be. We map the entire file into memory, which makes it possible to use fbarray as shared memory, provided the structure itself is allocated in shared memory. Per-fbarray locking is also used to avoid index data races (but not contents data races - that is up to user application to synchronize). In addition, in understanding that we will frequently need to scan this array for free space and iterating over array linearly can become slow, rte_fbarray provides facilities to index array's usage. The following use cases are covered: - find next free/used slot (useful either for adding new elements to fbarray, or walking the list) - find starting index for next N free/used slots (useful for when we want to allocate chunk of VA-contiguous memory composed of several pages) - find how many contiguous free/used slots there are, starting from specified index (useful for when we want to figure out how many pages we have until next hole in allocated memory, to speed up some bulk operations where we would otherwise have to walk the array and add pages one by one) This is accomplished by storing a usage mask in-memory, right after the data section of the array, and using some bit-level magic to figure out the info we need. Signed-off-by: Anatoly Burakov --- Notes: Initial version of this had resizing capability, however it was removed due to the fact that in multiprocess scenario, each fbarray would have its own view of mapped memory, which might not correspond with others due to some other process performing a resize that current process didn't know about. It was therefore decided that to avoid cost of synchronization on each and every operation (to make sure the array wasn't resized), resizing feature should be dropped. lib/librte_eal/bsdapp/eal/Makefile | 1 + lib/librte_eal/common/Makefile | 2 +- lib/librte_eal/common/eal_common_fbarray.c | 859 lib/librte_eal/common/eal_filesystem.h | 13 + lib/librte_eal/common/include/rte_fbarray.h | 352 lib/librte_eal/common/meson.build | 2 + lib/librte_eal/linuxapp/eal/Makefile| 1 + lib/librte_eal/rte_eal_version.map | 17 + 8 files changed, 1246 insertions(+), 1 deletion(-) create mode 100644 lib/librte_eal/common/eal_common_fbarray.c create mode 100644 lib/librte_eal/common/include/rte_fbarray.h diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile index ed1d17b..1b43d77 100644 --- a/lib/librte_eal/bsdapp/eal/Makefile +++ b/lib/librte_eal/bsdapp/eal/Makefile @@ -53,6 +53,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_dev.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_options.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_thread.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_proc.c +SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_fbarray.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_malloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += malloc_elem.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += malloc_heap.c diff --git a/lib/librte_eal/common/Makefile b/lib/librte_eal/common/Makefile index ea824a3..48f870f 100644 --- a/lib/librte_eal/common/Makefile +++ b/lib/librte_eal/common/Makefile @@ -16,7 +16,7 @@ INC += rte_pci_dev_feature_defs.h rte_pci_dev_features.h INC += rte_malloc.h rte_keepalive.h rte_time.h INC += rte_service.h rte_service_component.h INC += rte_bitmap.h rte_vfio.h rte_hypervisor.h rte_test.h -INC += rte_reciprocal.h +INC += rte_reciprocal.h rte_fbarray.h GENERIC_INC := rte_atomic.h rte_byteorder.h rte_cycles.h rte_prefetch.h GENERIC_INC += rte_spinlock.h rte_memcpy.h rte_cpuflags.h rte_rwlock.h diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c new file mode 100644 index 000..76d86c3 --- /dev/null +++ b/lib/librte_eal/common/eal_common_fbarray.c @@ -0,0 +1,859 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2017-2018 Intel Corporation + */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include "eal_filesystem.h" +#include "eal_private.h" + +#include "rte_fbarray.h" + +#define MASK_SHIFT 6ULL +#define MASK_ALIGN (1 << MASK_SHIFT) +#define MASK_LEN_TO_IDX(x) ((x) >> MASK_SHIFT) +#define MA
[dpdk-dev] [PATCH v2 16/41] eal: make use of memory hotplug for init
Add a new (non-legacy) memory init path for EAL. It uses the new memory hotplug facilities, although it's only being run at startup. If no -m or --socket-mem switches were specified, the new init will not allocate anything, whereas if those switches were passed, appropriate amounts of pages would be requested, just like for legacy init. Since rte_malloc support for dynamic allocation comes in later patches, running DPDK without --socket-mem or -m switches will fail in this patch. Also, allocated pages will be physically discontiguous (or rather, they're not guaranteed to be physically contiguous - they may still be, by accident) unless IOVA_AS_VA mode is used. Since memory hotplug subsystem relies on partial file locking, replace flock() locks with fcntl() locks. Signed-off-by: Anatoly Burakov --- Notes: This commit shows "the wolrd as it could have been". All of this other monstrous amount of code in eal_memory.c is there because of legacy init option. Do we *really* want to keep it around, and make DPDK init and memory system suffer from split personality? lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 25 - lib/librte_eal/linuxapp/eal/eal_memory.c| 74 +++-- 2 files changed, 92 insertions(+), 7 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c index 706b6d5..7e2475f 100644 --- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c +++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -200,6 +201,18 @@ get_hugepage_dir(uint64_t hugepage_sz) } /* + * uses fstat to report the size of a file on disk + */ +static off_t +getFileSize(int fd) +{ + struct stat st; + if (fstat(fd, &st) < 0) + return 0; + return st.st_size; +} + +/* * Clear the hugepage directory of whatever hugepage files * there are. Checks if the file is locked (i.e. * if it's in use by another DPDK process). @@ -229,6 +242,8 @@ clear_hugedir(const char * hugedir) } while(dirent != NULL){ + struct flock lck = {0}; + /* skip files that don't match the hugepage pattern */ if (fnmatch(filter, dirent->d_name, 0) > 0) { dirent = readdir(dir); @@ -245,11 +260,17 @@ clear_hugedir(const char * hugedir) } /* non-blocking lock */ - lck_result = flock(fd, LOCK_EX | LOCK_NB); + lck.l_type = F_RDLCK; + lck.l_whence = SEEK_SET; + lck.l_start = 0; + lck.l_len = getFileSize(fd); + + lck_result = fcntl(fd, F_SETLK, &lck); /* if lock succeeds, unlock and remove the file */ if (lck_result != -1) { - flock(fd, LOCK_UN); + lck.l_type = F_UNLCK; + fcntl(fd, F_SETLK, &lck); unlinkat(dir_fd, dirent->d_name, 0); } close (fd); diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 9512da9..e0b4988 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -40,6 +40,7 @@ #include #include "eal_private.h" +#include "eal_memalloc.h" #include "eal_internal_cfg.h" #include "eal_filesystem.h" #include "eal_hugepages.h" @@ -260,6 +261,7 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi, void *virtaddr; void *vma_addr = NULL; size_t vma_len = 0; + struct flock lck = {0}; #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES int node_id = -1; int essential_prev = 0; @@ -434,8 +436,12 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi, } - /* set shared flock on the file. */ - if (flock(fd, LOCK_SH | LOCK_NB) == -1) { + /* set shared lock on the file. */ + lck.l_type = F_RDLCK; + lck.l_whence = SEEK_SET; + lck.l_start = 0; + lck.l_len = hugepage_sz; + if (fcntl(fd, F_SETLK, &lck) == -1) { RTE_LOG(DEBUG, EAL, "%s(): Locking file failed:%s \n", __func__, strerror(errno)); close(fd); @@ -1300,6 +1306,62 @@ eal_legacy_hugepage_init(void) return -1; } +static int +eal_hugepage_init(void) +{ + struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES]; + uint64_t memory[RTE_MAX_NUMA_NODES]; + int hp_sz_idx, socket_id; + + test_phys_addrs_available(); + + memset(used_hp, 0, sizeof(used_hp)); + + for (hp_sz_idx = 0; + hp_sz_idx < (int) internal_config.num_hugepage_sizes; + hp_sz_idx++) { +
[dpdk-dev] [PATCH v2 20/41] eal: add backend support for contiguous allocation
No major changes, just add some checks in a few key places, and a new parameter to pass around. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 20 +++--- lib/librte_eal/common/malloc_elem.c| 101 ++--- lib/librte_eal/common/malloc_elem.h| 4 +- lib/librte_eal/common/malloc_heap.c| 57 ++-- lib/librte_eal/common/malloc_heap.h| 4 +- lib/librte_eal/common/rte_malloc.c | 6 +- 6 files changed, 134 insertions(+), 58 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index 718dee8..75c7dd9 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -98,7 +98,8 @@ find_heap_max_free_elem(int *s, unsigned align) static const struct rte_memzone * memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, - int socket_id, unsigned flags, unsigned align, unsigned bound) + int socket_id, unsigned int flags, unsigned int align, + unsigned int bound, bool contig) { struct rte_memzone *mz; struct rte_mem_config *mcfg; @@ -182,7 +183,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, /* allocate memory on heap */ void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket_id, flags, - align, bound); + align, bound, contig); if (mz_addr == NULL) { rte_errno = ENOMEM; @@ -215,9 +216,9 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, } static const struct rte_memzone * -rte_memzone_reserve_thread_safe(const char *name, size_t len, - int socket_id, unsigned flags, unsigned align, - unsigned bound) +rte_memzone_reserve_thread_safe(const char *name, size_t len, int socket_id, + unsigned int flags, unsigned int align, unsigned int bound, + bool contig) { struct rte_mem_config *mcfg; const struct rte_memzone *mz = NULL; @@ -228,7 +229,7 @@ rte_memzone_reserve_thread_safe(const char *name, size_t len, rte_rwlock_write_lock(&mcfg->mlock); mz = memzone_reserve_aligned_thread_unsafe( - name, len, socket_id, flags, align, bound); + name, len, socket_id, flags, align, bound, contig); rte_rwlock_write_unlock(&mcfg->mlock); @@ -245,7 +246,7 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id, unsigned flags, unsigned align, unsigned bound) { return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, - align, bound); + align, bound, false); } /* @@ -257,7 +258,7 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id, unsigned flags, unsigned align) { return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, - align, 0); + align, 0, false); } /* @@ -269,7 +270,8 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id, unsigned flags) { return rte_memzone_reserve_thread_safe(name, len, socket_id, - flags, RTE_CACHE_LINE_SIZE, 0); + flags, RTE_CACHE_LINE_SIZE, 0, + false); } int diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index eabad66..d2dba35 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -17,6 +17,7 @@ #include #include +#include "eal_memalloc.h" #include "malloc_elem.h" #include "malloc_heap.h" @@ -94,33 +95,88 @@ malloc_elem_insert(struct malloc_elem *elem) } /* + * Attempt to find enough physically contiguous memory in this block to store + * our data. Assume that element has at least enough space to fit in the data, + * so we just check the page addresses. + */ +static bool +elem_check_phys_contig(struct rte_memseg_list *msl, void *start, size_t size) +{ + uint64_t page_sz; + void *aligned_start, *end, *aligned_end; + size_t aligned_len; + + /* figure out how many pages we need to fit in current data */ + page_sz = msl->hugepage_sz; + aligned_start = RTE_PTR_ALIGN_FLOOR(start, page_sz); + end = RTE_PTR_ADD(start, size); + aligned_end = RTE_PTR_ALIGN_CEIL(end, page_sz); + + aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start); + + return eal_memalloc_is_contig(msl, aligned_start, aligned_len); +} + +/* * calculate the starting point of where data of the requested size * and
[dpdk-dev] [PATCH v2 13/41] eal: replace memseg with memseg lists
Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance. As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_PER_TYPE gigabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_PER_LIST gigabytes per list, whichever is the smaller one). So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only for now), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, for non-legacy mode, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. For legacy mode, old behavior of walking the memseg list remains. Due to switch to fbarray and to avoid any intrusive changes, secondary processes are not supported in this commit. Also, one particular API call (dump physmem layout) no longer makes sense and was removed, according to deprecation notice [1]. In legacy mode, nothing is preallocated, and all memsegs are in a list like before, but each segment still resides in an appropriate memseg list. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov --- config/common_base| 15 +- drivers/bus/pci/linux/pci.c | 29 +- drivers/net/virtio/virtio_user/vhost_kernel.c | 108 +--- lib/librte_eal/common/eal_common_memory.c | 322 +++--- lib/librte_eal/common/eal_common_memzone.c| 12 +- lib/librte_eal/common/eal_hugepages.h | 2 + lib/librte_eal/common/eal_internal_cfg.h | 2 +- lib/librte_eal/common/include/rte_eal_memconfig.h | 22 +- lib/librte_eal/common/include/rte_memory.h| 33 ++- lib/librte_eal/common/include/rte_memzone.h | 1 - lib/librte_eal/common/malloc_elem.c | 8 +- lib/librte_eal/common/malloc_elem.h | 6 +- lib/librte_eal/common/malloc_heap.c | 92 +-- lib/librte_eal/common/rte_malloc.c| 22 +- lib/librte_eal/linuxapp/eal/eal.c | 21 +- lib/librte_eal/linuxapp/eal/eal_memory.c | 297 +--- lib/librte_eal/linuxapp/eal/eal_vfio.c| 164 +++ lib/librte_eal/rte_eal_version.map| 3 +- test/test/test_malloc.c | 29 +- test/test/test_memory.c | 43 ++- test/test/test_memzone.c | 17 +- 21 files changed, 917 insertions(+), 331 deletions(-) diff --git a/config/common_base b/config/common_base index ad03cf4..e9c1d93 100644 --- a/config/common_base +++ b/config/common_base @@ -61,7 +61,20 @@ CONFIG_RTE_CACHE_LINE_SIZE=64 CONFIG_RTE_LIBRTE_EAL=y CONFIG_RTE_MAX_LCORE=128 CONFIG_RTE_MAX_NUMA_NODES=8 -CONFIG_RTE_MAX_MEMSEG=256 +CONFIG_RTE_MAX_MEMSEG_LISTS=32 +# each memseg list will be limited to either RTE_MAX_MEMSEG_PER_LIST pages +# or RTE_MAX_MEM_PER_LIST gigabytes worth of memory, whichever is the smallest +CONFIG_RTE_MAX_MEMSEG_PER_LIST=8192 +CONFIG_RTE_MAX_MEM_PER_LIST=32 +# a "type" is a combination of page size and NUMA node. total number of memseg +# lists per type will be limited to either RTE_MAX_MEMSEG_PER_TYPE pages (split +# over multiple lists of RTE_MAX_MEMSEG_PER_LIST pages), or RTE_MAX_MEM_PER_TYPE +# gigabytes of memory (split over multiple lists of RTE_MAX_MEM_PER_LIST), +# whichever is the smallest +CONFIG_RTE_MAX_MEMSEG_PER_TYPE=32768 +CONFIG_RTE_MAX_MEM_PER_TYPE=128 +# legacy mem mode only +CONFIG_RTE_MAX_LEGACY_MEMSEG=256 CONFIG_RTE_MAX_MEMZONE=2560 CONFIG_RTE_MAX_TAILQ=32 CONFIG_RTE_ENABLE_ASSERT=n diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c index abde641..ec05d7c 100644 --- a/drivers/bus/pci/linux/pci.c +++ b/drivers/bus/pci/linux/pci.c @@ -119,19 +119,30 @@ rte_pci_unmap_device(struct rte_pci_device *dev) void
[dpdk-dev] [PATCH v2 17/41] eal: enable memory hotplug support in rte_malloc
This set of changes enables rte_malloc to allocate and free memory as needed. The way it works is, first malloc checks if there is enough memory already allocated to satisfy user's request. If there isn't, we try and allocate more memory. The reverse happens with free - we free an element, check its size (including free element merging due to adjacency) and see if it's bigger than hugepage size and that its start and end span a hugepage or more. Then we remove the area from malloc heap (adjusting element lengths where appropriate), and deallocate the page. For legacy mode, runtime alloc/free of pages is disabled. It is worth noting that memseg lists are being sorted by page size, and that we try our best to satisfy user's request. That is, if the user requests an element from a 2MB page memory, we will check if we can satisfy that request from existing memory, if not we try and allocate more 2MB pages. If that fails and user also specified a "size is hint" flag, we then check other page sizes and try to allocate from there. If that fails too, then, depending on flags, we may try allocating from other sockets. In other words, we try our best to give the user what they asked for, but going to other sockets is last resort - first we try to allocate more memory on the same socket. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 23 +- lib/librte_eal/common/malloc_elem.c| 85 lib/librte_eal/common/malloc_elem.h| 3 + lib/librte_eal/common/malloc_heap.c| 332 - lib/librte_eal/common/malloc_heap.h| 4 +- lib/librte_eal/common/rte_malloc.c | 31 +-- 6 files changed, 416 insertions(+), 62 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index ed36174..718dee8 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -103,7 +103,6 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, struct rte_memzone *mz; struct rte_mem_config *mcfg; size_t requested_len; - int socket, i; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; @@ -181,27 +180,9 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, } } - if (socket_id == SOCKET_ID_ANY) - socket = malloc_get_numa_socket(); - else - socket = socket_id; - /* allocate memory on heap */ - void *mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[socket], NULL, - requested_len, flags, align, bound); - - if ((mz_addr == NULL) && (socket_id == SOCKET_ID_ANY)) { - /* try other heaps */ - for (i = 0; i < RTE_MAX_NUMA_NODES; i++) { - if (socket == i) - continue; - - mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[i], - NULL, requested_len, flags, align, bound); - if (mz_addr != NULL) - break; - } - } + void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket_id, flags, + align, bound); if (mz_addr == NULL) { rte_errno = ENOMEM; diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 701bffd..eabad66 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -400,6 +400,91 @@ malloc_elem_free(struct malloc_elem *elem) return elem; } +/* assume all checks were already done */ +void +malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len) +{ + size_t len_before, len_after; + struct malloc_elem *prev, *next; + void *end, *elem_end; + + end = RTE_PTR_ADD(start, len); + elem_end = RTE_PTR_ADD(elem, elem->size); + len_before = RTE_PTR_DIFF(start, elem); + len_after = RTE_PTR_DIFF(elem_end, end); + + prev = elem->prev; + next = elem->next; + + if (len_after >= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { + /* split after */ + struct malloc_elem *split_after = end; + + split_elem(elem, split_after); + + next = split_after; + + malloc_elem_free_list_insert(split_after); + } else if (len_after >= MALLOC_ELEM_HEADER_LEN) { + struct malloc_elem *pad_elem = end; + + /* shrink current element */ + elem->size -= len_after; + memset(pad_elem, 0, sizeof(*pad_elem)); + + /* copy next element's data to our pad */ + memcpy(pad_elem, next, sizeof(*pad_elem)); + + /* pad next element */ + next->state = ELEM_PAD; + next->pad = l
[dpdk-dev] [PATCH v2 21/41] eal: enable reserving physically contiguous memzones
This adds a new set of _contig API's to rte_memzone. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 44 lib/librte_eal/common/include/rte_memzone.h | 154 2 files changed, 198 insertions(+) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index 75c7dd9..8c9aa28 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -170,6 +170,12 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, socket_id = SOCKET_ID_ANY; if (len == 0) { + /* len == 0 is only allowed for non-contiguous zones */ + if (contig) { + RTE_LOG(DEBUG, EAL, "Reserving zero-length contiguous memzones is not supported\n"); + rte_errno = EINVAL; + return NULL; + } if (bound != 0) requested_len = bound; else { @@ -251,6 +257,19 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id, /* * Return a pointer to a correctly filled memzone descriptor (with a + * specified alignment and boundary). If the allocation cannot be done, + * return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_bounded_contig(const char *name, size_t len, int socket_id, + unsigned int flags, unsigned int align, unsigned int bound) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, + align, bound, true); +} + +/* + * Return a pointer to a correctly filled memzone descriptor (with a * specified alignment). If the allocation cannot be done, return NULL. */ const struct rte_memzone * @@ -262,6 +281,18 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id, } /* + * Return a pointer to a correctly filled memzone descriptor (with a + * specified alignment). If the allocation cannot be done, return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_aligned_contig(const char *name, size_t len, int socket_id, + unsigned int flags, unsigned int align) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, + align, 0, true); +} + +/* * Return a pointer to a correctly filled memzone descriptor. If the * allocation cannot be done, return NULL. */ @@ -274,6 +305,19 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id, false); } +/* + * Return a pointer to a correctly filled memzone descriptor. If the + * allocation cannot be done, return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_contig(const char *name, size_t len, int socket_id, + unsigned int flags) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, + flags, RTE_CACHE_LINE_SIZE, 0, + true); +} + int rte_memzone_free(const struct rte_memzone *mz) { diff --git a/lib/librte_eal/common/include/rte_memzone.h b/lib/librte_eal/common/include/rte_memzone.h index a69f068..5f1293f 100644 --- a/lib/librte_eal/common/include/rte_memzone.h +++ b/lib/librte_eal/common/include/rte_memzone.h @@ -227,6 +227,160 @@ const struct rte_memzone *rte_memzone_reserve_bounded(const char *name, unsigned flags, unsigned align, unsigned bound); /** + * Reserve an IOVA-contiguous portion of physical memory. + * + * This function reserves some IOVA-contiguous memory and returns a pointer to a + * correctly filled memzone descriptor. If the allocation cannot be + * done, return NULL. + * + * @param name + * The name of the memzone. If it already exists, the function will + * fail and return NULL. + * @param len + * The size of the memory to be reserved. + * @param socket_id + * The socket identifier in the case of + * NUMA. The value can be SOCKET_ID_ANY if there is no NUMA + * constraint for the reserved zone. + * @param flags + * The flags parameter is used to request memzones to be + * taken from specifically sized hugepages. + * - RTE_MEMZONE_2MB - Reserved from 2MB pages + * - RTE_MEMZONE_1GB - Reserved from 1GB pages + * - RTE_MEMZONE_16MB - Reserved from 16MB pages + * - RTE_MEMZONE_16GB - Reserved from 16GB pages + * - RTE_MEMZONE_256KB - Reserved from 256KB pages + * - RTE_MEMZONE_256MB - Reserved from 256MB pages + * - RTE_MEMZONE_512MB - Reserved from 512MB pages + * - RTE_MEMZONE_4GB - Reserved from 4GB pages + * - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if + * the requested page size is unavailable. + * If this flag is not set, the function + *
[dpdk-dev] [PATCH v2 23/41] mempool: add support for the new allocation methods
If a user has specified that the zone should have contiguous memory, use the new _contig allocation API's instead of normal ones. Otherwise, account for the fact that unless we're in IOVA_AS_VA mode, we cannot guarantee that the pages would be physically contiguous, so we calculate the memzone size and alignments as if we were getting the smallest page size available. Signed-off-by: Anatoly Burakov --- lib/librte_mempool/rte_mempool.c | 87 +++- 1 file changed, 78 insertions(+), 9 deletions(-) diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c index 54f7f4b..5c4d3fd 100644 --- a/lib/librte_mempool/rte_mempool.c +++ b/lib/librte_mempool/rte_mempool.c @@ -98,6 +98,27 @@ static unsigned optimize_object_size(unsigned obj_size) return new_obj_size * RTE_MEMPOOL_ALIGN; } +static size_t +get_min_page_size(void) +{ + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + int i; + size_t min_pagesz = SIZE_MAX; + + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + + if (msl->base_va == NULL) + continue; + + if (msl->hugepage_sz < min_pagesz) + min_pagesz = msl->hugepage_sz; + } + + return min_pagesz == SIZE_MAX ? (size_t) getpagesize() : min_pagesz; +} + static void mempool_add_elem(struct rte_mempool *mp, void *obj, rte_iova_t iova) { @@ -549,6 +570,7 @@ rte_mempool_populate_default(struct rte_mempool *mp) unsigned mz_id, n; unsigned int mp_flags; int ret; + bool force_contig, no_contig; /* mempool must not be populated */ if (mp->nb_mem_chunks != 0) @@ -563,10 +585,46 @@ rte_mempool_populate_default(struct rte_mempool *mp) /* update mempool capabilities */ mp->flags |= mp_flags; - if (rte_eal_has_hugepages()) { - pg_shift = 0; /* not needed, zone is physically contiguous */ + no_contig = mp->flags & MEMPOOL_F_NO_PHYS_CONTIG; + force_contig = mp->flags & MEMPOOL_F_CAPA_PHYS_CONTIG; + + /* +* there are several considerations for page size and page shift here. +* +* if we don't need our mempools to have physically contiguous objects, +* then just set page shift and page size to 0, because the user has +* indicated that there's no need to care about anything. +* +* if we do need contiguous objects, there is also an option to reserve +* the entire mempool memory as one contiguous block of memory, in +* which case the page shift and alignment wouldn't matter as well. +* +* if we require contiguous objects, but not necessarily the entire +* mempool reserved space to be contiguous, then there are two options. +* +* if our IO addresses are virtual, not actual physical (IOVA as VA +* case), then no page shift needed - our memory allocation will give us +* contiguous physical memory as far as the hardware is concerned, so +* act as if we're getting contiguous memory. +* +* if our IO addresses are physical, we may get memory from bigger +* pages, or we might get memory from smaller pages, and how much of it +* we require depends on whether we want bigger or smaller pages. +* However, requesting each and every memory size is too much work, so +* what we'll do instead is walk through the page sizes available, pick +* the smallest one and set up page shift to match that one. We will be +* wasting some space this way, but it's much nicer than looping around +* trying to reserve each and every page size. +*/ + + if (no_contig || force_contig || rte_eal_iova_mode() == RTE_IOVA_VA) { pg_sz = 0; + pg_shift = 0; align = RTE_CACHE_LINE_SIZE; + } else if (rte_eal_has_hugepages()) { + pg_sz = get_min_page_size(); + pg_shift = rte_bsf32(pg_sz); + align = pg_sz; } else { pg_sz = getpagesize(); pg_shift = rte_bsf32(pg_sz); @@ -585,23 +643,34 @@ rte_mempool_populate_default(struct rte_mempool *mp) goto fail; } - mz = rte_memzone_reserve_aligned(mz_name, size, - mp->socket_id, mz_flags, align); - /* not enough memory, retry with the biggest zone we have */ - if (mz == NULL) - mz = rte_memzone_reserve_aligned(mz_name, 0, + if (force_contig) { + /* +* if contiguous memory for entire mempool memory was +* requested, don't try reserving again if we fail. +*/ +
[dpdk-dev] [PATCH v2 29/41] eal: add support for callbacks on memory hotplug
Each process will have its own callbacks. Callbacks will indicate whether it's allocation and deallocation that's happened, and will also provide start VA address and length of allocated block. Since memory hotplug isn't supported on FreeBSD and in legacy mem mode, it will not be possible to register them in either. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memalloc.c | 132 lib/librte_eal/common/eal_common_memory.c | 28 ++ lib/librte_eal/common/eal_memalloc.h| 10 +++ lib/librte_eal/common/include/rte_memory.h | 48 ++ lib/librte_eal/rte_eal_version.map | 2 + 5 files changed, 220 insertions(+) diff --git a/lib/librte_eal/common/eal_common_memalloc.c b/lib/librte_eal/common/eal_common_memalloc.c index 62e8c16..4fb55f2 100644 --- a/lib/librte_eal/common/eal_common_memalloc.c +++ b/lib/librte_eal/common/eal_common_memalloc.c @@ -2,16 +2,46 @@ * Copyright(c) 2017-2018 Intel Corporation */ +#include + +#include #include #include #include #include #include +#include #include "eal_private.h" #include "eal_internal_cfg.h" #include "eal_memalloc.h" +struct mem_event_callback_entry { + TAILQ_ENTRY(mem_event_callback_entry) next; + char name[RTE_MEM_EVENT_CALLBACK_NAME_LEN]; + rte_mem_event_callback_t clb; +}; + +/** Double linked list of actions. */ +TAILQ_HEAD(mem_event_callback_entry_list, mem_event_callback_entry); + +static struct mem_event_callback_entry_list callback_list = + TAILQ_HEAD_INITIALIZER(callback_list); + +static rte_rwlock_t rwlock = RTE_RWLOCK_INITIALIZER; + +static struct mem_event_callback_entry * +find_callback(const char *name) +{ + struct mem_event_callback_entry *r; + + TAILQ_FOREACH(r, &callback_list, next) { + if (!strcmp(r->name, name)) + break; + } + return r; +} + bool eal_memalloc_is_contig(struct rte_memseg_list *msl, void *start, size_t len) @@ -47,3 +77,105 @@ eal_memalloc_is_contig(struct rte_memseg_list *msl, void *start, return true; } + +int +eal_memalloc_callback_register(const char *name, + rte_mem_event_callback_t clb) +{ + struct mem_event_callback_entry *entry; + int ret, len; + if (name == NULL || clb == NULL) { + rte_errno = EINVAL; + return -1; + } + len = strnlen(name, RTE_MEM_EVENT_CALLBACK_NAME_LEN); + if (len == 0) { + rte_errno = EINVAL; + return -1; + } else if (len == RTE_MEM_EVENT_CALLBACK_NAME_LEN) { + rte_errno = ENAMETOOLONG; + return -1; + } + rte_rwlock_write_lock(&rwlock); + + entry = find_callback(name); + if (entry != NULL) { + rte_errno = EEXIST; + ret = -1; + goto unlock; + } + + entry = malloc(sizeof(*entry)); + if (entry == NULL) { + rte_errno = ENOMEM; + ret = -1; + goto unlock; + } + + /* callback successfully created and is valid, add it to the list */ + entry->clb = clb; + snprintf(entry->name, RTE_MEM_EVENT_CALLBACK_NAME_LEN, "%s", name); + TAILQ_INSERT_TAIL(&callback_list, entry, next); + + ret = 0; + + RTE_LOG(DEBUG, EAL, "Mem event callback '%s' registered\n", name); + +unlock: + rte_rwlock_write_unlock(&rwlock); + return ret; +} + +int +eal_memalloc_callback_unregister(const char *name) +{ + struct mem_event_callback_entry *entry; + int ret, len; + + if (name == NULL) { + rte_errno = EINVAL; + return -1; + } + len = strnlen(name, RTE_MEM_EVENT_CALLBACK_NAME_LEN); + if (len == 0) { + rte_errno = EINVAL; + return -1; + } else if (len == RTE_MEM_EVENT_CALLBACK_NAME_LEN) { + rte_errno = ENAMETOOLONG; + return -1; + } + rte_rwlock_write_lock(&rwlock); + + entry = find_callback(name); + if (entry == NULL) { + rte_errno = ENOENT; + ret = -1; + goto unlock; + } + TAILQ_REMOVE(&callback_list, entry, next); + free(entry); + + ret = 0; + + RTE_LOG(DEBUG, EAL, "Mem event callback '%s' unregistered\n", name); + +unlock: + rte_rwlock_write_unlock(&rwlock); + return ret; +} + +void +eal_memalloc_notify(enum rte_mem_event event, const void *start, size_t len) +{ + struct mem_event_callback_entry *entry; + + rte_rwlock_read_lock(&rwlock); + + TAILQ_FOREACH(entry, &callback_list, next) { + RTE_LOG(DEBUG, EAL, "Calling mem event callback %s", + entry->name); + entry->clb(event, start, len); + } + + rte_rwlock_read_unlock(&rwlock); +} diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_e
[dpdk-dev] [PATCH v2 26/41] eal: prepare memseg lists for multiprocess sync
In preparation for implementing multiprocess support, we are adding a version number and write locks to memseg lists. There are two ways of implementing multiprocess support for memory hotplug: either all information about mapped memory is shared between processes, and secondary processes simply attempt to map/unmap memory based on requests from the primary, or secondary processes store their own maps and only check if they are in sync with the primary process' maps. This implementation will opt for the latter option: primary process shared mappings will be authoritative, and each secondary process will use its own interal view of mapped memory, and will attempt to synchronize on these mappings using versioning. Under this model, only primary process will decide which pages get mapped, and secondary processes will only copy primary's page maps and get notified of the changes via IPC mechanism (coming in later commits). To avoid race conditions, memseg lists will also have write locks - that is, it will be possible for several secondary processes to initialize concurrently, but it will not be possible for several processes to request memory allocation unless all other allocations were complete (on a single socket - it is OK to allocate/free memory on different sockets concurrently). In principle, it is possible for multiple processes to request allocation/deallcation on multiple sockets, but we will only allow one such request to be active at any one time. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/eal_memalloc.c | 7 + lib/librte_eal/common/eal_memalloc.h | 4 + lib/librte_eal/common/include/rte_eal_memconfig.h | 2 + lib/librte_eal/linuxapp/eal/eal_memalloc.c| 288 +- 4 files changed, 295 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c index be8340b..255aedc 100644 --- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c +++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c @@ -24,3 +24,10 @@ eal_memalloc_alloc_page(uint64_t __rte_unused size, int __rte_unused socket) RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n"); return NULL; } + +int +eal_memalloc_sync_with_primary(void) +{ + RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n"); + return -1; +} diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index 08ba70e..beac296 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -24,4 +24,8 @@ bool eal_memalloc_is_contig(struct rte_memseg_list *msl, void *start, size_t len); +/* synchronize local memory map to primary process */ +int +eal_memalloc_sync_with_primary(void); + #endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h index b6bdb21..d653d57 100644 --- a/lib/librte_eal/common/include/rte_eal_memconfig.h +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h @@ -32,6 +32,8 @@ struct rte_memseg_list { }; int socket_id; /**< Socket ID for all memsegs in this list. */ uint64_t hugepage_sz; /**< page size for all memsegs in this list. */ + rte_rwlock_t mplock; /**< read-write lock for multiprocess sync. */ + uint32_t version; /**< version number for multiprocess sync. */ struct rte_fbarray memseg_arr; }; diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index c03e7bc..227d703 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -65,6 +65,9 @@ static struct msl_entry_list msl_entry_list = TAILQ_HEAD_INITIALIZER(msl_entry_list); static rte_spinlock_t tailq_lock = RTE_SPINLOCK_INITIALIZER; +/** local copy of a memory map, used to synchronize memory hotplug in MP */ +static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS]; + static sigjmp_buf huge_jmpenv; static void __rte_unused huge_sigbus_handler(int signo __rte_unused) @@ -619,11 +622,14 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, continue; msl = cur_msl; + /* lock memseg list */ + rte_rwlock_write_lock(&msl->mplock); + /* try finding space in memseg list */ cur_idx = rte_fbarray_find_next_n_free(&msl->memseg_arr, 0, n); if (cur_idx < 0) - continue; + goto next_list; end_idx = cur_idx + n; start_idx = cur_idx; @@ -637,7 +643,6 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, if (alloc_page(cur, addr, size, socket, hi, msl_idx, cur_idx)) { - RTE_LOG(DEBUG, EAL, "attemp
[dpdk-dev] [PATCH v2 19/41] eal: add API to check if memory is contiguous
This will be helpful down the line when we implement support for allocating physically contiguous memory. We can no longer guarantee physically contiguous memory unless we're in IOVA_AS_VA mode, but we can certainly try and see if we succeed. In addition, this would be useful for e.g. PMD's who may allocate chunks that are smaller than the pagesize, but they must not cross the page boundary, in which case we will be able to accommodate that request. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/Makefile | 1 + lib/librte_eal/common/eal_common_memalloc.c | 49 + lib/librte_eal/common/eal_memalloc.h| 5 +++ lib/librte_eal/common/meson.build | 1 + lib/librte_eal/linuxapp/eal/Makefile| 1 + 5 files changed, 57 insertions(+) create mode 100644 lib/librte_eal/common/eal_common_memalloc.c diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile index 19f9322..907e30d 100644 --- a/lib/librte_eal/bsdapp/eal/Makefile +++ b/lib/librte_eal/bsdapp/eal/Makefile @@ -41,6 +41,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_timer.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_memzone.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_log.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_launch.c +SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_memory.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_tailqs.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_common_errno.c diff --git a/lib/librte_eal/common/eal_common_memalloc.c b/lib/librte_eal/common/eal_common_memalloc.c new file mode 100644 index 000..62e8c16 --- /dev/null +++ b/lib/librte_eal/common/eal_common_memalloc.c @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2017-2018 Intel Corporation + */ + +#include +#include +#include +#include +#include + +#include "eal_private.h" +#include "eal_internal_cfg.h" +#include "eal_memalloc.h" + +bool +eal_memalloc_is_contig(struct rte_memseg_list *msl, void *start, + size_t len) +{ + const struct rte_memseg *ms; + uint64_t page_sz; + void *end; + int start_page, end_page, cur_page; + rte_iova_t expected; + + /* for legacy memory, it's always contiguous */ + if (internal_config.legacy_mem) + return true; + + /* figure out how many pages we need to fit in current data */ + page_sz = msl->hugepage_sz; + end = RTE_PTR_ADD(start, len); + + start_page = RTE_PTR_DIFF(start, msl->base_va) / page_sz; + end_page = RTE_PTR_DIFF(end, msl->base_va) / page_sz; + + /* now, look for contiguous memory */ + ms = rte_fbarray_get(&msl->memseg_arr, start_page); + expected = ms->iova + page_sz; + + for (cur_page = start_page + 1; cur_page < end_page; + cur_page++, expected += page_sz) { + ms = rte_fbarray_get(&msl->memseg_arr, cur_page); + + if (ms->iova != expected) + return false; + } + + return true; +} diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index adf59c4..08ba70e 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -8,6 +8,7 @@ #include #include +#include struct rte_memseg * eal_memalloc_alloc_page(uint64_t size, int socket); @@ -19,4 +20,8 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int eal_memalloc_free_page(struct rte_memseg *ms); +bool +eal_memalloc_is_contig(struct rte_memseg_list *msl, void *start, + size_t len); + #endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build index 7d02191..a1ada24 100644 --- a/lib/librte_eal/common/meson.build +++ b/lib/librte_eal/common/meson.build @@ -16,6 +16,7 @@ common_sources = files( 'eal_common_launch.c', 'eal_common_lcore.c', 'eal_common_log.c', + 'eal_common_memalloc.c', 'eal_common_memory.c', 'eal_common_memzone.c', 'eal_common_options.c', diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile index af6b9be..5380ba8 100644 --- a/lib/librte_eal/linuxapp/eal/Makefile +++ b/lib/librte_eal/linuxapp/eal/Makefile @@ -49,6 +49,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_timer.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memzone.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_log.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_launch.c +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memory.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_tailqs.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_errno.c -- 2.7.4
[dpdk-dev] [PATCH v2 32/41] crypto/qat: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov Acked-by: Fiona Trahe --- drivers/crypto/qat/qat_qp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/crypto/qat/qat_qp.c b/drivers/crypto/qat/qat_qp.c index 87b9ce0..3f8ed4d 100644 --- a/drivers/crypto/qat/qat_qp.c +++ b/drivers/crypto/qat/qat_qp.c @@ -95,8 +95,8 @@ queue_dma_zone_reserve(const char *queue_name, uint32_t queue_size, default: memzone_flags = RTE_MEMZONE_SIZE_HINT_ONLY; } - return rte_memzone_reserve_aligned(queue_name, queue_size, socket_id, - memzone_flags, queue_size); + return rte_memzone_reserve_aligned_contig(queue_name, queue_size, + socket_id, memzone_flags, queue_size); } int qat_crypto_sym_qp_setup(struct rte_cryptodev *dev, uint16_t queue_pair_id, -- 2.7.4
[dpdk-dev] [PATCH v2 31/41] ethdev: use contiguous allocation for DMA memory
This fixes the following drivers in one go: grep -Rl rte_eth_dma_zone_reserve drivers/ drivers/net/avf/avf_rxtx.c drivers/net/thunderx/nicvf_ethdev.c drivers/net/e1000/igb_rxtx.c drivers/net/e1000/em_rxtx.c drivers/net/fm10k/fm10k_ethdev.c drivers/net/vmxnet3/vmxnet3_rxtx.c drivers/net/liquidio/lio_rxtx.c drivers/net/i40e/i40e_rxtx.c drivers/net/sfc/sfc.c drivers/net/ixgbe/ixgbe_rxtx.c drivers/net/nfp/nfp_net.c Signed-off-by: Anatoly Burakov --- lib/librte_ether/rte_ethdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c index 0590f0c..7935230 100644 --- a/lib/librte_ether/rte_ethdev.c +++ b/lib/librte_ether/rte_ethdev.c @@ -3401,7 +3401,8 @@ rte_eth_dma_zone_reserve(const struct rte_eth_dev *dev, const char *ring_name, if (mz) return mz; - return rte_memzone_reserve_aligned(z_name, size, socket_id, 0, align); + return rte_memzone_reserve_aligned_contig(z_name, size, socket_id, 0, + align); } int -- 2.7.4
[dpdk-dev] [PATCH v2 24/41] vfio: allow to map other memory regions
Currently it is not possible to use memory that is not owned by DPDK to perform DMA. This scenarion might be used in vhost applications (like SPDK) where guest send its own memory table. To fill this gap provide API to allow registering arbitrary address in VFIO container. Signed-off-by: Pawel Wodkowski Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/eal.c | 16 lib/librte_eal/common/include/rte_vfio.h | 39 lib/librte_eal/linuxapp/eal/eal_vfio.c | 153 ++- lib/librte_eal/linuxapp/eal/eal_vfio.h | 11 +++ 4 files changed, 196 insertions(+), 23 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c index 3b06e21..5a7f436 100644 --- a/lib/librte_eal/bsdapp/eal/eal.c +++ b/lib/librte_eal/bsdapp/eal/eal.c @@ -755,6 +755,8 @@ int rte_vfio_enable(const char *modname); int rte_vfio_is_enabled(const char *modname); int rte_vfio_noiommu_is_enabled(void); int rte_vfio_clear_group(int vfio_group_fd); +int rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len); +int rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len); int rte_vfio_setup_device(__rte_unused const char *sysfs_base, __rte_unused const char *dev_addr, @@ -790,3 +792,17 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd) { return 0; } + +int +rte_vfio_dma_map(uint64_t __rte_unused vaddr, __rte_unused uint64_t iova, + __rte_unused uint64_t len) +{ + return -1; +} + +int +rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova, + __rte_unused uint64_t len) +{ + return -1; +} diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h index e981a62..093c309 100644 --- a/lib/librte_eal/common/include/rte_vfio.h +++ b/lib/librte_eal/common/include/rte_vfio.h @@ -123,6 +123,45 @@ int rte_vfio_noiommu_is_enabled(void); int rte_vfio_clear_group(int vfio_group_fd); +/** + * Map memory region for use with VFIO. + * + * @param vaddr + * Starting virtual address of memory to be mapped. + * + * @param iova + * Starting IOVA address of memory to be mapped. + * + * @param len + * Length of memory segment being mapped. + * + * @return + * 0 if success. + * -1 on error. + */ +int +rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len); + + +/** + * Unmap memory region from VFIO. + * + * @param vaddr + * Starting virtual address of memory to be unmapped. + * + * @param iova + * Starting IOVA address of memory to be unmapped. + * + * @param len + * Length of memory segment being unmapped. + * + * @return + * 0 if success. + * -1 on error. + */ +int +rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len); + #endif /* VFIO_PRESENT */ #endif /* _RTE_VFIO_H_ */ diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c index 5192763..8fe8984 100644 --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c @@ -22,17 +22,35 @@ static struct vfio_config vfio_cfg; static int vfio_type1_dma_map(int); +static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int); static int vfio_spapr_dma_map(int); static int vfio_noiommu_dma_map(int); +static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int); /* IOMMU types we support */ static const struct vfio_iommu_type iommu_types[] = { /* x86 IOMMU, otherwise known as type 1 */ - { RTE_VFIO_TYPE1, "Type 1", &vfio_type1_dma_map}, + { + .type_id = RTE_VFIO_TYPE1, + .name = "Type 1", + .dma_map_func = &vfio_type1_dma_map, + .dma_user_map_func = &vfio_type1_dma_mem_map + }, /* ppc64 IOMMU, otherwise known as spapr */ - { RTE_VFIO_SPAPR, "sPAPR", &vfio_spapr_dma_map}, + { + .type_id = RTE_VFIO_SPAPR, + .name = "sPAPR", + .dma_map_func = &vfio_spapr_dma_map, + .dma_user_map_func = NULL + // TODO: work with PPC64 people on enabling this, window size! + }, /* IOMMU-less mode */ - { RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map}, + { + .type_id = RTE_VFIO_NOIOMMU, + .name = "No-IOMMU", + .dma_map_func = &vfio_noiommu_dma_map, + .dma_user_map_func = &vfio_noiommu_dma_mem_map + }, }; int @@ -333,9 +351,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr, */ if (internal_config.process_type == RTE_PROC_PRIMARY && vfio_cfg.vfio_active_groups == 1) { + const struct vfio_iommu_type *t; + /* select an IOMMU type which we will be using */ - const struct vfio_iommu_type *t = - vfio_set_iommu
[dpdk-dev] [PATCH v2 27/41] eal: add multiprocess init with memory hotplug
for legacy memory mode, attach to primary's memseg list, and map hugepages as before. for non-legacy mode, preallocate all VA space and then do a sync of local memory map. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/eal_hugepage_info.c | 7 ++ lib/librte_eal/common/eal_common_memory.c | 99 + lib/librte_eal/common/eal_hugepages.h | 5 ++ lib/librte_eal/linuxapp/eal/eal.c | 18 +++-- lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 53 - lib/librte_eal/linuxapp/eal/eal_memory.c| 24 -- 6 files changed, 159 insertions(+), 47 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal_hugepage_info.c b/lib/librte_eal/bsdapp/eal/eal_hugepage_info.c index be2dbf0..18e6e5e 100644 --- a/lib/librte_eal/bsdapp/eal/eal_hugepage_info.c +++ b/lib/librte_eal/bsdapp/eal/eal_hugepage_info.c @@ -103,3 +103,10 @@ eal_hugepage_info_init(void) return 0; } + +/* memory hotplug is not supported in FreeBSD, so no need to implement this */ +int +eal_hugepage_info_read(void) +{ + return 0; +} diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 457e239..a571e24 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -20,6 +20,7 @@ #include #include +#include "eal_memalloc.h" #include "eal_private.h" #include "eal_internal_cfg.h" @@ -147,19 +148,11 @@ alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz, char name[RTE_FBARRAY_NAME_LEN]; int max_pages; uint64_t mem_amount; - void *addr; if (!internal_config.legacy_mem) { mem_amount = get_mem_amount(page_sz); max_pages = mem_amount / page_sz; - - addr = eal_get_virtual_area(NULL, &mem_amount, page_sz, 0, 0); - if (addr == NULL) { - RTE_LOG(ERR, EAL, "Cannot reserve memory\n"); - return -1; - } } else { - addr = NULL; /* numer of memsegs in each list, these will not be single-page * segments, so RTE_MAX_LEGACY_MEMSEG is like old default. */ @@ -177,7 +170,7 @@ alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz, msl->hugepage_sz = page_sz; msl->socket_id = socket_id; - msl->base_va = addr; + msl->base_va = NULL; RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n", page_sz >> 10, socket_id); @@ -186,16 +179,46 @@ alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz, } static int -memseg_init(void) +alloc_va_space(struct rte_memseg_list *msl) +{ + uint64_t mem_sz, page_sz; + void *addr; + int flags = 0; + +#ifdef RTE_ARCH_PPC_64 + flags |= MAP_HUGETLB; +#endif + + page_sz = msl->hugepage_sz; + mem_sz = page_sz * msl->memseg_arr.len; + + addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags); + if (addr == NULL) { + if (rte_errno == EADDRNOTAVAIL) + RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - please use '--base-virtaddr' option\n", + (unsigned long long)mem_sz, msl->base_va); + else + RTE_LOG(ERR, EAL, "Cannot reserve memory\n"); + return -1; + } + msl->base_va = addr; + + return 0; +} + + +static int +memseg_primary_init(void) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; int socket_id, hpi_idx, msl_idx = 0; struct rte_memseg_list *msl; - if (rte_eal_process_type() == RTE_PROC_SECONDARY) { - RTE_LOG(ERR, EAL, "Secondary process not supported\n"); - return -1; - } + /* if we start allocating memory segments for pages straight away, VA +* space will become fragmented, reducing chances of success when +* secondary process maps the same addresses. to fix this, allocate +* fbarrays first, and then allocate VA space for them. +*/ /* create memseg lists */ for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes; @@ -235,12 +258,55 @@ memseg_init(void) total_segs += msl->memseg_arr.len; total_mem = total_segs * msl->hugepage_sz; type_msl_idx++; + + /* no need to preallocate VA in legacy mode */ + if (internal_config.legacy_mem) + continue; + + if (alloc_va_space(msl)) { + RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n"); + return -1; +
[dpdk-dev] [PATCH v2 30/41] eal: enable callbacks on malloc/free and mp sync
Also, rewrite VFIO to rely on memory callbacks instead of manually registering memory with VFIO. Callbacks will only be registered if VFIO is enabled. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_heap.c| 21 + lib/librte_eal/linuxapp/eal/eal_memalloc.c | 37 +- lib/librte_eal/linuxapp/eal/eal_vfio.c | 35 3 files changed, 82 insertions(+), 11 deletions(-) diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 9935238..d932ead 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -223,6 +223,7 @@ try_expand_heap_primary(struct malloc_heap *heap, uint64_t pg_sz, void *map_addr; size_t map_len; int n_pages; + bool callback_triggered = false; map_len = RTE_ALIGN_CEIL(align + elt_size + MALLOC_ELEM_TRAILER_LEN, pg_sz); @@ -242,14 +243,25 @@ try_expand_heap_primary(struct malloc_heap *heap, uint64_t pg_sz, map_addr = ms[0]->addr; + /* notify user about changes in memory map */ + eal_memalloc_notify(RTE_MEM_EVENT_ALLOC, map_addr, map_len); + /* notify other processes that this has happened */ if (request_sync()) { /* we couldn't ensure all processes have mapped memory, * so free it back and notify everyone that it's been * freed back. +* +* technically, we could've avoided adding memory addresses to +* the map, but that would've led to inconsistent behavior +* between primary and secondary processes, as those get +* callbacks during sync. therefore, force primary process to +* do alloc-and-rollback syncs as well. */ + callback_triggered = true; goto free_elem; } + heap->total_size += map_len; RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n", @@ -260,6 +272,9 @@ try_expand_heap_primary(struct malloc_heap *heap, uint64_t pg_sz, return 0; free_elem: + if (callback_triggered) + eal_memalloc_notify(RTE_MEM_EVENT_FREE, map_addr, map_len); + rollback_expand_heap(ms, n_pages, elem, map_addr, map_len); request_sync(); @@ -615,6 +630,10 @@ malloc_heap_free(struct malloc_elem *elem) heap->total_size -= n_pages * msl->hugepage_sz; if (rte_eal_process_type() == RTE_PROC_PRIMARY) { + /* notify user about changes in memory map */ + eal_memalloc_notify(RTE_MEM_EVENT_FREE, + aligned_start, aligned_len); + /* don't care if any of this fails */ malloc_heap_free_pages(aligned_start, aligned_len); @@ -637,6 +656,8 @@ malloc_heap_free(struct malloc_elem *elem) * already removed from the heap, so it is, for all intents and * purposes, hidden from the rest of DPDK even if some other * process (including this one) may have these pages mapped. +* +* notifications about deallocated memory happen during sync. */ request_to_primary(&req); } diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 227d703..1008fae 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -34,7 +34,6 @@ #include #include #include -#include #include "eal_filesystem.h" #include "eal_internal_cfg.h" @@ -480,10 +479,6 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, ms->iova = iova; ms->socket_id = socket_id; - /* map the segment so that VFIO has access to it */ - if (rte_eal_iova_mode() == RTE_IOVA_VA && - rte_vfio_dma_map(ms->addr_64, iova, size)) - RTE_LOG(DEBUG, EAL, "Cannot register segment with VFIO\n"); return 0; mapped: @@ -515,12 +510,6 @@ free_page(struct rte_memseg *ms, struct hugepage_info *hi, char path[PATH_MAX]; int fd, ret; - /* unmap the segment from VFIO */ - if (rte_eal_iova_mode() == RTE_IOVA_VA && - rte_vfio_dma_unmap(ms->addr_64, ms->iova, ms->len)) { - RTE_LOG(DEBUG, EAL, "Cannot unregister segment with VFIO\n"); - } - if (mmap(ms->addr, ms->hugepage_sz, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) == MAP_FAILED) { @@ -808,6 +797,19 @@ sync_chunk(struct rte_memseg_list *primary_msl, diff_len = RTE_MIN(chunk_len, diff_len); + /* if we are freeing memory, notif the application */ + if (!used) { + struct rte_memseg *ms; + void *start_v
[dpdk-dev] [PATCH v2 33/41] net/avf: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- drivers/net/avf/avf_ethdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/avf/avf_ethdev.c b/drivers/net/avf/avf_ethdev.c index 4df6617..f69d697 100644 --- a/drivers/net/avf/avf_ethdev.c +++ b/drivers/net/avf/avf_ethdev.c @@ -1365,7 +1365,7 @@ avf_allocate_dma_mem_d(__rte_unused struct avf_hw *hw, return AVF_ERR_PARAM; snprintf(z_name, sizeof(z_name), "avf_dma_%"PRIu64, rte_rand()); - mz = rte_memzone_reserve_bounded(z_name, size, SOCKET_ID_ANY, 0, + mz = rte_memzone_reserve_bounded_contig(z_name, size, SOCKET_ID_ANY, 0, alignment, RTE_PGSIZE_2M); if (!mz) return AVF_ERR_NO_MEMORY; -- 2.7.4
[dpdk-dev] [PATCH v2 37/41] net/enic: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov Acked-by: John Daley --- Notes: It is not 100% clear that second call to memzone_reserve is allocating DMA memory. Corrections welcome. drivers/net/enic/enic_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/enic/enic_main.c b/drivers/net/enic/enic_main.c index ec9d343..cb2a7ba 100644 --- a/drivers/net/enic/enic_main.c +++ b/drivers/net/enic/enic_main.c @@ -319,7 +319,7 @@ enic_alloc_consistent(void *priv, size_t size, struct enic *enic = (struct enic *)priv; struct enic_memzone_entry *mze; - rz = rte_memzone_reserve_aligned((const char *)name, + rz = rte_memzone_reserve_aligned_contig((const char *)name, size, SOCKET_ID_ANY, 0, ENIC_ALIGN); if (!rz) { pr_err("%s : Failed to allocate memory requested for %s\n", @@ -787,7 +787,7 @@ int enic_alloc_wq(struct enic *enic, uint16_t queue_idx, "vnic_cqmsg-%s-%d-%d", enic->bdf_name, queue_idx, instance++); - wq->cqmsg_rz = rte_memzone_reserve_aligned((const char *)name, + wq->cqmsg_rz = rte_memzone_reserve_aligned_contig((const char *)name, sizeof(uint32_t), SOCKET_ID_ANY, 0, ENIC_ALIGN); -- 2.7.4
[dpdk-dev] [PATCH v2 39/41] net/qede: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- Notes: Doing "grep -R rte_memzone_reserve drivers/net/qede" returns the following: drivers/net/qede/qede_fdir.c: mz = rte_memzone_reserve_aligned(mz_name, QEDE_MAX_FDIR_PKT_LEN, drivers/net/qede/base/bcm_osal.c: mz = rte_memzone_reserve_aligned_contig(mz_name, size, drivers/net/qede/base/bcm_osal.c: mz = rte_memzone_reserve_aligned_contig(mz_name, size, socket_id, 0, I took a brief look at memzone in qede_fdir and it didn't look like memzone was used for DMA, so i left it alone. Corrections welcome. drivers/net/qede/base/bcm_osal.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/qede/base/bcm_osal.c b/drivers/net/qede/base/bcm_osal.c index fe42f32..707d553 100644 --- a/drivers/net/qede/base/bcm_osal.c +++ b/drivers/net/qede/base/bcm_osal.c @@ -135,7 +135,7 @@ void *osal_dma_alloc_coherent(struct ecore_dev *p_dev, if (core_id == (unsigned int)LCORE_ID_ANY) core_id = 0; socket_id = rte_lcore_to_socket_id(core_id); - mz = rte_memzone_reserve_aligned(mz_name, size, + mz = rte_memzone_reserve_aligned_contig(mz_name, size, socket_id, 0, RTE_CACHE_LINE_SIZE); if (!mz) { DP_ERR(p_dev, "Unable to allocate DMA memory " @@ -174,7 +174,8 @@ void *osal_dma_alloc_coherent_aligned(struct ecore_dev *p_dev, if (core_id == (unsigned int)LCORE_ID_ANY) core_id = 0; socket_id = rte_lcore_to_socket_id(core_id); - mz = rte_memzone_reserve_aligned(mz_name, size, socket_id, 0, align); + mz = rte_memzone_reserve_aligned_contig(mz_name, size, socket_id, 0, + align); if (!mz) { DP_ERR(p_dev, "Unable to allocate DMA memory " "of size %zu bytes - %s\n", -- 2.7.4
[dpdk-dev] [PATCH v2 41/41] net/vmxnet3: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- Notes: Not sure if DMA-capable memzones are needed for vmxnet3. Corrections welcome. drivers/net/vmxnet3/vmxnet3_ethdev.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/vmxnet3/vmxnet3_ethdev.c b/drivers/net/vmxnet3/vmxnet3_ethdev.c index 4e68aae..c787379 100644 --- a/drivers/net/vmxnet3/vmxnet3_ethdev.c +++ b/drivers/net/vmxnet3/vmxnet3_ethdev.c @@ -150,14 +150,15 @@ gpa_zone_reserve(struct rte_eth_dev *dev, uint32_t size, if (!reuse) { if (mz) rte_memzone_free(mz); - return rte_memzone_reserve_aligned(z_name, size, socket_id, - 0, align); + return rte_memzone_reserve_aligned_contig(z_name, size, + socket_id, 0, align); } if (mz) return mz; - return rte_memzone_reserve_aligned(z_name, size, socket_id, 0, align); + return rte_memzone_reserve_aligned_contig(z_name, size, socket_id, 0, + align); } /** -- 2.7.4
[dpdk-dev] [PATCH v2 28/41] eal: add support for multiprocess memory hotplug
This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Signed-off-by: Anatoly Burakov --- Notes: v2: - fixed deadlocking on init problem - reverted rte_panic changes (fixed by changes in IPC instead) This problem is evidently complex to solve without multithreaded IPC implementation. An alternative approach would be to process each individual message in its own thread (or at least spawn a thread per incoming request) - that way, we can send requests while responding to another request, and this problem becomes trivial to solve (and in fact it was solved that way initially, before my aversion to certain other programming languages kicked in). Is the added complexity worth saving a couple of thread spin-ups here and there? lib/librte_eal/bsdapp/eal/Makefile| 1 + lib/librte_eal/common/eal_common_memory.c | 16 +- lib/librte_eal/common/include/rte_eal_memconfig.h | 3 + lib/librte_eal/common/malloc_heap.c | 255 ++-- lib/librte_eal/common/malloc_mp.c | 723 ++ lib/librte_eal/common/malloc_mp.h | 86 +++ lib/librte_eal/common/meson.build | 1 + lib/librte_eal/linuxapp/eal/Makefile | 1 + 8 files changed, 1040 insertions(+), 46 deletions(-) create mode 100644 lib/librte_eal/common/malloc_mp.c create mode 100644 lib/librte_eal/com
[dpdk-dev] [PATCH v2 40/41] net/virtio: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov Reviewed-by: Venkatesh Srinivas --- Notes: Not sure if virtio needs to allocate DMA-capable memory, being a software driver and all. Corrections welcome. drivers/net/virtio/virtio_ethdev.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 884f74a..35812e4 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -391,7 +391,7 @@ virtio_init_queue(struct rte_eth_dev *dev, uint16_t vtpci_queue_idx) PMD_INIT_LOG(DEBUG, "vring_size: %d, rounded_vring_size: %d", size, vq->vq_ring_size); - mz = rte_memzone_reserve_aligned(vq_name, vq->vq_ring_size, + mz = rte_memzone_reserve_aligned_contig(vq_name, vq->vq_ring_size, SOCKET_ID_ANY, 0, VIRTIO_PCI_VRING_ALIGN); if (mz == NULL) { @@ -417,9 +417,9 @@ virtio_init_queue(struct rte_eth_dev *dev, uint16_t vtpci_queue_idx) if (sz_hdr_mz) { snprintf(vq_hdr_name, sizeof(vq_hdr_name), "port%d_vq%d_hdr", dev->data->port_id, vtpci_queue_idx); - hdr_mz = rte_memzone_reserve_aligned(vq_hdr_name, sz_hdr_mz, -SOCKET_ID_ANY, 0, -RTE_CACHE_LINE_SIZE); + hdr_mz = rte_memzone_reserve_aligned_contig(vq_hdr_name, + sz_hdr_mz, SOCKET_ID_ANY, 0, + RTE_CACHE_LINE_SIZE); if (hdr_mz == NULL) { if (rte_errno == EEXIST) hdr_mz = rte_memzone_lookup(vq_hdr_name); -- 2.7.4
[dpdk-dev] [PATCH v2 14/41] eal: add support for mapping hugepages at runtime
Nothing uses this code yet. The bulk of it is copied from old memory allocation code (linuxapp eal_memory.c). We provide an EAL-internal API to allocate either one page or multiple pages, guaranteeing that we'll get contiguous VA for all of the pages that we requested. For single-file segments, we will use fallocate() to grow and shrink memory segments, however fallocate() is not supported on all kernel versions, so we will fall back to using ftruncate() to grow the file, and disable shrinking as there's little we can do there. This will enable vhost use cases where having single file segments is of great value even without support for hot-unplugging memory. Not supported on FreeBSD. Locking is done via fcntl() because that way, when it comes to taking out write locks or unlocking on deallocation, we don't have to keep original fd's around. Plus, using fcntl() gives us ability to lock parts of a file, which is useful for single-file segments. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/Makefile | 1 + lib/librte_eal/bsdapp/eal/eal_memalloc.c | 26 ++ lib/librte_eal/bsdapp/eal/meson.build | 1 + lib/librte_eal/common/eal_memalloc.h | 19 + lib/librte_eal/linuxapp/eal/Makefile | 2 + lib/librte_eal/linuxapp/eal/eal_memalloc.c | 609 + lib/librte_eal/linuxapp/eal/meson.build| 1 + 7 files changed, 659 insertions(+) create mode 100644 lib/librte_eal/bsdapp/eal/eal_memalloc.c create mode 100644 lib/librte_eal/common/eal_memalloc.h create mode 100644 lib/librte_eal/linuxapp/eal/eal_memalloc.c diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile index 1b43d77..19f9322 100644 --- a/lib/librte_eal/bsdapp/eal/Makefile +++ b/lib/librte_eal/bsdapp/eal/Makefile @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_memory.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_hugepage_info.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_thread.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_debug.c +SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_lcore.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_timer.c SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += eal_interrupts.c diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c new file mode 100644 index 000..be8340b --- /dev/null +++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2017-2018 Intel Corporation + */ + +#include + +#include +#include + +#include "eal_memalloc.h" + +int +eal_memalloc_alloc_page_bulk(struct rte_memseg **ms __rte_unused, + int __rte_unused n, uint64_t __rte_unused size, + int __rte_unused socket, bool __rte_unused exact) +{ + RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n"); + return -1; +} + +struct rte_memseg * +eal_memalloc_alloc_page(uint64_t __rte_unused size, int __rte_unused socket) +{ + RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n"); + return NULL; +} diff --git a/lib/librte_eal/bsdapp/eal/meson.build b/lib/librte_eal/bsdapp/eal/meson.build index e83fc91..4b40223 100644 --- a/lib/librte_eal/bsdapp/eal/meson.build +++ b/lib/librte_eal/bsdapp/eal/meson.build @@ -8,6 +8,7 @@ env_sources = files('eal_alarm.c', 'eal_hugepage_info.c', 'eal_interrupts.c', 'eal_lcore.c', + 'eal_memalloc.c', 'eal_thread.c', 'eal_timer.c', 'eal.c', diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h new file mode 100644 index 000..c1076cf --- /dev/null +++ b/lib/librte_eal/common/eal_memalloc.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2017-2018 Intel Corporation + */ + +#ifndef EAL_MEMALLOC_H +#define EAL_MEMALLOC_H + +#include + +#include + +struct rte_memseg * +eal_memalloc_alloc_page(uint64_t size, int socket); + +int +eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, + int socket, bool exact); + +#endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile index c407a43..af6b9be 100644 --- a/lib/librte_eal/linuxapp/eal/Makefile +++ b/lib/librte_eal/linuxapp/eal/Makefile @@ -36,6 +36,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_thread.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_log.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio_mp_sync.c +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_debug.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_lcore.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_timer.c @@ -82,6 +83,7 @@ CFLAGS_eal_interrupts.o := -D_GNU_SOURCE CFLAGS_eal_vfio_mp_sync.o := -D_GNU_SOURCE CFLAGS_eal_timer.o
[dpdk-dev] [PATCH v2 25/41] eal: map/unmap memory with VFIO when alloc/free pages
Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index bbeeeba..c03e7bc 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_internal_cfg.h" @@ -476,6 +477,10 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, ms->iova = iova; ms->socket_id = socket_id; + /* map the segment so that VFIO has access to it */ + if (rte_eal_iova_mode() == RTE_IOVA_VA && + rte_vfio_dma_map(ms->addr_64, iova, size)) + RTE_LOG(DEBUG, EAL, "Cannot register segment with VFIO\n"); return 0; mapped: @@ -507,6 +512,12 @@ free_page(struct rte_memseg *ms, struct hugepage_info *hi, char path[PATH_MAX]; int fd, ret; + /* unmap the segment from VFIO */ + if (rte_eal_iova_mode() == RTE_IOVA_VA && + rte_vfio_dma_unmap(ms->addr_64, ms->iova, ms->len)) { + RTE_LOG(DEBUG, EAL, "Cannot unregister segment with VFIO\n"); + } + if (mmap(ms->addr, ms->hugepage_sz, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) == MAP_FAILED) { -- 2.7.4
[dpdk-dev] [PATCH v2 15/41] eal: add support for unmapping pages at runtime
This isn't used anywhere yet, but the support is now there. Also, adding cleanup to allocation procedures, so that if we fail to allocate everything we asked for, we can free all of it back. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_memalloc.h | 3 + lib/librte_eal/linuxapp/eal/eal_memalloc.c | 148 - 2 files changed, 146 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index c1076cf..adf59c4 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -16,4 +16,7 @@ int eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int socket, bool exact); +int +eal_memalloc_free_page(struct rte_memseg *ms); + #endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 1ba1201..bbeeeba 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -499,6 +499,64 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, return -1; } +static int +free_page(struct rte_memseg *ms, struct hugepage_info *hi, + unsigned int list_idx, unsigned int seg_idx) +{ + uint64_t map_offset; + char path[PATH_MAX]; + int fd, ret; + + if (mmap(ms->addr, ms->hugepage_sz, PROT_READ, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) == + MAP_FAILED) { + RTE_LOG(DEBUG, EAL, "couldn't unmap page\n"); + return -1; + } + + fd = get_page_fd(path, sizeof(path), hi, list_idx, seg_idx); + if (fd < 0) + return -1; + + if (internal_config.single_file_segments) { + map_offset = seg_idx * ms->hugepage_sz; + if (resize_hugefile(fd, map_offset, ms->hugepage_sz, false)) + return -1; + /* if file is zero-length, we've already shrunk it, so it's +* safe to remove. +*/ + if (is_zero_length(fd)) { + struct msl_entry *te = get_msl_entry_by_idx(list_idx); + if (te != NULL && te->fd >= 0) { + close(te->fd); + te->fd = -1; + } + unlink(path); + } + ret = 0; + } else { + /* if we're able to take out a write lock, we're the last one +* holding onto this page. +*/ + + ret = lock(fd, 0, ms->hugepage_sz, F_WRLCK); + if (ret >= 0) { + /* no one else is using this page */ + if (ret == 1) + unlink(path); + ret = lock(fd, 0, ms->hugepage_sz, F_UNLCK); + if (ret != 1) + RTE_LOG(ERR, EAL, "%s(): unable to unlock file %s\n", + __func__, path); + } + close(fd); + } + + memset(ms, 0, sizeof(*ms)); + + return ret; +} + int eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int socket, bool exact) @@ -507,7 +565,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, struct rte_memseg_list *msl = NULL; void *addr; unsigned int msl_idx; - int cur_idx, end_idx, i, ret = -1; + int cur_idx, start_idx, end_idx, i, j, ret = -1; #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES bool have_numa; int oldpolicy; @@ -557,6 +615,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, continue; end_idx = cur_idx + n; + start_idx = cur_idx; for (i = 0; cur_idx < end_idx; cur_idx++, i++) { struct rte_memseg *cur; @@ -567,25 +626,56 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, if (alloc_page(cur, addr, size, socket, hi, msl_idx, cur_idx)) { + RTE_LOG(DEBUG, EAL, "attempted to allocate %i pages, but only %i were allocated\n", n, i); - /* if exact number wasn't requested, stop */ - if (!exact) + /* if exact number of pages wasn't requested, +* failing to allocate is not an error. we could +* of course try other lists to see if there are +* better fits, but a bird in the hand... +*/ + if (!exact) {
[dpdk-dev] [PATCH v2 22/41] eal: replace memzone array with fbarray
It's there, so we might as well use it. Some operations will be sped up by that. Since we have to allocate an fbarray for memzones, we have to do it before we initialize memory subsystem, because that, in secondary processes, will (later) allocate more fbarrays than the primary process, which will result in inability to attach to memzone fbarray if we do it after the fact. Signed-off-by: Anatoly Burakov --- Notes: Code for ENA driver makes little sense to me, but i've attempted to keep the same semantics as the old code. drivers/net/ena/ena_ethdev.c | 10 +- lib/librte_eal/bsdapp/eal/eal.c | 6 + lib/librte_eal/common/eal_common_memzone.c| 180 +++--- lib/librte_eal/common/include/rte_eal_memconfig.h | 4 +- lib/librte_eal/common/malloc_heap.c | 4 + lib/librte_eal/linuxapp/eal/eal.c | 13 +- test/test/test_memzone.c | 9 +- 7 files changed, 157 insertions(+), 69 deletions(-) diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c index 34b2a8d..f7bfc7a 100644 --- a/drivers/net/ena/ena_ethdev.c +++ b/drivers/net/ena/ena_ethdev.c @@ -264,11 +264,15 @@ static const struct eth_dev_ops ena_dev_ops = { static inline int ena_cpu_to_node(int cpu) { struct rte_config *config = rte_eal_get_configuration(); + struct rte_fbarray *arr = &config->mem_config->memzones; + const struct rte_memzone *mz; - if (likely(cpu < RTE_MAX_MEMZONE)) - return config->mem_config->memzone[cpu].socket_id; + if (unlikely(cpu >= RTE_MAX_MEMZONE)) + return NUMA_NO_NODE; - return NUMA_NO_NODE; + mz = rte_fbarray_get(arr, cpu); + + return mz->socket_id; } static inline void ena_rx_mbuf_prepare(struct rte_mbuf *mbuf, diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c index 45e5670..3b06e21 100644 --- a/lib/librte_eal/bsdapp/eal/eal.c +++ b/lib/librte_eal/bsdapp/eal/eal.c @@ -608,6 +608,12 @@ rte_eal_init(int argc, char **argv) return -1; } + if (rte_eal_malloc_heap_init() < 0) { + rte_eal_init_alert("Cannot init malloc heap\n"); + rte_errno = ENODEV; + return -1; + } + if (rte_eal_tailqs_init() < 0) { rte_eal_init_alert("Cannot init tail queues for objects\n"); rte_errno = EFAULT; diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index 8c9aa28..a7cfdaf 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -28,42 +28,29 @@ static inline const struct rte_memzone * memzone_lookup_thread_unsafe(const char *name) { - const struct rte_mem_config *mcfg; + struct rte_mem_config *mcfg; + struct rte_fbarray *arr; const struct rte_memzone *mz; - unsigned i = 0; + int i = 0; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; /* * the algorithm is not optimal (linear), but there are few * zones and this function should be called at init only */ - for (i = 0; i < RTE_MAX_MEMZONE; i++) { - mz = &mcfg->memzone[i]; - if (mz->addr != NULL && !strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE)) - return &mcfg->memzone[i]; + while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) { + mz = rte_fbarray_get(arr, i++); + if (mz->addr != NULL && + !strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE)) + return mz; } return NULL; } -static inline struct rte_memzone * -get_next_free_memzone(void) -{ - struct rte_mem_config *mcfg; - unsigned i = 0; - - /* get pointer to global configuration */ - mcfg = rte_eal_get_configuration()->mem_config; - - for (i = 0; i < RTE_MAX_MEMZONE; i++) { - if (mcfg->memzone[i].addr == NULL) - return &mcfg->memzone[i]; - } - - return NULL; -} /* This function will return the greatest free block if a heap has been * specified. If no heap has been specified, it will return the heap and @@ -103,13 +90,16 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, { struct rte_memzone *mz; struct rte_mem_config *mcfg; + struct rte_fbarray *arr; size_t requested_len; + int idx; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; /* no more room in config */ - if (mcfg->memzone_cnt >= RTE_MAX_MEMZONE) { + if (arr->count >= arr->len) { RTE_LOG(ERR, EAL, "%s(): No more room in config\n", __fun
[dpdk-dev] [PATCH v2 35/41] net/cxgbe: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- Notes: It is not 100% clear if this memzone is used for DMA, corrections welcome. drivers/net/cxgbe/sge.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/cxgbe/sge.c b/drivers/net/cxgbe/sge.c index 3d5aa59..e31474c 100644 --- a/drivers/net/cxgbe/sge.c +++ b/drivers/net/cxgbe/sge.c @@ -1299,7 +1299,8 @@ static void *alloc_ring(size_t nelem, size_t elem_size, * handle the maximum ring size is allocated in order to allow for * resizing in later calls to the queue setup function. */ - tz = rte_memzone_reserve_aligned(z_name, len, socket_id, 0, 4096); + tz = rte_memzone_reserve_aligned_contig(z_name, len, socket_id, 0, + 4096); if (!tz) return NULL; -- 2.7.4
[dpdk-dev] [PATCH v2 38/41] net/i40e: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- Notes: It is not 100% clear that all users of this function need to allocate DMA memory. Corrections welcome. drivers/net/i40e/i40e_ethdev.c | 2 +- drivers/net/i40e/i40e_rxtx.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index 508b417..0fffe2c 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -4010,7 +4010,7 @@ i40e_allocate_dma_mem_d(__attribute__((unused)) struct i40e_hw *hw, return I40E_ERR_PARAM; snprintf(z_name, sizeof(z_name), "i40e_dma_%"PRIu64, rte_rand()); - mz = rte_memzone_reserve_bounded(z_name, size, SOCKET_ID_ANY, 0, + mz = rte_memzone_reserve_bounded_contig(z_name, size, SOCKET_ID_ANY, 0, alignment, RTE_PGSIZE_2M); if (!mz) return I40E_ERR_NO_MEMORY; diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 1217e5a..6b2b40e 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -2189,7 +2189,7 @@ i40e_memzone_reserve(const char *name, uint32_t len, int socket_id) if (mz) return mz; - mz = rte_memzone_reserve_aligned(name, len, + mz = rte_memzone_reserve_aligned_contig(name, len, socket_id, 0, I40E_RING_BASE_ALIGN); return mz; } -- 2.7.4
[dpdk-dev] [PATCH v2 34/41] net/bnx2x: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- drivers/net/bnx2x/bnx2x.c | 2 +- drivers/net/bnx2x/bnx2x_rxtx.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/bnx2x/bnx2x.c b/drivers/net/bnx2x/bnx2x.c index fb02d0f..81f5dae 100644 --- a/drivers/net/bnx2x/bnx2x.c +++ b/drivers/net/bnx2x/bnx2x.c @@ -177,7 +177,7 @@ bnx2x_dma_alloc(struct bnx2x_softc *sc, size_t size, struct bnx2x_dma *dma, rte_get_timer_cycles()); /* Caller must take care that strlen(mz_name) < RTE_MEMZONE_NAMESIZE */ - z = rte_memzone_reserve_aligned(mz_name, (uint64_t) (size), + z = rte_memzone_reserve_aligned_contig(mz_name, (uint64_t)size, SOCKET_ID_ANY, 0, align); if (z == NULL) { diff --git a/drivers/net/bnx2x/bnx2x_rxtx.c b/drivers/net/bnx2x/bnx2x_rxtx.c index a0d4ac9..325b94d 100644 --- a/drivers/net/bnx2x/bnx2x_rxtx.c +++ b/drivers/net/bnx2x/bnx2x_rxtx.c @@ -26,7 +26,8 @@ ring_dma_zone_reserve(struct rte_eth_dev *dev, const char *ring_name, if (mz) return mz; - return rte_memzone_reserve_aligned(z_name, ring_size, socket_id, 0, BNX2X_PAGE_SIZE); + return rte_memzone_reserve_aligned_contig(z_name, ring_size, socket_id, + 0, BNX2X_PAGE_SIZE); } static void -- 2.7.4
[dpdk-dev] [PATCH v2 36/41] net/ena: use contiguous allocation for DMA memory
Signed-off-by: Anatoly Burakov --- drivers/net/ena/base/ena_plat_dpdk.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ena/base/ena_plat_dpdk.h b/drivers/net/ena/base/ena_plat_dpdk.h index 8cba319..c1ebf00 100644 --- a/drivers/net/ena/base/ena_plat_dpdk.h +++ b/drivers/net/ena/base/ena_plat_dpdk.h @@ -188,7 +188,8 @@ typedef uint64_t dma_addr_t; ENA_TOUCH(dmadev); ENA_TOUCH(handle); \ snprintf(z_name, sizeof(z_name),\ "ena_alloc_%d", ena_alloc_cnt++); \ - mz = rte_memzone_reserve(z_name, size, SOCKET_ID_ANY, 0); \ + mz = rte_memzone_reserve_contig(z_name, \ + size, SOCKET_ID_ANY, 0);\ memset(mz->addr, 0, size); \ virt = mz->addr;\ phys = mz->iova;\ @@ -206,7 +207,7 @@ typedef uint64_t dma_addr_t; ENA_TOUCH(dmadev); ENA_TOUCH(dev_node); \ snprintf(z_name, sizeof(z_name),\ "ena_alloc_%d", ena_alloc_cnt++); \ - mz = rte_memzone_reserve(z_name, size, node, 0); \ + mz = rte_memzone_reserve_contig(z_name, size, node, 0); \ memset(mz->addr, 0, size); \ virt = mz->addr;\ phys = mz->iova;\ @@ -219,7 +220,7 @@ typedef uint64_t dma_addr_t; ENA_TOUCH(dmadev); ENA_TOUCH(dev_node); \ snprintf(z_name, sizeof(z_name),\ "ena_alloc_%d", ena_alloc_cnt++); \ - mz = rte_memzone_reserve(z_name, size, node, 0); \ + mz = rte_memzone_reserve_contig(z_name, size, node, 0); \ memset(mz->addr, 0, size); \ virt = mz->addr;\ } while (0) -- 2.7.4
Re: [dpdk-dev] [PATCH] eal: register rte_panic user callback
07/03/2018 17:31, Arnon Warshavsky: > > > Can we add a compile warning for adding new rte_panic's into code? It's > a > > > > > nice tool while debugging, but it probably shouldn't be in any new > > > > production code. > > > > Yes could be nice to automatically detect it in drivers/ or lib/ > > directories. > > > > How do we apply a warning only to new code? via checkpatch? Yes in devtools/checkpatches.sh
Re: [dpdk-dev] [PATCH v2] ethdev: remove versioning of ethdev filter control function
On 2/27/2018 2:18 PM, Kirill Rybalchenko wrote: > In 18.02 release the ABI of ethdev component was changed. > To keep compatibility with previous versions of the library > the versioning of rte_eth_dev_filter_ctrl function was implemented. > As soon as deprecation note was issued in 18.02 release, there is > no need to keep compatibility with previous versions. > Remove the versioning of rte_eth_dev_filter_ctrl function. > > v2: > Modify map file, increment library version, > remove deprecation notice > > Signed-off-by: Kirill Rybalchenko Reviewed-by: Ferruh Yigit
[dpdk-dev] [RFC] config: remove RTE_NEXT_ABI
After experimental API process defined do we still need RTE_NEXT_ABI config and process which has similar targets? Are distros disable experimental APIs when delivering DPDK? And is there any config required to control this, as RTE_NEXT_ABI intended to do? Cc: Neil Horman Cc: Thomas Monjalon Cc: Luca Boccassi Cc: Christian Ehrhardt Signed-off-by: Ferruh Yigit --- config/common_base | 5 - devtools/test-build.sh | 2 -- devtools/validate-abi.sh | 1 - doc/guides/contributing/versioning.rst | 10 -- mk/rte.lib.mk | 5 - pkg/dpdk.spec | 1 - 6 files changed, 24 deletions(-) diff --git a/config/common_base b/config/common_base index ad03cf433..6b867f6a9 100644 --- a/config/common_base +++ b/config/common_base @@ -41,11 +41,6 @@ CONFIG_RTE_ARCH_STRICT_ALIGN=n CONFIG_RTE_BUILD_SHARED_LIB=n # -# Use newest code breaking previous ABI -# -CONFIG_RTE_NEXT_ABI=y - -# # Major ABI to overwrite library specific LIBABIVER # CONFIG_RTE_MAJOR_ABI= diff --git a/devtools/test-build.sh b/devtools/test-build.sh index 3362edcc5..22b4e1a98 100755 --- a/devtools/test-build.sh +++ b/devtools/test-build.sh @@ -154,8 +154,6 @@ config () # # Built-in options (lowercase) ! echo $3 | grep -q '+default' || \ sed -ri 's,(RTE_MACHINE=")native,\1default,' $1/.config - echo $3 | grep -q '+next' || \ - sed -ri 's,(NEXT_ABI=)y,\1n,' $1/.config ! echo $3 | grep -q '+shared' || \ sed -ri 's,(SHARED_LIB=)n,\1y,' $1/.config ! echo $3 | grep -q '+debug' || ( \ diff --git a/devtools/validate-abi.sh b/devtools/validate-abi.sh index 138436d93..a64edf92f 100755 --- a/devtools/validate-abi.sh +++ b/devtools/validate-abi.sh @@ -105,7 +105,6 @@ set_log_file() { fixup_config() { local conf=config/defconfig_$target cmd sed -i -e"$ a\CONFIG_RTE_BUILD_SHARED_LIB=y" $conf - cmd sed -i -e"$ a\CONFIG_RTE_NEXT_ABI=n" $conf cmd sed -i -e"$ a\CONFIG_RTE_EAL_IGB_UIO=n" $conf cmd sed -i -e"$ a\CONFIG_RTE_LIBRTE_KNI=n" $conf cmd sed -i -e"$ a\CONFIG_RTE_KNI_KMOD=n" $conf diff --git a/doc/guides/contributing/versioning.rst b/doc/guides/contributing/versioning.rst index c495294db..59ff0e8b7 100644 --- a/doc/guides/contributing/versioning.rst +++ b/doc/guides/contributing/versioning.rst @@ -91,19 +91,9 @@ being provided. The requirements for doing so are: interest" be sought for each deprecation, for example: from NIC vendors, CPU vendors, end-users, etc. -#. The changes (including an alternative map file) must be gated with - the ``RTE_NEXT_ABI`` option, and provided with a deprecation notice at the - same time. - It will become the default ABI in the next release. - #. A full deprecation cycle, as explained above, must be made to offer downstream consumers sufficient warning of the change. -#. At the beginning of the next release cycle, every ``RTE_NEXT_ABI`` - conditions will be removed, the ``LIBABIVER`` variable in the makefile(s) - where the ABI is changed will be incremented, and the map files will - be updated. - Note that the above process for ABI deprecation should not be undertaken lightly. ABI stability is extremely important for downstream consumers of the DPDK, especially when distributed in shared object form. Every effort should diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk index c696a2174..8ac26face 100644 --- a/mk/rte.lib.mk +++ b/mk/rte.lib.mk @@ -20,11 +20,6 @@ endif ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),y) LIB := $(patsubst %.a,%.so.$(LIBABIVER),$(LIB)) ifeq ($(EXTLIB_BUILD),n) -ifeq ($(CONFIG_RTE_MAJOR_ABI),) -ifeq ($(CONFIG_RTE_NEXT_ABI),y) -LIB := $(LIB).1 -endif -endif CPU_LDFLAGS += --version-script=$(SRCDIR)/$(EXPORT_MAP) endif endif diff --git a/pkg/dpdk.spec b/pkg/dpdk.spec index 4d3b5745c..d118f0463 100644 --- a/pkg/dpdk.spec +++ b/pkg/dpdk.spec @@ -84,7 +84,6 @@ make O=%{target} T=%{config} config sed -ri 's,(RTE_MACHINE=).*,\1%{machine},' %{target}/.config sed -ri 's,(RTE_APP_TEST=).*,\1n,' %{target}/.config sed -ri 's,(RTE_BUILD_SHARED_LIB=).*,\1y,' %{target}/.config -sed -ri 's,(RTE_NEXT_ABI=).*,\1n,' %{target}/.config sed -ri 's,(LIBRTE_VHOST=).*,\1y,' %{target}/.config sed -ri 's,(LIBRTE_PMD_PCAP=).*,\1y,' %{target}/.config make O=%{target} %{?_smp_mflags} -- 2.13.6