On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote: > On 7/19/2024 10:57 AM, Bruce Richardson wrote: > > On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote: > >> On 7/11/2024 1:35 PM, Bruce Richardson wrote: > >>> When allocating memory for an ethdev, the rte_malloc_socket call used > >>> only allocates memory on the NUMA node/socket local to the device. This > >>> means that even if the user wanted to, they could never use a remote NIC > >>> without also having memory on that NIC's socket. > >>> > >>> For example, if we change examples/skeleton/basicfwd.c to have > >>> SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should > >>> be able to run the app cross-numa e.g. as below, where the two PCI > >>> devices are on socket 1, and core 1 is on socket 0: > >>> > >>> ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \ > >>> -a a8:00.0 -a b8:00.0 > >>> > >>> This fails however, with the error: > >>> > >>> ETHDEV: failed to allocate private data > >>> PCI_BUS: Requested device 0000:a8:00.0 cannot be used > >>> > >>> We can remove this restriction by doing a fallback call to general > >>> rte_malloc after a call to rte_malloc_socket fails. This should be safe > >>> to do because the later ethdev calls to setup Rx/Tx queues all take a > >>> socket_id parameter, which can be used by applications to enforce the > >>> requirement for local-only memory for a device, if so desired. [If > >>> device-local memory is present it will be used as before, while if not > >>> present the rte_eth_dev_configure call will now pass, but the subsequent > >>> queue setup calls requesting local memory will fail]. > >>> > >>> Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs") > >>> Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers") > >>> Cc: sta...@dpdk.org > >>> > >>> Signed-off-by: Bruce Richardson <bruce.richard...@intel.com> > >>> Signed-off-by: Padraig Connolly <padraig.j.conno...@intel.com> > >>> > >> > >> Hi Bruce, > >> > >> If device-local memory is present, behavior will be same, so I agree > >> this is low impact. > >> > >> But for the case device-local memory is NOT present, should we enforce > >> the HW setup is the question. This can be beneficial for users new to DPDK. > >> > > > > No we should not do so, because if we do, there is no way for the user to > > allow using remote memory - the probe/init and even configure call has NO > > socket_id parameter in it, so the enforcement of local memory is an > > internal assumption on the part of the API which is not documented > > anywhere, and is not possible for the user to override. > > > >> Probably 'dev_private' on its own has small impact on the performance > >> (although it may depend if these fields used in datapath), but it may be > >> vehicle to enforce local memory. > >> > > > > As I explain above in the cover letter, there are already other ways to > > enforce local memory - we don't need another one. If the user only wants to > > use local memory for a port, they can do so by setting the socket_id > > parameter of the rx and tx queues. Enforcing local memory in probe > > doesn't add anything to that, and just prevents other use-cases. > > > >> What is enabled by enabling app to run on cross-numa memory, since on a > >> production I expect users would like to use device-local memory for > >> performance reasons anyway? > >> > > > > Mostly users want socket-local memory, but with increasing speeds of NICs > > we are already seeing cases where users want to run cross-NUMA. For > > example, a multi-port NIC may have some ports in use on each socket. > > > > Ack. > > >> > >> Also I am not sure if this is a fix, or change of a intentional behavior. > >> > > > > I suppose it can be viewed either way. However, for me this is a fix, > > because right now it is impossible for many users to run their applications > > with memory on a different socket to the ports. Nowhere is it documented in > > DPDK that there is a hard restriction that ports must have local memory, so > > any enforcement of such a policy is wrong. > > > > Although it seems this is done intentionally in the API, I agree that > this is not documented in the API, or this restriction is not part of > the API definition. > > > Turning things the other way around - I can't see how anything will break > > or even slow down with this patch applied. As I point out above, the user > > can already enforce local memory by passing the required socket id when > > allocating rx and tx rings - this patch only pushed the failure for > > non-local memory a bit later in the initialization sequence, where the user > > can actually specify the desired NUMA behaviour. Is there some > > case I'm missing where you forsee this causing problems? > > > > A new user may not know that allocating memory from cross-numa impacts > performance negatively and have this configuration unintentionally, > current failure enforces the ideal configuration. > > One option can be adding a warning log to the fallback case, saying that > memory allocated from non-local socket and performance will be less. > Although this message may not mean much to a new user, it may still help > via a support engineer or internet search... >
Yes, we could add a warning, but that is probably better in the app itself. Thinking about where we get issues, they primarily stem from running the cores on a different numa node Since the private_data struct is accessed by cores not devices, any perf degradation will come from having it remote to the cores. Because of that, I actually think the original implementation should really have used "rte_socket_id()" rather than the device socket id for the allocation. For v2, will I therefore include a warning in the case that rte_socket_id() != device socket_id? WDYT? /Bruce