Hi Lucero, No, we have reproduced multi-process issues(include symmetric_mp, simple_mp, hotplug_mp, multi-process unit test… )on most of our servers. It is also strange that 1~2 servers don’t have the issue.
Bind two NNT ports or FVL ports ./build/symmetric_mp -c 4 --proc-type=auto -- -p 3 --num-procs=4 --proc-id=1 EAL: Detected 88 lcore(s) EAL: Detected 2 NUMA nodes EAL: Auto-detected process type: SECONDARY [New Thread 0x7ffff6eda700 (LWP 90103)] EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_90099_2f1b553882b62 [New Thread 0x7ffff66d9700 (LWP 90104)] Thread 1 "symmetric_mp" received signal SIGSEGV, Segmentation fault. 0x00000000005566b5 in rte_fbarray_find_next_used () (gdb) bt #0 0x00000000005566b5 in rte_fbarray_find_next_used () #1 0x000000000054da9c in rte_eal_check_dma_mask () #2 0x0000000000572ae7 in pci_one_device_iommu_support_va () #3 0x0000000000573988 in rte_pci_get_iommu_class () #4 0x000000000054f743 in rte_bus_get_iommu_class () #5 0x000000000053c123 in rte_eal_init () #6 0x000000000046be2b in main () Best regards, Xueqin From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com] Sent: Tuesday, October 30, 2018 5:41 PM To: Lin, Xueqin <xueqin....@intel.com> Cc: Yao, Lei A <lei.a....@intel.com>; Thomas Monjalon <tho...@monjalon.net>; dev <dev@dpdk.org>; Xu, Qian Q <qian.q...@intel.com>; Burakov, Anatoly <anatoly.bura...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com> Subject: Re: [dpdk-dev] [PATCH v3 0/6] use IOVAs check based on DMA mask On Tue, Oct 30, 2018 at 3:20 AM Lin, Xueqin <xueqin....@intel.com<mailto:xueqin....@intel.com>> wrote: Hi Lucero&Thomas, Find the patch can’t fix multi-process cases. Hi, I think it is not specifically about multiprocess but about hotplug with multiprocess because I can execute the symmetric_mp successfully with a secondary process. Working on this as a priority. Thanks. Steps: 1. Setup primary process successfully ./hotplug_mp --proc-type=auto 2. Fail to setup secondary process ./hotplug_mp --proc-type=auto EAL: Detected 88 lcore(s) EAL: Detected 2 NUMA nodes EAL: Auto-detected process type: SECONDARY EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_147212_2bfe08ee88d23 Segmentation fault (core dumped) More information as below: Thread 1 "hotplug_mp" received signal SIGSEGV, Segmentation fault. 0x0000000000597cfb in find_next (arr=0x7ffff7ff20a4, start=0, used=true) at /root/dpdk/lib/librte_eal/common/eal_common_fbarray.c:264 264 for (idx = first; idx < msk->n_masks; idx++) { #0 0x0000000000597cfb in find_next (arr=0x7ffff7ff20a4, start=0, used=true) at /root/dpdk/lib/librte_eal/common/eal_common_fbarray.c:264 #1 0x0000000000598573 in fbarray_find (arr=0x7ffff7ff20a4, start=0, next=true, used=true) at /root/dpdk/lib/librte_eal/common/eal_common_fbarray.c:1001 #2 0x000000000059929b in rte_fbarray_find_next_used (arr=0x7ffff7ff20a4, start=0) at /root/dpdk/lib/librte_eal/common/eal_common_fbarray.c:1018 #3 0x000000000058c877 in rte_memseg_walk_thread_unsafe (func=0x58c401 <check_iova>, arg=0x7fffffffcc38) at /root/dpdk/lib/librte_eal/common/eal_common_memory.c:589 #4 0x000000000058ce08 in rte_eal_check_dma_mask (maskbits=48 '0') at /root/dpdk/lib/librte_eal/common/eal_common_memory.c:465 #5 0x00000000005b96c4 in pci_one_device_iommu_support_va (dev=0x11b3d90) at /root/dpdk/drivers/bus/pci/linux/pci.c:593 #6 0x00000000005b9738 in pci_devices_iommu_support_va () at /root/dpdk/drivers/bus/pci/linux/pci.c:626 #7 0x00000000005b97a7 in rte_pci_get_iommu_class () at /root/dpdk/drivers/bus/pci/linux/pci.c:650 #8 0x000000000058f1ce in rte_bus_get_iommu_class () at /root/dpdk/lib/librte_eal/common/eal_common_bus.c:237 #9 0x0000000000577c7a in rte_eal_init (argc=2, argv=0x7fffffffdf98) at /root/dpdk/lib/librte_eal/linuxapp/eal/eal.c:919 #10 0x000000000045dd56 in main (argc=2, argv=0x7fffffffdf98) at /root/dpdk/examples/multi_process/hotplug_mp/main.c:28 Best regards, Xueqin From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com<mailto:alejandro.luc...@netronome.com>] Sent: Monday, October 29, 2018 9:41 PM To: Yao, Lei A <lei.a....@intel.com<mailto:lei.a....@intel.com>> Cc: Thomas Monjalon <tho...@monjalon.net<mailto:tho...@monjalon.net>>; dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Xu, Qian Q <qian.q...@intel.com<mailto:qian.q...@intel.com>>; Lin, Xueqin <xueqin....@intel.com<mailto:xueqin....@intel.com>>; Burakov, Anatoly <anatoly.bura...@intel.com<mailto:anatoly.bura...@intel.com>>; Yigit, Ferruh <ferruh.yi...@intel.com<mailto:ferruh.yi...@intel.com>> Subject: Re: [dpdk-dev] [PATCH v3 0/6] use IOVAs check based on DMA mask On Mon, Oct 29, 2018 at 1:18 PM Yao, Lei A <lei.a....@intel.com<mailto:lei.a....@intel.com>> wrote: From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com<mailto:alejandro.luc...@netronome.com>] Sent: Monday, October 29, 2018 8:56 PM To: Thomas Monjalon <tho...@monjalon.net<mailto:tho...@monjalon.net>> Cc: Yao, Lei A <lei.a....@intel.com<mailto:lei.a....@intel.com>>; dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Xu, Qian Q <qian.q...@intel.com<mailto:qian.q...@intel.com>>; Lin, Xueqin <xueqin....@intel.com<mailto:xueqin....@intel.com>>; Burakov, Anatoly <anatoly.bura...@intel.com<mailto:anatoly.bura...@intel.com>>; Yigit, Ferruh <ferruh.yi...@intel.com<mailto:ferruh.yi...@intel.com>> Subject: Re: [dpdk-dev] [PATCH v3 0/6] use IOVAs check based on DMA mask On Mon, Oct 29, 2018 at 11:46 AM Thomas Monjalon <tho...@monjalon.net<mailto:tho...@monjalon.net>> wrote: 29/10/2018 12:39, Alejandro Lucero: > I got a patch that solves a bug when calling rte_eal_dma_mask using the > mask instead of the maskbits. However, this does not solves the deadlock. The deadlock is a bigger concern I think. I think once the call to rte_eal_check_dma_mask uses the maskbits instead of the mask, calling rte_memseg_walk_thread_unsafe avoids the deadlock. Yao, can you try with the attached patch? Hi, Lucero This patch can fix the issue at my side. Thanks a lot for you quick action. Great! I will send an official patch with the changes. I have to say that I tested the patchset, but I think it was where legacy_mem was still there and therefore dynamic memory allocation code not used during memory initialization. There is something that concerns me though. Using rte_memseg_walk_thread_unsafe could be a problem under some situations although those situations being unlikely. Usually, calling rte_eal_check_dma_mask happens during initialization. Then it is safe to use the unsafe function for walking memsegs, but with device hotplug and dynamic memory allocation, there exists a potential race condition when the primary process is allocating more memory and concurrently a device is hotplugged and a secondary process does the device initialization. By now, this is just a problem with the NFP, and the potential race condition window really unlikely, but I will work on this asap. BRs Lei > Interestingly, the problem looks like a compiler one. Calling > rte_memseg_walk does not return when calling inside rt_eal_dma_mask, but if > you modify the call like this: > > - if (rte_memseg_walk(check_iova, &mask)) > + if (!rte_memseg_walk(check_iova, &mask)) > > it works, although the value returned to the invoker changes, of course. > But the point here is it should be the same behaviour when calling > rte_memseg_walk than before and it is not. Anyway, the coding style requires to save the return value in a variable, instead of nesting the call in an "if" condition. And the "if" check should be explicitly != 0 because it is not a real boolean. PS: please do not top post and avoid HTML emails, thanks