Thanks for the Checks David! This is tagged as being in v23 and Debian/experimental is already at v24. So mid term this will be resolved by a re-sync of a latter version. But for now we will need an Ubuntu Delta to fix it.
Eoan isn't fully open yet, but we can prep SRU template, PPAs and MPs to review to get that already out of the way. An MP is opened at [1] and the respective MR is at [2]. Until 19.10 is fully ready for development this might take a bit, but you should be able to use the PPA as needed until then. [1]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1823836-rdma-core-mem-clear/+packages [2]: https://code.launchpad.net/~paelzer/ubuntu/+source/rdma-core/+git/rdma-core/+merge/366380 ** Merge proposal linked: https://code.launchpad.net/~paelzer/ubuntu/+source/rdma-core/+git/rdma-core/+merge/366380 ** Description changed: + [Impact] + + * a missing memset can make rdma (users) use uninitialized memory + In the reported case this was a fail to initialize DPDK devices on + ppc64, but it could be almost anything else using the cmd buffers + + * The patch is already at the v22 stable branch (backported and + intended to be in v22.2 once released) + + [Test Case] + + * So far the only way to trigger this that was found is to run a + Connect-X5 card on ppc64 (power9) and try to initialize it, e.g. + $ /usr/bin/dpdk-testpmd -i -a + + This requires special HW, but I hope due to the patch bein a simple + one liner that should not be concerning for the SRU. + + [Regression Potential] + + * Without the memset it would be random memory, I could imagine a lucky + case that ran despite this issue but I can not imagine an issue + "relying" on the memory being not-set-to-zero (unless stealing data + was your use case). + + [Other Info] + + * n/a + + ---- originla bug report ---- + == Comment: #0 - David J. Wilder - 2019-04-05 12:44:56 == ---Problem Description--- dpdk-testpmd is failing in net_mlx5. /usr/bin/dpdk-testpmd \ -w 0000:01:00.0 \ -l 0-3 \ -n 4 -- \ -i -a EAL: Detected 128 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: No free hugepages reported in hugepages-2048kB EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1019 net_mlx5 net_mlx5: probe of PCI device 0000:01:00.0 aborted after encountering an error: Unknown error -95 testpmd: No probed ethernet devices Interactive-mode selected Auto-start selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Done Start automatic packet forwarding io packet forwarding - ports=0 - cores=0 - streams=0 - NUMA support enabled, MP allocation mode: native - io packet forwarding packets/burst=32 - nb forwarding cores=1 - nb forwarding ports=0 - - - Contact Information = David Wilder/wil...@us.ibm.com - + io packet forwarding packets/burst=32 + nb forwarding cores=1 - nb forwarding ports=0 + + Contact Information = David Wilder/wil...@us.ibm.com + ---uname output--- Linux ltc17u31 5.0.0-8-generic #9-Ubuntu SMP Tue Mar 12 21:59:39 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux - - Machine Type = 9006-22P Boston - + + Machine Type = 9006-22P Boston + ---Debugger--- A debugger is not configured - + ---Steps to Reproduce--- - Installed 19.04 (ppc64le) + Installed 19.04 (ppc64le) Installed dpdk and dpdk-dev ---- run dpdk-testpmd /usr/bin/dpdk-testpmd \ -w 0000:01:00.0 \ -l 0-3 \ -n 4 -- \ -i -a EAL: Detected 128 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: No free hugepages reported in hugepages-2048kB EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1019 net_mlx5 net_mlx5: probe of PCI device 0000:01:00.0 aborted after encountering an error: Unknown error -95 testpmd: No probed ethernet devices Interactive-mode selected Auto-start selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Done Start automatic packet forwarding io packet forwarding - ports=0 - cores=0 - streams=0 - NUMA support enabled, MP allocation mode: native - io packet forwarding packets/burst=32 - nb forwarding cores=1 - nb forwarding ports=0 - - - Userspace tool common name: testpmd - - The userspace tool has the following bit modes: 64-bit + io packet forwarding packets/burst=32 + nb forwarding cores=1 - nb forwarding ports=0 + + Userspace tool common name: testpmd + + The userspace tool has the following bit modes: 64-bit Userspace rpm: dpdk-dev/disco,now 18.11-6 ppc64el - == Comment: #1 - David J. Wilder - 2019-04-05 12:45:35 == # lspci -vvv -s 0000:01:00.0 0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] - Subsystem: IBM MT28800 Family [ConnectX-5 Ex] - Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ - Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- - Latency: 0 - Interrupt: pin A routed to IRQ 24 - NUMA node: 0 - Region 0: Memory at 6000800000000 (64-bit, prefetchable) [size=512M] - [virtual] Expansion ROM at 600c000000000 [disabled] [size=1M] - Capabilities: [60] Express (v2) Endpoint, MSI 00 - DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited - ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W - DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- - RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- - MaxPayload 512 bytes, MaxReadReq 512 bytes - DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- - LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited - ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ - LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- - LnkSta: Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- - DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported - DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled - LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- - Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- - Compliance De-emphasis: -6dB - LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ - EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- - Capabilities: [48] Vital Product Data - Product Name: PCIe4 2-port 100Gb EDR Adapter x16 - Read-only fields: - [PN] Part number: 00WT174 - [EC] Engineering changes: P40094 - [VF] Vendor specific: 00WT176 - [SN] Serial number: YA50YF7CE0V3 - [Z0] Unknown: 49 42 4d 30 30 30 30 30 30 30 30 30 32 - [VC] Vendor specific: EC64 - [MN] Manufacture ID: 37 35 30 58 30 39 31 37 32 35 33 30 38 37 20 - [VH] Vendor specific: 2CF2 - [VK] Vendor specific: ipzSeries - [RV] Reserved: checksum good, 0 byte(s) reserved - End - Capabilities: [9c] MSI-X: Enable+ Count=256 Masked- - Vector table: BAR=0 offset=00002000 - PBA: BAR=0 offset=00003000 - Capabilities: [c0] Vendor Specific Information: Len=18 <?> - Capabilities: [40] Power Management version 3 - Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+) - Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- - Capabilities: [100 v1] Advanced Error Reporting - UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- - UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- - UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- - CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- - CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ - AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+ - Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) - ARICap: MFVC- ACS-, Next Function: 1 - ARICtl: MFVC- ACS-, Function Group: 0 - Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV) - IOVCap: Migration-, Interrupt Message Number: 000 - IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ - IOVSta: Migration- - Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00 - VF offset: 2, stride: 1, Device ID: 101a - Supported Page Size: 000007ff, System Page Size: 00000010 - Region 0: Memory at 0006000000000000 (64-bit, prefetchable) - VF Migration: offset: 00000000, BIR: 0 - Capabilities: [1c0 v1] #19 - Capabilities: [230 v1] Access Control Services - ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- - ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- - Capabilities: [320 v1] #27 - Capabilities: [370 v1] #26 - Capabilities: [420 v1] #25 - Kernel driver in use: mlx5_core - Kernel modules: mlx5_core + Subsystem: IBM MT28800 Family [ConnectX-5 Ex] + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ + Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- + Latency: 0 + Interrupt: pin A routed to IRQ 24 + NUMA node: 0 + Region 0: Memory at 6000800000000 (64-bit, prefetchable) [size=512M] + [virtual] Expansion ROM at 600c000000000 [disabled] [size=1M] + Capabilities: [60] Express (v2) Endpoint, MSI 00 + DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited + ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W + DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- + RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- + MaxPayload 512 bytes, MaxReadReq 512 bytes + DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- + LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited + ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ + LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- + ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- + LnkSta: Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- + DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported + DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled + LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- + Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- + Compliance De-emphasis: -6dB + LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ + EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- + Capabilities: [48] Vital Product Data + Product Name: PCIe4 2-port 100Gb EDR Adapter x16 + Read-only fields: + [PN] Part number: 00WT174 + [EC] Engineering changes: P40094 + [VF] Vendor specific: 00WT176 + [SN] Serial number: YA50YF7CE0V3 + [Z0] Unknown: 49 42 4d 30 30 30 30 30 30 30 30 30 32 + [VC] Vendor specific: EC64 + [MN] Manufacture ID: 37 35 30 58 30 39 31 37 32 35 33 30 38 37 20 + [VH] Vendor specific: 2CF2 + [VK] Vendor specific: ipzSeries + [RV] Reserved: checksum good, 0 byte(s) reserved + End + Capabilities: [9c] MSI-X: Enable+ Count=256 Masked- + Vector table: BAR=0 offset=00002000 + PBA: BAR=0 offset=00003000 + Capabilities: [c0] Vendor Specific Information: Len=18 <?> + Capabilities: [40] Power Management version 3 + Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+) + Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- + Capabilities: [100 v1] Advanced Error Reporting + UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- + UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- + UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- + CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- + CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ + AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+ + Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) + ARICap: MFVC- ACS-, Next Function: 1 + ARICtl: MFVC- ACS-, Function Group: 0 + Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV) + IOVCap: Migration-, Interrupt Message Number: 000 + IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ + IOVSta: Migration- + Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00 + VF offset: 2, stride: 1, Device ID: 101a + Supported Page Size: 000007ff, System Page Size: 00000010 + Region 0: Memory at 0006000000000000 (64-bit, prefetchable) + VF Migration: offset: 00000000, BIR: 0 + Capabilities: [1c0 v1] #19 + Capabilities: [230 v1] Access Control Services + ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- + ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- + Capabilities: [320 v1] #27 + Capabilities: [370 v1] #26 + Capabilities: [420 v1] #25 + Kernel driver in use: mlx5_core + Kernel modules: mlx5_core == Comment: #2 - David J. Wilder - 2019-04-05 12:54:17 == Building from git://dpdk.org/dpdk tag=v18.11 in the same environment also shows the same error. == Comment: #4 - David J. Wilder - 2019-04-05 12:56:25 == Testing dpdk on beta 19.04 is showing an error with Mellanox Technologies MT28800 Family [ConnectX-5 Ex] ethernet controller. == Comment: #6 - David J. Wilder - 2019-04-05 13:35:12 == Chasing the source of the error. gdb dpdk/ppc_64-power8-linuxapp-gcc/app/testpmd <....> (gdb) break mlx5_ind_table_ibv_drop_new Breakpoint 1 at 0x4998e8: file /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c, line 2067. (gdb) run -w 0000:01:00.0 -l 0-3 -n 4 -- -i -a Starting program: /home/wilder/ubuntu-19.04-debug/dpdk/ppc_64-power8-linuxapp-gcc/app/testpmd -w 0000:01:00.0 -l 0-3 -n 4 -- -i -a [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1". EAL: Detected 128 lcore(s) EAL: Detected 2 NUMA nodes [New Thread 0x7ffff795dc90 (LWP 117018)] EAL: Multi-process socket /var/run/dpdk/rte/mp_socket [New Thread 0x7ffff714dc90 (LWP 117019)] EAL: No free hugepages reported in hugepages-2048kB EAL: Probing VFIO support... EAL: VFIO support initialized [New Thread 0x7ffff693dc90 (LWP 117020)] [New Thread 0x7ffff612dc90 (LWP 117021)] [New Thread 0x7ffff591dc90 (LWP 117022)] EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1019 net_mlx5 Thread 1 "testpmd" hit Breakpoint 1, 0x00000001004998e8 in mlx5_ind_table_ibv_drop_new (dev=0x100d97580 <rte_eth_devices>) - at /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c:2067 + at /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c:2067 2067 { (gdb) list 2062 * @return 2063 * The Verbs object initialised, NULL otherwise and rte_errno is set. 2064 */ 2065 struct mlx5_ind_table_ibv * 2066 mlx5_ind_table_ibv_drop_new(struct rte_eth_dev *dev) 2067 { 2068 struct priv *priv = dev->data->dev_private; 2069 struct mlx5_ind_table_ibv *ind_tbl; 2070 struct mlx5_rxq_ibv *rxq; 2071 struct mlx5_ind_table_ibv tmpl; - (gdb) - 2072 + (gdb) + 2072 2073 rxq = mlx5_rxq_ibv_drop_new(dev); 2074 if (!rxq) 2075 return NULL; 2076 tmpl.ind_table = mlx5_glue->create_rwq_ind_table 2077 (priv->ctx, 2078 &(struct ibv_rwq_ind_table_init_attr){ 2079 .log_ind_tbl_size = 0, 2080 .ind_tbl = &rxq->wq, 2081 .comp_mask = 0, - (gdb) + (gdb) 2082 }); 2083 if (!tmpl.ind_table) { 2084 DEBUG("port %u cannot allocate indirection table for drop" 2085 " queue", 2086 dev->data->port_id); 2087 rte_errno = errno; 2088 goto error; 2089 } 2090 ind_tbl = rte_calloc(__func__, 1, sizeof(*ind_tbl), 0); 2091 if (!ind_tbl) { (gdb) break 2084 Breakpoint 2 at 0x1004999d0: file /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c, line 2084. (gdb) cont Continuing. Thread 1 "testpmd" hit Breakpoint 2, mlx5_ind_table_ibv_drop_new (dev=0x100d97580 <rte_eth_devices>) - at /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c:2087 + at /home/wilder/ubuntu-19.04-debug/dpdk/drivers/net/mlx5/mlx5_rxq.c:2087 2087 rte_errno = errno; (gdb) print errno $1 = 95 - (gdb) + (gdb) ------ == Comment: #7 - David J. Wilder - 2019-04-05 18:53:33 == Interesting excerpt from strace: write(1, "mlx5_glue_create_rwq_ind_table: "..., 65) = 65 ioctl(23, RDMA_VERBS_IOCTL, 0x7fffe3966c70) = -1 EOPNOTSUPP (Operation not supported) == Comment: #8 - David J. Wilder <wil...@us.ibm.com> - 2019-04-05 21:05:21 == ConnectX-5 Firmware version: # mstflint -d 0000:01:00.0 q Image type: FS4 FW Version: 16.23.1020 FW Release Date: 10.7.2018 Product Version: 16.23.1020 Description: UID GuidsNumber Base GUID: ec0d9a0300cab17c 4 Base MAC: ec0d9acab17c 4 Image VSD: N/A Device VSD: N/A PSID: IBM0000000020 Security Attributes: N/A -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1823836 Title: dpdk app is reporting: net_mlx5: probe of PCI device xxxx aborted after encountering an error: Unknown error -95 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1823836/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs