Hi Ben, Lijian, Honnappa, The issue is reproducible after the second invocation of show pci: DBGvpp# show pci Address Sock VID:PID Link Speed Driver Product Name Vital Product Data 0000:11:00.0 2 8086:10fb 5.0 GT/s x8 ixgbe 0000:11:00.1 2 8086:10fb 5.0 GT/s x8 ixgbe 0002:f9:00.0 0 15b3:1015 8.0 GT/s x8 mlx5_core CX4121A - ConnectX-4 LX SFP28 PN: MCX4121A-ACAT_C12 EC: A1 SN: MT1745K13032 V0: 0x 50 43 49 65 47 65 6e 33 ... RV: 0x ba 0002:f9:00.1 0 15b3:1015 8.0 GT/s x8 mlx5_core CX4121A - ConnectX-4 LX SFP28 PN: MCX4121A-ACAT_C12 EC: A1 SN: MT1745K13032 V0: 0x 50 43 49 65 47 65 6e 33 ... RV: 0x ba DBGvpp# show pci Address Sock VID:PID Link Speed Driver Product Name Vital Product Data 0000:11:00.0 2 8086:10fb 5.0 GT/s x8 ixgbe 0000:11:00.1 2 8086:10fb 5.0 GT/s x8 ixgbe Aborted Makefile:546: recipe for target 'run' failed make: *** [run] Error 134
I've tried to do some debugging with a debug build: (gdb) bt ... #5 0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0 0 15b3:1015 8.0 GT/s x8 mlx5_core CX4121A - ConnectX-4 LX SFP28", args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230 ... (gdb) frame 5 #5 0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0 0 15b3:1015 8.0 GT/s x8 mlx5_core CX4121A - ConnectX-4 LX SFP28", args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230 230 else if (*(u16 *) & data[p] == *(u16 *) id) (gdb) info locals data = 0xffff7efa9cd0 "PN\025MCX4121A-ACAT_C12 EC\002A1SN\030MT1745K13032", ' ' <repeats 12 times>, "V0\023PCIeGen3 x8 RV\001\272" id = 0xaaa8000000000000 <error: Cannot access memory at address 0xaaa8000000000000> indent = 91 string_types = {0xffffbe7b7950 "PN", 0xffffbe7b7958 "EC", 0xffffbe7b7960 "SN", 0xffffbe7b7968 "MN", 0x0} p = 0 first_line = 1 Looks like something went wrong with the 'id' variable. More is attached. As a temporary workaround (until we fix this), we're going to replace show pci with something else in CSIT: https://gerrit.fd.io/r/c/csit/+/23785 Juraj -----Original Message----- From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) <pmi...@cisco.com> Sent: Tuesday, December 3, 2019 3:58 PM To: Juraj Linkeš <juraj.lin...@pantheon.tech>; Benoit Ganne (bganne) <bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> Subject: RE: CSIT - performance tests failing on Taishan Latest update is that Benoit has no access over VPN so he did try to replicate in local lab (assuming x86). I will do quick fix in CSIT. I will disable MLX driver on Taishan. Peter Mikus Engineer - Software Cisco Systems Limited > -----Original Message----- > From: Juraj Linkeš <juraj.lin...@pantheon.tech> > Sent: Tuesday, December 3, 2019 3:09 PM > To: Benoit Ganne (bganne) <bga...@cisco.com>; Peter Mikus -X (pmikus - > PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; Maciek Konstantynowicz > (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit- > d...@lists.fd.io > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) > <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com> > Subject: RE: CSIT - performance tests failing on Taishan > > Hi Benoit, > > Do you have access to FD.io lab? The Taishan servers are in it. > > Juraj > > -----Original Message----- > From: Benoit Ganne (bganne) <bga...@cisco.com> > Sent: Friday, November 29, 2019 4:03 PM > To: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) > <pmi...@cisco.com>; Juraj Linkeš <juraj.lin...@pantheon.tech>; Maciek > Konstantynowicz (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp- > d...@lists.fd.io>; csit-...@lists.fd.io > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) > <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com> > Subject: RE: CSIT - performance tests failing on Taishan > > Hi Peter, can I get access to the setup to investigate? > > Best > ben > > > -----Original Message----- > > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) > > <pmi...@cisco.com> > > Sent: vendredi 29 novembre 2019 11:08 > > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš > > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) > > <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; > > csit-...@lists.fd.io > > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) > > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; > > lijian.zh...@arm.com; Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com> > > Subject: RE: CSIT - performance tests failing on Taishan > > > > +dev lists > > > > Peter Mikus > > Engineer - Software > > Cisco Systems Limited > > > > > -----Original Message----- > > > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) > > > Sent: Friday, November 29, 2019 11:06 AM > > > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš > > > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) > > > <mkons...@cisco.com> > > > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) > > > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; > > > lijian.zh...@arm.com; Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com> > > > Subject: CSIT - performance tests failing on Taishan > > > > > > Hello all, > > > > > > In CSIT we are observing the issue with Taishan boxes where > > > performance tests are failing. > > > There has been long misleading discussion about the potential > > > issue, > > root > > > cause and what workaround to apply. > > > > > > Issue > > > ===== > > > VPP is being restarted after an attempt to read "show pci" over > > > the socket on '/run/vpp/cli.sock' > > > in a loop. This loop test is executed in CSIT towards VPP with > > > default startup configuration via command below to check if VPP is > > > really UP and responding. > > > > > > How to reproduce > > > ================ > > > for i in $(seq 1 120); do echo "show pci" | sudo socat - UNIX- > > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done > > > > > > The same can be reproduced using vppctl: > > > > > > for i in $(seq 1 120); do echo "show pci" | sudo vppctl; sudo > > > netstat - > > ap > > > | grep vpp; done > > > > > > To eliminate the issue with test itself I used "show version" > > > for i in $(seq 1 120); do echo "show version" | sudo socat - UNIX- > > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done > > > > > > This test is passing with "show version" and VPP is not restarted. > > > > > > > > > Root cause > > > ========== > > > The root cause seems to be: > > > > > > Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault. > > > 0x0000ffffbeb4f3d0 in format_vlib_pci_vpd ( > > > s=0xffff7fabe830 "0002:f9:00.0 0 15b3:1015 8.0 GT/s x8 > > > mlx5_core CX4121A - ConnectX-4 LX SFP28", args > > > =<optimized out>) > > > at /w/workspace/vpp-arm-merge-master- > > > ubuntu1804/src/vlib/pci/pci.c:230 > > > 230 /w/workspace/vpp-arm-merge-master- > ubuntu1804/src/vlib/pci/pci.c: > > > No such file or directory. > > > (gdb) > > > Continuing. > > > > > > Thread 1 "vpp_main" received signal SIGABRT, Aborted. > > > __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > > > 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or > directory. > > > (gdb) > > > > > > > > > Issue started after MLX was installed into Taishan. > > > > > > > > > @Benoit Ganne (bganne) can you please help fixing the root cause? > > > > > > Thank you. > > > > > > Peter Mikus > > > Engineer - Software > > > Cisco Systems Limited
show_pci_mlnx.dbg
Description: show_pci_mlnx.dbg
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14783): https://lists.fd.io/g/vpp-dev/message/14783 Mute This Topic: https://lists.fd.io/mt/64332740/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-