Hi Ben, Lijian, Honnappa,

The issue is reproducible after the second invocation of show pci:
DBGvpp# show pci
Address      Sock VID:PID     Link Speed   Driver          Product Name         
           Vital Product Data
0000:11:00.0   2  8086:10fb   5.0 GT/s x8  ixgbe                                
           
0000:11:00.1   2  8086:10fb   5.0 GT/s x8  ixgbe                                
           
0002:f9:00.0   0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 
LX SFP28   PN: MCX4121A-ACAT_C12    
                                                                                
           EC: A1
                                                                                
           SN: MT1745K13032            
                                                                                
           V0: 0x 50 43 49 65 47 65 6e 33 ...
                                                                                
           RV: 0x ba
0002:f9:00.1   0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 
LX SFP28   PN: MCX4121A-ACAT_C12    
                                                                                
           EC: A1
                                                                                
           SN: MT1745K13032            
                                                                                
           V0: 0x 50 43 49 65 47 65 6e 33 ...
                                                                                
           RV: 0x ba
DBGvpp# show pci
Address      Sock VID:PID     Link Speed   Driver          Product Name         
           Vital Product Data
0000:11:00.0   2  8086:10fb   5.0 GT/s x8  ixgbe                                
           
0000:11:00.1   2  8086:10fb   5.0 GT/s x8  ixgbe                                
           
Aborted
Makefile:546: recipe for target 'run' failed
make: *** [run] Error 134

I've tried to do some debugging with a debug build:
(gdb) bt
...
#5  0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0   
0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 LX SFP28", 
args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230
...
(gdb) frame 5
#5  0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0   
0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 LX SFP28", 
args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230
230           else if (*(u16 *) & data[p] == *(u16 *) id)
(gdb) info locals
data = 0xffff7efa9cd0 "PN\025MCX4121A-ACAT_C12    EC\002A1SN\030MT1745K13032", 
' ' <repeats 12 times>, "V0\023PCIeGen3 x8        RV\001\272"
id = 0xaaa8000000000000 <error: Cannot access memory at address 
0xaaa8000000000000>
indent = 91
string_types = {0xffffbe7b7950 "PN", 0xffffbe7b7958 "EC", 0xffffbe7b7960 "SN", 
0xffffbe7b7968 "MN", 0x0}
p = 0
first_line = 1

Looks like something went wrong with the 'id' variable. More is attached.

As a temporary workaround (until we fix this), we're going to replace show pci 
with something else in CSIT: https://gerrit.fd.io/r/c/csit/+/23785

Juraj

-----Original Message-----
From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) <pmi...@cisco.com> 
Sent: Tuesday, December 3, 2019 3:58 PM
To: Juraj Linkeš <juraj.lin...@pantheon.tech>; Benoit Ganne (bganne) 
<bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com>; 
vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io
Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; 
lijian.zh...@arm.com; Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Subject: RE: CSIT - performance tests failing on Taishan

Latest update is that Benoit has no access over VPN so he did try to replicate 
in local lab (assuming x86).
I will do quick fix in CSIT. I will disable MLX driver on Taishan.

Peter Mikus
Engineer - Software
Cisco Systems Limited

> -----Original Message-----
> From: Juraj Linkeš <juraj.lin...@pantheon.tech>
> Sent: Tuesday, December 3, 2019 3:09 PM
> To: Benoit Ganne (bganne) <bga...@cisco.com>; Peter Mikus -X (pmikus - 
> PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; Maciek Konstantynowicz
> (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit- 
> d...@lists.fd.io
> Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli 
> <honnappa.nagaraha...@arm.com>
> Subject: RE: CSIT - performance tests failing on Taishan
> 
> Hi Benoit,
> 
> Do you have access to FD.io lab? The Taishan servers are in it.
> 
> Juraj
> 
> -----Original Message-----
> From: Benoit Ganne (bganne) <bga...@cisco.com>
> Sent: Friday, November 29, 2019 4:03 PM
> To: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) 
> <pmi...@cisco.com>; Juraj Linkeš <juraj.lin...@pantheon.tech>; Maciek 
> Konstantynowicz (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp- 
> d...@lists.fd.io>; csit-...@lists.fd.io
> Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli 
> <honnappa.nagaraha...@arm.com>
> Subject: RE: CSIT - performance tests failing on Taishan
> 
> Hi Peter, can I get access to the setup to investigate?
> 
> Best
> ben
> 
> > -----Original Message-----
> > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) 
> > <pmi...@cisco.com>
> > Sent: vendredi 29 novembre 2019 11:08
> > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš 
> > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) 
> > <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; 
> > csit-...@lists.fd.io
> > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; 
> > lijian.zh...@arm.com; Honnappa Nagarahalli 
> > <honnappa.nagaraha...@arm.com>
> > Subject: RE: CSIT - performance tests failing on Taishan
> >
> > +dev lists
> >
> > Peter Mikus
> > Engineer - Software
> > Cisco Systems Limited
> >
> > > -----Original Message-----
> > > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco)
> > > Sent: Friday, November 29, 2019 11:06 AM
> > > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš 
> > > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) 
> > > <mkons...@cisco.com>
> > > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> > > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; 
> > > lijian.zh...@arm.com; Honnappa Nagarahalli
> > <honnappa.nagaraha...@arm.com>
> > > Subject: CSIT - performance tests failing on Taishan
> > >
> > > Hello all,
> > >
> > > In CSIT we are observing the issue with Taishan boxes where 
> > > performance tests are failing.
> > > There has been long misleading discussion about the potential 
> > > issue,
> > root
> > > cause and what workaround to apply.
> > >
> > > Issue
> > > =====
> > > VPP is being restarted after an attempt to read "show pci" over 
> > > the socket on '/run/vpp/cli.sock'
> > > in a loop. This loop test is executed in CSIT towards VPP with 
> > > default startup configuration via command below to check if VPP is 
> > > really UP and responding.
> > >
> > > How to reproduce
> > > ================
> > > for i in $(seq 1 120); do echo "show pci" | sudo socat - UNIX- 
> > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done
> > >
> > > The same can be reproduced using vppctl:
> > >
> > > for i in $(seq 1 120); do echo "show pci" | sudo vppctl; sudo 
> > > netstat -
> > ap
> > > | grep vpp; done
> > >
> > > To eliminate the issue with test itself I used "show version"
> > > for i in $(seq 1 120); do echo "show version" | sudo socat - UNIX- 
> > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done
> > >
> > > This test is passing with "show version" and VPP is not restarted.
> > >
> > >
> > > Root cause
> > > ==========
> > > The root cause seems to be:
> > >
> > > Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault.
> > > 0x0000ffffbeb4f3d0 in format_vlib_pci_vpd (
> > >     s=0xffff7fabe830 "0002:f9:00.0   0  15b3:1015   8.0 GT/s x8
> > > mlx5_core       CX4121A - ConnectX-4 LX SFP28", args
> > > =<optimized out>)
> > >     at /w/workspace/vpp-arm-merge-master-
> > > ubuntu1804/src/vlib/pci/pci.c:230
> > > 230     /w/workspace/vpp-arm-merge-master-
> ubuntu1804/src/vlib/pci/pci.c:
> > > No such file or directory.
> > > (gdb)
> > > Continuing.
> > >
> > > Thread 1 "vpp_main" received signal SIGABRT, Aborted.
> > > __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> > > 51      ../sysdeps/unix/sysv/linux/raise.c: No such file or
> directory.
> > > (gdb)
> > >
> > >
> > > Issue started after MLX was installed into Taishan.
> > >
> > >
> > > @Benoit Ganne (bganne) can you please help fixing the root cause?
> > >
> > > Thank you.
> > >
> > > Peter Mikus
> > > Engineer - Software
> > > Cisco Systems Limited

Attachment: show_pci_mlnx.dbg
Description: show_pci_mlnx.dbg

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14783): https://lists.fd.io/g/vpp-dev/message/14783
Mute This Topic: https://lists.fd.io/mt/64332740/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-
  • ... Peter Mikus via Lists.Fd.Io
    • ... Benoit Ganne (bganne) via Lists.Fd.Io
      • ... Juraj Linkeš
        • ... Peter Mikus via Lists.Fd.Io
          • ... Juraj Linkeš
            • ... Lijian Zhang
              • ... Juraj Linkeš
                • ... Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via Lists.Fd.Io

Reply via email to