Good morning,

We hava a cluster with two type of infinibad cards

The first one :

lspci | grep -i mella
5e:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

mstflint -d 5e:00.0 q
Image type:            FS3
FW Version:            12.24.1000
FW Release Date:       26.11.2018
Product Version:       12.24.1000
Rom Info:              type=PXE version=3.5.603 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             506b4b03001be9fa        4
Base MAC:              506b4b1be9fa            4
Image VSD:             N/A
Device VSD:            N/A
PSID:                  DEL2180110032
Security Attributes:   N/A

# ibv_devinfo
hca_id:            mlx5_0
            transport:                               InfiniBand (0)
            fw_ver:                                               12.24.1000
            node_guid:                             506b:4b03:001b:e9fa
            sys_image_guid:                                506b:4b03:001b:e9fa
            vendor_id:                              0x02c9
            vendor_part_id:                                 4115
            hw_ver:                                              0x0
            board_id:                                DEL2180110032
            phys_port_cnt:                                  1
                        port:    1
                                    state:                          PORT_ACTIVE 
(4)
                                    max_mtu:                   4096 (5)
                                    active_mtu:                4096 (5)
                                    sm_lid:                        1
                                    port_lid:                      20
                                    port_lmc:                    0x00
                                    link_layer:                   InfiniBand
#ibstat
CA 'mlx5_0'
            CA type: MT4115
            Number of ports: 1
            Firmware version: 12.24.1000
            Hardware version: 0
            Node GUID: 0x506b4b03001be9fa
            System image GUID: 0x506b4b03001be9fa
            Port 1:
                        State: Active
                        Physical state: LinkUp
                        Rate: 100
                        Base lid: 20
                        LMC: 0
                        SM lid: 1
                        Capability mask: 0x2659e848
                        Port GUID: 0x506b4b03001be9fa
                        Link layer: InfiniBand


And the other one

lspci | grep -i mella
06:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
mstflint -d 06:00.0 q
Image type:            FS4
FW Version:            20.26.4012
FW Release Date:       10.12.2019
Product Version:       20.26.4012
Rom Info:              type=UEFI version=14.19.17 cpu=AMD64
                       type=PXE version=3.5.805 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             b8599f0300e4453e        4
Base MAC:              b8599fe4453e            4

ibv_devinfo
hca_id:            mlx5_0
            transport:                               InfiniBand (0)
            fw_ver:                                               20.26.4012
            node_guid:                             b859:9f03:00e4:453e
            sys_image_guid:                                b859:9f03:00e4:453e
            vendor_id:                              0x02c9
            vendor_part_id:                                 4123
            hw_ver:                                              0x0
            board_id:                                LNV0000000016
            phys_port_cnt:                                  1
                        port:    1
                                    state:                          PORT_ACTIVE 
(4)
                                    max_mtu:                   4096 (5)
                                    active_mtu:                4096 (5)
                                    sm_lid:                        1
                                    port_lid:                      3
                                    port_lmc:                    0x00
                                    link_layer:                   InfiniBand
ibstat
CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.26.4012
            Hardware version: 0
            Node GUID: 0xb8599f0300e4453e
            System image GUID: 0xb8599f0300e4453e
            Port 1:
                        State: Active
                        Physical state: LinkUp
                        Rate: 100
                        Base lid: 3
                        LMC: 0
                        SM lid: 1
                        Capability mask: 0x2659e848
                        Port GUID: 0xb8599f0300e4453e
                        Link layer: InfiniBand


At the beginning, we only have the first one, the cards with connectx-4, and we 
use openmpi-3.1.3
openmpi@3.1.3<mailto:openmpi@3.1.3> -cuda +cxx_exceptions fabrics=verbs -java 
-legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 +thread_multiple +vt

The program that we execute is WRF, and it works fine.

When we made an ampliation of the cluster, the cards were with connectx-6, we 
use the same openmpi but we get warning about mxm and in this case the program 
was slower that in the first cluster, so
we use openmpi-4.0.4
openmpi@4.0.3<mailto:openmpi@4.0.3> -cuda +cxx_exceptions fabrics=verbs -java 
-legacylaunchers -memchecker +pmi schedulers=slurm  -sqlite3 +thread_multiple 
+vt
and when we use this, the program start to run but suddenly stop, and if we 
make a ps we get

…..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       00:05:33 real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       00:05:28 real.exe
……

(Real.exe is part of WRF)

The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work

Do you know anything about this problem??? We have other program that have the 
same problem…

We launch our program with slurm, srun –mpi=pmix


________________________________________________

Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID
________________________________________________


Reply via email to