Good morning, We hava a cluster with two type of infinibad cards
The first one : lspci | grep -i mella 5e:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] mstflint -d 5e:00.0 q Image type: FS3 FW Version: 12.24.1000 FW Release Date: 26.11.2018 Product Version: 12.24.1000 Rom Info: type=PXE version=3.5.603 cpu=AMD64 Description: UID GuidsNumber Base GUID: 506b4b03001be9fa 4 Base MAC: 506b4b1be9fa 4 Image VSD: N/A Device VSD: N/A PSID: DEL2180110032 Security Attributes: N/A # ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.24.1000 node_guid: 506b:4b03:001b:e9fa sys_image_guid: 506b:4b03:001b:e9fa vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: DEL2180110032 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 20 port_lmc: 0x00 link_layer: InfiniBand #ibstat CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.24.1000 Hardware version: 0 Node GUID: 0x506b4b03001be9fa System image GUID: 0x506b4b03001be9fa Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 20 LMC: 0 SM lid: 1 Capability mask: 0x2659e848 Port GUID: 0x506b4b03001be9fa Link layer: InfiniBand And the other one lspci | grep -i mella 06:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] mstflint -d 06:00.0 q Image type: FS4 FW Version: 20.26.4012 FW Release Date: 10.12.2019 Product Version: 20.26.4012 Rom Info: type=UEFI version=14.19.17 cpu=AMD64 type=PXE version=3.5.805 cpu=AMD64 Description: UID GuidsNumber Base GUID: b8599f0300e4453e 4 Base MAC: b8599fe4453e 4 ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.26.4012 node_guid: b859:9f03:00e4:453e sys_image_guid: b859:9f03:00e4:453e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: LNV0000000016 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 3 port_lmc: 0x00 link_layer: InfiniBand ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.26.4012 Hardware version: 0 Node GUID: 0xb8599f0300e4453e System image GUID: 0xb8599f0300e4453e Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 3 LMC: 0 SM lid: 1 Capability mask: 0x2659e848 Port GUID: 0xb8599f0300e4453e Link layer: InfiniBand At the beginning, we only have the first one, the cards with connectx-4, and we use openmpi-3.1.3 openmpi@3.1.3<mailto:openmpi@3.1.3> -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 +thread_multiple +vt The program that we execute is WRF, and it works fine. When we made an ampliation of the cluster, the cards were with connectx-6, we use the same openmpi but we get warning about mxm and in this case the program was slower that in the first cluster, so we use openmpi-4.0.4 openmpi@4.0.3<mailto:openmpi@4.0.3> -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 +thread_multiple +vt and when we use this, the program start to run but suddenly stop, and if we make a ps we get ….. 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? 00:05:25 real.exe 0 S 4556 87384 87361 0 80 0 - 126677 hrtime ? 00:05:33 real.exe 0 S 4556 87385 87361 0 80 0 - 126675 hrtime ? 00:05:28 real.exe …… (Real.exe is part of WRF) The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work Do you know anything about this problem??? We have other program that have the same problem… We launch our program with slurm, srun –mpi=pmix ________________________________________________ Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID ________________________________________________