Re: [OMPI users] Unable to run MPI application
I tried this again and it resulted in the same error: nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23) nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23) nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23) -- PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint — The link is up according to ibstat: CA 'qib0' CA type: InfiniPath_QLE7340 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x00117576ec76 System image GUID: 0x00117576ec76 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 6 LMC: 0 SM lid: 7 Capability mask: 0x0761086a Port GUID: 0x00117576ec76 Link layer: InfiniBand Any other ideas? Dean > On 27 Jun 2020, at 16:58, Jeff Squyres (jsquyres) wrote: > > On Jun 26, 2020, at 7:30 AM, Peter Kjellström via users > wrote: >> >>> The cluster hardware is QLogic infiniband with Intel CPUs. My >>> understanding is that we should be using the old PSM for networking. >>> >>> Any thoughts what might be going wrong with the build? >> >> Yes only PSM will perform well on that hardware. Make sure that PSM >> works on the system. Then make sure you got a mca_mtl_psm built. > > > I think Peter is right: you want to use > >mpirun --mca pml cm --mca mtl psm ... > > I *think* Intel InfiniPath is PSM and Intel OmniPath is PSM2, so "psm" is > what you want (not "psm2"). > > Don't try to use pml/ob1 + btl/openib, and don't try to use UCX. PSM is > Intel's native support for its Infinipath network. > > -- > Jeff Squyres > jsquy...@cisco.com >
Re: [OMPI users] Unable to run MPI application
On Thu, 2 Jul 2020 08:38:51 + "CHESTER, DEAN \(PGR\) via users" wrote: > I tried this again and it resulted in the same error: > nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23) > nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23) > nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23) > -- > PSM was unable to open an endpoint. Please make sure that the network > link is active on the node and the hardware is functioning. > > Error: Failure in initializing endpoint > — > > The link is up according to ibstat: > CA 'qib0' > CA type: InfiniPath_QLE7340 ... > Port 1: > State: Active > Physical state: LinkUp > Rate: 40 ... > > Any other ideas? Does anything psm-related work? Do you have the correct permissions on /dev/ipath*? Here are mine: # ls -ltr /dev/ipath* crw--- 1 root root 244, 128 Jun 4 12:20 /dev/ipath_diagpkt crw--- 1 root root 244, 129 Jun 4 12:20 /dev/ipath_diag0 crw-rw-rw- 1 root root 244, 1 Jun 4 12:20 /dev/ipath0 crw-rw-rw- 1 root root 244, 0 Jun 4 12:20 /dev/ipath
Re: [OMPI users] Unable to run MPI application
The permissions were incorrect! For our old installation of OMPI 1.10.6 it didn’t complain which is strange. Thanks for the help. Dean > On 2 Jul 2020, at 11:01, Peter Kjellström wrote: > > On Thu, 2 Jul 2020 08:38:51 + > "CHESTER, DEAN \(PGR\) via users" wrote: > >> I tried this again and it resulted in the same error: >> nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23) >> nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23) >> nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23) >> -- >> PSM was unable to open an endpoint. Please make sure that the network >> link is active on the node and the hardware is functioning. >> >> Error: Failure in initializing endpoint >> — >> >> The link is up according to ibstat: >> CA 'qib0' >> CA type: InfiniPath_QLE7340 > ... >> Port 1: >> State: Active >> Physical state: LinkUp >> Rate: 40 > ... >> >> Any other ideas? > > Does anything psm-related work? > > Do you have the correct permissions on /dev/ipath*? Here are mine: > > # ls -ltr /dev/ipath* > crw--- 1 root root 244, 128 Jun 4 12:20 /dev/ipath_diagpkt > crw--- 1 root root 244, 129 Jun 4 12:20 /dev/ipath_diag0 > crw-rw-rw- 1 root root 244, 1 Jun 4 12:20 /dev/ipath0 > crw-rw-rw- 1 root root 244, 0 Jun 4 12:20 /dev/ipath >
Re: [OMPI users] Unable to run MPI application
On Thu, 2 Jul 2020 10:27:51 + "CHESTER, DEAN \(PGR\) via users" wrote: > The permissions were incorrect! > > For our old installation of OMPI 1.10.6 it didn’t complain which is > strange. Then that did not use PSM and as such had horrible performance :-( /Peter K
[OMPI users] Signal code: Non-existant physical address (2)
I manage a very heterogeneous cluster. I have nodes of different ages with different processors, different amounts of RAM, etc. One user is reporting that on certain nodes, his jobs keep crashing with the errors below. His application is using OpenMPI 1.10.3, which I know is an ancient version of OpenMPI, but someone else in his research group built the code with that, so that's what he's stuck with. I did a Google search of "Signal code: Non-existant physical address", and it appears that this may be a bug in 1.10.3 that happens on certain hardware. Can anyone else confirm this? The full error message is below: [dawson120:29064] *** Process received signal *** [dawson120:29062] *** Process received signal *** [dawson120:29062] Signal: Bus error (7) [dawson120:29062] Signal code: Non-existant physical address (2) [dawson120:29062] Failing at address: 0x7ff3f030f180 [dawson120:29067] *** Process received signal *** [dawson120:29067] Signal: Bus error (7) [dawson120:29067] Signal code: Non-existant physical address (2) [dawson120:29067] Failing at address: 0x7fb2b8a61d18 [dawson120:29077] *** Process received signal *** [dawson120:29078] *** Process received signal *** [dawson120:29078] Signal: Bus error (7) [dawson120:29078] Signal code: Non-existant physical address (2) [dawson120:29078] Failing at address: 0x7f60a13d2c98 [dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0] [dawson120:29078] [ 1] /usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4] I've asked the user to switch to a newer version of OpenMPI, but since his research group is all using the same application and someone else built it, he's not in a position to do that. For now, he's excluding the "bad" nodes with Slurm -x option. I just want to know if this is in fact a bug in 1.10.3, or if there's something we can do to fix this error. Thanks, -- Prentice