date:20200702

Re: [OMPI users] Unable to run MPI application

2020-07-02 Thread CHESTER, DEAN (PGR) via users

I tried this again and it resulted in the same error: 
nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23)
nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23)
nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23)
--
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
—

The link is up according to ibstat: 
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x00117576ec76
System image GUID: 0x00117576ec76
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 6
LMC: 0
SM lid: 7
Capability mask: 0x0761086a
Port GUID: 0x00117576ec76
Link layer: InfiniBand

Any other ideas? 

Dean 

> On 27 Jun 2020, at 16:58, Jeff Squyres (jsquyres)  wrote:
> 
> On Jun 26, 2020, at 7:30 AM, Peter Kjellström via users 
>  wrote:
>> 
>>> The cluster hardware is QLogic infiniband with Intel CPUs. My
>>> understanding is that we should be using the old PSM for networking. 
>>> 
>>> Any thoughts what might be going wrong with the build? 
>> 
>> Yes only PSM will perform well on that hardware. Make sure that PSM
>> works on the system. Then make sure you got a mca_mtl_psm built.
> 
> 
> I think Peter is right: you want to use 
> 
>mpirun --mca pml cm --mca mtl psm ...
> 
> I *think* Intel InfiniPath is PSM and Intel OmniPath is PSM2, so "psm" is 
> what you want (not "psm2").
> 
> Don't try to use pml/ob1 + btl/openib, and don't try to use UCX.  PSM is 
> Intel's native support for its Infinipath network.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
>

Re: [OMPI users] Unable to run MPI application

2020-07-02 Thread Peter Kjellström via users

On Thu, 2 Jul 2020 08:38:51 +
"CHESTER, DEAN \(PGR\) via users"  wrote:

> I tried this again and it resulted in the same error: 
> nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23)
> nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23)
> nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23)
> --
> PSM was unable to open an endpoint. Please make sure that the network
> link is active on the node and the hardware is functioning.
> 
>   Error: Failure in initializing endpoint
> —
> 
> The link is up according to ibstat: 
> CA 'qib0'
>   CA type: InfiniPath_QLE7340
...
>   Port 1:
>   State: Active
>   Physical state: LinkUp
>   Rate: 40
...
> 
> Any other ideas? 

Does anything psm-related work?

Do you have the correct permissions on /dev/ipath*? Here are mine:

# ls -ltr /dev/ipath*
crw--- 1 root root 244, 128 Jun  4 12:20 /dev/ipath_diagpkt
crw--- 1 root root 244, 129 Jun  4 12:20 /dev/ipath_diag0
crw-rw-rw- 1 root root 244,   1 Jun  4 12:20 /dev/ipath0
crw-rw-rw- 1 root root 244,   0 Jun  4 12:20 /dev/ipath

Re: [OMPI users] Unable to run MPI application

2020-07-02 Thread CHESTER, DEAN (PGR) via users

The permissions were incorrect! 

For our old installation of OMPI 1.10.6 it didn’t complain which is strange. 

Thanks for the help. 

Dean 

> On 2 Jul 2020, at 11:01, Peter Kjellström  wrote:
> 
> On Thu, 2 Jul 2020 08:38:51 +
> "CHESTER, DEAN \(PGR\) via users"  wrote:
> 
>> I tried this again and it resulted in the same error: 
>> nymph3.29935PSM can't open /dev/ipath for reading and writing (err=23)
>> nymph3.29937PSM can't open /dev/ipath for reading and writing (err=23)
>> nymph3.29936PSM can't open /dev/ipath for reading and writing (err=23)
>> --
>> PSM was unable to open an endpoint. Please make sure that the network
>> link is active on the node and the hardware is functioning.
>> 
>>  Error: Failure in initializing endpoint
>> —
>> 
>> The link is up according to ibstat: 
>> CA 'qib0'
>>  CA type: InfiniPath_QLE7340
> ...
>>  Port 1:
>>  State: Active
>>  Physical state: LinkUp
>>  Rate: 40
> ...
>> 
>> Any other ideas? 
> 
> Does anything psm-related work?
> 
> Do you have the correct permissions on /dev/ipath*? Here are mine:
> 
> # ls -ltr /dev/ipath*
> crw--- 1 root root 244, 128 Jun  4 12:20 /dev/ipath_diagpkt
> crw--- 1 root root 244, 129 Jun  4 12:20 /dev/ipath_diag0
> crw-rw-rw- 1 root root 244,   1 Jun  4 12:20 /dev/ipath0
> crw-rw-rw- 1 root root 244,   0 Jun  4 12:20 /dev/ipath
>

Re: [OMPI users] Unable to run MPI application

2020-07-02 Thread Peter Kjellström via users

On Thu, 2 Jul 2020 10:27:51 +
"CHESTER, DEAN \(PGR\) via users"  wrote:

> The permissions were incorrect! 
> 
> For our old installation of OMPI 1.10.6 it didn’t complain which is
> strange. 

Then that did not use PSM and as such had horrible performance :-(

/Peter K

[OMPI users] Signal code: Non-existant physical address (2)

2020-07-02 Thread Prentice Bisbal via users

I manage a very heterogeneous cluster. I have nodes of different ages 
with different processors, different amounts of RAM, etc. One user is 
reporting that on certain nodes, his jobs keep crashing with the errors 
below. His application is using OpenMPI 1.10.3, which I know is an 
ancient version of OpenMPI, but someone else in his research group built 
the code with that, so that's what he's stuck with.


I did a Google search of "Signal code: Non-existant physical address", 
and it appears that this may be a bug in 1.10.3 that happens on certain 
hardware. Can anyone else confirm this? The full error message is below:


[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1] 
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]


I've asked the user to switch to a newer version of OpenMPI, but since 
his research group is all using the same application and someone else 
built it, he's not in a position to do that. For now, he's excluding the 
"bad" nodes with Slurm -x option.


I just want to know if this is in fact a bug in 1.10.3, or if there's 
something we can do to fix this error.


Thanks,

--
Prentice

Re: [OMPI users] Unable to run MPI application

Re: [OMPI users] Unable to run MPI application

Re: [OMPI users] Unable to run MPI application

Re: [OMPI users] Unable to run MPI application

[OMPI users] Signal code: Non-existant physical address (2)

5 matches

Site Navigation

Mail list logo

Footer information