Re: [OMPI users] *** Error in `orted': double free or corruption (out): 0x00002aaab4001680 ***, in some node combos.

2018-09-12 Thread Balázs Hajgató

Dear Jeff,

Setting mca oob to tcp works. I will stick to this solution in our 
production environment.


I am not sure that it is relevant, but I also tried the patch on a 
non-procduction OpenMPI 3.1.1, and "mpirun -host nic114,nic151 hostname" 
works without any parameters, but issuing the libibverbs error 
(libibverbs: GRH is mandatory For RoCE address handle)


However, if i enforce mca oob ud, then it does not work, it hangs after 
issuing error:
[nic151:23609] [[45140,0],2] ORTE_ERROR_LOG: Unreachable in file 
oob_ud_send.c at line 141


After a "ctrl-c": [nic115:23707] [[45140,0],0] ORTE_ERROR_LOG: 
Unreachable in file oob_ud_send.c at line 141


Thank you for your answer!

Regards,

Balazs


On 12/09/2018 00:37, Jeff Squyres (jsquyres) via users wrote:

Thanks for reporting the issue.

First, you can workaround the issue by using:

 mpirun --mca oob tcp ...

This uses a different out-of-band plugin (TCP) instead of verbs unreliable 
datagrams.

Second, I just filed a fix for our current release branches (v2.1.x, v3.0.x, 
and v3.1.x):

 https://github.com/open-mpi/ompi/issues/5672

Could you try it out and let me know if it works for you?

Thanks!



On Sep 10, 2018, at 5:36 PM, Balazs HAJGATO  wrote:

Dear list readers,

I have some problems with OpenMPI 3.1.1. In some node combos, I got the error 
(libibverbs: GRH is mandatory For RoCE address handle; *** Error in 
`/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted':
 double free or corruption (out): 0x2aaab4001680 ***), see details in file 
114_151.out.bz2, even with the most simplest run, like
mpirun -host nic114,nic151 hostname
In the file 114_151.out.bz2, you can see the output if I run the command from 
nic114. If I run the same command from nic151, it simply spits out the 
hostnames, without any errors.

I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is 
identical, see ompi.nic114.bz2). I do not have the config.log file, although I 
still have the config output (see confilg.out.bz2). The nodes have identical 
opsystems (as we use the same image), and the OpenMPI is also loaded from a 
central directory shared amongst the nodes. We have an infiniband network (with 
IP over IB) and an ethernet network. Intel MPI works without a problem, and I 
confirmed that the network is IB when I use the Intel MPI) It is not clear 
whether the orted error is the consequence of the libibverbs error, but it is 
not clear why OpenMPI wants to use RoCE at all. (ibv_devinfo is also attached, 
we do have a somewhat creative infiniband topology, based on fat-tree, but 
changing the topology did not solved the problem). The /tmp directory is 
writable, and not full. As a matter of fact, I get the same error incase of 
OpenMPI 2.0.2, and 2.1.1, and I do not get this error in case of OpenMP

I

   1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?

Regards,

Balazs Hajgato
<114_151.out.bz2>___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




--
HPC consultant
HPC/VSC Support and System Administration
Computing Center
ULB/VUB
Avenue Adolphe Buyllaan 91 - CP 197
1050 Brussels
Belgium


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
What OMPI version are we talking about here?


> On Sep 11, 2018, at 6:56 PM, Greg Russell  wrote:
> 
> I have a single machine w 96 cores.  It runs CentOS7 and is not connected to 
> any network as it needs to isolated for security.
> 
> I attempted the standard install process and upon attempting to run ./mpirun 
> I find the error message
> 
> "No network interfaces were found for out-of-band communications. We require 
> at least one available network for out-of-band messaging."
> 
> I'm a rookie with openMPI so I'm guessing maybe some configuration flags 
> might fix the whole problem?  Any ideas are very much appreciated.
> 
> Thank you,
> Russell
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://lists.open-mpi.org/mailman/listinfo/users 
> 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Greg Russell
OpenMPI-3.1.2

Sent from my iPhone

On Sep 12, 2018, at 10:50 AM, Ralph H Castain 
mailto:r...@open-mpi.org>> wrote:

What OMPI version are we talking about here?


On Sep 11, 2018, at 6:56 PM, Greg Russell 
mailto:russ...@pcka.com>> wrote:

I have a single machine w 96 cores.  It runs CentOS7 and is not connected to 
any network as it needs to isolated for security.

I attempted the standard install process and upon attempting to run ./mpirun I 
find the error message

"No network interfaces were found for out-of-band communications. We require at 
least one available network for out-of-band messaging."

I'm a rookie with openMPI so I'm guessing maybe some configuration flags might 
fix the whole problem?  Any ideas are very much appreciated.

Thank you,
Russell
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Jeff Squyres (jsquyres) via users
Can you send all the information listed here:

https://www.open-mpi.org/community/help/



> On Sep 12, 2018, at 11:03 AM, Greg Russell  wrote:
> 
> OpenMPI-3.1.2
> 
> Sent from my iPhone
> 
> On Sep 12, 2018, at 10:50 AM, Ralph H Castain  wrote:
> 
>> What OMPI version are we talking about here?
>> 
>> 
>>> On Sep 11, 2018, at 6:56 PM, Greg Russell  wrote:
>>> 
>>> I have a single machine w 96 cores.  It runs CentOS7 and is not connected 
>>> to any network as it needs to isolated for security.
>>> 
>>> I attempted the standard install process and upon attempting to run 
>>> ./mpirun I find the error message
>>> 
>>> "No network interfaces were found for out-of-band communications. We 
>>> require at least one available network for out-of-band messaging."
>>> 
>>> I'm a rookie with openMPI so I'm guessing maybe some configuration flags 
>>> might fix the whole problem?  Any ideas are very much appreciated.
>>> 
>>> Thank you,
>>> Russell
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] No network interfaces were found for out-of-band communications.

2018-09-12 Thread Ralph H Castain
Just looking at the code, we do require that at least the loopback device be 
available. So you need to “activate” the Ethernet support, but you can restrict 
it to only loopback, which should meet your security requirement.


> On Sep 12, 2018, at 8:10 AM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Can you send all the information listed here:
> 
>https://www.open-mpi.org/community/help/
> 
> 
> 
>> On Sep 12, 2018, at 11:03 AM, Greg Russell  wrote:
>> 
>> OpenMPI-3.1.2
>> 
>> Sent from my iPhone
>> 
>> On Sep 12, 2018, at 10:50 AM, Ralph H Castain  wrote:
>> 
>>> What OMPI version are we talking about here?
>>> 
>>> 
 On Sep 11, 2018, at 6:56 PM, Greg Russell  wrote:
 
 I have a single machine w 96 cores.  It runs CentOS7 and is not connected 
 to any network as it needs to isolated for security.
 
 I attempted the standard install process and upon attempting to run 
 ./mpirun I find the error message
 
 "No network interfaces were found for out-of-band communications. We 
 require at least one available network for out-of-band messaging."
 
 I'm a rookie with openMPI so I'm guessing maybe some configuration flags 
 might fix the whole problem?  Any ideas are very much appreciated.
 
 Thank you,
 Russell
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users