Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-20 Thread Ralph Castain
Been trying to decipher this problem, and think maybe I'm beginning to
understand it. Just to clarify:

* when you execute "hostname", you get the .local response?

* you somewhere have it setup so that 10.x.x.x resolves to , with no
".local" extension?

Correct?



On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri wrote:

> On 19 June 2013 20:42, Ralph Castain  wrote:
> > I'm assuming that the offending host has some other address besides
> > just 127.0.1.1 as otherwise it couldn't connect to anything.
>
> Yes, it has an IP on some 10.x.x.x network.
>
>
> > I'm heading out the door for a couple of weeks, but can try to look at
> it when I return.
>
> We have a workaround (just create the hostfile using FQDNs -- actually
> FQDNs or UQDNS depending on what `uname -n` returns), so it's
> definitely not urgent for us.  But if you think it's a bug worth
> fixing, I can provide details and/or test code.
>
> Thanks,
> Riccardo
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-20 Thread Riccardo Murri
On 20 June 2013 06:33, Ralph Castain  wrote:
> Been trying to decipher this problem, and think maybe I'm beginning to
> understand it. Just to clarify:
>
> * when you execute "hostname", you get the .local response?

Yes:

[rmurri@nh64-2-11 ~]$ hostname
nh64-2-11.local

[rmurri@nh64-2-11 ~]$ uname -n
nh64-2-11.local

[rmurri@nh64-2-11 ~]$ hostname -s
nh64-2-11

[rmurri@nh64-2-11 ~]$ hostname -f
nh64-2-11.local


> * you somewhere have it setup so that 10.x.x.x resolves to , with no
> ".local" extension?

No. Host name resolution is correct, but the hostname resolves to the
127.0.1.1 address:

[rmurri@nh64-2-11 ~]$ getent hosts `hostname`
127.0.1.1nh64-2-11.local nh64-2-11

Note that `/etc/hosts` also lists a 10.x.x.x address, which is the one
actually assigned to the ethernet interface:

[rmurri@nh64-2-11 ~]$ fgrep `hostname -s` /etc/hosts
127.0.1.1   nh64-2-11.local nh64-2-11
10.1.255.201nh64-2-11.local nh64-2-11
192.168.255.206 nh64-2-11-myri0

If we remove the `127.0.1.1` line from `/etc/hosts`, then everything
works again.  Also, everything works if we use only FQDNs in the
hostfile.

So it seems that the 127.0.1.1 address is treated specially.

Thanks,
Riccardo


Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-20 Thread Riccardo Murri
On 19 June 2013 23:52, Reuti  wrote:
> Am 19.06.2013 um 22:14 schrieb Riccardo Murri:
>
>> On 19 June 2013 20:42, Reuti  wrote:
>>> Am 19.06.2013 um 19:43 schrieb Riccardo Murri :
>>>
 On 19 June 2013 16:01, Ralph Castain  wrote:
> How is OMPI picking up this hostfile? It isn't being specified on the cmd 
> line - are you running under some resource manager?

 Via the environment variable `OMPI_MCA_orte_default_hostfile`.

 We're running under SGE, but disable the OMPI/SGE integration (rather
>
> BTW: Which version of SGE?

SGE6.2u4 running under Rocks 5.3:

$ qstat -h
GE 6.2u4

$ cat /etc/rocks-release
Rocks release 5.3 (Rolled Tacos)


>> It's enabled but (IIRC) the problem is that OpenMPI detects the
>> presence of SGE from some environment variable
>
> Correct.
>
>
>> , which, in our version
>> of SGE, simply isn't there.
>
> Do you use a custom "starter_method" in the queue definition?

No custom starter_method.


> Does a submitted script with:
>
> #!/bin/sh
> env
>
> list at least some of the SGE* environment variables - or none at all?

Quite some SGE_* variables are in the environment:

$ cat env.sh
env | sort

$ qsub -pe mpi 2 env.sh
Your job 29590 ("env.sh") has been submitted

$ egrep ^SGE_ env.sh.o29590
SGE_ACCOUNT=sge
SGE_ARCH=lx26-amd64
...

However, I cannot reproduce the issue now -- it's quite possible that
it originated on a older cluster (now decommisioned) and we just kept
the submission
script on newer hardware without checking.

Thanks for the help,
Riccardo


Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-20 Thread Lorenzo Donà
Dear all that help me thanks to everyone.
I compiled open MPI with all yours advices posted but the error is always the 
same I'm also able to run the examples found with the package.
but really I don't know what can I do to solve the problem.
I trust in you to help me.
Dearly Lorenzo.
 
Il giorno 20/giu/2013, alle ore 06.33, Ralph Castain ha scritto:

> Been trying to decipher this problem, and think maybe I'm beginning to 
> understand it. Just to clarify:
> 
> * when you execute "hostname", you get the .local response?
> 
> * you somewhere have it setup so that 10.x.x.x resolves to , with no 
> ".local" extension?
> 
> Correct?
> 
> 
> 
> On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri  wrote:
> On 19 June 2013 20:42, Ralph Castain  wrote:
> > I'm assuming that the offending host has some other address besides
> > just 127.0.1.1 as otherwise it couldn't connect to anything.
> 
> Yes, it has an IP on some 10.x.x.x network.
> 
> 
> > I'm heading out the door for a couple of weeks, but can try to look at it 
> > when I return.
> 
> We have a workaround (just create the hostfile using FQDNs -- actually
> FQDNs or UQDNS depending on what `uname -n` returns), so it's
> definitely not urgent for us.  But if you think it's a bug worth
> fixing, I can provide details and/or test code.
> 
> Thanks,
> Riccardo
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-20 Thread Jeff Squyres (jsquyres)
Er... are you having problems with host IP addresses 127.0.1.1, or did you 
reply to the wrong thread?  

I thought you were asking about problems with multiple mpf90's in your PATH, 
etc. -- not 127.0.1.1 IP address issues.  IIRC, there were a bunch of 
suggestions over on that thread about how to fix your problem.  If those were 
not helpful to you, it might be easier to find a local Linux/OS X/shell guru 
and get them to help you setup your PATH / LD_LIBRARY_PATH correctly, and give 
you a quick tutorial on shell basics.


On Jun 20, 2013, at 10:04 AM, Lorenzo Donà  wrote:

> Dear all that help me thanks to everyone.
> I compiled open MPI with all yours advices posted but the error is always the 
> same I'm also able to run the examples found with the package.
> but really I don't know what can I do to solve the problem.
> I trust in you to help me.
> Dearly Lorenzo.
>  
> Il giorno 20/giu/2013, alle ore 06.33, Ralph Castain ha scritto:
> 
>> Been trying to decipher this problem, and think maybe I'm beginning to 
>> understand it. Just to clarify:
>> 
>> * when you execute "hostname", you get the .local response?
>> 
>> * you somewhere have it setup so that 10.x.x.x resolves to , with no 
>> ".local" extension?
>> 
>> Correct?
>> 
>> 
>> 
>> On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri  
>> wrote:
>> On 19 June 2013 20:42, Ralph Castain  wrote:
>> > I'm assuming that the offending host has some other address besides
>> > just 127.0.1.1 as otherwise it couldn't connect to anything.
>> 
>> Yes, it has an IP on some 10.x.x.x network.
>> 
>> 
>> > I'm heading out the door for a couple of weeks, but can try to look at it 
>> > when I return.
>> 
>> We have a workaround (just create the hostfile using FQDNs -- actually
>> FQDNs or UQDNS depending on what `uname -n` returns), so it's
>> definitely not urgent for us.  But if you think it's a bug worth
>> fixing, I can provide details and/or test code.
>> 
>> Thanks,
>> Riccardo
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] error running with mpirun

2013-06-20 Thread Lorenzo Donà
Dear all that help me: THANKS for your patience with me.
I was able to compile with open MPI:
but now I found this error message running programs copiled with open MPI:

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  MacBook-Pro-di-Lorenzo-Dona.local
Framework: ras
Component: proxy
--
[MacBook-Pro-di-Lorenzo-Dona.local:34123] [[34784,0],0] ORTE_ERROR_LOG: Error 
in file ess_hnp_module.c at line 360
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ras_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[MacBook-Pro-di-Lorenzo-Dona.local:34122] [[INVALID],INVALID] ORTE_ERROR_LOG: 
Unable to start a daemon on the local node in file ess_singleton_module.c at 
line 428
[MacBook-Pro-di-Lorenzo-Dona.local:34122] [[INVALID],INVALID] ORTE_ERROR_LOG: 
Unable to start a daemon on the local node in file ess_singleton_module.c at 
line 211
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead 
of ORTE_SUCCESS
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of 
"Success" (0)
--
[MacBook-Pro-di-Lorenzo-Dona:34122] *** An error occurred in MPI_Init
[MacBook-Pro-di-Lorenzo-Dona:34122] *** reported by process 
[4294967295,4294967295]
[MacBook-Pro-di-Lorenzo-Dona:34122] *** on a NULL communicator
[MacBook-Pro-di-Lorenzo-Dona:34122] *** Unknown error
[MacBook-Pro-di-Lorenzo-Dona:34122] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[MacBook-Pro-di-Lorenzo-Dona:34122] ***and potentially your MPI job)
--
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason: Before MPI_INIT completed
  Local host: MacBook-Pro-di-Lorenzo-Dona.local
  PID:34122
--
MacBook-Pro-di-Lorenzo-Dona:v1 lorenzodona$ export 
LD_LIBRARY_PATH=/Users/lorenzodona/Desktop/openmpi-1.7.1/bin/lib:$LD_LIBRARY_PATH
MacBook-Pro-di-Lorenzo-Dona:v1 lorenzodona$ mpirun -np 1 
/Users/lorenzodona/Downloads/abinit-7.2.2/src/98_main/abinit  Returned value Error (-1) instead of ORTE_SUCCESS
--
[MacBook-Pro-di-Lorenzo-Dona.local:34142] [[INVALID],INVALID] ORTE_ERROR_LOG: 
Unable to start a daemon on the local node in file ess_singleton_module.c at 
line 428
[MacBook-Pro-di-Lorenzo-Dona.local:34142] [[INVALID],INVALID] ORTE_ERROR_LOG: 
Unable to start a daemon on the local node in file ess_singleton_module.c at 
line 211
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some addition

[OMPI users] Detecting Node Failure

2013-06-20 Thread Claire Williams
Hi all,

I was wondering if Open-MPI had any way to detect that a node has crashed, 
rebooted, etc. I am currently trying to integrate my MPI application with 
Amazon EC2 spot instances, and since spot instances can be terminated at any 
time, I would like to try to make it so that my application can detect this 
node failure, maybe remove the node from the machine file, and restart the 
application automatically. Right now, when one of the worker nodes is rebooted 
or terminated, the master that is waiting on the results of that node will just 
hang, waiting for results that will never come. 

Thanks,

Claire  

Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Ralph Castain
It should detect and abort - what version are you using?

Sent from my iPhone

On Jun 20, 2013, at 2:02 PM, Claire Williams  
wrote:

> Hi all,
> 
> I was wondering if Open-MPI had any way to detect that a node has crashed, 
> rebooted, etc. I am currently trying to integrate my MPI application with 
> Amazon EC2 spot instances, and since spot instances can be terminated at any 
> time, I would like to try to make it so that my application can detect this 
> node failure, maybe remove the node from the machine file, and restart the 
> application automatically. Right now, when one of the worker nodes is 
> rebooted or terminated, the master that is waiting on the results of that 
> node will just hang, waiting for results that will never come. 
> 
> Thanks,
> 
> Claire  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Claire Williams
Hi Ralph,

I'm using 1.4.3. Thanks

- Claire



 From: Ralph Castain 
To: Claire Williams ; Open MPI Users 
 
Sent: Thursday, June 20, 2013 1:59 PM
Subject: Re: [OMPI users] Detecting Node Failure
 


It should detect and abort - what version are you using?

Sent from my iPhone

On Jun 20, 2013, at 2:02 PM, Claire Williams  
wrote:


Hi all,
>
>
>I was wondering if Open-MPI had any way to detect that a node has crashed, 
>rebooted, etc. I am currently trying to integrate my MPI application with 
>Amazon EC2 spot instances, and since spot instances can be terminated at any 
>time, I would like to try to make it so that my application can detect this 
>node failure, maybe remove the node from the machine file, and restart the 
>application automatically. Right now, when one of the worker nodes is rebooted 
>or terminated, the master that is waiting on the results of that node will 
>just hang, waiting for results that will never come. 
>
>
>Thanks,
>
>
>Claire  
___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Ralph Castain
Wow that's ancient - can u up to 1.6 series?

Sent from my iPhone

On Jun 20, 2013, at 3:05 PM, Claire Williams  
wrote:

> Hi Ralph,
> 
> I'm using 1.4.3. Thanks
> 
> - Claire
> 
> From: Ralph Castain 
> To: Claire Williams ; Open MPI Users 
>  
> Sent: Thursday, June 20, 2013 1:59 PM
> Subject: Re: [OMPI users] Detecting Node Failure
> 
> It should detect and abort - what version are you using?
> 
> Sent from my iPhone
> 
> On Jun 20, 2013, at 2:02 PM, Claire Williams  
> wrote:
> 
>> Hi all,
>> 
>> I was wondering if Open-MPI had any way to detect that a node has crashed, 
>> rebooted, etc. I am currently trying to integrate my MPI application with 
>> Amazon EC2 spot instances, and since spot instances can be terminated at any 
>> time, I would like to try to make it so that my application can detect this 
>> node failure, maybe remove the node from the machine file, and restart the 
>> application automatically. Right now, when one of the worker nodes is 
>> rebooted or terminated, the master that is waiting on the results of that 
>> node will just hang, waiting for results that will never come. 
>> 
>> Thanks,
>> 
>> Claire  
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Andreas Schäfer
On 14:59 Thu 20 Jun , Ralph Castain wrote:
> It should detect and abort - what version are you using?

Would it be possible to call MPI_Comm_disconnect() in the case the
communicator in question is an intercom -- without having OMPI abort?

I'm asking because if we had a possibility to dynamically
connect/disconnect nodes in a robust way, then we could build
fault-resilient apps on top of that.

Best
-Andreas


-- 
==
Andreas Schäfer
HPC and Grid Computing
Chair of Computer Science 3
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-27910
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!


signature.asc
Description: Digital signature


Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Jeff Squyres (jsquyres)
Not at present, no.

But you might want to look at a fork of the OMPI code base that was exploring 
fault resilience issues:

http://fault-tolerance.org/


On Jun 20, 2013, at 5:57 PM, Andreas Schäfer 
 wrote:

> On 14:59 Thu 20 Jun , Ralph Castain wrote:
>> It should detect and abort - what version are you using?
> 
> Would it be possible to call MPI_Comm_disconnect() in the case the
> communicator in question is an intercom -- without having OMPI abort?
> 
> I'm asking because if we had a possibility to dynamically
> connect/disconnect nodes in a robust way, then we could build
> fault-resilient apps on top of that.
> 
> Best
> -Andreas
> 
> 
> -- 
> ==
> Andreas Schäfer
> HPC and Grid Computing
> Chair of Computer Science 3
> Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
> +49 9131 85-27910
> PGP/GPG key via keyserver
> http://www.libgeodecomp.org
> ==
> 
> (\___/)
> (+'.'+)
> (")_(")
> This is Bunny. Copy and paste Bunny into your
> signature to help him gain world domination!
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Detecting Node Failure

2013-06-20 Thread Ralph Castain
We will also be supporting that in the developer's trunk fairly soon, and
that will appear later on in the 1.9 series.



On Thu, Jun 20, 2013 at 4:18 PM, Jeff Squyres (jsquyres)  wrote:

> Not at present, no.
>
> But you might want to look at a fork of the OMPI code base that was
> exploring fault resilience issues:
>
> http://fault-tolerance.org/
>
>
> On Jun 20, 2013, at 5:57 PM, Andreas Schäfer 
>  wrote:
>
> > On 14:59 Thu 20 Jun , Ralph Castain wrote:
> >> It should detect and abort - what version are you using?
> >
> > Would it be possible to call MPI_Comm_disconnect() in the case the
> > communicator in question is an intercom -- without having OMPI abort?
> >
> > I'm asking because if we had a possibility to dynamically
> > connect/disconnect nodes in a robust way, then we could build
> > fault-resilient apps on top of that.
> >
> > Best
> > -Andreas
> >
> >
> > --
> > ==
> > Andreas Schäfer
> > HPC and Grid Computing
> > Chair of Computer Science 3
> > Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
> > +49 9131 85-27910
> > PGP/GPG key via keyserver
> > http://www.libgeodecomp.org
> > ==
> >
> > (\___/)
> > (+'.'+)
> > (")_(")
> > This is Bunny. Copy and paste Bunny into your
> > signature to help him gain world domination!
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>