Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Been trying to decipher this problem, and think maybe I'm beginning to understand it. Just to clarify: * when you execute "hostname", you get the .local response? * you somewhere have it setup so that 10.x.x.x resolves to , with no ".local" extension? Correct? On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri wrote: > On 19 June 2013 20:42, Ralph Castain wrote: > > I'm assuming that the offending host has some other address besides > > just 127.0.1.1 as otherwise it couldn't connect to anything. > > Yes, it has an IP on some 10.x.x.x network. > > > > I'm heading out the door for a couple of weeks, but can try to look at > it when I return. > > We have a workaround (just create the hostfile using FQDNs -- actually > FQDNs or UQDNS depending on what `uname -n` returns), so it's > definitely not urgent for us. But if you think it's a bug worth > fixing, I can provide details and/or test code. > > Thanks, > Riccardo > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 20 June 2013 06:33, Ralph Castain wrote: > Been trying to decipher this problem, and think maybe I'm beginning to > understand it. Just to clarify: > > * when you execute "hostname", you get the .local response? Yes: [rmurri@nh64-2-11 ~]$ hostname nh64-2-11.local [rmurri@nh64-2-11 ~]$ uname -n nh64-2-11.local [rmurri@nh64-2-11 ~]$ hostname -s nh64-2-11 [rmurri@nh64-2-11 ~]$ hostname -f nh64-2-11.local > * you somewhere have it setup so that 10.x.x.x resolves to , with no > ".local" extension? No. Host name resolution is correct, but the hostname resolves to the 127.0.1.1 address: [rmurri@nh64-2-11 ~]$ getent hosts `hostname` 127.0.1.1nh64-2-11.local nh64-2-11 Note that `/etc/hosts` also lists a 10.x.x.x address, which is the one actually assigned to the ethernet interface: [rmurri@nh64-2-11 ~]$ fgrep `hostname -s` /etc/hosts 127.0.1.1 nh64-2-11.local nh64-2-11 10.1.255.201nh64-2-11.local nh64-2-11 192.168.255.206 nh64-2-11-myri0 If we remove the `127.0.1.1` line from `/etc/hosts`, then everything works again. Also, everything works if we use only FQDNs in the hostfile. So it seems that the 127.0.1.1 address is treated specially. Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 19 June 2013 23:52, Reuti wrote: > Am 19.06.2013 um 22:14 schrieb Riccardo Murri: > >> On 19 June 2013 20:42, Reuti wrote: >>> Am 19.06.2013 um 19:43 schrieb Riccardo Murri : >>> On 19 June 2013 16:01, Ralph Castain wrote: > How is OMPI picking up this hostfile? It isn't being specified on the cmd > line - are you running under some resource manager? Via the environment variable `OMPI_MCA_orte_default_hostfile`. We're running under SGE, but disable the OMPI/SGE integration (rather > > BTW: Which version of SGE? SGE6.2u4 running under Rocks 5.3: $ qstat -h GE 6.2u4 $ cat /etc/rocks-release Rocks release 5.3 (Rolled Tacos) >> It's enabled but (IIRC) the problem is that OpenMPI detects the >> presence of SGE from some environment variable > > Correct. > > >> , which, in our version >> of SGE, simply isn't there. > > Do you use a custom "starter_method" in the queue definition? No custom starter_method. > Does a submitted script with: > > #!/bin/sh > env > > list at least some of the SGE* environment variables - or none at all? Quite some SGE_* variables are in the environment: $ cat env.sh env | sort $ qsub -pe mpi 2 env.sh Your job 29590 ("env.sh") has been submitted $ egrep ^SGE_ env.sh.o29590 SGE_ACCOUNT=sge SGE_ARCH=lx26-amd64 ... However, I cannot reproduce the issue now -- it's quite possible that it originated on a older cluster (now decommisioned) and we just kept the submission script on newer hardware without checking. Thanks for the help, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Dear all that help me thanks to everyone. I compiled open MPI with all yours advices posted but the error is always the same I'm also able to run the examples found with the package. but really I don't know what can I do to solve the problem. I trust in you to help me. Dearly Lorenzo. Il giorno 20/giu/2013, alle ore 06.33, Ralph Castain ha scritto: > Been trying to decipher this problem, and think maybe I'm beginning to > understand it. Just to clarify: > > * when you execute "hostname", you get the .local response? > > * you somewhere have it setup so that 10.x.x.x resolves to , with no > ".local" extension? > > Correct? > > > > On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri wrote: > On 19 June 2013 20:42, Ralph Castain wrote: > > I'm assuming that the offending host has some other address besides > > just 127.0.1.1 as otherwise it couldn't connect to anything. > > Yes, it has an IP on some 10.x.x.x network. > > > > I'm heading out the door for a couple of weeks, but can try to look at it > > when I return. > > We have a workaround (just create the hostfile using FQDNs -- actually > FQDNs or UQDNS depending on what `uname -n` returns), so it's > definitely not urgent for us. But if you think it's a bug worth > fixing, I can provide details and/or test code. > > Thanks, > Riccardo > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Er... are you having problems with host IP addresses 127.0.1.1, or did you reply to the wrong thread? I thought you were asking about problems with multiple mpf90's in your PATH, etc. -- not 127.0.1.1 IP address issues. IIRC, there were a bunch of suggestions over on that thread about how to fix your problem. If those were not helpful to you, it might be easier to find a local Linux/OS X/shell guru and get them to help you setup your PATH / LD_LIBRARY_PATH correctly, and give you a quick tutorial on shell basics. On Jun 20, 2013, at 10:04 AM, Lorenzo Donà wrote: > Dear all that help me thanks to everyone. > I compiled open MPI with all yours advices posted but the error is always the > same I'm also able to run the examples found with the package. > but really I don't know what can I do to solve the problem. > I trust in you to help me. > Dearly Lorenzo. > > Il giorno 20/giu/2013, alle ore 06.33, Ralph Castain ha scritto: > >> Been trying to decipher this problem, and think maybe I'm beginning to >> understand it. Just to clarify: >> >> * when you execute "hostname", you get the .local response? >> >> * you somewhere have it setup so that 10.x.x.x resolves to , with no >> ".local" extension? >> >> Correct? >> >> >> >> On Wed, Jun 19, 2013 at 1:17 PM, Riccardo Murri >> wrote: >> On 19 June 2013 20:42, Ralph Castain wrote: >> > I'm assuming that the offending host has some other address besides >> > just 127.0.1.1 as otherwise it couldn't connect to anything. >> >> Yes, it has an IP on some 10.x.x.x network. >> >> >> > I'm heading out the door for a couple of weeks, but can try to look at it >> > when I return. >> >> We have a workaround (just create the hostfile using FQDNs -- actually >> FQDNs or UQDNS depending on what `uname -n` returns), so it's >> definitely not urgent for us. But if you think it's a bug worth >> fixing, I can provide details and/or test code. >> >> Thanks, >> Riccardo >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] error running with mpirun
Dear all that help me: THANKS for your patience with me. I was able to compile with open MPI: but now I found this error message running programs copiled with open MPI: A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: MacBook-Pro-di-Lorenzo-Dona.local Framework: ras Component: proxy -- [MacBook-Pro-di-Lorenzo-Dona.local:34123] [[34784,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 360 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ras_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [MacBook-Pro-di-Lorenzo-Dona.local:34122] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 428 [MacBook-Pro-di-Lorenzo-Dona.local:34122] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 211 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0) -- [MacBook-Pro-di-Lorenzo-Dona:34122] *** An error occurred in MPI_Init [MacBook-Pro-di-Lorenzo-Dona:34122] *** reported by process [4294967295,4294967295] [MacBook-Pro-di-Lorenzo-Dona:34122] *** on a NULL communicator [MacBook-Pro-di-Lorenzo-Dona:34122] *** Unknown error [MacBook-Pro-di-Lorenzo-Dona:34122] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [MacBook-Pro-di-Lorenzo-Dona:34122] ***and potentially your MPI job) -- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: MacBook-Pro-di-Lorenzo-Dona.local PID:34122 -- MacBook-Pro-di-Lorenzo-Dona:v1 lorenzodona$ export LD_LIBRARY_PATH=/Users/lorenzodona/Desktop/openmpi-1.7.1/bin/lib:$LD_LIBRARY_PATH MacBook-Pro-di-Lorenzo-Dona:v1 lorenzodona$ mpirun -np 1 /Users/lorenzodona/Downloads/abinit-7.2.2/src/98_main/abinit Returned value Error (-1) instead of ORTE_SUCCESS -- [MacBook-Pro-di-Lorenzo-Dona.local:34142] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 428 [MacBook-Pro-di-Lorenzo-Dona.local:34142] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 211 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some addition
[OMPI users] Detecting Node Failure
Hi all, I was wondering if Open-MPI had any way to detect that a node has crashed, rebooted, etc. I am currently trying to integrate my MPI application with Amazon EC2 spot instances, and since spot instances can be terminated at any time, I would like to try to make it so that my application can detect this node failure, maybe remove the node from the machine file, and restart the application automatically. Right now, when one of the worker nodes is rebooted or terminated, the master that is waiting on the results of that node will just hang, waiting for results that will never come. Thanks, Claire
Re: [OMPI users] Detecting Node Failure
It should detect and abort - what version are you using? Sent from my iPhone On Jun 20, 2013, at 2:02 PM, Claire Williams wrote: > Hi all, > > I was wondering if Open-MPI had any way to detect that a node has crashed, > rebooted, etc. I am currently trying to integrate my MPI application with > Amazon EC2 spot instances, and since spot instances can be terminated at any > time, I would like to try to make it so that my application can detect this > node failure, maybe remove the node from the machine file, and restart the > application automatically. Right now, when one of the worker nodes is > rebooted or terminated, the master that is waiting on the results of that > node will just hang, waiting for results that will never come. > > Thanks, > > Claire > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Detecting Node Failure
Hi Ralph, I'm using 1.4.3. Thanks - Claire From: Ralph Castain To: Claire Williams ; Open MPI Users Sent: Thursday, June 20, 2013 1:59 PM Subject: Re: [OMPI users] Detecting Node Failure It should detect and abort - what version are you using? Sent from my iPhone On Jun 20, 2013, at 2:02 PM, Claire Williams wrote: Hi all, > > >I was wondering if Open-MPI had any way to detect that a node has crashed, >rebooted, etc. I am currently trying to integrate my MPI application with >Amazon EC2 spot instances, and since spot instances can be terminated at any >time, I would like to try to make it so that my application can detect this >node failure, maybe remove the node from the machine file, and restart the >application automatically. Right now, when one of the worker nodes is rebooted >or terminated, the master that is waiting on the results of that node will >just hang, waiting for results that will never come. > > >Thanks, > > >Claire ___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Detecting Node Failure
Wow that's ancient - can u up to 1.6 series? Sent from my iPhone On Jun 20, 2013, at 3:05 PM, Claire Williams wrote: > Hi Ralph, > > I'm using 1.4.3. Thanks > > - Claire > > From: Ralph Castain > To: Claire Williams ; Open MPI Users > > Sent: Thursday, June 20, 2013 1:59 PM > Subject: Re: [OMPI users] Detecting Node Failure > > It should detect and abort - what version are you using? > > Sent from my iPhone > > On Jun 20, 2013, at 2:02 PM, Claire Williams > wrote: > >> Hi all, >> >> I was wondering if Open-MPI had any way to detect that a node has crashed, >> rebooted, etc. I am currently trying to integrate my MPI application with >> Amazon EC2 spot instances, and since spot instances can be terminated at any >> time, I would like to try to make it so that my application can detect this >> node failure, maybe remove the node from the machine file, and restart the >> application automatically. Right now, when one of the worker nodes is >> rebooted or terminated, the master that is waiting on the results of that >> node will just hang, waiting for results that will never come. >> >> Thanks, >> >> Claire >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] Detecting Node Failure
On 14:59 Thu 20 Jun , Ralph Castain wrote: > It should detect and abort - what version are you using? Would it be possible to call MPI_Comm_disconnect() in the case the communicator in question is an intercom -- without having OMPI abort? I'm asking because if we had a possibility to dynamically connect/disconnect nodes in a robust way, then we could build fault-resilient apps on top of that. Best -Andreas -- == Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org == (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! signature.asc Description: Digital signature
Re: [OMPI users] Detecting Node Failure
Not at present, no. But you might want to look at a fork of the OMPI code base that was exploring fault resilience issues: http://fault-tolerance.org/ On Jun 20, 2013, at 5:57 PM, Andreas Schäfer wrote: > On 14:59 Thu 20 Jun , Ralph Castain wrote: >> It should detect and abort - what version are you using? > > Would it be possible to call MPI_Comm_disconnect() in the case the > communicator in question is an intercom -- without having OMPI abort? > > I'm asking because if we had a possibility to dynamically > connect/disconnect nodes in a robust way, then we could build > fault-resilient apps on top of that. > > Best > -Andreas > > > -- > == > Andreas Schäfer > HPC and Grid Computing > Chair of Computer Science 3 > Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany > +49 9131 85-27910 > PGP/GPG key via keyserver > http://www.libgeodecomp.org > == > > (\___/) > (+'.'+) > (")_(") > This is Bunny. Copy and paste Bunny into your > signature to help him gain world domination! > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Detecting Node Failure
We will also be supporting that in the developer's trunk fairly soon, and that will appear later on in the 1.9 series. On Thu, Jun 20, 2013 at 4:18 PM, Jeff Squyres (jsquyres) wrote: > Not at present, no. > > But you might want to look at a fork of the OMPI code base that was > exploring fault resilience issues: > > http://fault-tolerance.org/ > > > On Jun 20, 2013, at 5:57 PM, Andreas Schäfer > wrote: > > > On 14:59 Thu 20 Jun , Ralph Castain wrote: > >> It should detect and abort - what version are you using? > > > > Would it be possible to call MPI_Comm_disconnect() in the case the > > communicator in question is an intercom -- without having OMPI abort? > > > > I'm asking because if we had a possibility to dynamically > > connect/disconnect nodes in a robust way, then we could build > > fault-resilient apps on top of that. > > > > Best > > -Andreas > > > > > > -- > > == > > Andreas Schäfer > > HPC and Grid Computing > > Chair of Computer Science 3 > > Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany > > +49 9131 85-27910 > > PGP/GPG key via keyserver > > http://www.libgeodecomp.org > > == > > > > (\___/) > > (+'.'+) > > (")_(") > > This is Bunny. Copy and paste Bunny into your > > signature to help him gain world domination! > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >