[OMPI users] delimiter in appfile
Hi, I get strange results if I use a tab instead of a space as a delimiter in an appfile. Perhaps I've missed something but I can't remember that I read that tabs are not allowed. Tab between 2 and -host. -np 2 -host tyr.informatik.hs-fulda.de rank_size tyr small_prog 144 mpiexec -app app_rank_size.openmpi_fulda -- mpiexec was unable to launch the specified application as it could not find an executable: Executable: tyr.informatik.hs-fulda.de Node: tyr.informatik.hs-fulda.de while attempting to start process rank 0. -- 2 total processes failed to start tyr small_prog 145 Tab between -host and tyr. -np 2 -host tyr.informatik.hs-fulda.de rank_size tyr small_prog 145 mpiexec -app app_rank_size.openmpi_fulda -- mpiexec was unable to launch the specified application as it could not find an executable: Executable: -o Node: tyr.informatik.hs-fulda.de while attempting to start process rank 0. -- 2 total processes failed to start tyr small_prog 146 Tab before rank_size. -np 2 -host tyr.informatik.hs-fulda.de rank_size tyr small_prog 147 mpiexec -app app_rank_size.openmpi_fulda -- No executable was specified on the mpiexec command line. Aborting. -- tyr small_prog 148 Everything works fine if I only use spaces. tyr small_prog 132 mpiexec -app app_rank_size.openmpi_fulda I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de. I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.1 is supported. MPI standard 2.1 is supported. Is it possible to change the behaviour so that both tab and space can be used as delimiter? Kind regards Siegmar
[OMPI users] problem with rankfile
Hi, the man page for "mpiexec" shows the following: cat myrankfile rank 0=aa slot=1:0-2 rank 1=bb slot=0:0,1 rank 2=cc slot=1-2 mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that Rank 0 runs on node aa, bound to socket 1, cores 0-2. Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. Rank 2 runs on node cc, bound to cores 1 and 2. Does it mean that the process with rank 0 should be bound to core 0, 1, or 2 of socket 1? I tried to use a rankfile and have a problem. My rankfile contains the following lines. rank 0=tyr.informatik.hs-fulda.de slot=0:0 rank 1=tyr.informatik.hs-fulda.de slot=1:0 #rank 2=rs0.informatik.hs-fulda.de slot=0:0 Everything is fine if I use the file with just my local machine (the first two lines). tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] odls:default:fork binding child [[9849,1],0] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] odls:default:fork binding child [[9849,1],1] to slot_list 1:0 I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.1 is supported. I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.1 is supported. tyr small_prog 116 I can also change the socket number and the processes will be attached to the correct cores. Unfortunately it doesn't work if I add one other machine (third line). tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size -- We were unable to successfully process/set the requested processor affinity settings: Specified slot list: 0:0 Error: Cross-device link This could mean that a non-existent processor was specified, or that the specification had improper syntax. -- [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] odls:default:fork binding child [[10212,1],0] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] odls:default:fork binding child [[10212,1],1] to slot_list 1:0 [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] odls:default:fork binding child [[10212,1],2] to slot_list 0:0 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send to [[10212,1],0]: tag 20 [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c at line 2501 -- mpiexec was unable to start the specified application as it encountered an error: Error name: Error 0 Node: rs0.informatik.hs-fulda.de when attempting to start process rank 2. -- tyr small_prog 113 The other machine has two 8 core processors. tyr small_prog 121 ssh rs0 psrinfo -v Status of virtual processor 0 as of: 09/03/2012 19:51:15 on-line since 07/26/2012 15:03:14. The sparcv9 processor operates at 2400 MHz, and has a sparcv9 floating point processor. Status of virtual processor 1 as of: 09/03/2012 19:51:15 ... Status of virtual processor 15 as of: 09/03/2012 19:51:15 on-line since 07/26/2012 15:03:16. The sparcv9 processor operates at 2400 MHz, and has a sparcv9 floating point processor. tyr small_prog 122 Is it necessary to specify another option on the command line or is my rankfile faulty? Thank you very much for any suggestions in advance. Kind regards Siegmar
Re: [OMPI users] problem with rankfile
Are *all* the machines Sparc? Or just the 3rd one (rs0)? On Sep 3, 2012, at 12:43 PM, Siegmar Gross wrote: > Hi, > > the man page for "mpiexec" shows the following: > > cat myrankfile > rank 0=aa slot=1:0-2 > rank 1=bb slot=0:0,1 > rank 2=cc slot=1-2 > mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that > > Rank 0 runs on node aa, bound to socket 1, cores 0-2. > Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. > Rank 2 runs on node cc, bound to cores 1 and 2. > > Does it mean that the process with rank 0 should be bound to > core 0, 1, or 2 of socket 1? > > I tried to use a rankfile and have a problem. My rankfile contains > the following lines. > > rank 0=tyr.informatik.hs-fulda.de slot=0:0 > rank 1=tyr.informatik.hs-fulda.de slot=1:0 > #rank 2=rs0.informatik.hs-fulda.de slot=0:0 > > > Everything is fine if I use the file with just my local machine > (the first two lines). > > tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] > odls:default:fork binding child [[9849,1],0] to slot_list 0:0 > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] > odls:default:fork binding child [[9849,1],1] to slot_list 1:0 > I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de. > MPI standard 2.1 is supported. > I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de. > MPI standard 2.1 is supported. > tyr small_prog 116 > > > I can also change the socket number and the processes will be attached > to the correct cores. Unfortunately it doesn't work if I add one > other machine (third line). > > > tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size > -- > We were unable to successfully process/set the requested processor > affinity settings: > > Specified slot list: 0:0 > Error: Cross-device link > > This could mean that a non-existent processor was specified, or > that the specification had improper syntax. > -- > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > odls:default:fork binding child [[10212,1],0] to slot_list 0:0 > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > odls:default:fork binding child [[10212,1],1] to slot_list 1:0 > [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] > odls:default:fork binding child [[10212,1],2] to slot_list 0:0 > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] > ORTE_ERROR_LOG: A message is attempting to be sent to a process > whose contact information is unknown in file > ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send > to [[10212,1],0]: tag 20 > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: > A message is attempting to be sent to a process whose contact > information is unknown in file > ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c > at line 2501 > -- > mpiexec was unable to start the specified application as it > encountered an error: > > Error name: Error 0 > Node: rs0.informatik.hs-fulda.de > > when attempting to start process rank 2. > -- > tyr small_prog 113 > > > > The other machine has two 8 core processors. > > tyr small_prog 121 ssh rs0 psrinfo -v > Status of virtual processor 0 as of: 09/03/2012 19:51:15 > on-line since 07/26/2012 15:03:14. > The sparcv9 processor operates at 2400 MHz, >and has a sparcv9 floating point processor. > Status of virtual processor 1 as of: 09/03/2012 19:51:15 > ... > Status of virtual processor 15 as of: 09/03/2012 19:51:15 > on-line since 07/26/2012 15:03:16. > The sparcv9 processor operates at 2400 MHz, >and has a sparcv9 floating point processor. > tyr small_prog 122 > > > > Is it necessary to specify another option on the command line or > is my rankfile faulty? Thank you very much for any suggestions in > advance. > > > Kind regards > > Siegmar > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] delimiter in appfile
Possible - yes. Likely to happen immediately - less so as most of us are quite busy right now. I'll add it to the "requested feature" list, but can make no promises on if/when it might happen. Certainly won't be included in anything prior to the upcoming 1.7 series. On Sep 3, 2012, at 12:42 PM, Siegmar Gross wrote: > Hi, > > I get strange results if I use a tab instead of a space as a > delimiter in an appfile. Perhaps I've missed something but I > can't remember that I read that tabs are not allowed. > > > Tab between 2 and -host. > > -np 2 -host tyr.informatik.hs-fulda.de rank_size > > tyr small_prog 144 mpiexec -app app_rank_size.openmpi_fulda > -- > mpiexec was unable to launch the specified application as it could > not find an executable: > > Executable: tyr.informatik.hs-fulda.de > Node: tyr.informatik.hs-fulda.de > > while attempting to start process rank 0. > -- > 2 total processes failed to start > tyr small_prog 145 > > > > Tab between -host and tyr. > > -np 2 -host tyr.informatik.hs-fulda.de rank_size > > tyr small_prog 145 mpiexec -app app_rank_size.openmpi_fulda > -- > mpiexec was unable to launch the specified application as it could > not find an executable: > > Executable: -o > Node: tyr.informatik.hs-fulda.de > > while attempting to start process rank 0. > -- > 2 total processes failed to start > tyr small_prog 146 > > > > Tab before rank_size. > > -np 2 -host tyr.informatik.hs-fulda.derank_size > > tyr small_prog 147 mpiexec -app app_rank_size.openmpi_fulda > -- > No executable was specified on the mpiexec command line. > > Aborting. > -- > tyr small_prog 148 > > > > Everything works fine if I only use spaces. > > tyr small_prog 132 mpiexec -app app_rank_size.openmpi_fulda > I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de. > I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de. > MPI standard 2.1 is supported. > MPI standard 2.1 is supported. > > > Is it possible to change the behaviour so that both tab and space > can be used as delimiter? > > > Kind regards > > Siegmar > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] some mpi processes "disappear" on a cluster of servers
I have asked to my admin and he said that no log messages were present in /var/log, apart my login on the compute node. No killed processes, no full stack errors, the memory is ok, 1GB is used and 2GB are free. Actually I don't know what kind of problem coud be, does someone have ideas? Or at least a suspect? Please, don't let me alone! Sorry for the trouble with the mail 2012/9/1 : > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > >1. Re: some mpi processes "disappear" on a cluster ofservers > (John Hearns) >2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) > > > -- > > Message: 1 > Date: Sat, 1 Sep 2012 08:48:56 +0100 > From: John Hearns > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers > To: Open MPI Users > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Apologies, I have not taken the time to read your comprehensive diagnostics! > > As Gus says, this sounds like a memory problem. > My suspicion would be the kernel Out Of Memory (OOM) killer. > Log into those nodes (or ask your systems manager to do this). Look > closely at /var/log/messages where there will be notifications when > the OOM Killer kicks in and - well - kills large memory processes! > Grep for "killed process" in /var/log/messages > > http://linux-mm.org/OOM_Killer > > > -- > > Message: 2 > Date: Sat, 1 Sep 2012 11:50:59 +0200 > From: Andrea Negri > Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5 > To: us...@open-mpi.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, Gus and John, > > my code (zeusmp2) is a F77 code ported in F95, and the very first time > I have launched it, the processed disappears almost immediately. > I checked the code with valgrind and some unallocated arrays were > passed wrongly to some subroutines. > After having corrected this bug, the code runs for a while and after > occour all the stuff described in my first post. > However, the code still performs a lot of main temporal cycle before > "die" (I don't know if thies piece of information is useful). > > Now I'm going to check the memory usage, (I also have a lot of unused > variables in this pretty large code, maybe I shoud comment them). > > uname -a returns > Linux cloud 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006 > x86_64 x86_64 x86_64 GNU/Linux > > ulimit -a returns: > core file size(blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals(-i) 1024 > max locked memory (kbytes, -l) 32 > max memory size(kbytes, -m) unlimited > open files (-n) 1024 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 36864 > virtual memory (kbytes, -v) unlimited > file locks(-x) unlimited > > > I can log on the logins nodes, but unfortunately the command ls > /var/log/messages return > acpid cron.4 messages.3 secure.4 > anaconda.logcupsmessages.4 spooler > anaconda.syslog dmesg mpi_uninstall.log spooler.1 > anaconda.xlog gdm pppspooler.2 > audit httpd prelink.logspooler.3 > boot.logitac_uninstall.log rpmpkgsspooler.4 > boot.log.1 lastlog rpmpkgs.1 vbox > boot.log.2 mailrpmpkgs.2 wtmp > boot.log.3 maillog rpmpkgs.3 wtmp.1 > boot.log.4 maillog.1 rpmpkgs.4 Xorg.0.log > cmkl_install.logmaillog.2 samba Xorg.0.log.old > cmkl_uninstall.log maillog.3 scrollkeeper.log yum.log > cronmaillog.4 secure yum.log.1 > cron.1 messagessecure.1 > cron.2 messages.1 secure.2 > cron.3 messages.2 secure.3 > > so, the log should be in some of these files (I don't have read > permission on these files). I
[OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
Hi all, I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone else notice, that the command: $ mpiexec -n 4 -machinefile mymachines ./mpihello will ignore the argument "-machinefile mymachines" and use the file "openmpi-default-hostfile" instead all the time? == SGE issue I usually don't install new versions instantly, so I only noticed right now, that in 1.4.5 I get a wrong allocation inside SGE (always one process less than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 I get: -- There are no nodes allocated to this job. -- all the time. == I configured with: ./configure --prefix=$HOME/local/... --enable-static --disable-shared --with-sge and adjusted my PATHs accordingly (at least: I hope so). -- Reuti
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
It looks to me like the network is losing connections - your error messages all state "no route to host", which implies that the network interface dropped out. On Sep 3, 2012, at 1:39 PM, Andrea Negri wrote: > I have asked to my admin and he said that no log messages were present > in /var/log, apart my login on the compute node. > No killed processes, no full stack errors, the memory is ok, 1GB is > used and 2GB are free. > Actually I don't know what kind of problem coud be, does someone have > ideas? Or at least a suspect? > > Please, don't let me alone! > > Sorry for the trouble with the mail > > 2012/9/1 : >> Send users mailing list submissions to >>us...@open-mpi.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >>http://www.open-mpi.org/mailman/listinfo.cgi/users >> or, via email, send a message with subject or body 'help' to >>users-requ...@open-mpi.org >> >> You can reach the person managing the list at >>users-ow...@open-mpi.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of users digest..." >> >> >> Today's Topics: >> >> 1. Re: some mpi processes "disappear" on a cluster ofservers >> (John Hearns) >> 2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) >> >> >> -- >> >> Message: 1 >> Date: Sat, 1 Sep 2012 08:48:56 +0100 >> From: John Hearns >> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster >>of servers >> To: Open MPI Users >> Message-ID: >> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Apologies, I have not taken the time to read your comprehensive diagnostics! >> >> As Gus says, this sounds like a memory problem. >> My suspicion would be the kernel Out Of Memory (OOM) killer. >> Log into those nodes (or ask your systems manager to do this). Look >> closely at /var/log/messages where there will be notifications when >> the OOM Killer kicks in and - well - kills large memory processes! >> Grep for "killed process" in /var/log/messages >> >> http://linux-mm.org/OOM_Killer >> >> >> -- >> >> Message: 2 >> Date: Sat, 1 Sep 2012 11:50:59 +0200 >> From: Andrea Negri >> Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5 >> To: us...@open-mpi.org >> Message-ID: >> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hi, Gus and John, >> >> my code (zeusmp2) is a F77 code ported in F95, and the very first time >> I have launched it, the processed disappears almost immediately. >> I checked the code with valgrind and some unallocated arrays were >> passed wrongly to some subroutines. >> After having corrected this bug, the code runs for a while and after >> occour all the stuff described in my first post. >> However, the code still performs a lot of main temporal cycle before >> "die" (I don't know if thies piece of information is useful). >> >> Now I'm going to check the memory usage, (I also have a lot of unused >> variables in this pretty large code, maybe I shoud comment them). >> >> uname -a returns >> Linux cloud 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006 >> x86_64 x86_64 x86_64 GNU/Linux >> >> ulimit -a returns: >> core file size(blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> file size (blocks, -f) unlimited >> pending signals(-i) 1024 >> max locked memory (kbytes, -l) 32 >> max memory size(kbytes, -m) unlimited >> open files (-n) 1024 >> pipe size(512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> stack size (kbytes, -s) 10240 >> cpu time (seconds, -t) unlimited >> max user processes (-u) 36864 >> virtual memory (kbytes, -v) unlimited >> file locks(-x) unlimited >> >> >> I can log on the logins nodes, but unfortunately the command ls >> /var/log/messages return >> acpid cron.4 messages.3 secure.4 >> anaconda.logcupsmessages.4 spooler >> anaconda.syslog dmesg mpi_uninstall.log spooler.1 >> anaconda.xlog gdm pppspooler.2 >> audit httpd prelink.logspooler.3 >> boot.logitac_uninstall.log rpmpkgsspooler.4 >> boot.log.1 lastlog rpmpkgs.1 vbox >> boot.log.2 mailrpmpkgs.2 wtmp >> boot.log.3 maillog rpmpkgs.3 wtmp.1 >> boot.log.4 maillog.1 rpmpkgs.4 Xorg.0.log >> cmkl_install.logmaillog.2 samba Xorg.0.log.old >> cmkl_uninstall.log maillog.3
Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
On Sep 3, 2012, at 2:12 PM, Reuti wrote: > Hi all, > > I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone > else notice, that the command: > > $ mpiexec -n 4 -machinefile mymachines ./mpihello > > will ignore the argument "-machinefile mymachines" and use the file > "openmpi-default-hostfile" instead all the time? Try setting "-mca orte_default_hostfile mymachines" instead. > > == > > SGE issue > > I usually don't install new versions instantly, so I only noticed right now, > that in 1.4.5 I get a wrong allocation inside SGE (always one process less > than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 > I get: > > -- > There are no nodes allocated to this job. > -- > > all the time. Weird - I'm not sure I understand what you are saying. Is this happening with 1.6.1 as well? Or just with 1.4.5? > > == > > I configured with: > > ./configure --prefix=$HOME/local/... --enable-static --disable-shared > --with-sge > > and adjusted my PATHs accordingly (at least: I hope so). > > -- Reuti > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Give the attached patch a try - this works for me, but I'd like it verified before it goes into the next 1.6 release (singleton comm_spawn is so rarely used that it can easily be overlooked for some time). Thx Ralph singleton_comm_spawn.diff Description: Binary data On Aug 31, 2012, at 3:32 PM, Brian Budge wrote: > Thanks, much appreciated. > > On Fri, Aug 31, 2012 at 2:37 PM, Ralph Castain wrote: >> I see - well, I hope to work on it this weekend and may get it fixed. If I >> do, I can provide you with a patch for the 1.6 series that you can use until >> the actual release is issued, if that helps. >> >> >> On Aug 31, 2012, at 2:33 PM, Brian Budge wrote: >> >>> Hi Ralph - >>> >>> This is true, but we may not know until well into the process whether >>> we need MPI at all. We have an SMP/NUMA mode that is designed to run >>> faster on a single machine. We also may build our application on >>> machines where there is no MPI, and we simply don't build the code >>> that runs the MPI functionality in that case. We have scripts all >>> over the place that need to start this application, and it would be >>> much easier to be able to simply run the program than to figure out >>> when or if mpirun needs to be starting the program. >>> >>> Before, we went so far as to fork and exec a full mpirun when we run >>> in clustered mode. This resulted in an additional process running, >>> and we had to use sockets to get the data to the new master process. >>> I very much like the idea of being able to have our process become the >>> MPI master instead, so I have been very excited about your work around >>> this singleton fork/exec under the hood. >>> >>> Once I get my new infrastructure designed to work with mpirun -n 1 + >>> spawn, I will try some previous openmpi versions to see if I can find >>> a version with this singleton functionality in-tact. >>> >>> Thanks again, >>> Brian >>> >>> On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain wrote: not off the top of my head. However, as noted earlier, there is absolutely no advantage to a singleton vs mpirun start - all the singleton does is immediately fork/exec "mpirun" to support the rest of the job. In both cases, you have a daemon running the job - only difference is in the number of characters the user types to start it. On Aug 30, 2012, at 8:44 AM, Brian Budge wrote: > In the event that I need to get this up-and-running soon (I do need > something working within 2 weeks), can you recommend an older version > where this is expected to work? > > Thanks, > Brian > > On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge > wrote: >> Thanks! >> >> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain wrote: >>> Yeah, I'm seeing the hang as well when running across multiple >>> machines. Let me dig a little and get this fixed. >>> >>> Thanks >>> Ralph >>> >>> On Aug 28, 2012, at 4:51 PM, Brian Budge wrote: >>> Hmmm, I went to the build directories of openmpi for my two machines, went into the orte/test/mpi directory and made the executables on both machines. I set the hostsfile in the env variable on the "master" machine. Here's the output: OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile ./simple_spawn Parent [pid 97504] starting up! 0 completed MPI_Init Parent [pid 97504] about to spawn! Parent [pid 97507] starting up! Parent [pid 97508] starting up! Parent [pid 30626] starting up! ^C zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn I had to ^C to kill the hung process. When I run using mpirun: OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile mpirun -np 1 ./simple_spawn Parent [pid 97511] starting up! 0 completed MPI_Init Parent [pid 97511] about to spawn! Parent [pid 97513] starting up! Parent [pid 30762] starting up! Parent [pid 30764] starting up! Parent done with spawn Parent sending message to child 1 completed MPI_Init Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 0 completed MPI_Init Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 2 completed MPI_Init Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 Child 1 disconnected Child 0 received msg: 38 Child 0 disconnected Parent disconnected Child 2 disconnected 97511: exiting 97513: exiting 30762: exiting 30764: exiting As you can see, I'm using openmpi v
Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
Hi Ralph, Am 03.09.2012 um 23:34 schrieb Ralph Castain: > > On Sep 3, 2012, at 2:12 PM, Reuti wrote: > >> Hi all, >> >> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone >> else notice, that the command: >> >> $ mpiexec -n 4 -machinefile mymachines ./mpihello >> >> will ignore the argument "-machinefile mymachines" and use the file >> "openmpi-default-hostfile" instead all the time? > > Try setting "-mca orte_default_hostfile mymachines" instead. Is this a known bug and will be fixed or is this the new syntax? >> == >> >> SGE issue >> >> I usually don't install new versions instantly, so I only noticed right now, >> that in 1.4.5 I get a wrong allocation inside SGE (always one process less >> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 >> I get: >> >> -- >> There are no nodes allocated to this job. >> -- >> >> all the time. > > Weird - I'm not sure I understand what you are saying. Is this happening with > 1.6.1 as well? Or just with 1.4.5? 1.6.1 = no nodes allocated 1.4.5 = one process less than requested 1.4.1 = works as it should -- Reuti > >> >> == >> >> I configured with: >> >> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >> --with-sge >> >> and adjusted my PATHs accordingly (at least: I hope so). >> >> -- Reuti >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
In which ways can I check the failure of the ethernet connections? 2012/9/3 : > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > >1. -hostfile ignored in 1.6.1 / SGE integration broken (Reuti) >2. Re: some mpi processes "disappear" on a cluster ofservers > (Ralph Castain) > > > -- > > Message: 1 > Date: Mon, 3 Sep 2012 23:12:14 +0200 > From: Reuti > Subject: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration > broken > To: Open MPI Users > Message-ID: > > Content-Type: text/plain; charset=us-ascii > > Hi all, > > I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone > else notice, that the command: > > $ mpiexec -n 4 -machinefile mymachines ./mpihello > > will ignore the argument "-machinefile mymachines" and use the file > "openmpi-default-hostfile" instead all the time? > > == > > SGE issue > > I usually don't install new versions instantly, so I only noticed right now, > that in 1.4.5 I get a wrong allocation inside SGE (always one process less > than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 > I get: > > -- > There are no nodes allocated to this job. > -- > > all the time. > > == > > I configured with: > > ./configure --prefix=$HOME/local/... --enable-static --disable-shared > --with-sge > > and adjusted my PATHs accordingly (at least: I hope so). > > -- Reuti > > > -- > > Message: 2 > Date: Mon, 3 Sep 2012 14:32:48 -0700 > From: Ralph Castain > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=us-ascii > > It looks to me like the network is losing connections - your error messages > all state "no route to host", which implies that the network interface > dropped out. > > On Sep 3, 2012, at 1:39 PM, Andrea Negri wrote: > >> I have asked to my admin and he said that no log messages were present >> in /var/log, apart my login on the compute node. >> No killed processes, no full stack errors, the memory is ok, 1GB is >> used and 2GB are free. >> Actually I don't know what kind of problem coud be, does someone have >> ideas? Or at least a suspect? >> >> Please, don't let me alone! >> >> Sorry for the trouble with the mail >> >> 2012/9/1 : >>> Send users mailing list submissions to >>>us...@open-mpi.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>>http://www.open-mpi.org/mailman/listinfo.cgi/users >>> or, via email, send a message with subject or body 'help' to >>>users-requ...@open-mpi.org >>> >>> You can reach the person managing the list at >>>users-ow...@open-mpi.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of users digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Re: some mpi processes "disappear" on a cluster ofservers >>> (John Hearns) >>> 2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) >>> >>> >>> -- >>> >>> Message: 1 >>> Date: Sat, 1 Sep 2012 08:48:56 +0100 >>> From: John Hearns >>> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster >>>of servers >>> To: Open MPI Users >>> Message-ID: >>> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> >>> Apologies, I have not taken the time to read your comprehensive diagnostics! >>> >>> As Gus says, this sounds like a memory problem. >>> My suspicion would be the kernel Out Of Memory (OOM) killer. >>> Log into those nodes (or ask your systems manager to do this). Look >>> closely at /var/log/messages where there will be notifications when >>> the OOM Killer kicks in and - well - kills large memory processes! >>> Grep for "killed process" in /var/log/messages >>> >>> http://linux-mm.org/OOM_Killer >>> >>> >>> -- >>> >>> Message: 2 >>> Date: Sat, 1 Sep 2012 11:50:59 +0200 >>> From: Andrea Negri >>> Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5 >>> To: us...@open-mpi.org >>> Message-ID: >>> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> >>> Hi, Gus and John, >>> >>> my code (zeusmp2) is a F77 code ported in F95, and the very first time >>> I have
Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
On Sep 3, 2012, at 2:40 PM, Reuti wrote: > Hi Ralph, > > Am 03.09.2012 um 23:34 schrieb Ralph Castain: > >> >> On Sep 3, 2012, at 2:12 PM, Reuti wrote: >> >>> Hi all, >>> >>> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone >>> else notice, that the command: >>> >>> $ mpiexec -n 4 -machinefile mymachines ./mpihello >>> >>> will ignore the argument "-machinefile mymachines" and use the file >>> "openmpi-default-hostfile" instead all the time? >> >> Try setting "-mca orte_default_hostfile mymachines" instead. > > Is this a known bug and will be fixed or is this the new syntax? I'm leaning towards fixing it - it came due to discussions on how to handle hostfile when there is an allocation. For now, though, that should work. > > >>> == >>> >>> SGE issue >>> >>> I usually don't install new versions instantly, so I only noticed right >>> now, that in 1.4.5 I get a wrong allocation inside SGE (always one process >>> less than requested with `qsub -pe orted N ...`. This I tried only, as with >>> 1.6.1 I get: >>> >>> -- >>> There are no nodes allocated to this job. >>> -- >>> >>> all the time. >> >> Weird - I'm not sure I understand what you are saying. Is this happening >> with 1.6.1 as well? Or just with 1.4.5? > > 1.6.1 = no nodes allocated > 1.4.5 = one process less than requested > 1.4.1 = works as it should > Well that seems strange! Can you run 1.6.1 with the following on the mpirun cmd line: -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca ras_base_verbose 10 My guess is that something in the pe_hostfile syntax may have changed and we didn't pick up on it. > -- Reuti > > >> >>> >>> == >>> >>> I configured with: >>> >>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >>> --with-sge >>> >>> and adjusted my PATHs accordingly (at least: I hope so). >>> >>> -- Reuti >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
This is something you probably need to work on with your sys admin - it sounds like there is something unreliable in your network, and that's usually a somewhat hard thing to diagnose. On Sep 3, 2012, at 2:49 PM, Andrea Negri wrote: > In which ways can I check the failure of the ethernet connections? > > 2012/9/3 : >> Send users mailing list submissions to >>us...@open-mpi.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >>http://www.open-mpi.org/mailman/listinfo.cgi/users >> or, via email, send a message with subject or body 'help' to >>users-requ...@open-mpi.org >> >> You can reach the person managing the list at >>users-ow...@open-mpi.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of users digest..." >> >> >> Today's Topics: >> >> 1. -hostfile ignored in 1.6.1 / SGE integration broken (Reuti) >> 2. Re: some mpi processes "disappear" on a cluster ofservers >> (Ralph Castain) >> >> >> -- >> >> Message: 1 >> Date: Mon, 3 Sep 2012 23:12:14 +0200 >> From: Reuti >> Subject: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration >>broken >> To: Open MPI Users >> Message-ID: >> >> Content-Type: text/plain; charset=us-ascii >> >> Hi all, >> >> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone >> else notice, that the command: >> >> $ mpiexec -n 4 -machinefile mymachines ./mpihello >> >> will ignore the argument "-machinefile mymachines" and use the file >> "openmpi-default-hostfile" instead all the time? >> >> == >> >> SGE issue >> >> I usually don't install new versions instantly, so I only noticed right now, >> that in 1.4.5 I get a wrong allocation inside SGE (always one process less >> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 >> I get: >> >> -- >> There are no nodes allocated to this job. >> -- >> >> all the time. >> >> == >> >> I configured with: >> >> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >> --with-sge >> >> and adjusted my PATHs accordingly (at least: I hope so). >> >> -- Reuti >> >> >> -- >> >> Message: 2 >> Date: Mon, 3 Sep 2012 14:32:48 -0700 >> From: Ralph Castain >> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster >>of servers >> To: Open MPI Users >> Message-ID: >> Content-Type: text/plain; charset=us-ascii >> >> It looks to me like the network is losing connections - your error messages >> all state "no route to host", which implies that the network interface >> dropped out. >> >> On Sep 3, 2012, at 1:39 PM, Andrea Negri wrote: >> >>> I have asked to my admin and he said that no log messages were present >>> in /var/log, apart my login on the compute node. >>> No killed processes, no full stack errors, the memory is ok, 1GB is >>> used and 2GB are free. >>> Actually I don't know what kind of problem coud be, does someone have >>> ideas? Or at least a suspect? >>> >>> Please, don't let me alone! >>> >>> Sorry for the trouble with the mail >>> >>> 2012/9/1 : Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: some mpi processes "disappear" on a cluster ofservers (John Hearns) 2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) -- Message: 1 Date: Sat, 1 Sep 2012 08:48:56 +0100 From: John Hearns Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers To: Open MPI Users Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Apologies, I have not taken the time to read your comprehensive diagnostics! As Gus says, this sounds like a memory problem. My suspicion would be the kernel Out Of Memory (OOM) killer. Log into those nodes (or ask your systems manager to do this). Look closely at /var/log/messages where there will be notifications when the OOM Killer kicks in and - well - kills large memory processes! Grep for "killed process" in /var/log/messages http://linux-mm.org
Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
Am 04.09.2012 um 00:07 schrieb Ralph Castain: > I'm leaning towards fixing it - it came due to discussions on how to handle > hostfile when there is an allocation. For now, though, that should work. Oh, did I miss this on the list? If there is a hostfile given as argument, it should override the default hostfile IMO. >> >> == SGE issue I usually don't install new versions instantly, so I only noticed right now, that in 1.4.5 I get a wrong allocation inside SGE (always one process less than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 I get: -- There are no nodes allocated to this job. -- all the time. >>> >>> Weird - I'm not sure I understand what you are saying. Is this happening >>> with 1.6.1 as well? Or just with 1.4.5? >> >> 1.6.1 = no nodes allocated >> 1.4.5 = one process less than requested >> 1.4.1 = works as it should >> > > Well that seems strange! Can you run 1.6.1 with the following on the mpirun > cmd line: > > -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca > ras_base_verbose 10 [pc15381:06250] mca: base: components_open: Looking for ras components [pc15381:06250] mca: base: components_open: opening ras components [pc15381:06250] mca: base: components_open: found loaded component cm [pc15381:06250] mca: base: components_open: component cm has no register function [pc15381:06250] mca: base: components_open: component cm open function successful [pc15381:06250] mca: base: components_open: found loaded component gridengine [pc15381:06250] mca: base: components_open: component gridengine has no register function [pc15381:06250] mca: base: components_open: component gridengine open function successful [pc15381:06250] mca: base: components_open: found loaded component loadleveler [pc15381:06250] mca: base: components_open: component loadleveler has no register function [pc15381:06250] mca: base: components_open: component loadleveler open function successful [pc15381:06250] mca: base: components_open: found loaded component slurm [pc15381:06250] mca: base: components_open: component slurm has no register function [pc15381:06250] mca: base: components_open: component slurm open function successful [pc15381:06250] mca:base:select: Auto-selecting ras components [pc15381:06250] mca:base:select:( ras) Querying component [cm] [pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Querying component [gridengine] [pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set priority to 100 [pc15381:06250] mca:base:select:( ras) Querying component [loadleveler] [pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Querying component [slurm] [pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Selected component [gridengine] [pc15381:06250] mca: base: close: unloading component cm [pc15381:06250] mca: base: close: unloading component loadleveler [pc15381:06250] mca: base: close: unloading component slurm [pc15381:06250] ras:gridengine: JOB_ID: 4636 [pc15381:06250] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1 [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 -- There are no nodes allocated to this job. -- [pc15381:06250] mca: base: close: component gridengine closed [pc15381:06250] mca: base: close: unloading component gridengine The actual hostfile contains: pc15381 1 all.q@pc15381 UNDEFINED pc15370 2 extra.q@pc15370 UNDEFINED pc15381 1 extra.q@pc15381 UNDEFINED and it was submitted with `qsub -pe orted 4 ...`. Aha, I remember an issue on the list, if a job gets slots from several queues that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed lateron? But here it's getting no allocation at all. == If I force it to get jobs only from one queue: [pc15370:30447] mca: base: components_open: Looking for ras components [pc15370:30447] mca: base: components_open: opening ras components [pc15370:30447] mca: base: components_open: found loaded component cm [pc15370:30447] mca: base: components_open: component cm has no register function [pc15370:30447] mca: base: components_open: component cm open function successful [pc15370:30447] mca: base: components_open: found loaded component gridengine [pc15370:30447] mca: base: components_open: component gridengine has no
Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
On Sep 3, 2012, at 3:50 PM, Reuti wrote: > Am 04.09.2012 um 00:07 schrieb Ralph Castain: > >> I'm leaning towards fixing it - it came due to discussions on how to handle >> hostfile when there is an allocation. For now, though, that should work. > > Oh, did I miss this on the list? If there is a hostfile given as argument, it > should override the default hostfile IMO. This was several years ago now - first showed up in the 1.5 series. Unless someone objects, I'll change it. > > >>> >>> > == > > SGE issue > > I usually don't install new versions instantly, so I only noticed right > now, that in 1.4.5 I get a wrong allocation inside SGE (always one > process less than requested with `qsub -pe orted N ...`. This I tried > only, as with 1.6.1 I get: > > -- > There are no nodes allocated to this job. > -- > > all the time. Weird - I'm not sure I understand what you are saying. Is this happening with 1.6.1 as well? Or just with 1.4.5? >>> >>> 1.6.1 = no nodes allocated >>> 1.4.5 = one process less than requested >>> 1.4.1 = works as it should >>> >> >> Well that seems strange! Can you run 1.6.1 with the following on the mpirun >> cmd line: >> >> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca >> ras_base_verbose 10 I'll take a look at this and see what's going on - have to get back to you on it. Thx! > > [pc15381:06250] mca: base: components_open: Looking for ras components > [pc15381:06250] mca: base: components_open: opening ras components > [pc15381:06250] mca: base: components_open: found loaded component cm > [pc15381:06250] mca: base: components_open: component cm has no register > function > [pc15381:06250] mca: base: components_open: component cm open function > successful > [pc15381:06250] mca: base: components_open: found loaded component gridengine > [pc15381:06250] mca: base: components_open: component gridengine has no > register function > [pc15381:06250] mca: base: components_open: component gridengine open > function successful > [pc15381:06250] mca: base: components_open: found loaded component loadleveler > [pc15381:06250] mca: base: components_open: component loadleveler has no > register function > [pc15381:06250] mca: base: components_open: component loadleveler open > function successful > [pc15381:06250] mca: base: components_open: found loaded component slurm > [pc15381:06250] mca: base: components_open: component slurm has no register > function > [pc15381:06250] mca: base: components_open: component slurm open function > successful > [pc15381:06250] mca:base:select: Auto-selecting ras components > [pc15381:06250] mca:base:select:( ras) Querying component [cm] > [pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query failed > to return a module > [pc15381:06250] mca:base:select:( ras) Querying component [gridengine] > [pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set > priority to 100 > [pc15381:06250] mca:base:select:( ras) Querying component [loadleveler] > [pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. > Query failed to return a module > [pc15381:06250] mca:base:select:( ras) Querying component [slurm] > [pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [pc15381:06250] mca:base:select:( ras) Selected component [gridengine] > [pc15381:06250] mca: base: close: unloading component cm > [pc15381:06250] mca: base: close: unloading component loadleveler > [pc15381:06250] mca: base: close: unloading component slurm > [pc15381:06250] ras:gridengine: JOB_ID: 4636 > [pc15381:06250] ras:gridengine: PE_HOSTFILE: > /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile > [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1 > [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 > -- > There are no nodes allocated to this job. > -- > [pc15381:06250] mca: base: close: component gridengine closed > [pc15381:06250] mca: base: close: unloading component gridengine > > The actual hostfile contains: > > pc15381 1 all.q@pc15381 UNDEFINED > pc15370 2 extra.q@pc15370 UNDEFINED > pc15381 1 extra.q@pc15381 UNDEFINED > > and it was submitted with `qsub -pe orted 4 ...`. > > > Aha, I remember an issue on the list, if a job gets slots from several queues > that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed > lateron? But here it's getting no allocation at all. > > == > > If I force it to get jobs only from one queue: > > [pc15370:30447] mca: base: components_open: Looking for ras components > [pc15370:30447] mca:
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Great. I'll try applying this tomorrow and I'll let you know if it works for me. Brian On Mon, Sep 3, 2012 at 2:36 PM, Ralph Castain wrote: > Give the attached patch a try - this works for me, but I'd like it verified > before it goes into the next 1.6 release (singleton comm_spawn is so rarely > used that it can easily be overlooked for some time). > > Thx > Ralph > > > > > On Aug 31, 2012, at 3:32 PM, Brian Budge wrote: > >> Thanks, much appreciated. >> >> On Fri, Aug 31, 2012 at 2:37 PM, Ralph Castain wrote: >>> I see - well, I hope to work on it this weekend and may get it fixed. If I >>> do, I can provide you with a patch for the 1.6 series that you can use >>> until the actual release is issued, if that helps. >>> >>> >>> On Aug 31, 2012, at 2:33 PM, Brian Budge wrote: >>> Hi Ralph - This is true, but we may not know until well into the process whether we need MPI at all. We have an SMP/NUMA mode that is designed to run faster on a single machine. We also may build our application on machines where there is no MPI, and we simply don't build the code that runs the MPI functionality in that case. We have scripts all over the place that need to start this application, and it would be much easier to be able to simply run the program than to figure out when or if mpirun needs to be starting the program. Before, we went so far as to fork and exec a full mpirun when we run in clustered mode. This resulted in an additional process running, and we had to use sockets to get the data to the new master process. I very much like the idea of being able to have our process become the MPI master instead, so I have been very excited about your work around this singleton fork/exec under the hood. Once I get my new infrastructure designed to work with mpirun -n 1 + spawn, I will try some previous openmpi versions to see if I can find a version with this singleton functionality in-tact. Thanks again, Brian On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain wrote: > not off the top of my head. However, as noted earlier, there is > absolutely no advantage to a singleton vs mpirun start - all the > singleton does is immediately fork/exec "mpirun" to support the rest of > the job. In both cases, you have a daemon running the job - only > difference is in the number of characters the user types to start it. > > > On Aug 30, 2012, at 8:44 AM, Brian Budge wrote: > >> In the event that I need to get this up-and-running soon (I do need >> something working within 2 weeks), can you recommend an older version >> where this is expected to work? >> >> Thanks, >> Brian >> >> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge >> wrote: >>> Thanks! >>> >>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain >>> wrote: Yeah, I'm seeing the hang as well when running across multiple machines. Let me dig a little and get this fixed. Thanks Ralph On Aug 28, 2012, at 4:51 PM, Brian Budge wrote: > Hmmm, I went to the build directories of openmpi for my two machines, > went into the orte/test/mpi directory and made the executables on both > machines. I set the hostsfile in the env variable on the "master" > machine. > > Here's the output: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > ./simple_spawn > Parent [pid 97504] starting up! > 0 completed MPI_Init > Parent [pid 97504] about to spawn! > Parent [pid 97507] starting up! > Parent [pid 97508] starting up! > Parent [pid 30626] starting up! > ^C > zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn > > I had to ^C to kill the hung process. > > When I run using mpirun: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > mpirun -np 1 ./simple_spawn > Parent [pid 97511] starting up! > 0 completed MPI_Init > Parent [pid 97511] about to spawn! > Parent [pid 97513] starting up! > Parent [pid 30762] starting up! > Parent [pid 30764] starting up! > Parent done with spawn > Parent sending message to child > 1 completed MPI_Init > Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 > 0 completed MPI_Init > Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 > 2 completed MPI_Init > Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 > Child 1 disconnected > Child 0 received msg: 38 > Child 0 disconnected >