[OMPI users] delimiter in appfile

2012-09-03 Thread Siegmar Gross
Hi,

I get strange results if I use a tab instead of a space as a
delimiter in an appfile. Perhaps I've missed something but I
can't remember that I read that tabs are not allowed.


Tab between 2 and -host.

-np 2   -host tyr.informatik.hs-fulda.de rank_size

tyr small_prog 144 mpiexec -app app_rank_size.openmpi_fulda
--
mpiexec was unable to launch the specified application as it could
  not find an executable:

Executable: tyr.informatik.hs-fulda.de
Node: tyr.informatik.hs-fulda.de

while attempting to start process rank 0.
--
2 total processes failed to start
tyr small_prog 145 



Tab between -host and tyr.

-np 2 -host tyr.informatik.hs-fulda.de rank_size

tyr small_prog 145 mpiexec -app app_rank_size.openmpi_fulda
--
mpiexec was unable to launch the specified application as it could
  not find an executable:

Executable: -o
Node: tyr.informatik.hs-fulda.de

while attempting to start process rank 0.
--
2 total processes failed to start
tyr small_prog 146 



Tab before rank_size.

-np 2 -host tyr.informatik.hs-fulda.de  rank_size

tyr small_prog 147 mpiexec -app app_rank_size.openmpi_fulda
--
No executable was specified on the mpiexec command line.

Aborting.
--
tyr small_prog 148 



Everything works fine if I only use spaces.

tyr small_prog 132 mpiexec -app app_rank_size.openmpi_fulda
I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de.
I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
MPI standard 2.1 is supported.


Is it possible to change the behaviour so that both tab and space
can be used as delimiter?


Kind regards

Siegmar





[OMPI users] problem with rankfile

2012-09-03 Thread Siegmar Gross
Hi,

the man page for "mpiexec" shows the following:

 cat myrankfile
 rank 0=aa slot=1:0-2
 rank 1=bb slot=0:0,1
 rank 2=cc slot=1-2
 mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that

   Rank 0 runs on node aa, bound to socket 1, cores 0-2.
   Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
   Rank 2 runs on node cc, bound to cores 1 and 2.

Does it mean that the process with rank 0 should be bound to
core 0, 1, or 2 of socket 1?

I tried to use a rankfile and have a problem. My rankfile contains
the following lines.

rank 0=tyr.informatik.hs-fulda.de slot=0:0
rank 1=tyr.informatik.hs-fulda.de slot=1:0
#rank 2=rs0.informatik.hs-fulda.de slot=0:0


Everything is fine if I use the file with just my local machine
(the first two lines).

tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
[tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
  odls:default:fork binding child [[9849,1],0] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
  odls:default:fork binding child [[9849,1],1] to slot_list 1:0
I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
tyr small_prog 116 


I can also change the socket number and the processes will be attached
to the correct cores. Unfortunately it doesn't work if I add one
other machine (third line).


tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
--
We were unable to successfully process/set the requested processor
affinity settings:

Specified slot list: 0:0
Error: Cross-device link

This could mean that a non-existent processor was specified, or
that the specification had improper syntax.
--
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  odls:default:fork binding child [[10212,1],0] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  odls:default:fork binding child [[10212,1],1] to slot_list 1:0
[rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
  odls:default:fork binding child [[10212,1],2] to slot_list 0:0
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
  ORTE_ERROR_LOG: A message is attempting to be sent to a process
  whose contact information is unknown in file
  ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
  to [[10212,1],0]: tag 20
[tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
  A message is attempting to be sent to a process whose contact
  information is unknown in file
  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
  at line 2501
--
mpiexec was unable to start the specified application as it
  encountered an error:

Error name: Error 0
Node: rs0.informatik.hs-fulda.de

when attempting to start process rank 2.
--
tyr small_prog 113 



The other machine has two 8 core processors.

tyr small_prog 121 ssh rs0 psrinfo -v
Status of virtual processor 0 as of: 09/03/2012 19:51:15
  on-line since 07/26/2012 15:03:14.
  The sparcv9 processor operates at 2400 MHz,
and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 09/03/2012 19:51:15
...
Status of virtual processor 15 as of: 09/03/2012 19:51:15
  on-line since 07/26/2012 15:03:16.
  The sparcv9 processor operates at 2400 MHz,
and has a sparcv9 floating point processor.
tyr small_prog 122 



Is it necessary to specify another option on the command line or
is my rankfile faulty? Thank you very much for any suggestions in
advance.


Kind regards

Siegmar




Re: [OMPI users] problem with rankfile

2012-09-03 Thread Ralph Castain
Are *all* the machines Sparc? Or just the 3rd one (rs0)?

On Sep 3, 2012, at 12:43 PM, Siegmar Gross 
 wrote:

> Hi,
> 
> the man page for "mpiexec" shows the following:
> 
> cat myrankfile
> rank 0=aa slot=1:0-2
> rank 1=bb slot=0:0,1
> rank 2=cc slot=1-2
> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
> 
>   Rank 0 runs on node aa, bound to socket 1, cores 0-2.
>   Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>   Rank 2 runs on node cc, bound to cores 1 and 2.
> 
> Does it mean that the process with rank 0 should be bound to
> core 0, 1, or 2 of socket 1?
> 
> I tried to use a rankfile and have a problem. My rankfile contains
> the following lines.
> 
> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> rank 1=tyr.informatik.hs-fulda.de slot=1:0
> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
> 
> 
> Everything is fine if I use the file with just my local machine
> (the first two lines).
> 
> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>  odls:default:fork binding child [[9849,1],0] to slot_list 0:0
> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>  odls:default:fork binding child [[9849,1],1] to slot_list 1:0
> I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de.
> MPI standard 2.1 is supported.
> I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de.
> MPI standard 2.1 is supported.
> tyr small_prog 116 
> 
> 
> I can also change the socket number and the processes will be attached
> to the correct cores. Unfortunately it doesn't work if I add one
> other machine (third line).
> 
> 
> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
> --
> We were unable to successfully process/set the requested processor
> affinity settings:
> 
> Specified slot list: 0:0
> Error: Cross-device link
> 
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.
> --
> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>  odls:default:fork binding child [[10212,1],0] to slot_list 0:0
> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>  odls:default:fork binding child [[10212,1],1] to slot_list 1:0
> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
>  odls:default:fork binding child [[10212,1],2] to slot_list 0:0
> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>  ORTE_ERROR_LOG: A message is attempting to be sent to a process
>  whose contact information is unknown in file
>  ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
>  to [[10212,1],0]: tag 20
> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
>  A message is attempting to be sent to a process whose contact
>  information is unknown in file
>  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>  at line 2501
> --
> mpiexec was unable to start the specified application as it
>  encountered an error:
> 
> Error name: Error 0
> Node: rs0.informatik.hs-fulda.de
> 
> when attempting to start process rank 2.
> --
> tyr small_prog 113 
> 
> 
> 
> The other machine has two 8 core processors.
> 
> tyr small_prog 121 ssh rs0 psrinfo -v
> Status of virtual processor 0 as of: 09/03/2012 19:51:15
>  on-line since 07/26/2012 15:03:14.
>  The sparcv9 processor operates at 2400 MHz,
>and has a sparcv9 floating point processor.
> Status of virtual processor 1 as of: 09/03/2012 19:51:15
> ...
> Status of virtual processor 15 as of: 09/03/2012 19:51:15
>  on-line since 07/26/2012 15:03:16.
>  The sparcv9 processor operates at 2400 MHz,
>and has a sparcv9 floating point processor.
> tyr small_prog 122 
> 
> 
> 
> Is it necessary to specify another option on the command line or
> is my rankfile faulty? Thank you very much for any suggestions in
> advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] delimiter in appfile

2012-09-03 Thread Ralph Castain
Possible - yes. Likely to happen immediately - less so as most of us are quite 
busy right now. I'll add it to the "requested feature" list, but can make no 
promises on if/when it might happen. Certainly won't be included in anything 
prior to the upcoming 1.7 series.


On Sep 3, 2012, at 12:42 PM, Siegmar Gross 
 wrote:

> Hi,
> 
> I get strange results if I use a tab instead of a space as a
> delimiter in an appfile. Perhaps I've missed something but I
> can't remember that I read that tabs are not allowed.
> 
> 
> Tab between 2 and -host.
> 
> -np 2 -host tyr.informatik.hs-fulda.de rank_size
> 
> tyr small_prog 144 mpiexec -app app_rank_size.openmpi_fulda
> --
> mpiexec was unable to launch the specified application as it could
>  not find an executable:
> 
> Executable: tyr.informatik.hs-fulda.de
> Node: tyr.informatik.hs-fulda.de
> 
> while attempting to start process rank 0.
> --
> 2 total processes failed to start
> tyr small_prog 145 
> 
> 
> 
> Tab between -host and tyr.
> 
> -np 2 -host   tyr.informatik.hs-fulda.de rank_size
> 
> tyr small_prog 145 mpiexec -app app_rank_size.openmpi_fulda
> --
> mpiexec was unable to launch the specified application as it could
>  not find an executable:
> 
> Executable: -o
> Node: tyr.informatik.hs-fulda.de
> 
> while attempting to start process rank 0.
> --
> 2 total processes failed to start
> tyr small_prog 146 
> 
> 
> 
> Tab before rank_size.
> 
> -np 2 -host tyr.informatik.hs-fulda.derank_size
> 
> tyr small_prog 147 mpiexec -app app_rank_size.openmpi_fulda
> --
> No executable was specified on the mpiexec command line.
> 
> Aborting.
> --
> tyr small_prog 148 
> 
> 
> 
> Everything works fine if I only use spaces.
> 
> tyr small_prog 132 mpiexec -app app_rank_size.openmpi_fulda
> I'm process 1 of 2 available processes running on tyr.informatik.hs-fulda.de.
> I'm process 0 of 2 available processes running on tyr.informatik.hs-fulda.de.
> MPI standard 2.1 is supported.
> MPI standard 2.1 is supported.
> 
> 
> Is it possible to change the behaviour so that both tab and space
> can be used as delimiter?
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
I have asked to my admin and he said that no log messages were present
in /var/log, apart my login on the compute node.
No killed processes, no full stack errors, the memory is ok, 1GB is
used and 2GB are free.
Actually I don't know what kind of problem coud be, does someone have
ideas? Or at least a suspect?

Please, don't let me alone!

Sorry for the trouble with the mail

2012/9/1  :
> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: some mpi processes "disappear" on a cluster ofservers
>   (John Hearns)
>2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri)
>
>
> --
>
> Message: 1
> Date: Sat, 1 Sep 2012 08:48:56 +0100
> From: John Hearns 
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
> of  servers
> To: Open MPI Users 
> Message-ID:
> 
> Content-Type: text/plain; charset=ISO-8859-1
>
> Apologies, I have not taken the time to read your comprehensive diagnostics!
>
> As Gus says, this sounds like a memory problem.
> My suspicion would be the kernel Out Of Memory (OOM) killer.
> Log into those nodes (or ask your systems manager to do this). Look
> closely at /var/log/messages where there will be notifications when
> the OOM Killer kicks in and - well - kills large memory processes!
> Grep for "killed process" in /var/log/messages
>
> http://linux-mm.org/OOM_Killer
>
>
> --
>
> Message: 2
> Date: Sat, 1 Sep 2012 11:50:59 +0200
> From: Andrea Negri 
> Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5
> To: us...@open-mpi.org
> Message-ID:
> 
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi, Gus and John,
>
> my code (zeusmp2) is a F77 code ported in F95, and the very first time
> I have launched it, the processed disappears almost immediately.
> I checked the code with valgrind and some unallocated arrays were
> passed wrongly to some subroutines.
> After having corrected this bug, the code runs for a while and after
> occour all the stuff described in my first post.
> However, the code still performs a lot of main temporal cycle before
> "die" (I don't know if thies piece of information is useful).
>
> Now I'm going to check the memory usage, (I also have a lot of unused
> variables in this pretty large code, maybe I shoud comment them).
>
> uname -a returns
> Linux cloud 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006
> x86_64 x86_64 x86_64 GNU/Linux
>
> ulimit -a returns:
> core file size(blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> file size   (blocks, -f) unlimited
> pending signals(-i) 1024
> max locked memory (kbytes, -l) 32
> max memory size(kbytes, -m) unlimited
> open files   (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size   (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 36864
> virtual memory   (kbytes, -v) unlimited
> file locks(-x) unlimited
>
>
> I can log on the logins nodes, but unfortunately the command ls
> /var/log/messages return
> acpid   cron.4  messages.3 secure.4
> anaconda.logcupsmessages.4 spooler
> anaconda.syslog dmesg   mpi_uninstall.log  spooler.1
> anaconda.xlog   gdm pppspooler.2
> audit   httpd   prelink.logspooler.3
> boot.logitac_uninstall.log  rpmpkgsspooler.4
> boot.log.1  lastlog rpmpkgs.1  vbox
> boot.log.2  mailrpmpkgs.2  wtmp
> boot.log.3  maillog rpmpkgs.3  wtmp.1
> boot.log.4  maillog.1   rpmpkgs.4  Xorg.0.log
> cmkl_install.logmaillog.2   samba  Xorg.0.log.old
> cmkl_uninstall.log  maillog.3   scrollkeeper.log   yum.log
> cronmaillog.4   secure yum.log.1
> cron.1  messagessecure.1
> cron.2  messages.1  secure.2
> cron.3  messages.2  secure.3
>
> so, the log should be in some of these files (I don't have read
> permission on these files). I

[OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Reuti
Hi all,

I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone else 
notice, that the command:

$ mpiexec -n 4 -machinefile mymachines ./mpihello

will ignore the argument "-machinefile mymachines" and use the file 
"openmpi-default-hostfile" instead all the time?

==

SGE issue

I usually don't install new versions instantly, so I only noticed right now, 
that in 1.4.5 I get a wrong allocation inside SGE (always one process less than 
requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 I get:

--
There are no nodes allocated to this job.
--

all the time.

==

I configured with:

./configure --prefix=$HOME/local/... --enable-static --disable-shared --with-sge

and adjusted my PATHs accordingly (at least: I hope so).

-- Reuti


Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
It looks to me like the network is losing connections - your error messages all 
state "no route to host", which implies that the network interface dropped out.

On Sep 3, 2012, at 1:39 PM, Andrea Negri  wrote:

> I have asked to my admin and he said that no log messages were present
> in /var/log, apart my login on the compute node.
> No killed processes, no full stack errors, the memory is ok, 1GB is
> used and 2GB are free.
> Actually I don't know what kind of problem coud be, does someone have
> ideas? Or at least a suspect?
> 
> Please, don't let me alone!
> 
> Sorry for the trouble with the mail
> 
> 2012/9/1  :
>> Send users mailing list submissions to
>>us...@open-mpi.org
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>users-requ...@open-mpi.org
>> 
>> You can reach the person managing the list at
>>users-ow...@open-mpi.org
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>> 
>> 
>> Today's Topics:
>> 
>>   1. Re: some mpi processes "disappear" on a cluster ofservers
>>  (John Hearns)
>>   2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri)
>> 
>> 
>> --
>> 
>> Message: 1
>> Date: Sat, 1 Sep 2012 08:48:56 +0100
>> From: John Hearns 
>> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>>of  servers
>> To: Open MPI Users 
>> Message-ID:
>>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Apologies, I have not taken the time to read your comprehensive diagnostics!
>> 
>> As Gus says, this sounds like a memory problem.
>> My suspicion would be the kernel Out Of Memory (OOM) killer.
>> Log into those nodes (or ask your systems manager to do this). Look
>> closely at /var/log/messages where there will be notifications when
>> the OOM Killer kicks in and - well - kills large memory processes!
>> Grep for "killed process" in /var/log/messages
>> 
>> http://linux-mm.org/OOM_Killer
>> 
>> 
>> --
>> 
>> Message: 2
>> Date: Sat, 1 Sep 2012 11:50:59 +0200
>> From: Andrea Negri 
>> Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5
>> To: us...@open-mpi.org
>> Message-ID:
>>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Hi, Gus and John,
>> 
>> my code (zeusmp2) is a F77 code ported in F95, and the very first time
>> I have launched it, the processed disappears almost immediately.
>> I checked the code with valgrind and some unallocated arrays were
>> passed wrongly to some subroutines.
>> After having corrected this bug, the code runs for a while and after
>> occour all the stuff described in my first post.
>> However, the code still performs a lot of main temporal cycle before
>> "die" (I don't know if thies piece of information is useful).
>> 
>> Now I'm going to check the memory usage, (I also have a lot of unused
>> variables in this pretty large code, maybe I shoud comment them).
>> 
>> uname -a returns
>> Linux cloud 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006
>> x86_64 x86_64 x86_64 GNU/Linux
>> 
>> ulimit -a returns:
>> core file size(blocks, -c) 0
>> data seg size   (kbytes, -d) unlimited
>> file size   (blocks, -f) unlimited
>> pending signals(-i) 1024
>> max locked memory (kbytes, -l) 32
>> max memory size(kbytes, -m) unlimited
>> open files   (-n) 1024
>> pipe size(512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> stack size   (kbytes, -s) 10240
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 36864
>> virtual memory   (kbytes, -v) unlimited
>> file locks(-x) unlimited
>> 
>> 
>> I can log on the logins nodes, but unfortunately the command ls
>> /var/log/messages return
>> acpid   cron.4  messages.3 secure.4
>> anaconda.logcupsmessages.4 spooler
>> anaconda.syslog dmesg   mpi_uninstall.log  spooler.1
>> anaconda.xlog   gdm pppspooler.2
>> audit   httpd   prelink.logspooler.3
>> boot.logitac_uninstall.log  rpmpkgsspooler.4
>> boot.log.1  lastlog rpmpkgs.1  vbox
>> boot.log.2  mailrpmpkgs.2  wtmp
>> boot.log.3  maillog rpmpkgs.3  wtmp.1
>> boot.log.4  maillog.1   rpmpkgs.4  Xorg.0.log
>> cmkl_install.logmaillog.2   samba  Xorg.0.log.old
>> cmkl_uninstall.log  maillog.3   

Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Ralph Castain

On Sep 3, 2012, at 2:12 PM, Reuti  wrote:

> Hi all,
> 
> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone 
> else notice, that the command:
> 
> $ mpiexec -n 4 -machinefile mymachines ./mpihello
> 
> will ignore the argument "-machinefile mymachines" and use the file 
> "openmpi-default-hostfile" instead all the time?

Try setting "-mca orte_default_hostfile mymachines" instead.

> 
> ==
> 
> SGE issue
> 
> I usually don't install new versions instantly, so I only noticed right now, 
> that in 1.4.5 I get a wrong allocation inside SGE (always one process less 
> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 
> I get:
> 
> --
> There are no nodes allocated to this job.
> --
> 
> all the time.

Weird - I'm not sure I understand what you are saying. Is this happening with 
1.6.1 as well? Or just with 1.4.5?

> 
> ==
> 
> I configured with:
> 
> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
> --with-sge
> 
> and adjusted my PATHs accordingly (at least: I hope so).
> 
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-09-03 Thread Ralph Castain
Give the attached patch a try - this works for me, but I'd like it verified 
before it goes into the next 1.6 release (singleton comm_spawn is so rarely 
used that it can easily be overlooked for some time).

Thx
Ralph



singleton_comm_spawn.diff
Description: Binary data


On Aug 31, 2012, at 3:32 PM, Brian Budge  wrote:

> Thanks, much appreciated.
> 
> On Fri, Aug 31, 2012 at 2:37 PM, Ralph Castain  wrote:
>> I see - well, I hope to work on it this weekend and may get it fixed. If I 
>> do, I can provide you with a patch for the 1.6 series that you can use until 
>> the actual release is issued, if that helps.
>> 
>> 
>> On Aug 31, 2012, at 2:33 PM, Brian Budge  wrote:
>> 
>>> Hi Ralph -
>>> 
>>> This is true, but we may not know until well into the process whether
>>> we need MPI at all.  We have an SMP/NUMA mode that is designed to run
>>> faster on a single machine.  We also may build our application on
>>> machines where there is no MPI, and we simply don't build the code
>>> that runs the MPI functionality in that case.  We have scripts all
>>> over the place that need to start this application, and it would be
>>> much easier to be able to simply run the program than to figure out
>>> when or if mpirun needs to be starting the program.
>>> 
>>> Before, we went so far as to fork and exec a full mpirun when we run
>>> in clustered mode.  This resulted in an additional process running,
>>> and we had to use sockets to get the data to the new master process.
>>> I very much like the idea of being able to have our process become the
>>> MPI master instead, so I have been very excited about your work around
>>> this singleton fork/exec under the hood.
>>> 
>>> Once I get my new infrastructure designed to work with mpirun -n 1 +
>>> spawn, I will try some previous openmpi versions to see if I can find
>>> a version with this singleton functionality in-tact.
>>> 
>>> Thanks again,
>>> Brian
>>> 
>>> On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain  wrote:
 not off the top of my head. However, as noted earlier, there is absolutely 
 no advantage to a singleton vs mpirun start - all the singleton does is 
 immediately fork/exec "mpirun" to support the rest of the job. In both 
 cases, you have a daemon running the job - only difference is in the 
 number of characters the user types to start it.
 
 
 On Aug 30, 2012, at 8:44 AM, Brian Budge  wrote:
 
> In the event that I need to get this up-and-running soon (I do need
> something working within 2 weeks), can you recommend an older version
> where this is expected to work?
> 
> Thanks,
> Brian
> 
> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge  
> wrote:
>> Thanks!
>> 
>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain  wrote:
>>> Yeah, I'm seeing the hang as well when running across multiple 
>>> machines. Let me dig a little and get this fixed.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> On Aug 28, 2012, at 4:51 PM, Brian Budge  wrote:
>>> 
 Hmmm, I went to the build directories of openmpi for my two machines,
 went into the orte/test/mpi directory and made the executables on both
 machines.  I set the hostsfile in the env variable on the "master"
 machine.
 
 Here's the output:
 
 OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
 ./simple_spawn
 Parent [pid 97504] starting up!
 0 completed MPI_Init
 Parent [pid 97504] about to spawn!
 Parent [pid 97507] starting up!
 Parent [pid 97508] starting up!
 Parent [pid 30626] starting up!
 ^C
 zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
 
 I had to ^C to kill the hung process.
 
 When I run using mpirun:
 
 OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
 mpirun -np 1 ./simple_spawn
 Parent [pid 97511] starting up!
 0 completed MPI_Init
 Parent [pid 97511] about to spawn!
 Parent [pid 97513] starting up!
 Parent [pid 30762] starting up!
 Parent [pid 30764] starting up!
 Parent done with spawn
 Parent sending message to child
 1 completed MPI_Init
 Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
 0 completed MPI_Init
 Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
 2 completed MPI_Init
 Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
 Child 1 disconnected
 Child 0 received msg: 38
 Child 0 disconnected
 Parent disconnected
 Child 2 disconnected
 97511: exiting
 97513: exiting
 30762: exiting
 30764: exiting
 
 As you can see, I'm using openmpi v

Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Reuti
Hi Ralph,

Am 03.09.2012 um 23:34 schrieb Ralph Castain:

> 
> On Sep 3, 2012, at 2:12 PM, Reuti  wrote:
> 
>> Hi all,
>> 
>> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone 
>> else notice, that the command:
>> 
>> $ mpiexec -n 4 -machinefile mymachines ./mpihello
>> 
>> will ignore the argument "-machinefile mymachines" and use the file 
>> "openmpi-default-hostfile" instead all the time?
> 
> Try setting "-mca orte_default_hostfile mymachines" instead.

Is this a known bug and will be fixed or is this the new syntax?


>> ==
>> 
>> SGE issue
>> 
>> I usually don't install new versions instantly, so I only noticed right now, 
>> that in 1.4.5 I get a wrong allocation inside SGE (always one process less 
>> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 
>> I get:
>> 
>> --
>> There are no nodes allocated to this job.
>> --
>> 
>> all the time.
> 
> Weird - I'm not sure I understand what you are saying. Is this happening with 
> 1.6.1 as well? Or just with 1.4.5?

1.6.1 = no nodes allocated
1.4.5 = one process less than requested
1.4.1 = works as it should

-- Reuti


> 
>> 
>> ==
>> 
>> I configured with:
>> 
>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
>> --with-sge
>> 
>> and adjusted my PATHs accordingly (at least: I hope so).
>> 
>> -- Reuti
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
In which ways can I check the failure of the ethernet connections?

2012/9/3  :
> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. -hostfile ignored in 1.6.1 / SGE integration broken (Reuti)
>2. Re: some mpi processes "disappear" on a cluster ofservers
>   (Ralph Castain)
>
>
> --
>
> Message: 1
> Date: Mon, 3 Sep 2012 23:12:14 +0200
> From: Reuti 
> Subject: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration
> broken
> To: Open MPI Users 
> Message-ID:
> 
> Content-Type: text/plain; charset=us-ascii
>
> Hi all,
>
> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone 
> else notice, that the command:
>
> $ mpiexec -n 4 -machinefile mymachines ./mpihello
>
> will ignore the argument "-machinefile mymachines" and use the file 
> "openmpi-default-hostfile" instead all the time?
>
> ==
>
> SGE issue
>
> I usually don't install new versions instantly, so I only noticed right now, 
> that in 1.4.5 I get a wrong allocation inside SGE (always one process less 
> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 
> I get:
>
> --
> There are no nodes allocated to this job.
> --
>
> all the time.
>
> ==
>
> I configured with:
>
> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
> --with-sge
>
> and adjusted my PATHs accordingly (at least: I hope so).
>
> -- Reuti
>
>
> --
>
> Message: 2
> Date: Mon, 3 Sep 2012 14:32:48 -0700
> From: Ralph Castain 
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
> of  servers
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=us-ascii
>
> It looks to me like the network is losing connections - your error messages 
> all state "no route to host", which implies that the network interface 
> dropped out.
>
> On Sep 3, 2012, at 1:39 PM, Andrea Negri  wrote:
>
>> I have asked to my admin and he said that no log messages were present
>> in /var/log, apart my login on the compute node.
>> No killed processes, no full stack errors, the memory is ok, 1GB is
>> used and 2GB are free.
>> Actually I don't know what kind of problem coud be, does someone have
>> ideas? Or at least a suspect?
>>
>> Please, don't let me alone!
>>
>> Sorry for the trouble with the mail
>>
>> 2012/9/1  :
>>> Send users mailing list submissions to
>>>us...@open-mpi.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> or, via email, send a message with subject or body 'help' to
>>>users-requ...@open-mpi.org
>>>
>>> You can reach the person managing the list at
>>>users-ow...@open-mpi.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of users digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>   1. Re: some mpi processes "disappear" on a cluster ofservers
>>>  (John Hearns)
>>>   2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri)
>>>
>>>
>>> --
>>>
>>> Message: 1
>>> Date: Sat, 1 Sep 2012 08:48:56 +0100
>>> From: John Hearns 
>>> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>>>of  servers
>>> To: Open MPI Users 
>>> Message-ID:
>>>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> Apologies, I have not taken the time to read your comprehensive diagnostics!
>>>
>>> As Gus says, this sounds like a memory problem.
>>> My suspicion would be the kernel Out Of Memory (OOM) killer.
>>> Log into those nodes (or ask your systems manager to do this). Look
>>> closely at /var/log/messages where there will be notifications when
>>> the OOM Killer kicks in and - well - kills large memory processes!
>>> Grep for "killed process" in /var/log/messages
>>>
>>> http://linux-mm.org/OOM_Killer
>>>
>>>
>>> --
>>>
>>> Message: 2
>>> Date: Sat, 1 Sep 2012 11:50:59 +0200
>>> From: Andrea Negri 
>>> Subject: Re: [OMPI users] users Digest, Vol 2339, Issue 5
>>> To: us...@open-mpi.org
>>> Message-ID:
>>>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> Hi, Gus and John,
>>>
>>> my code (zeusmp2) is a F77 code ported in F95, and the very first time
>>> I have

Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Ralph Castain

On Sep 3, 2012, at 2:40 PM, Reuti  wrote:

> Hi Ralph,
> 
> Am 03.09.2012 um 23:34 schrieb Ralph Castain:
> 
>> 
>> On Sep 3, 2012, at 2:12 PM, Reuti  wrote:
>> 
>>> Hi all,
>>> 
>>> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone 
>>> else notice, that the command:
>>> 
>>> $ mpiexec -n 4 -machinefile mymachines ./mpihello
>>> 
>>> will ignore the argument "-machinefile mymachines" and use the file 
>>> "openmpi-default-hostfile" instead all the time?
>> 
>> Try setting "-mca orte_default_hostfile mymachines" instead.
> 
> Is this a known bug and will be fixed or is this the new syntax?

I'm leaning towards fixing it - it came due to discussions on how to handle 
hostfile when there is an allocation. For now, though, that should work.

> 
> 
>>> ==
>>> 
>>> SGE issue
>>> 
>>> I usually don't install new versions instantly, so I only noticed right 
>>> now, that in 1.4.5 I get a wrong allocation inside SGE (always one process 
>>> less than requested with `qsub -pe orted N ...`. This I tried only, as with 
>>> 1.6.1 I get:
>>> 
>>> --
>>> There are no nodes allocated to this job.
>>> --
>>> 
>>> all the time.
>> 
>> Weird - I'm not sure I understand what you are saying. Is this happening 
>> with 1.6.1 as well? Or just with 1.4.5?
> 
> 1.6.1 = no nodes allocated
> 1.4.5 = one process less than requested
> 1.4.1 = works as it should
> 

Well that seems strange! Can you run 1.6.1 with the following on the mpirun cmd 
line:

-mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca 
ras_base_verbose 10

My guess is that something in the pe_hostfile syntax may have changed and we 
didn't pick up on it.


> -- Reuti
> 
> 
>> 
>>> 
>>> ==
>>> 
>>> I configured with:
>>> 
>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
>>> --with-sge
>>> 
>>> and adjusted my PATHs accordingly (at least: I hope so).
>>> 
>>> -- Reuti
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
This is something you probably need to work on with your sys admin - it sounds 
like there is something unreliable in your network, and that's usually a 
somewhat hard thing to diagnose.


On Sep 3, 2012, at 2:49 PM, Andrea Negri  wrote:

> In which ways can I check the failure of the ethernet connections?
> 
> 2012/9/3  :
>> Send users mailing list submissions to
>>us...@open-mpi.org
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>users-requ...@open-mpi.org
>> 
>> You can reach the person managing the list at
>>users-ow...@open-mpi.org
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>> 
>> 
>> Today's Topics:
>> 
>>   1. -hostfile ignored in 1.6.1 / SGE integration broken (Reuti)
>>   2. Re: some mpi processes "disappear" on a cluster ofservers
>>  (Ralph Castain)
>> 
>> 
>> --
>> 
>> Message: 1
>> Date: Mon, 3 Sep 2012 23:12:14 +0200
>> From: Reuti 
>> Subject: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration
>>broken
>> To: Open MPI Users 
>> Message-ID:
>>
>> Content-Type: text/plain; charset=us-ascii
>> 
>> Hi all,
>> 
>> I just compiled Open MPI 1.6.1 and before digging any deeper: does anyone 
>> else notice, that the command:
>> 
>> $ mpiexec -n 4 -machinefile mymachines ./mpihello
>> 
>> will ignore the argument "-machinefile mymachines" and use the file 
>> "openmpi-default-hostfile" instead all the time?
>> 
>> ==
>> 
>> SGE issue
>> 
>> I usually don't install new versions instantly, so I only noticed right now, 
>> that in 1.4.5 I get a wrong allocation inside SGE (always one process less 
>> than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 
>> I get:
>> 
>> --
>> There are no nodes allocated to this job.
>> --
>> 
>> all the time.
>> 
>> ==
>> 
>> I configured with:
>> 
>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
>> --with-sge
>> 
>> and adjusted my PATHs accordingly (at least: I hope so).
>> 
>> -- Reuti
>> 
>> 
>> --
>> 
>> Message: 2
>> Date: Mon, 3 Sep 2012 14:32:48 -0700
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>>of  servers
>> To: Open MPI Users 
>> Message-ID: 
>> Content-Type: text/plain; charset=us-ascii
>> 
>> It looks to me like the network is losing connections - your error messages 
>> all state "no route to host", which implies that the network interface 
>> dropped out.
>> 
>> On Sep 3, 2012, at 1:39 PM, Andrea Negri  wrote:
>> 
>>> I have asked to my admin and he said that no log messages were present
>>> in /var/log, apart my login on the compute node.
>>> No killed processes, no full stack errors, the memory is ok, 1GB is
>>> used and 2GB are free.
>>> Actually I don't know what kind of problem coud be, does someone have
>>> ideas? Or at least a suspect?
>>> 
>>> Please, don't let me alone!
>>> 
>>> Sorry for the trouble with the mail
>>> 
>>> 2012/9/1  :
 Send users mailing list submissions to
   us...@open-mpi.org
 
 To subscribe or unsubscribe via the World Wide Web, visit
   http://www.open-mpi.org/mailman/listinfo.cgi/users
 or, via email, send a message with subject or body 'help' to
   users-requ...@open-mpi.org
 
 You can reach the person managing the list at
   users-ow...@open-mpi.org
 
 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of users digest..."
 
 
 Today's Topics:
 
  1. Re: some mpi processes "disappear" on a cluster ofservers
 (John Hearns)
  2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri)
 
 
 --
 
 Message: 1
 Date: Sat, 1 Sep 2012 08:48:56 +0100
 From: John Hearns 
 Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
   of  servers
 To: Open MPI Users 
 Message-ID:
   
 Content-Type: text/plain; charset=ISO-8859-1
 
 Apologies, I have not taken the time to read your comprehensive 
 diagnostics!
 
 As Gus says, this sounds like a memory problem.
 My suspicion would be the kernel Out Of Memory (OOM) killer.
 Log into those nodes (or ask your systems manager to do this). Look
 closely at /var/log/messages where there will be notifications when
 the OOM Killer kicks in and - well - kills large memory processes!
 Grep for "killed process" in /var/log/messages
 
 http://linux-mm.org

Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Reuti
Am 04.09.2012 um 00:07 schrieb Ralph Castain:

> I'm leaning towards fixing it - it came due to discussions on how to handle 
> hostfile when there is an allocation. For now, though, that should work.

Oh, did I miss this on the list? If there is a hostfile given as argument, it 
should override the default hostfile IMO. 


>> 
>> 
 ==
 
 SGE issue
 
 I usually don't install new versions instantly, so I only noticed right 
 now, that in 1.4.5 I get a wrong allocation inside SGE (always one process 
 less than requested with `qsub -pe orted N ...`. This I tried only, as 
 with 1.6.1 I get:
 
 --
 There are no nodes allocated to this job.
 --
 
 all the time.
>>> 
>>> Weird - I'm not sure I understand what you are saying. Is this happening 
>>> with 1.6.1 as well? Or just with 1.4.5?
>> 
>> 1.6.1 = no nodes allocated
>> 1.4.5 = one process less than requested
>> 1.4.1 = works as it should
>> 
> 
> Well that seems strange! Can you run 1.6.1 with the following on the mpirun 
> cmd line:
> 
> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca 
> ras_base_verbose 10

[pc15381:06250] mca: base: components_open: Looking for ras components
[pc15381:06250] mca: base: components_open: opening ras components
[pc15381:06250] mca: base: components_open: found loaded component cm
[pc15381:06250] mca: base: components_open: component cm has no register 
function
[pc15381:06250] mca: base: components_open: component cm open function 
successful
[pc15381:06250] mca: base: components_open: found loaded component gridengine
[pc15381:06250] mca: base: components_open: component gridengine has no 
register function
[pc15381:06250] mca: base: components_open: component gridengine open function 
successful
[pc15381:06250] mca: base: components_open: found loaded component loadleveler
[pc15381:06250] mca: base: components_open: component loadleveler has no 
register function
[pc15381:06250] mca: base: components_open: component loadleveler open function 
successful
[pc15381:06250] mca: base: components_open: found loaded component slurm
[pc15381:06250] mca: base: components_open: component slurm has no register 
function
[pc15381:06250] mca: base: components_open: component slurm open function 
successful
[pc15381:06250] mca:base:select: Auto-selecting ras components
[pc15381:06250] mca:base:select:(  ras) Querying component [cm]
[pc15381:06250] mca:base:select:(  ras) Skipping component [cm]. Query failed 
to return a module
[pc15381:06250] mca:base:select:(  ras) Querying component [gridengine]
[pc15381:06250] mca:base:select:(  ras) Query of component [gridengine] set 
priority to 100
[pc15381:06250] mca:base:select:(  ras) Querying component [loadleveler]
[pc15381:06250] mca:base:select:(  ras) Skipping component [loadleveler]. Query 
failed to return a module
[pc15381:06250] mca:base:select:(  ras) Querying component [slurm]
[pc15381:06250] mca:base:select:(  ras) Skipping component [slurm]. Query 
failed to return a module
[pc15381:06250] mca:base:select:(  ras) Selected component [gridengine]
[pc15381:06250] mca: base: close: unloading component cm
[pc15381:06250] mca: base: close: unloading component loadleveler
[pc15381:06250] mca: base: close: unloading component slurm
[pc15381:06250] ras:gridengine: JOB_ID: 4636
[pc15381:06250] ras:gridengine: PE_HOSTFILE: 
/var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile
[pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1
[pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
--
There are no nodes allocated to this job.
--
[pc15381:06250] mca: base: close: component gridengine closed
[pc15381:06250] mca: base: close: unloading component gridengine

The actual hostfile contains:

pc15381 1 all.q@pc15381 UNDEFINED
pc15370 2 extra.q@pc15370 UNDEFINED
pc15381 1 extra.q@pc15381 UNDEFINED

and it was submitted with `qsub -pe orted 4 ...`.


Aha, I remember an issue on the list, if a job gets slots from several queues 
that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed 
lateron? But here it's getting no allocation at all.

==

If I force it to get jobs only from one queue:

[pc15370:30447] mca: base: components_open: Looking for ras components
[pc15370:30447] mca: base: components_open: opening ras components
[pc15370:30447] mca: base: components_open: found loaded component cm
[pc15370:30447] mca: base: components_open: component cm has no register 
function
[pc15370:30447] mca: base: components_open: component cm open function 
successful
[pc15370:30447] mca: base: components_open: found loaded component gridengine
[pc15370:30447] mca: base: components_open: component gridengine has no 

Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken

2012-09-03 Thread Ralph Castain

On Sep 3, 2012, at 3:50 PM, Reuti  wrote:

> Am 04.09.2012 um 00:07 schrieb Ralph Castain:
> 
>> I'm leaning towards fixing it - it came due to discussions on how to handle 
>> hostfile when there is an allocation. For now, though, that should work.
> 
> Oh, did I miss this on the list? If there is a hostfile given as argument, it 
> should override the default hostfile IMO. 

This was several years ago now - first showed up in the 1.5 series. Unless 
someone objects, I'll change it.

> 
> 
>>> 
>>> 
> ==
> 
> SGE issue
> 
> I usually don't install new versions instantly, so I only noticed right 
> now, that in 1.4.5 I get a wrong allocation inside SGE (always one 
> process less than requested with `qsub -pe orted N ...`. This I tried 
> only, as with 1.6.1 I get:
> 
> --
> There are no nodes allocated to this job.
> --
> 
> all the time.
 
 Weird - I'm not sure I understand what you are saying. Is this happening 
 with 1.6.1 as well? Or just with 1.4.5?
>>> 
>>> 1.6.1 = no nodes allocated
>>> 1.4.5 = one process less than requested
>>> 1.4.1 = works as it should
>>> 
>> 
>> Well that seems strange! Can you run 1.6.1 with the following on the mpirun 
>> cmd line:
>> 
>> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca 
>> ras_base_verbose 10

I'll take a look at this and see what's going on - have to get back to you on 
it.

Thx!

> 
> [pc15381:06250] mca: base: components_open: Looking for ras components
> [pc15381:06250] mca: base: components_open: opening ras components
> [pc15381:06250] mca: base: components_open: found loaded component cm
> [pc15381:06250] mca: base: components_open: component cm has no register 
> function
> [pc15381:06250] mca: base: components_open: component cm open function 
> successful
> [pc15381:06250] mca: base: components_open: found loaded component gridengine
> [pc15381:06250] mca: base: components_open: component gridengine has no 
> register function
> [pc15381:06250] mca: base: components_open: component gridengine open 
> function successful
> [pc15381:06250] mca: base: components_open: found loaded component loadleveler
> [pc15381:06250] mca: base: components_open: component loadleveler has no 
> register function
> [pc15381:06250] mca: base: components_open: component loadleveler open 
> function successful
> [pc15381:06250] mca: base: components_open: found loaded component slurm
> [pc15381:06250] mca: base: components_open: component slurm has no register 
> function
> [pc15381:06250] mca: base: components_open: component slurm open function 
> successful
> [pc15381:06250] mca:base:select: Auto-selecting ras components
> [pc15381:06250] mca:base:select:(  ras) Querying component [cm]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to return a module
> [pc15381:06250] mca:base:select:(  ras) Querying component [gridengine]
> [pc15381:06250] mca:base:select:(  ras) Query of component [gridengine] set 
> priority to 100
> [pc15381:06250] mca:base:select:(  ras) Querying component [loadleveler]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [loadleveler]. 
> Query failed to return a module
> [pc15381:06250] mca:base:select:(  ras) Querying component [slurm]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [slurm]. Query 
> failed to return a module
> [pc15381:06250] mca:base:select:(  ras) Selected component [gridengine]
> [pc15381:06250] mca: base: close: unloading component cm
> [pc15381:06250] mca: base: close: unloading component loadleveler
> [pc15381:06250] mca: base: close: unloading component slurm
> [pc15381:06250] ras:gridengine: JOB_ID: 4636
> [pc15381:06250] ras:gridengine: PE_HOSTFILE: 
> /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile
> [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1
> [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
> --
> There are no nodes allocated to this job.
> --
> [pc15381:06250] mca: base: close: component gridengine closed
> [pc15381:06250] mca: base: close: unloading component gridengine
> 
> The actual hostfile contains:
> 
> pc15381 1 all.q@pc15381 UNDEFINED
> pc15370 2 extra.q@pc15370 UNDEFINED
> pc15381 1 extra.q@pc15381 UNDEFINED
> 
> and it was submitted with `qsub -pe orted 4 ...`.
> 
> 
> Aha, I remember an issue on the list, if a job gets slots from several queues 
> that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed 
> lateron? But here it's getting no allocation at all.
> 
> ==
> 
> If I force it to get jobs only from one queue:
> 
> [pc15370:30447] mca: base: components_open: Looking for ras components
> [pc15370:30447] mca: 

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-09-03 Thread Brian Budge
Great.  I'll try applying this tomorrow and I'll let you know if it
works for me.

  Brian

On Mon, Sep 3, 2012 at 2:36 PM, Ralph Castain  wrote:
> Give the attached patch a try - this works for me, but I'd like it verified 
> before it goes into the next 1.6 release (singleton comm_spawn is so rarely 
> used that it can easily be overlooked for some time).
>
> Thx
> Ralph
>
>
>
>
> On Aug 31, 2012, at 3:32 PM, Brian Budge  wrote:
>
>> Thanks, much appreciated.
>>
>> On Fri, Aug 31, 2012 at 2:37 PM, Ralph Castain  wrote:
>>> I see - well, I hope to work on it this weekend and may get it fixed. If I 
>>> do, I can provide you with a patch for the 1.6 series that you can use 
>>> until the actual release is issued, if that helps.
>>>
>>>
>>> On Aug 31, 2012, at 2:33 PM, Brian Budge  wrote:
>>>
 Hi Ralph -

 This is true, but we may not know until well into the process whether
 we need MPI at all.  We have an SMP/NUMA mode that is designed to run
 faster on a single machine.  We also may build our application on
 machines where there is no MPI, and we simply don't build the code
 that runs the MPI functionality in that case.  We have scripts all
 over the place that need to start this application, and it would be
 much easier to be able to simply run the program than to figure out
 when or if mpirun needs to be starting the program.

 Before, we went so far as to fork and exec a full mpirun when we run
 in clustered mode.  This resulted in an additional process running,
 and we had to use sockets to get the data to the new master process.
 I very much like the idea of being able to have our process become the
 MPI master instead, so I have been very excited about your work around
 this singleton fork/exec under the hood.

 Once I get my new infrastructure designed to work with mpirun -n 1 +
 spawn, I will try some previous openmpi versions to see if I can find
 a version with this singleton functionality in-tact.

 Thanks again,
 Brian

 On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain  wrote:
> not off the top of my head. However, as noted earlier, there is 
> absolutely no advantage to a singleton vs mpirun start - all the 
> singleton does is immediately fork/exec "mpirun" to support the rest of 
> the job. In both cases, you have a daemon running the job - only 
> difference is in the number of characters the user types to start it.
>
>
> On Aug 30, 2012, at 8:44 AM, Brian Budge  wrote:
>
>> In the event that I need to get this up-and-running soon (I do need
>> something working within 2 weeks), can you recommend an older version
>> where this is expected to work?
>>
>> Thanks,
>> Brian
>>
>> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge  
>> wrote:
>>> Thanks!
>>>
>>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain  
>>> wrote:
 Yeah, I'm seeing the hang as well when running across multiple 
 machines. Let me dig a little and get this fixed.

 Thanks
 Ralph

 On Aug 28, 2012, at 4:51 PM, Brian Budge  wrote:

> Hmmm, I went to the build directories of openmpi for my two machines,
> went into the orte/test/mpi directory and made the executables on both
> machines.  I set the hostsfile in the env variable on the "master"
> machine.
>
> Here's the output:
>
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> ./simple_spawn
> Parent [pid 97504] starting up!
> 0 completed MPI_Init
> Parent [pid 97504] about to spawn!
> Parent [pid 97507] starting up!
> Parent [pid 97508] starting up!
> Parent [pid 30626] starting up!
> ^C
> zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
>
> I had to ^C to kill the hung process.
>
> When I run using mpirun:
>
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> mpirun -np 1 ./simple_spawn
> Parent [pid 97511] starting up!
> 0 completed MPI_Init
> Parent [pid 97511] about to spawn!
> Parent [pid 97513] starting up!
> Parent [pid 30762] starting up!
> Parent [pid 30764] starting up!
> Parent done with spawn
> Parent sending message to child
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
> Child 1 disconnected
> Child 0 received msg: 38
> Child 0 disconnected
>