Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...

2012-02-01 Thread Götz Waschk
On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh
 wrote:
> in the malloc.c routine in 1.5.5.  Perhaps you should lower the optimization
> level to zero and see what you get.
Hi Richard,

thanks for the suggestion. I was able to solve the problem by
upgrading the Intel Compiler to version 12.1.2 and recompiling the
openmpi runtime with unchanged options. Now I cannot reproduce that
crash. I'll have to test some more, but I think the problem is solved.

Thanks, Götz



Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2

2012-02-01 Thread Götz Waschk
On Tue, Jan 31, 2012 at 8:19 PM, Daniel Milroy
 wrote:
> Hello,
>
> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC
> environment.  We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon
> X5660 cpus.  You can find my build options below.  In an effort to
> test the OpenMPI build, I compiled "Hello world" with an mpi_init call
> in C and Fortran.  Mpirun of both versions on a single node results in
> a segfault.  I have attached the pertinent portion of gdb's output of
> the "Hello world" core dump.

Hi Daniel,

that looks like the problem I had with my intel build of openmpi. I
could solve it by upgrading the Intel Compiler version to 12.1.2.273:
% icc -v
icc version 12.1.2 (gcc version 4.4.5 compatibility)
% icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on
Intel(R) 64, Version 12.1 Build 2028
Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.


After a rebuild of the openmpi runtime, the crashes went away. I was
using openmpi 1.5.3, but you could still have the same problem.

Regards, Götz



Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Reuti
Am 31.01.2012 um 21:25 schrieb Ralph Castain:

> 
> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
> 
>> 
>> Am 31.01.2012 um 20:38 schrieb Ralph Castain:
>> 
>>> Not sure I fully grok this thread, but will try to provide an answer.
>>> 
>>> When you start a singleton, it spawns off a daemon that is the equivalent 
>>> of "mpirun". This daemon is created for the express purpose of allowing the 
>>> singleton to use MPI dynamics like comm_spawn - without it, the singleton 
>>> would be unable to execute those functions.
>>> 
>>> The first thing the daemon does is read the local allocation, using the 
>>> same methods as used by mpirun. So whatever allocation is present that 
>>> mpirun would have read, the daemon will get. This includes hostfiles and 
>>> SGE allocations.
>> 
>> So it should honor also the default hostfile of Open MPI if running outside 
>> of SGE, i.e. from the command line?
> 
> Yes

BTW: is there any default for a hostfile for Open MPI - I mean any in my home 
directory or /etc? When I check `man orte_hosts`, and all possible optiions are 
unset (like in a singleton run), it will only run local (Job is co-located with 
mpirun).


>>> The exception to this is when the singleton gets started in an altered 
>>> environment - e.g., if SGE changes the environmental variables when 
>>> launching the singleton process. We see this in some resource managers - 
>>> you can get an allocation of N nodes, but when you launch a job, the envar 
>>> in that job only indicates the number of nodes actually running processes 
>>> in that job. In such a situation, the daemon will see the altered value as 
>>> its "allocation", potentially causing confusion.
>> 
>> Not sure whether I get it right. When I launch the same application with:
>> 
>> "mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines):
>> 
>> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
>> 9504 ?S  0:00  \_ sge_shepherd-3791 -bg
>> 9506 ?Ss 0:00  \_ /bin/sh 
>> /var/spool/sge/pc15370/job_scripts/3791
>> 9507 ?S  0:00  \_ mpiexec -np 1 ./Mpitest
>> 9508 ?R  0:07  \_ ./Mpitest
>> 9509 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh 
>> -inherit -nostdin -V pc15381  orted -mca
>> 9513 ?S  0:00  \_ /home/reuti/mpitest/Mpitest --child
>> 
>> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
>> 25434 ?Sl 0:00  \_ sge_shepherd-3791 -bg
>> 25436 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
>> /var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381
>> 25444 ?S  0:00  \_ orted -mca ess env -mca 
>> orte_ess_jobid 821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 
>> --hnp-uri 
>> 25447 ?S  0:01  \_ /home/reuti/mpitest/Mpitest 
>> --child
>> 25448 ?S  0:01  \_ /home/reuti/mpitest/Mpitest 
>> --child
>> 
>> This is what I expect (main + 1 child, other node gets 2 children). Now I 
>> launch the singleton instead (nothing changed besides this, still 2+2 
>> granted):
>> 
>> "./Mpitest" and get:
>> 
>> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
>> 9546 ?S  0:00  \_ sge_shepherd-3793 -bg
>> 9548 ?Ss 0:00  \_ /bin/sh 
>> /var/spool/sge/pc15370/job_scripts/3793
>> 9549 ?R  0:00  \_ ./Mpitest
>> 9550 ?Ss 0:00  \_ orted --hnp --set-sid --report-uri 
>> 6 --singleton-died-pipe 7
>> 9551 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh 
>> -inherit -nostdin -V pc15381 orted
>> 9554 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
>> --child
>> 9555 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
>> --child
>> 
>> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
>> 25494 ?Sl 0:00  \_ sge_shepherd-3793 -bg
>> 25495 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
>> /var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381
>> 25502 ?S  0:00  \_ orted -mca ess env -mca 
>> orte_ess_jobid 814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 
>> --hnp-uri 
>> 25503 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
>> --child
>> 
>> Only one child is going to the other node. The environment is the same in 
>> both cases. Is this the correct behavior?
> 
> 
> We probably aren't correctly marking the original singleton on that node, and 
> so the mapper thinks there are still two slots available on the original node.

Okay. There is something to discuss/fix. BTW: if started as singleton I get an 
error at the end with the program the OP provided:

[pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
[[12435,0],0] lost

It's not the case if run by mpiexec.

-- Reuti


Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Ralph Castain

On Feb 1, 2012, at 3:49 AM, Reuti wrote:

> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
> 
>> 
>> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
> 
> BTW: is there any default for a hostfile for Open MPI - I mean any in my home 
> directory or /etc? When I check `man orte_hosts`, and all possible optiions 
> are unset (like in a singleton run), it will only run local (Job is 
> co-located with mpirun).

Yep - it is /etc/openmpi-default-hostfile


>> We probably aren't correctly marking the original singleton on that node, 
>> and so the mapper thinks there are still two slots available on the original 
>> node.
> 
> Okay. There is something to discuss/fix. BTW: if started as singleton I get 
> an error at the end with the program the OP provided:
> 
> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
> [[12435,0],0] lost

Okay, I'll take a look at it - but it may take awhile before I can address 
either issue as other priorities loom.

> 
> It's not the case if run by mpiexec.
> 
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Reuti
Am 01.02.2012 um 15:38 schrieb Ralph Castain:

> On Feb 1, 2012, at 3:49 AM, Reuti wrote:
> 
>> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
>> 
>>> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
>> 
>> BTW: is there any default for a hostfile for Open MPI - I mean any in my 
>> home directory or /etc? When I check `man orte_hosts`, and all possible 
>> optiions are unset (like in a singleton run), it will only run local (Job is 
>> co-located with mpirun).
> 
> Yep - it is /etc/openmpi-default-hostfile

Thx for replying Ralph.

I spotted it too, but this is not working for me. Neither for mpiexec from the 
command line, nor any singleton. I also tried a plain /etc as location of this 
file as well.

reuti@pc15370:~> which mpicc
/home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
reuti@pc15370:~> cat 
/home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
pc15370 slots=2
pc15381 slots=2
reuti@pc15370:~> mpicc -o mpihello mpihello.c
reuti@pc15370:~> mpiexec -np 4 ./mpihello
Hello World from Node 0.
Hello World from Node 1.
Hello World from Node 2.
Hello World from Node 3.

But all is local (no spawn here, traditional mpihello):

19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
11583 ?Ss 0:00  \_ sshd: reuti [priv]   
  
11585 ?S  0:00  |   \_ sshd: reuti@pts/6
  
11587 pts/6Ss 0:00  |   \_ -bash
13470 pts/6S+ 0:00  |   \_ mpiexec -np 4 ./mpihello
13471 pts/6R+ 0:00  |   \_ ./mpihello
13472 pts/6R+ 0:00  |   \_ ./mpihello
13473 pts/6R+ 0:00  |   \_ ./mpihello
13474 pts/6R+ 0:00  |   \_ ./mpihello

-- Reuti


>>> We probably aren't correctly marking the original singleton on that node, 
>>> and so the mapper thinks there are still two slots available on the 
>>> original node.
>> 
>> Okay. There is something to discuss/fix. BTW: if started as singleton I get 
>> an error at the end with the program the OP provided:
>> 
>> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
>> [[12435,0],0] lost
> 
> Okay, I'll take a look at it - but it may take awhile before I can address 
> either issue as other priorities loom.
> 
>> 
>> It's not the case if run by mpiexec.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Frank
When running

mpirun -n 2 

the STDOUT streams of both processes are combined and are displayed by
the shell. In such an interleaved format its hard to tell what line
comes from which node.

Is there a way to have mpirun just merger STDOUT of one process to its
STDOUT stream?

Best,
Frank

Cross-reference:
http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process


Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Lloyd Brown
I don't know about using mpirun to do it, but you can actually call
mpirun on a script, and have that script individually call a single
instance of your program.  Then that script could use shell redirection
to redirect the output of the program's instance to a separate file.

I've used this technique to play with ulimit sort of things in the
script before.  I'm not entirely sure what variables are exposed to you
in the script, such that you could come up with a unique filename to
output to, though.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 02/01/2012 08:59 AM, Frank wrote:
> When running
> 
> mpirun -n 2 
> 
> the STDOUT streams of both processes are combined and are displayed by
> the shell. In such an interleaved format its hard to tell what line
> comes from which node.
> 
> Is there a way to have mpirun just merger STDOUT of one process to its
> STDOUT stream?
> 
> Best,
> Frank
> 
> Cross-reference:
> http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Noam Bernstein
man mpirun
. 
. 
. 
   -output-filename, --output-filename 
  Redirect the stdout, stderr, and stddiag of all ranks to a 
rank-unique version of the specified filename. Any directories in the filename 
will automatically be created.  Each output
  file will consist of filename.rank, where the rank will be 
left-filled with zero's for correct ordering in listings.
. 
. 
. 




Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Eugene Loh

On 2/1/2012 7:59 AM, Frank wrote:

When running

mpirun -n 2

the STDOUT streams of both processes are combined and are displayed by
the shell. In such an interleaved format its hard to tell what line
comes from which node.
As far as this part goes, there is also "mpirun --tag-output".  Check 
the mpirun man page.

Is there a way to have mpirun just merger STDOUT of one process to its
STDOUT stream?


Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Ralph Castain
Could you add --display-allocation to your cmd line? This will tell us if it 
found/read the default hostfile, or if the problem is with the mapper.


On Feb 1, 2012, at 7:58 AM, Reuti wrote:

> Am 01.02.2012 um 15:38 schrieb Ralph Castain:
> 
>> On Feb 1, 2012, at 3:49 AM, Reuti wrote:
>> 
>>> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
>>> 
 On Jan 31, 2012, at 12:58 PM, Reuti wrote:
>>> 
>>> BTW: is there any default for a hostfile for Open MPI - I mean any in my 
>>> home directory or /etc? When I check `man orte_hosts`, and all possible 
>>> optiions are unset (like in a singleton run), it will only run local (Job 
>>> is co-located with mpirun).
>> 
>> Yep - it is /etc/openmpi-default-hostfile
> 
> Thx for replying Ralph.
> 
> I spotted it too, but this is not working for me. Neither for mpiexec from 
> the command line, nor any singleton. I also tried a plain /etc as location of 
> this file as well.
> 
> reuti@pc15370:~> which mpicc
> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
> reuti@pc15370:~> cat 
> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
> pc15370 slots=2
> pc15381 slots=2
> reuti@pc15370:~> mpicc -o mpihello mpihello.c
> reuti@pc15370:~> mpiexec -np 4 ./mpihello
> Hello World from Node 0.
> Hello World from Node 1.
> Hello World from Node 2.
> Hello World from Node 3.
> 
> But all is local (no spawn here, traditional mpihello):
> 
> 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
> 11583 ?Ss 0:00  \_ sshd: reuti [priv] 
> 
> 11585 ?S  0:00  |   \_ sshd: reuti@pts/6  
> 
> 11587 pts/6Ss 0:00  |   \_ -bash
> 13470 pts/6S+ 0:00  |   \_ mpiexec -np 4 ./mpihello
> 13471 pts/6R+ 0:00  |   \_ ./mpihello
> 13472 pts/6R+ 0:00  |   \_ ./mpihello
> 13473 pts/6R+ 0:00  |   \_ ./mpihello
> 13474 pts/6R+ 0:00  |   \_ ./mpihello
> 
> -- Reuti
> 
> 
 We probably aren't correctly marking the original singleton on that node, 
 and so the mapper thinks there are still two slots available on the 
 original node.
>>> 
>>> Okay. There is something to discuss/fix. BTW: if started as singleton I get 
>>> an error at the end with the program the OP provided:
>>> 
>>> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
>>> [[12435,0],0] lost
>> 
>> Okay, I'll take a look at it - but it may take awhile before I can address 
>> either issue as other priorities loom.
>> 
>>> 
>>> It's not the case if run by mpiexec.
>>> 
>>> -- Reuti
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Paul Kapinos

Try out the attached wrapper:
$ mpiexec -np 2 masterstdout 


mpirun -n 2 



Is there a way to have mpirun just merger STDOUT of one process to its
STDOUT stream?





--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
#!/bin/sh
ARGS=$@
if [[ $OMPI_COMM_WORLD_RANK == 0 ]] 
then 
  $ARGS
else
  $ARGS 1>/dev/null 2>/dev/null
fi


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Reuti
Am 01.02.2012 um 17:16 schrieb Ralph Castain:

> Could you add --display-allocation to your cmd line? This will tell us if it 
> found/read the default hostfile, or if the problem is with the mapper.

Sure:

reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello

==   ALLOCATED NODES   ==

 Data for node: Name: pc15370   Num slots: 1Max slots: 0

=
Hello World from Node 0.
Hello World from Node 1.
Hello World from Node 2.
Hello World from Node 3.

(Nothing in `strace` about accessing someting with "default")


reuti@pc15370:~> mpiexec --default-hostfile 
local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation 
-np 4 ./mpihello

==   ALLOCATED NODES   ==

 Data for node: Name: pc15370   Num slots: 2Max slots: 0
 Data for node: Name: pc15381   Num slots: 2Max slots: 0

=
Hello World from Node 0.
Hello World from Node 3.
Hello World from Node 2.
Hello World from Node 1.

Specifying it works fine with correct distribution in `ps`.

-- Reuti


> On Feb 1, 2012, at 7:58 AM, Reuti wrote:
> 
>> Am 01.02.2012 um 15:38 schrieb Ralph Castain:
>> 
>>> On Feb 1, 2012, at 3:49 AM, Reuti wrote:
>>> 
 Am 31.01.2012 um 21:25 schrieb Ralph Castain:
 
> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
 
 BTW: is there any default for a hostfile for Open MPI - I mean any in my 
 home directory or /etc? When I check `man orte_hosts`, and all possible 
 optiions are unset (like in a singleton run), it will only run local (Job 
 is co-located with mpirun).
>>> 
>>> Yep - it is /etc/openmpi-default-hostfile
>> 
>> Thx for replying Ralph.
>> 
>> I spotted it too, but this is not working for me. Neither for mpiexec from 
>> the command line, nor any singleton. I also tried a plain /etc as location 
>> of this file as well.
>> 
>> reuti@pc15370:~> which mpicc
>> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
>> reuti@pc15370:~> cat 
>> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
>> pc15370 slots=2
>> pc15381 slots=2
>> reuti@pc15370:~> mpicc -o mpihello mpihello.c
>> reuti@pc15370:~> mpiexec -np 4 ./mpihello
>> Hello World from Node 0.
>> Hello World from Node 1.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> 
>> But all is local (no spawn here, traditional mpihello):
>> 
>> 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
>> 11583 ?Ss 0:00  \_ sshd: reuti [priv]
>>  
>> 11585 ?S  0:00  |   \_ sshd: reuti@pts/6 
>>  
>> 11587 pts/6Ss 0:00  |   \_ -bash
>> 13470 pts/6S+ 0:00  |   \_ mpiexec -np 4 ./mpihello
>> 13471 pts/6R+ 0:00  |   \_ ./mpihello
>> 13472 pts/6R+ 0:00  |   \_ ./mpihello
>> 13473 pts/6R+ 0:00  |   \_ ./mpihello
>> 13474 pts/6R+ 0:00  |   \_ ./mpihello
>> 
>> -- Reuti
>> 
>> 
> We probably aren't correctly marking the original singleton on that node, 
> and so the mapper thinks there are still two slots available on the 
> original node.
 
 Okay. There is something to discuss/fix. BTW: if started as singleton I 
 get an error at the end with the program the OP provided:
 
 [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
 [[12435,0],0] lost
>>> 
>>> Okay, I'll take a look at it - but it may take awhile before I can address 
>>> either issue as other priorities loom.
>>> 
 
 It's not the case if run by mpiexec.
 
 -- Reuti
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Ralph Castain
Ah - crud. Looks like the default-hostfile mca param isn't getting set to the 
default value. Will resolve - thanks!

On Feb 1, 2012, at 9:28 AM, Reuti wrote:

> Am 01.02.2012 um 17:16 schrieb Ralph Castain:
> 
>> Could you add --display-allocation to your cmd line? This will tell us if it 
>> found/read the default hostfile, or if the problem is with the mapper.
> 
> Sure:
> 
> reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello
> 
> ==   ALLOCATED NODES   ==
> 
> Data for node: Name: pc15370  Num slots: 1Max slots: 0
> 
> =
> Hello World from Node 0.
> Hello World from Node 1.
> Hello World from Node 2.
> Hello World from Node 3.
> 
> (Nothing in `strace` about accessing someting with "default")
> 
> 
> reuti@pc15370:~> mpiexec --default-hostfile 
> local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation 
> -np 4 ./mpihello
> 
> ==   ALLOCATED NODES   ==
> 
> Data for node: Name: pc15370  Num slots: 2Max slots: 0
> Data for node: Name: pc15381  Num slots: 2Max slots: 0
> 
> =
> Hello World from Node 0.
> Hello World from Node 3.
> Hello World from Node 2.
> Hello World from Node 1.
> 
> Specifying it works fine with correct distribution in `ps`.
> 
> -- Reuti
> 
> 
>> On Feb 1, 2012, at 7:58 AM, Reuti wrote:
>> 
>>> Am 01.02.2012 um 15:38 schrieb Ralph Castain:
>>> 
 On Feb 1, 2012, at 3:49 AM, Reuti wrote:
 
> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
> 
>> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
> 
> BTW: is there any default for a hostfile for Open MPI - I mean any in my 
> home directory or /etc? When I check `man orte_hosts`, and all possible 
> optiions are unset (like in a singleton run), it will only run local (Job 
> is co-located with mpirun).
 
 Yep - it is /etc/openmpi-default-hostfile
>>> 
>>> Thx for replying Ralph.
>>> 
>>> I spotted it too, but this is not working for me. Neither for mpiexec from 
>>> the command line, nor any singleton. I also tried a plain /etc as location 
>>> of this file as well.
>>> 
>>> reuti@pc15370:~> which mpicc
>>> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
>>> reuti@pc15370:~> cat 
>>> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
>>> pc15370 slots=2
>>> pc15381 slots=2
>>> reuti@pc15370:~> mpicc -o mpihello mpihello.c
>>> reuti@pc15370:~> mpiexec -np 4 ./mpihello
>>> Hello World from Node 0.
>>> Hello World from Node 1.
>>> Hello World from Node 2.
>>> Hello World from Node 3.
>>> 
>>> But all is local (no spawn here, traditional mpihello):
>>> 
>>> 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
>>> 11583 ?Ss 0:00  \_ sshd: reuti [priv]   
>>>   
>>> 11585 ?S  0:00  |   \_ sshd: reuti@pts/6
>>>   
>>> 11587 pts/6Ss 0:00  |   \_ -bash
>>> 13470 pts/6S+ 0:00  |   \_ mpiexec -np 4 ./mpihello
>>> 13471 pts/6R+ 0:00  |   \_ ./mpihello
>>> 13472 pts/6R+ 0:00  |   \_ ./mpihello
>>> 13473 pts/6R+ 0:00  |   \_ ./mpihello
>>> 13474 pts/6R+ 0:00  |   \_ ./mpihello
>>> 
>>> -- Reuti
>>> 
>>> 
>> We probably aren't correctly marking the original singleton on that 
>> node, and so the mapper thinks there are still two slots available on 
>> the original node.
> 
> Okay. There is something to discuss/fix. BTW: if started as singleton I 
> get an error at the end with the program the OP provided:
> 
> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
> [[12435,0],0] lost
 
 Okay, I'll take a look at it - but it may take awhile before I can address 
 either issue as other priorities loom.
 
> 
> It's not the case if run by mpiexec.
> 
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Frank
Great, that works!! Many Thanks!

On Wed, Feb 1, 2012 at 4:17 PM, Paul Kapinos  wrote:
> Try out the attached wrapper:
> $ mpiexec -np 2 masterstdout 
>
>> mpirun -n 2 
>
>
>> Is there a way to have mpirun just merger STDOUT of one process to its
>> STDOUT stream?
>
>
>
>
>
> --
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
>
> #!/bin/sh
> ARGS=$@
> if [[ $OMPI_COMM_WORLD_RANK == 0 ]]
> then
>  $ARGS
> else
>  $ARGS 1>/dev/null 2>/dev/null
> fi
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2

2012-02-01 Thread Daniel Milroy
Hi Jeff,

Pending further testing, your suggestion seems to have fixed the
issue.  Thank you very much.


Dan Milroy


2012/1/31 Jeff Squyres :
> We have heard reports of failures with the Intel 12.1 compilers.
>
> Can you try with rc4 (that was literally just released) with the 
> --without-memory-manager configure option?
>
>
> On Jan 31, 2012, at 2:19 PM, Daniel Milroy wrote:
>
>> Hello,
>>
>> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC
>> environment.  We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon
>> X5660 cpus.  You can find my build options below.  In an effort to
>> test the OpenMPI build, I compiled "Hello world" with an mpi_init call
>> in C and Fortran.  Mpirun of both versions on a single node results in
>> a segfault.  I have attached the pertinent portion of gdb's output of
>> the "Hello world" core dump.  Submitting a parallel "Hello world" job
>> to torque results in segfaults across the respective nodes.  However,
>> if I execute mpirun of C or Fortran "Hello world" following a segfault
>> the program will exit successfully.  Additionally, if I strace mpirun
>> on either a single node or on multiple nodes in parallel "Hello world"
>> runs successfully.  I am unsure how to proceed- any help would be
>> greatly appreciated.
>>
>>
>> Thank you in advance,
>>
>> Dan Milroy
>>
>>
>> Build options:
>>
>>        source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/iccvars.sh 
>> intel64
>>        source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/ifortvars.sh
>> intel64
>>        export CC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc
>>        export CXX=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc
>>        export 
>> F77=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>>        export 
>> F90=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>>        export FC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>>        ./configure --prefix=/openmpi-1.4.5rc2_intel-12.1
>> --with-tm=/torque-2.5.8/ --enable-shared --enable-static --without-psm
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2

2012-02-01 Thread Daniel Milroy
Hi Götz,

I don't know whether we can implement your suggestion; it is dependent
on the terms of our license with Intel.  I will take this under
advisement.  Thank you very much.


Dan Milroy


2012/2/1 Götz Waschk :
> On Tue, Jan 31, 2012 at 8:19 PM, Daniel Milroy
>  wrote:
>> Hello,
>>
>> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC
>> environment.  We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon
>> X5660 cpus.  You can find my build options below.  In an effort to
>> test the OpenMPI build, I compiled "Hello world" with an mpi_init call
>> in C and Fortran.  Mpirun of both versions on a single node results in
>> a segfault.  I have attached the pertinent portion of gdb's output of
>> the "Hello world" core dump.
>
> Hi Daniel,
>
> that looks like the problem I had with my intel build of openmpi. I
> could solve it by upgrading the Intel Compiler version to 12.1.2.273:
> % icc -v
> icc version 12.1.2 (gcc version 4.4.5 compatibility)
> % icc -V
> Intel(R) C Intel(R) 64 Compiler XE for applications running on
> Intel(R) 64, Version 12.1 Build 2028
> Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.
>
>
> After a rebuild of the openmpi runtime, the crashes went away. I was
> using openmpi 1.5.3, but you could still have the same problem.
>
> Regards, Götz
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...

2012-02-01 Thread Jeff Squyres
I think we need to add something to the FAQ so that it's googleable "The Intel 
12.1 Linux compilers before 12.1.2 are busted.  Upgrade to at least 12.1.2, and 
OMPI should compile and work fine."


On Feb 1, 2012, at 3:34 AM, Götz Waschk wrote:

> On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh
>  wrote:
>> in the malloc.c routine in 1.5.5.  Perhaps you should lower the optimization
>> level to zero and see what you get.
> Hi Richard,
> 
> thanks for the suggestion. I was able to solve the problem by
> upgrading the Intel Compiler to version 12.1.2 and recompiling the
> openmpi runtime with unchanged options. Now I cannot reproduce that
> crash. I'll have to test some more, but I think the problem is solved.
> 
> Thanks, Götz
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Gustavo Correa
Hi Frank, Lloyd

If all you want is to sort out from which process the output is coming from,
you can use the "-tag-output" switch to the [OpenMPI] mpirun.
Check it out with 'man mpirun'.

I hope this helps,
Gus Correa

On Feb 1, 2012, at 11:04 AM, Lloyd Brown wrote:

> I don't know about using mpirun to do it, but you can actually call
> mpirun on a script, and have that script individually call a single
> instance of your program.  Then that script could use shell redirection
> to redirect the output of the program's instance to a separate file.
> 
> I've used this technique to play with ulimit sort of things in the
> script before.  I'm not entirely sure what variables are exposed to you
> in the script, such that you could come up with a unique filename to
> output to, though.
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 02/01/2012 08:59 AM, Frank wrote:
>> When running
>> 
>> mpirun -n 2 
>> 
>> the STDOUT streams of both processes are combined and are displayed by
>> the shell. In such an interleaved format its hard to tell what line
>> comes from which node.
>> 
>> Is there a way to have mpirun just merger STDOUT of one process to its
>> STDOUT stream?
>> 
>> Best,
>> Frank
>> 
>> Cross-reference:
>> http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...

2012-02-01 Thread Jeff Squyres
I just added it:

http://www.open-mpi.org/faq/?category=troubleshooting#intel-12.1-compiler


On Feb 1, 2012, at 12:41 PM, Jeff Squyres wrote:

> I think we need to add something to the FAQ so that it's googleable "The 
> Intel 12.1 Linux compilers before 12.1.2 are busted.  Upgrade to at least 
> 12.1.2, and OMPI should compile and work fine."
> 
> 
> On Feb 1, 2012, at 3:34 AM, Götz Waschk wrote:
> 
>> On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh
>>  wrote:
>>> in the malloc.c routine in 1.5.5.  Perhaps you should lower the optimization
>>> level to zero and see what you get.
>> Hi Richard,
>> 
>> thanks for the suggestion. I was able to solve the problem by
>> upgrading the Intel Compiler to version 12.1.2 and recompiling the
>> openmpi runtime with unchanged options. Now I cannot reproduce that
>> crash. I'll have to test some more, but I think the problem is solved.
>> 
>> Thanks, Götz
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-02-01 Thread Ralph Castain
FWIW: I have fixed this on the developer's trunk, and Jeff has scheduled it for 
release in the upcoming 1.6 release (when 1.5 series rolls over). I don't 
expect we'll backport it to 1.4 unless someone really needs it there.

Thanks!
Ralph

On Feb 1, 2012, at 9:31 AM, Ralph Castain wrote:

> Ah - crud. Looks like the default-hostfile mca param isn't getting set to the 
> default value. Will resolve - thanks!
> 
> On Feb 1, 2012, at 9:28 AM, Reuti wrote:
> 
>> Am 01.02.2012 um 17:16 schrieb Ralph Castain:
>> 
>>> Could you add --display-allocation to your cmd line? This will tell us if 
>>> it found/read the default hostfile, or if the problem is with the mapper.
>> 
>> Sure:
>> 
>> reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello
>> 
>> ==   ALLOCATED NODES   ==
>> 
>> Data for node: Name: pc15370 Num slots: 1Max slots: 0
>> 
>> =
>> Hello World from Node 0.
>> Hello World from Node 1.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> 
>> (Nothing in `strace` about accessing someting with "default")
>> 
>> 
>> reuti@pc15370:~> mpiexec --default-hostfile 
>> local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation 
>> -np 4 ./mpihello
>> 
>> ==   ALLOCATED NODES   ==
>> 
>> Data for node: Name: pc15370 Num slots: 2Max slots: 0
>> Data for node: Name: pc15381 Num slots: 2Max slots: 0
>> 
>> =
>> Hello World from Node 0.
>> Hello World from Node 3.
>> Hello World from Node 2.
>> Hello World from Node 1.
>> 
>> Specifying it works fine with correct distribution in `ps`.
>> 
>> -- Reuti
>> 
>> 
>>> On Feb 1, 2012, at 7:58 AM, Reuti wrote:
>>> 
 Am 01.02.2012 um 15:38 schrieb Ralph Castain:
 
> On Feb 1, 2012, at 3:49 AM, Reuti wrote:
> 
>> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
>> 
>>> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
>> 
>> BTW: is there any default for a hostfile for Open MPI - I mean any in my 
>> home directory or /etc? When I check `man orte_hosts`, and all possible 
>> optiions are unset (like in a singleton run), it will only run local 
>> (Job is co-located with mpirun).
> 
> Yep - it is /etc/openmpi-default-hostfile
 
 Thx for replying Ralph.
 
 I spotted it too, but this is not working for me. Neither for mpiexec from 
 the command line, nor any singleton. I also tried a plain /etc as location 
 of this file as well.
 
 reuti@pc15370:~> which mpicc
 /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
 reuti@pc15370:~> cat 
 /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
 pc15370 slots=2
 pc15381 slots=2
 reuti@pc15370:~> mpicc -o mpihello mpihello.c
 reuti@pc15370:~> mpiexec -np 4 ./mpihello
 Hello World from Node 0.
 Hello World from Node 1.
 Hello World from Node 2.
 Hello World from Node 3.
 
 But all is local (no spawn here, traditional mpihello):
 
 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
 11583 ?Ss 0:00  \_ sshd: reuti [priv]  

 11585 ?S  0:00  |   \_ sshd: reuti@pts/6   

 11587 pts/6Ss 0:00  |   \_ -bash
 13470 pts/6S+ 0:00  |   \_ mpiexec -np 4 ./mpihello
 13471 pts/6R+ 0:00  |   \_ ./mpihello
 13472 pts/6R+ 0:00  |   \_ ./mpihello
 13473 pts/6R+ 0:00  |   \_ ./mpihello
 13474 pts/6R+ 0:00  |   \_ ./mpihello
 
 -- Reuti
 
 
>>> We probably aren't correctly marking the original singleton on that 
>>> node, and so the mapper thinks there are still two slots available on 
>>> the original node.
>> 
>> Okay. There is something to discuss/fix. BTW: if started as singleton I 
>> get an error at the end with the program the OP provided:
>> 
>> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
>> [[12435,0],0] lost
> 
> Okay, I'll take a look at it - but it may take awhile before I can 
> address either issue as other priorities loom.
> 
>> 
>> It's not the case if run by mpiexec.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>

Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-02-01 Thread Jeff Squyres
On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:

> Like I said, a very simple program.
> When launching this application with SLURM (using "salloc -N2 mpirun 
> ./"), it hangs at the barrier.

Are you able to run the MPI example programs in examples/ ?

> However, it passes the barrier if I launch it without SLURM (using "mpirun 
> -np 2 ./"). I first noticed this problem when my application hanged 
> if I tried to send two successive messages from a process to another. Only 
> the first MPI_Send would work. The second MPI_Send would block indefinitely. 
> I was wondering whether any of you have encountered a similar problem, or may 
> have an ideea as to what is causing the Send/Receive pair to block when using 
> SLURM. The exact output in my console is as follows:
>  
> salloc: Granted job allocation 1138
> Process 0 - Sending...
> Process 1 - Receiving...
> Process 1 - Received.
> Process 1 - Barrier reached.
> Process 0 - Sent.
> Process 0 - Barrier reached.
> (it just hangs here)
>  
> I am new to MPI programming and to OpenMPI and would greatly appreciate any 
> help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), 
> my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),

I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4.  
0.3.3 would be pretty ancient, no?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/