Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-09 Thread Nguyen Toan
Hi Josh,
Thanks for the reply. I did not use the '--enable-ft-thread' option. Here is
my build options:

CFLAGS=-g \
./configure \
--with-ft=cr \
--enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr \
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi \
--with-openib \
--enable-mpirun-prefix-by-default

My application requires lots of communication in every loop, focusing on
MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint
per application execution for my purpose, but the unknown overhead exists
even when no checkpoint was taken.

Do you have any other idea?

Regards,
Nguyen Toan


On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey wrote:

> There are a few reasons why this might be occurring. Did you build with the
> '--enable-ft-thread' option?
>
> If so, it looks like I didn't move over the thread_sleep_wait adjustment
> from the trunk - the thread was being a bit too aggressive. Try adding the
> following to your command line options, and see if it changes the
> performance.
>  "-mca opal_cr_thread_sleep_wait 1000"
>
> There are other places to look as well depending on how frequently your
> application communicates, how often you checkpoint, process layout, ... But
> usually the aggressive nature of the thread is the main problem.
>
> Let me know if that helps.
>
> -- Josh
>
> On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
>
> > Hi all,
> >
> > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> > I found that when running an application,which uses MPI_Isend, MPI_Irecv
> and MPI_Wait,
> > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is
> much longer than the normal execution with mpirun (no checkpoint was taken).
> > This overhead becomes larger when the normal execution runtime is longer.
> > Does anybody have any idea about this overhead, and how to eliminate it?
> > Thanks.
> >
> > Regards,
> > Nguyen
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-09 Thread Joshua Hursey
It looks like the logic in the configure script is turning on the FT thread for 
you when you specify both '--with-ft=cr' and '--enable-mpi-threads'. 

Can you send me the output of 'ompi_info'? Can you also try the MCA parameter 
that I mentioned earlier to see if that changes the performance?

I there are many non-blocking sends and receives, there might be performance 
bug with the way the point-to-point wrapper is tracking request objects. If the 
above MCA parameter does not help the situation, let me know and I might be 
able to take a look at this next week.

Thanks,
Josh

On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote:

> Hi Josh,
> Thanks for the reply. I did not use the '--enable-ft-thread' option. Here is 
> my build options:
> 
> CFLAGS=-g \
> ./configure \
> --with-ft=cr \
> --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr \
> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --with-openib \
> --enable-mpirun-prefix-by-default
> 
> My application requires lots of communication in every loop, focusing on 
> MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint 
> per application execution for my purpose, but the unknown overhead exists 
> even when no checkpoint was taken.
> 
> Do you have any other idea?
> 
> Regards,
> Nguyen Toan
> 
> 
> On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey  wrote:
> There are a few reasons why this might be occurring. Did you build with the 
> '--enable-ft-thread' option?
> 
> If so, it looks like I didn't move over the thread_sleep_wait adjustment from 
> the trunk - the thread was being a bit too aggressive. Try adding the 
> following to your command line options, and see if it changes the performance.
>  "-mca opal_cr_thread_sleep_wait 1000"
> 
> There are other places to look as well depending on how frequently your 
> application communicates, how often you checkpoint, process layout, ... But 
> usually the aggressive nature of the thread is the main problem.
> 
> Let me know if that helps.
> 
> -- Josh
> 
> On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
> 
> > Hi all,
> >
> > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> > I found that when running an application,which uses MPI_Isend, MPI_Irecv 
> > and MPI_Wait,
> > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much 
> > longer than the normal execution with mpirun (no checkpoint was taken).
> > This overhead becomes larger when the normal execution runtime is longer.
> > Does anybody have any idea about this overhead, and how to eliminate it?
> > Thanks.
> >
> > Regards,
> > Nguyen
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-09 Thread Samuel K. Gutierrez

On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote:

I would personally suggest not reconfiguring your system simply to  
support a particular version of OMPI. The only difference between  
the 1.4 and 1.5 series wrt slurm is that we changed a few things to  
support a more recent version of slurm. It is relatively easy to  
backport that code to the 1.4 series, and it should be (mostly)  
backward compatible.


OMPI is agnostic wrt resource managers. We try to support all  
platforms, with our effort reflective of the needs of our developers  
and their organizations, and our perception of the relative size of  
the user community for a particular platform. Slurm is a fairly  
small community, mostly centered in the three DOE weapons labs, so  
our support for that platform tends to focus on their usage.


So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?


Open MPI 1.5.1 works as expected on our TLCC machines.  Open MPI 1.4.3  
with your SLURM update also tested.




I have created a ticket to upgrade the 1.4.4 release (due out any  
time now) with the 1.5.1 slurm support. Any interested parties can  
follow it here:


Thanks Ralph!

Sam



https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:



On 09/02/2011, at 9:16 AM, Ralph Castain wrote:


See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:



On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:


Hi Michael,

You may have tried to send some debug information to the list,  
but it appears to have been blocked.  Compressed text output of  
the backtrace text is sufficient.



Odd, I thought I sent it to you directly.  In any case, here is  
the backtrace and some information from gdb:


$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/ 
michael/home/ServerAdmin/mpi

[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
342	pdatorted[mev->sender.vpid]->state =  
ORTE_PROC_STATE_RUNNING;

(gdb) bt
#0  0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
#1  0x778a7338 in event_process_active (base=0x615240) at  
event.c:651
#2  0x778a797e in opal_event_base_loop (base=0x615240,  
flags=1) at event.c:823

#3  0x778a756f in opal_event_loop (flags=1) at event.c:730
#4  0x7789b916 in opal_progress () at runtime/ 
opal_progress.c:189
#5  0x77b76e20 in orte_plm_base_daemon_callback  
(num_daemons=2) at base/plm_base_launch_support.c:459
#6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560)  
at plm_slurm_module.c:360
#7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8)  
at orterun.c:754
#8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at  
main.c:13

(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class =  
0x77dd4f40, obj_reference_count = 1, cls_init_file_name =  
0x77bb9a78 "base/plm_base_launch_support.c",
cls_init_lineno = 423}, ev = 0x680850, sender = {jobid =  
1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file  
= 0x680640 "rml_oob_component.c", line = 279}


The jobid and vpid look like the defined INVALID values,  
indicating that something is quite wrong. This would quite likely  
lead to the segfault.


From this, it would indeed appear that you are getting some kind  
of library confusion - the most likely cause of such an error is  
a daemon from a different version trying to respond, and so the  
returned message isn't correct.


Not sure why else it would be happening...you could try setting - 
mca plm_base_verbose 5 to get more debug output displayed on your  
screen, assuming you built OMPI with --enable-debug.




Found the problem It is a site configuration issue, which I'll  
need to find a workaround for.


[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component  
[slurm] set priority to 75
[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component  
[slurm]

[bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
[bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523  
nodename hash 1936089714

[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job  
[31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job  
[31383,1]
[bio-ipc.{FQDN}:27523] [

[OMPI users] Mpirun --app option not working

2011-02-09 Thread Sindhi, Waris PW
Hi,
I am having trouble using the --app option with OpenMPI's mpirun
command. The MPI processes launched with the --app option get launched
on the linux node that mpirun command is executed on.

The same MPI executable works when specified on the command line using
the -np  option.

Please let me know what I am doing wrong ?

Bad launch :

head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
node1,node1,node2,node2 --app appfile 
head-node :Hello world from 0
head-node :Hello world from 3
head-node :Hello world from 1
head-node :Hello world from 2

Good launch :

head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
node1,node1,node2,node2 -np 4 mpiinit
node1 :Hello world from 0
node2 :Hello world from 2
node2 :Hello world from 3
node1 :Hello world from 1

head-node % cat appfile
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit

head-node % cat mpiinit.c
#include 
#include 

int main(int argc, char** argv)
{
int rc, me;
char pname[MPI_MAX_PROCESSOR_NAME];
int plen;

MPI_Init(
   &argc,
   &argv
);

rc = MPI_Comm_rank(
MPI_COMM_WORLD,
&me
);

if (rc != MPI_SUCCESS)
{
   return rc;
}

MPI_Get_processor_name(
   pname,
   &plen
);

printf("%s:Hello world from %d\n", pname, me);

MPI_Finalize();

return 0;
}


head-node % /usr/lib64/openmpi/1.4-gcc/bin/ompi_info
 Package: Open MPI
mockbu...@x86-004.build.bos.redhat.com Distribution
Open MPI: 1.4
   Open MPI SVN revision: r22285
   Open MPI release date: Dec 08, 2009
Open RTE: 1.4
   Open RTE SVN revision: r22285
   Open RTE release date: Dec 08, 2009
OPAL: 1.4
   OPAL SVN revision: r22285
   OPAL release date: Dec 08, 2009
Ident string: 1.4
  Prefix: /usr/lib64/openmpi/1.4-gcc
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: x86-004.build.bos.redhat.com
   Configured by: mockbuild
   Configured on: Tue Feb 23 12:39:24 EST 2010
  Configure host: x86-004.build.bos.redhat.com
Built by: mockbuild
Built on: Tue Feb 23 12:41:54 EST 2010
  Built host: x86-004.build.bos.redhat.com
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
   Sparse Groups: no
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no  (checkpoint thread: no)
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component
v1.4)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.4)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4)
   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.4)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.4)
MCA coll: self (MCA v2.0, API v2.0, Component v1.4)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.4)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.4)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.4)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.4)
   MCA mpool: fake (MCA v2.0, API v2.0, 

[OMPI users] Totalview not showing main program on startup with OpenMPI 1.3.x and 1.4.x

2011-02-09 Thread Dennis McRitchie
Hi,

I'm encountering a strange problem and can't find it having been discussed on 
this mailing list.

When building and running my parallel program using any recent Intel compiler 
and OpenMPI 1.2.8, TotalView behaves entirely correctly, displaying the 
"Process mpirun is a parallel job. Do you want to stop the job now?" dialog 
box, and stopping at the start of the program. The code displayed is the source 
code of my program's function main, and the stack trace window shows that we 
are stopped in the poll function many levels "up" from my main function's call 
to MPI_Init. I can then set breakpoints, single step, etc., and the code runs 
appropriately.

But when building and running using Intel compilers with OpenMPI 1.3.x or 
1.4.x, TotalView displays the usual dialog box, and stops at the start of the 
program; but my main program's source code is *not* displayed. The stack trace 
window again shows that we are stopped in the poll function several levels "up" 
from my main function's call to MPI_Init; but this time, the code displayed is 
the assembler code for the poll function itself.

If I click on 'main' in the stack trace window, the source code for my 
program's function main is then displayed, and I can now set breakpoints, 
single step, etc. as usual.

So why is the program's source code not displayed when using 1.3.x and 1.4.x, 
but is displayed when using 1.2.8. This change in behavior is fairly confusing 
to our users, and it would be nice to have it work as it used to, if possible.

Thanks,
   Dennis

Dennis McRitchie
Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University




Re: [OMPI users] Totalview not showing main program on startup with OpenMPI 1.3.x and 1.4.x

2011-02-09 Thread Terry Dontje
This sounds like something I ran into some time ago that involved the 
compiler omitting frame pointers.  You may want to try to compile your 
code with -fno-omit-frame-pointer.  I am unsure if you may need to do 
the same while building MPI though.


--td

On 02/09/2011 02:49 PM, Dennis McRitchie wrote:

Hi,

I'm encountering a strange problem and can't find it having been discussed on 
this mailing list.

When building and running my parallel program using any recent Intel compiler and OpenMPI 1.2.8, 
TotalView behaves entirely correctly, displaying the "Process mpirun is a parallel job. Do you 
want to stop the job now?" dialog box, and stopping at the start of the program. The code 
displayed is the source code of my program's function main, and the stack trace window shows that 
we are stopped in the poll function many levels "up" from my main function's call to 
MPI_Init. I can then set breakpoints, single step, etc., and the code runs appropriately.

But when building and running using Intel compilers with OpenMPI 1.3.x or 1.4.x, 
TotalView displays the usual dialog box, and stops at the start of the program; but my 
main program's source code is *not* displayed. The stack trace window again shows that we 
are stopped in the poll function several levels "up" from my main function's 
call to MPI_Init; but this time, the code displayed is the assembler code for the poll 
function itself.

If I click on 'main' in the stack trace window, the source code for my 
program's function main is then displayed, and I can now set breakpoints, 
single step, etc. as usual.

So why is the program's source code not displayed when using 1.3.x and 1.4.x, 
but is displayed when using 1.2.8. This change in behavior is fairly confusing 
to our users, and it would be nice to have it work as it used to, if possible.

Thanks,
Dennis

Dennis McRitchie
Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Mpirun --app option not working

2011-02-09 Thread Gus Correa

Sindhi, Waris PW wrote:

Hi,
I am having trouble using the --app option with OpenMPI's mpirun
command. The MPI processes launched with the --app option get launched
on the linux node that mpirun command is executed on.

The same MPI executable works when specified on the command line using
the -np  option.

Please let me know what I am doing wrong ?

Bad launch :

head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
node1,node1,node2,node2 --app appfile 
head-node :Hello world from 0

head-node :Hello world from 3
head-node :Hello world from 1
head-node :Hello world from 2

Good launch :

head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
node1,node1,node2,node2 -np 4 mpiinit
node1 :Hello world from 0
node2 :Hello world from 2
node2 :Hello world from 3
node1 :Hello world from 1

head-node % cat appfile
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit
-np 1 /home/user461/OPENMPI/mpiinit

head-node % cat mpiinit.c
#include 
#include 

int main(int argc, char** argv)
{
int rc, me;
char pname[MPI_MAX_PROCESSOR_NAME];
int plen;

MPI_Init(
   &argc,
   &argv
);

rc = MPI_Comm_rank(
MPI_COMM_WORLD,
&me
);

if (rc != MPI_SUCCESS)
{
   return rc;
}

MPI_Get_processor_name(
   pname,
   &plen
);

printf("%s:Hello world from %d\n", pname, me);

MPI_Finalize();

return 0;
}


head-node % /usr/lib64/openmpi/1.4-gcc/bin/ompi_info
 Package: Open MPI
mockbu...@x86-004.build.bos.redhat.com Distribution
Open MPI: 1.4
   Open MPI SVN revision: r22285
   Open MPI release date: Dec 08, 2009
Open RTE: 1.4
   Open RTE SVN revision: r22285
   Open RTE release date: Dec 08, 2009
OPAL: 1.4
   OPAL SVN revision: r22285
   OPAL release date: Dec 08, 2009
Ident string: 1.4
  Prefix: /usr/lib64/openmpi/1.4-gcc
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: x86-004.build.bos.redhat.com
   Configured by: mockbuild
   Configured on: Tue Feb 23 12:39:24 EST 2010
  Configure host: x86-004.build.bos.redhat.com
Built by: mockbuild
Built on: Tue Feb 23 12:41:54 EST 2010
  Built host: x86-004.build.bos.redhat.com
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
   Sparse Groups: no
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no  (checkpoint thread: no)
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component
v1.4)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.4)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4)
   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.4)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.4)
MCA coll: self (MCA v2.0, API v2.0, Component v1.4)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.4)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.4)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.4)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.4)
   MCA mpool:

Re: [OMPI users] Totalview not showing main program on startup with OpenMPI 1.3.x and 1.4.x

2011-02-09 Thread Dennis McRitchie
Thanks Terry.

Unfortunately, -fno-omit-frame-pointer is the default for the Intel compiler 
when -g  is used, which I am using since it is necessary for source level 
debugging. So the compiler kindly tells me that it is ignoring your suggested 
option when I specify it.  :)

Also, since I can reproduce this problem by simply changing the OpenMPI 
version, without changing the compiler version, it strikes me as being more 
likely to be an OpenMPI-related issue: 1.2.8 works, but anything later does not 
(as described below).

I have tried different versions of TotalView from 8.1 to 8.9, but all behave 
the same.

I was wondering if a change to the openmpi-totalview.tcl script might be needed?

Dennis


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Terry Dontje
Sent: Wednesday, February 09, 2011 5:02 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Totalview not showing main program on startup with 
OpenMPI 1.3.x and 1.4.x

This sounds like something I ran into some time ago that involved the compiler 
omitting frame pointers.  You may want to try to compile your code with 
-fno-omit-frame-pointer.  I am unsure if you may need to do the same while 
building MPI though.

--td

On 02/09/2011 02:49 PM, Dennis McRitchie wrote:

Hi,



I'm encountering a strange problem and can't find it having been discussed on 
this mailing list.



When building and running my parallel program using any recent Intel compiler 
and OpenMPI 1.2.8, TotalView behaves entirely correctly, displaying the 
"Process mpirun is a parallel job. Do you want to stop the job now?" dialog 
box, and stopping at the start of the program. The code displayed is the source 
code of my program's function main, and the stack trace window shows that we 
are stopped in the poll function many levels "up" from my main function's call 
to MPI_Init. I can then set breakpoints, single step, etc., and the code runs 
appropriately.



But when building and running using Intel compilers with OpenMPI 1.3.x or 
1.4.x, TotalView displays the usual dialog box, and stops at the start of the 
program; but my main program's source code is *not* displayed. The stack trace 
window again shows that we are stopped in the poll function several levels "up" 
from my main function's call to MPI_Init; but this time, the code displayed is 
the assembler code for the poll function itself.



If I click on 'main' in the stack trace window, the source code for my 
program's function main is then displayed, and I can now set breakpoints, 
single step, etc. as usual.



So why is the program's source code not displayed when using 1.3.x and 1.4.x, 
but is displayed when using 1.2.8. This change in behavior is fairly confusing 
to our users, and it would be nice to have it work as it used to, if possible.



Thanks,

   Dennis



Dennis McRitchie

Computational Science and Engineering Support (CSES)

Academic Services Department

Office of Information Technology

Princeton University





___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

--
[Oracle]
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com




Re: [OMPI users] Default hostfile not being used by mpirun

2011-02-09 Thread Jeff Squyres
You may have mentioned this in a prior mail, but what version are you using?

I tested and am unable to replicate your problem -- my openmpi-mca-params.conf 
file is always read.

Double check the value of your mca_param_files MCA parameter:

shell$ ompi_info --param mca param_files

Mine comes out as:

/users/jsquyres/.openmpi/mca-params.conf:/home/jsquyres/bogus/etc/openmpi-mca-params.conf

If I add an MCA param to either one of those files, it is definitely read and 
found.

Remember, too, that the location of this file is relative to whatever nodes 
your MPI processes are running on.  So if you have local installs of Open MPI 
(vs. a network filesystem install), you might need to edit the system config 
file on every node.



On Feb 6, 2011, at 8:52 AM, Barnet Wagman wrote:

> Setting the orte_default_hostfile param in 
> $HOME/.openmpi/mca-params.conf
> works (with $HOME set, of course), but for some reason setting it in the 
> system conf file,
> $prefix/etc/openmpi-mca-params.conf
> does not.  Using 'ompi_info --param  ...', it appear that the the system conf 
> file isn't being read at all.
> 
>  It would be nice to figure out why the system conf file isn't being read, 
> but I can easily get by with the user conf file.
> 
> Thanks
> 
> On 2/5/11 7:06 PM, Ralph Castain wrote:
>> The easiest solution is to take advantage of the fact that the default 
>> hostfile is an MCA parameter - so you can specify it in several ways other 
>> than on the cmd line. It can be in your environment, in the default MCA 
>> parameter file, or in an MCA param file in your home directory.
>> 
>> See
>> 
>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>> 
>> for a full description on how to do this.
>> 
>> 
>> On Feb 5, 2011, at 3:14 PM, ETHAN DENEAULT wrote:
>> 
>>> Barnet,
>>> 
>>> This isn't the most straightforward solution, but as a workaround, create a 
>>> bash script and run that script through npRmpi? Something like:
>>> 
>>> !#/bin/bash
>>> 
>>> openmpi -np 15 -hostfile /path/to/hostfile $1
>>> 
>>> Cheers,
>>> Ethan
>>> 
>>> --
>>> Dr. Ethan Deneault
>>> Assistant Professor of Physics
>>> The University of Tampa
>>> 401 W Kennedy Blvd
>>> Tampa, FL 33606
>>> (813) 732-3718
>>> 
>>> Barnet Wagman  wrote:
>>> 
>>> There have been many postings about openmpi-default-hostfile on the
>>> list, but I haven't found one that answers my question, so I hope you
>>> won't mind one more.
>>> 
>>> When I use mpirun, openmpi-default-hostfile does not appear to get used.
>>> I've added three lines to the default host file:
>>> 
>>>node0 slots=3
>>>node1 slots=4
>>>node2 slots=4
>>> 
>>> 'node0' is the local (master) host.
>>> 
>>> If I explicitly list the hostfile in the mpirun command, everything
>>> works correctly.  E.g.
>>> 
>>>mpirun -np 15 -hostfile /full/path/to/openmpi-default-hostfile hello_c
>>> 
>>> works correctly - hello_c gets run using all three nodes.
>>> 
>>> However, if I don't specify the hostfile, only the local node, node0, is
>>> used. E.g.
>>> 
>>>mpirun -np 15 hello_c
>>> 
>>> creates all 15 processes on node0.  I was under the impression that all
>>> machines listed in openmpi-default-hostfile should get used by default. 
>>> Is that correct?
>>> 
>>> Unfortunately I can't use the hostfile command line option.  I'm going
>>> to be using a mpi app (npRmpi) that doesn't let me pass params to
>>> mpirun. So I need all my nodes used by default.
>>> 
>>> Configuration details:
>>> 
>>>openmpi 1.4.3, built from source.
>>> 
>>>OS: Debian lenny (but the Debian openmpi package is NOT installed).
>>> 
>>>Installation dir: /home/omu/openmpi
>>> 
>>>The default host file has pathname
>>>/home/omu/openmpi/etc/openmpi-default-hostfile
>>> 
>>>I've set two envirnmental variables to support open mpi:
>>> 
>>>PATH=/home/omu/openmpi/bin:...
>>>LD_LIBRARY_PATH=/home/omu/openmpi/lib:...
>>> 
>>> 
>>> Are there any other environmental variables that need to be set?
>>> 
>>> I'd appreciate any suggestions about this.
>>> 
>>> thanks,
>>> 
>>> Barnet Wagman
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> 
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Mpirun --app option not working

2011-02-09 Thread Ralph Castain
Gus is correct - the -host option needs to be in the appfile


On Feb 9, 2011, at 3:32 PM, Gus Correa wrote:

> Sindhi, Waris PW wrote:
>> Hi,
>>I am having trouble using the --app option with OpenMPI's mpirun
>> command. The MPI processes launched with the --app option get launched
>> on the linux node that mpirun command is executed on.
>> The same MPI executable works when specified on the command line using
>> the -np  option.
>> Please let me know what I am doing wrong ?
>> Bad launch :
>> head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
>> node1,node1,node2,node2 --app appfile head-node :Hello world from 0
>> head-node :Hello world from 3
>> head-node :Hello world from 1
>> head-node :Hello world from 2
>> Good launch :
>> head-node % /usr/lib64/openmpi/1.4-gcc/bin/mpirun --host
>> node1,node1,node2,node2 -np 4 mpiinit
>> node1 :Hello world from 0
>> node2 :Hello world from 2
>> node2 :Hello world from 3
>> node1 :Hello world from 1
>> head-node % cat appfile
>> -np 1 /home/user461/OPENMPI/mpiinit
>> -np 1 /home/user461/OPENMPI/mpiinit
>> -np 1 /home/user461/OPENMPI/mpiinit
>> -np 1 /home/user461/OPENMPI/mpiinit
>> head-node % cat mpiinit.c
>> #include 
>> #include 
>> int main(int argc, char** argv)
>> {
>>int rc, me;
>>char pname[MPI_MAX_PROCESSOR_NAME];
>>int plen;
>>MPI_Init(
>>   &argc,
>>   &argv
>>);
>>rc = MPI_Comm_rank(
>>MPI_COMM_WORLD,
>>&me
>>);
>>if (rc != MPI_SUCCESS)
>>{
>>   return rc;
>>}
>>MPI_Get_processor_name(
>>   pname,
>>   &plen
>>);
>>printf("%s:Hello world from %d\n", pname, me);
>>MPI_Finalize();
>>return 0;
>> }
>> head-node % /usr/lib64/openmpi/1.4-gcc/bin/ompi_info
>> Package: Open MPI
>> mockbu...@x86-004.build.bos.redhat.com Distribution
>>Open MPI: 1.4
>>   Open MPI SVN revision: r22285
>>   Open MPI release date: Dec 08, 2009
>>Open RTE: 1.4
>>   Open RTE SVN revision: r22285
>>   Open RTE release date: Dec 08, 2009
>>OPAL: 1.4
>>   OPAL SVN revision: r22285
>>   OPAL release date: Dec 08, 2009
>>Ident string: 1.4
>>  Prefix: /usr/lib64/openmpi/1.4-gcc
>> Configured architecture: x86_64-unknown-linux-gnu
>>  Configure host: x86-004.build.bos.redhat.com
>>   Configured by: mockbuild
>>   Configured on: Tue Feb 23 12:39:24 EST 2010
>>  Configure host: x86-004.build.bos.redhat.com
>>Built by: mockbuild
>>Built on: Tue Feb 23 12:41:54 EST 2010
>>  Built host: x86-004.build.bos.redhat.com
>>  C bindings: yes
>>C++ bindings: yes
>>  Fortran77 bindings: yes (all)
>>  Fortran90 bindings: yes
>> Fortran90 bindings size: small
>>  C compiler: gcc
>> C compiler absolute: /usr/bin/gcc
>>C++ compiler: g++
>>   C++ compiler absolute: /usr/bin/g++
>>  Fortran77 compiler: gfortran
>>  Fortran77 compiler abs: /usr/bin/gfortran
>>  Fortran90 compiler: gfortran
>>  Fortran90 compiler abs: /usr/bin/gfortran
>> C profiling: yes
>>   C++ profiling: yes
>> Fortran77 profiling: yes
>> Fortran90 profiling: yes
>>  C++ exceptions: no
>>  Thread support: posix (mpi: no, progress: no)
>>   Sparse Groups: no
>>  Internal debug support: no
>> MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>> libltdl support: yes
>>   Heterogeneous support: no
>> mpirun default --prefix: yes
>> MPI I/O support: yes
>>   MPI_WTIME support: gettimeofday
>> Symbol visibility support: yes
>>   FT Checkpoint support: no  (checkpoint thread: no)
>>   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4)
>>  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4)
>>   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4)
>>   MCA carto: auto_detect (MCA v2.0, API v2.0, Component
>> v1.4)
>>   MCA carto: file (MCA v2.0, API v2.0, Component v1.4)
>>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4)
>>   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4)
>>   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4)
>> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4)
>> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4)
>> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4)
>>  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4)
>>   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4)
>>   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4)
>>MCA coll: basic (MCA v2.0, API v2.0, Component v1.4)
>>MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4)
>>MCA coll: inter (MCA v2.0, API v2.0, Compone

[OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-09 Thread Jeremiah Willcock

I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status 
returned by MPI_Bcast, so I don't know what the error is?  Is this 
something that people have seen before?


-- Jeremiah Willcock


Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-09 Thread Jeremiah Willcock

On Wed, 9 Feb 2011, Jeremiah Willcock wrote:


I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status returned 
by MPI_Bcast, so I don't know what the error is?  Is this something that 
people have seen before?


For the record, this appears to be caused by specifying inconsistent data 
sizes on the different ranks in the broadcast operation.  The error 
message could still be improved, though.


-- Jeremiah Willcock


[OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-09 Thread Tena Sakai
Hi

I have an app.ac1 file like below:
[tsakai@vixen local]$ cat app.ac1
-H vixen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
-H vixen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
-H blitzen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
-H blitzen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8

The program I run is
Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
Where x is [5..8].  The machines vixen and blitzen each run 2 runs.

Here’s the program fib.R:
[ tsakai@vixen local]$ cat fib.R
# fib() computes, given index n, fibonacci number iteratively
# here's the first dozen sequence (indexed from 0..11)
# 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

fib <- function( n ) {
a <- 0
b <- 1
for ( i in 1:n ) {
 t <- b
 b <- a
 a <- a + t
}
a

arg <- commandArgs( TRUE )
myHost <- system( 'hostname', intern=TRUE )
cat( fib(arg), myHost, '\n' )

It reads an argument from command line and produces a fibonacci number that
corresponds to that index, followed by the machine name.  Pretty simple stuff.

Here’s the run output:
[tsakai@vixen local]$ mpirun -app app.ac1
5 vixen.egcrc.org
8 vixen.egcrc.org
13 blitzen.egcrc.org
21 blitzen.egcrc.org

Which is exactly what I expect.  So far so good.

Now I want to run the same thing on cloud.  I launch 2 instances of the same
virtual machine, to which I get to by:
[tsakai@vixen local]$ ssh –A –I ~/.ssh/tsakai machine-instance-A-public-dns

Now I am on machine A:
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B without 
password authentication,
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2
[tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai domU-12-31-39-0C-C8-01
Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
[tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
domU-12-31-39-0C-C8-01
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine A 
without using password
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai domU-12-31-39-00-D1-F2
The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' can't be 
established.
RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the list of 
known hosts.
Last login: Wed Feb  9 20:49:34 2011 from 10.215.203.239
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ exit
logout
Connection to domU-12-31-39-00-D1-F2 closed.
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ exit
logout
Connection to domU-12-31-39-0C-C8-01 closed.
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2

As you can see, neither machine uses password for authentication; it uses
public/private key pairs.  There is no problem (that I can see) for ssh 
invocation
from one machine to the other.  This is so because I have a copy of public key
and a copy of private key on each instance.

The app.ac file is identical, except the node names:
[tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
-H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
-H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
-H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
-H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8

Here’s what happens with mpirun:

[tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
tsakai@domu-12-31-39-0c-c8-01's password:
Permission denied, please try again.
tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job...

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

mpirun: clean termination accomplished

[tsakai@domU-12-31-39-00-D1-F2 ~]$

Mpirun (or somebody else?) asks me password, which I don’t have.
I end up typing control-C.

Here’s my question:
How can I get past authentication by mpirun whe