[O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-13 Thread Glenn Morris

I'm having trouble with an application (CosmoMC;
) that can use both OpenMPI and
OpenMP.

I have several Opteron boxes, each with 2 * dual core CPUs. I want to
run the application with 4 MPI threads (one per box), each of which in
turn splits into 4 OpenMP threads (one per CPU).

The code is Fortran 90, and the compiler is the Intel Fortran Compiler
Version 8.1. OpenMPI v1.0.1 works fine (communicating between boxes or
amongst the CPUs in a single box) without OpenMP, and OpenMP works
fine without OpenMPI.

The combination OpenMP + OpenMPI works fine if I restrict the
application to only 1 OpenMP thread per MPI process (in other words
the code at least compiles and runs fine with both options on, in this
limited sense). If I try to use my desired value of 4 OpenMP threads,
it crashes. It works fine, however, if I use MPICH for the MPI
implementation.

The hostfile specifies "slots=4 max-slots=4" for each host (trying to
lie and say "slots=1" die not help), and I use "-np 4 --bynode" to get
only one MPI process per host. I'm using ssh over Gbit ethernet
between hosts.

There is no useful error message that I can see. Watching top, I can
see that processes are spawned on the four hosts, split into 4 OpenMP
threads, and then crash immediately. The only error message is:

mpirun noticed that job rank 0 with PID 30243 on node "coma006"
exited on signal 11.
Broken pipe


Using mpirun -d reveals nothing useful to me (see end of message).


I realize this is all rather vague. Any advice, or tips for debugging
(or OpenMPI + OpenMP success stories!) appreciated.


TIA.


[coma006:30450] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
(i, host, exe, pid) = (0, coma003, ./cosmomc, 20847)
(i, host, exe, pid) = (1, coma004, ./cosmomc, 21622)
(i, host, exe, pid) = (2, coma005, ./cosmomc, 22080)
(i, host, exe, pid) = (3, coma006, ./cosmomc, 30461)
[coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0x4)
[coma006:30461] [0,1,0] ompi_mpi_init completed
[coma004:21622] [0,1,2] ompi_mpi_init completed
[coma005:22080] [0,1,1] ompi_mpi_init completed
[coma003:20847] [0,1,3] ompi_mpi_init completed

[coma005:22079] sess_dir_finalize: found proc session dir empty - deleting
[coma005:22079] sess_dir_finalize: found job session dir empty - deleting
[coma005:22079] sess_dir_finalize: univ session dir not empty - leaving
[coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0xa)
[coma006:30451] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma005:22079] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma004:21621] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
mpirun noticed that job rank 1 with PID 22080 on node "coma005" exited on 
signal 11.
[coma003:20846] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma005:22079] sess_dir_finalize: found proc session dir empty - deleting
[coma005:22079] sess_dir_finalize: found job session dir empty - deleting
[coma005:22079] sess_dir_finalize: found univ session dir empty - deleting
[coma005:22079] sess_dir_finalize: top session dir not empty - leaving

3 processes killed (possibly by Open MPI)
[coma006:30451] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_TERMINATED)
[coma006:30451] sess_dir_finalize: found proc session dir empty - deleting
[coma006:30451] sess_dir_finalize: job session dir not empty - leaving



Re: [O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-16 Thread Glenn Morris
Brian Barrett wrote:

[debugging advice]


Thanks, I will look into this some more and try to provide a proper
report (if it is not a program bug), as I should have done in the
first place. I think we may have totalview around somewhere...



Re: [O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-18 Thread Glenn Morris

Don't know if this will be of help, but on further investigation the
problem seems to be some code that essentially does the following:


!$OMP PARALLEL DO
do i=1,n
  do j=1,m
call sub(arg1,...)
  end do
end do
!$OMP END PARALLEL DO


where subroutine sub allocates a temporary array:


subroutine sub(arg1,...)
   real, intent(in)  :: arg1
   real, dimension(:), allocatable :: u

   allocate(u(1:arg1))

   ...

   deallocate(u)

end subroutine


The only backtrace I can get is:

Thread received signal SEGV
__cfree () in /lib/tls/libc-2.3.2.so
(idb) bt
#0  0x008a91f6 in __cfree () in /lib/tls/libc-2.3.2.so
#1  0x081bdf46 in opal_mem_free_free_hook () in 


If I change the subroutine to make u have a fixed size larger than the
largest possible required value, it runs OK past that point (but then
tends to crash further on in the code with a similar sounding problem
in __cfree or somesuch).



[O-MPI users] mpirun tcsh LD_LIBRARY_PATH problem

2006-01-19 Thread Glenn Morris

Using openmpi-1.0.1. attemping to launch programs via 'mpirun --mca
pls_rsh_agent ssh' fails if the user login shell is tcsh, and
LD_LIBRARY_PATH is unset at startup.

if ($?FOO) setenv BAR $FOO

is an error in tcsh if $FOO is unset, because it expands the whole
line at once. Instead one has to use:

if ($?FOO) then
  setenv BAR $FOO
endif


Hence the following patch to orte/mca/pls/rsh/pls_rsh_module.c seems
to fix things:


*** pls_rsh_module.c2005-11-11 11:22:33.0 -0800
--- /home/gmorris/pls_rsh_module.c  2006-01-17 14:15:44.0 -0800
***
*** 806,815 
  if (remote_csh) {
  asprintf (&argv[local_exec_index],
  "set path = ( %s/bin $path ) ; "
! "if ( \"$?LD_LIBRARY_PATH\" == 1 ) "
! "setenv LD_LIBRARY_PATH 
%s/lib:$LD_LIBRARY_PATH ; "
! "if ( \"$?LD_LIBRARY_PATH\" == 0 ) "
! "setenv LD_LIBRARY_PATH %s/lib ; "
  "%s/bin/%s",
  prefix_dir,
  prefix_dir,
--- 806,816 
  if (remote_csh) {
  asprintf (&argv[local_exec_index],
  "set path = ( %s/bin $path ) ; "
! "if ( \"$?LD_LIBRARY_PATH\" == 1 ) then\n"
! "setenv LD_LIBRARY_PATH 
%s/lib:$LD_LIBRARY_PATH\n"
! "else\n"
! "setenv LD_LIBRARY_PATH %s/lib\n"
! "endif ; "
  "%s/bin/%s",
  prefix_dir,
  prefix_dir,





Re: [O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-25 Thread Glenn Morris

I tried nightly snapshot 1.1a1r8803 and it said the following. I'm
willing to try and debug this further, but would need some guidance. I
have access to totalview.


Signal:11 info.si_errno:0(Success) si_code:2(SEGV_ACCERR)
Failing at addr:0x97421004
[0] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803/lib/libopal.so.0
 [0x1cc9fa]
[1] func:/lib/tls/libpthread.so.0 [0xfd2f80]
[2] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803/lib/libopal.so.0(free+0x5e)
 [0x1cf0a2]
[3] func:./cosmomc(for_deallocate+0x56) [0x80d8806]
[4] func:./cosmomc(for_dealloc_allocatable+0x59) [0x80d886d]
[5] func:./cosmomc(spline_+0x4f2) [0x805a2ea]
[6] func:./cosmomc(cambmain_mp_initsourceinterpolation_+0x243) [0x8089b65]
[7] 
func:/afs/slac.stanford.edu/package/intel_tools/compiler8.1/i386_linux24/ifort/lib/libguide.so(__kmp_invoke_microtask+0x4d)
 [0x19f8cd]
[8] 
func:/afs/slac.stanford.edu/package/intel_tools/compiler8.1/i386_linux24/ifort/lib/libguide.so(__kmpc_invoke_task_func+0xa2)
 [0x18fea6]
[9] 
func:/afs/slac.stanford.edu/package/intel_tools/compiler8.1/i386_linux24/ifort/lib/libguide.so(__kmp_internal_fork+0x19b)
 [0x1900a1]
[10] 
func:/afs/slac.stanford.edu/package/intel_tools/compiler8.1/i386_linux24/ifort/lib/libguide.so(__kmp_fork_call+0x334)
 [0x18af18]
[11] 
func:/afs/slac.stanford.edu/package/intel_tools/compiler8.1/i386_linux24/ifort/lib/libguide.so(__kmpc_fork_call+0x35)
 [0x19242d]
[12] func:./cosmomc(cambmain_mp_initsourceinterpolation_+0xa8) [0x80899ca]
[13] func:./cosmomc(cambmain_mp_cmbmain_+0x8a8) [0x8085c34]
[14] func:./cosmomc(camb_mp_camb_getresults_+0x99) [0x80936db]
[15] func:./cosmomc(camb_mp_camb_gettransfers_+0x117) [0x8093387]
[16] func:./cosmomc(cmb_cls_mp_getcls_+0x102) [0x80aa82a]
[17] func:./cosmomc(calclike_mp_getloglikepost_+0x1f9) [0x80b43e9]
[18] func:./cosmomc(calclike_mp_getloglike_+0x23e) [0x80b41d0]
[19] func:./cosmomc(montecarlo_mp_mcmcsample_+0x130) [0x80b68dc]
[20] func:./cosmomc(MAIN__+0x15a3) [0x80b885b]
[21] func:./cosmomc(main+0x20) [0x8059758]
[22] func:/lib/tls/libc.so.6(__libc_start_main+0xda) [0x23c79a]
[23] func:./cosmomc(sinh+0x49) [0x8059611]
*** End of error message ***
1 process killed (possibly by Open MPI)




Re: [O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-26 Thread Glenn Morris

Thanks for your suggestions.


Jeff Squyres wrote:

> From the stack trace, it looks like you're in the middle of a
> complex deallocation of some C++ objects, so I really can't tell
> (i.e., not in an MPI function at all).

Well, not intentionally! I'm just calling "deallocate" in a purely
Fortran program. No C++ around anywhere.

> - configure your Open MPI --with-memory-manager=none and see if the
> problem goes away. This tells Open MPI to not intercept memory
> manager functions, so if you still have the problem, it's more
> likely to be a problem in your application than in OMPI.

If I compile with --with-memory-manager=none the program runs to
completion with no problems.

> - run your application through a memory-checking debugger (such as
> valgrind) and see if it identifies any memory faults within your
> code.

I've tried, but I'm not convinced valgrind understands Fortran 90. I
certainly don't understand valgrind... It spat out a lot of
information, but I don't think any of it was especially relevant to
this issue.

> - send the additional information for run-time problems listed on  
> http://www.open-mpi.org/community/help/

tar.gz attached as requested.



ompi-info.tar.gz
Description: OpenMPI config.log and ompi_info output


Re: [O-MPI users] mpirun tcsh LD_LIBRARY_PATH problem

2006-01-30 Thread Glenn Morris
Jeff Squyres wrote:

> I'll commit this to the trunk and v1.0 branch shortly; it'll be  
> included in v1.0.2.

Thanks.



Re: [O-MPI users] Trouble combining OpenMPI and OpenMP

2006-01-30 Thread Glenn Morris

Thanks for persevering with this. I'm far from sure that the
information I am providing is of much use, largely because I'm pretty
confused about what's going on. Anyway...


Brian Barrett wrote:

> Can you rebuild Open MPI with debugging symbols (just setting CFLAGS
> to -g during configure should do it), rebuild, and get a full call
> stack with line numbers?

For (superfluous) thoroughness, I did configure --enable-debug
--enable-memdebug, plus CFLAGS,FFLAGS,FCFLAGS=-g.

gdb tells me (abbreviated):

[New Thread 2853808 (LWP 16590)]
[New Thread 18697136 (LWP 16591)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 18697136 (LWP 16591)]
0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
4371  nextsize = chunksize(nextchunk);
(gdb) bt
#0  0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
#1  0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
#2  0x08154590 in for_deallocate. ()
#3  0x08154505 in for_dealloc_allocatable ()
#4  0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, 
ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167

(gdb) bt full 5
#0  0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
p = 0x9cb4188
size = 134776
fb = (mfastbinptr *) 0xe464fd
nextchunk = 0x9cd5000
nextsize = 744
nextinuse = 15160704
prevsize = 14968205
bck = 0x11d48b4
fwd = 0x2e8
#1  0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
ar_ptr = 0xe75580
p = 0x9cb4188
hook = (void (*)(void *, const void *)) 0
#2  0x08154590 in for_deallocate. ()
No symbol table info available.
#3  0x08154505 in for_dealloc_allocatable ()
No symbol table info available.
#4  0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, 
ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167
un = 0
sig = 0.5
qn = 0
p = 1.8660254037844382
k = 0
i = 93
u = 0x11d4904


Totalview's memory debugger tells me: "Allocator returned a block
already in use: heap may be corrupted" (at an allocation that gives
the crash when the associated storage is deallocated).


[valgrind]
> The output might be useful to us, if we could take a look (at least,  
> on the OMPI build that fails).  Again, doing this with a build of  
> Open MPI that contains debugging symbols would greatly increase the  
> usefulness to us.

I have to suppress many (irrelevant, I think...) warnings, else
valgrind stops reporting them before the crash. The final one is:

==10446== 
==10446== Invalid read of size 4
==10446==at 0x1C02FA92: _int_free (malloc.c:4371)
==10446==by 0x1C02E6F9: free (malloc.c:3501)
==10446==by 0x815458F: for_deallocate. (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==by 0x8154504: for_dealloc_allocatable (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==  Address 0x8FD3004 is not stack'd, malloc'd or (recently) free'd
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x8fd3004
[0] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0
 [0x1c02987a]
[1] func:[0x52bff000]
[2] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0(free+0xa6)
 [0x1c02e6fa]
[3] func:./cosmomc(for_deallocate.+0x54) [0x8154590]
[4] func:./cosmomc(for_dealloc_allocatable+0x5b) [0x8154505]
[...]
*** End of error message ***
==10446== 
==10446== Process terminating with default action of signal 11 (SIGSEGV)
==10446==  Access not within mapped region at address 0x4
==10446==at 0x1C02FA92: _int_free (malloc.c:4371)
==10446==by 0x1C02E6F9: free (malloc.c:3501)
==10446==by 0x815458F: for_deallocate. (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==by 0x8154504: for_dealloc_allocatable (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446== 



Re: [O-MPI users] mpirun tcsh LD_LIBRARY_PATH problem

2006-01-30 Thread Glenn Morris
Jeff Squyres wrote:

> After sending this reply, I thought about this issue a bit more --
> do you have any idea how portable the embedding of \n's in an ssh
> command is? I.e., will this work everywhere?

:) I almost commented the last time "I don't know how portable this
is". I would imagine it's entirely portable, but I only have access to
OpenSSH to test it - I think it's just a case of typing:

ssh host 'echo a
echo b
echo c'

at the command line and checking it does what it should.



Re: [O-MPI users] mpirun tcsh LD_LIBRARY_PATH problem

2006-01-31 Thread Glenn Morris
Jeff Squyres wrote:

> After sending this reply, I thought about this issue a bit more --
> do you have any idea how portable the embedding of \n's in an ssh
> command is? I.e., will this work everywhere?

On further reflection, if worried about portability, you could just
reverse the order of the "$?LD_LIBRARY_PATH" == "1" and == "0" tests
in the original (no newline) format. If L_L_P is unset, two copies of
the OpenMPI lib directory will get added; which is a little ugly, but
harmless.



[O-MPI users] tcsh 'Unmatched ".' error on localhost

2006-02-01 Thread Glenn Morris

Using v1.0.1, with tcsh as user login shell, trying to mpirun a job on
the localhost that involves tcsh produces an error from tcsh.

E.g.

hostfile = "localhost"

mpirun -np 1 --hostfile ./hostfile \
  --mca pls_rsh_agent ssh ... /bin/tcsh -c hostname

results in the error `Unmatched ".' from tcsh. /bin/bash is fine, as
is any host which is not the local machine.

tcsh -V showed the warning to come from one of the standard files in
/etc/profile.d/, which was trying to manipulate ${path}.

I believe the problem is caused by the \n added to the end of PATH and
LD_LIBRARY_PATH in pls_rsh_module.c at lines 749 and 762. tcsh does
not seem to like these, and removing them stops the error message
occurring.



Re: [O-MPI users] mpirun tcsh LD_LIBRARY_PATH problem

2006-02-02 Thread Glenn Morris
Jeff Squyres wrote:

> Excellent point.  Hardly elegant, but definitely no portability  
> issues there -- so I like it better.

Last word on this trivial issue I promise - if you don't want two
copies added to L_L_P, you could use a temporary variable, e.g.:

tcsh -c 'if ( "$?LD_LIBRARY_PATH" == 1 ) set foo ;
  if ( "$?LD_LIBRARY_PATH" == 0 ) setenv LD_LIBRARY_PATH blah ;
  if ( "$?foo" == 1 ) setenv LD_LIBRARY_PATH blah:$LD_LIBRARY_PATH ;
  unset foo'



[O-MPI users] mpirun sets umask to 0

2006-02-06 Thread Glenn Morris

mpirun (v1.0.1) sets the umask to 0, and hence creates world-writable
output files. Interestingly, adding the -d option to mpirun makes this
problem go away. To reproduce:

mpirun -np 1 --hostfile ./hostfile --mca pls_rsh_agent ssh ./a.out

where a.out is compiled from:

#include 
#include 
#include 
int main ()
{
printf("%.4o\n", umask( 022 ) );
return 0;
}