[O-MPI users] thread support

2005-10-18 Thread Hugh Merz

Howdy,

  I tried installing the release candidate with thread support 
enabled ( --enable-mpi-threads and --enable-progress-threads ) using an 
old rh7.3 install and a recent fc4 install (Intel compilers). When I try 
to run a simple test program, the executable, mpirun and orted all sleep 
in what appears to be a deadlock.  If I compile ompi without threads 
everything works fine.


  The faq states that thread support has only been lightly tested, and 
there was only brief discussion about it in the maillist 8 months ago - 
have there been any developments, and should I expect it to work properly?


Thanks,

Hugh


Re: [O-MPI users] thread support

2005-10-24 Thread Hugh Merz

It's still only lightly tested.  I'm surprised that it totally hangs for
you, though -- what is your simple test program doing?


It just initializes mpi (tried both mpi_init and mpi_init_thread), prints 
a string and exits.  It works fine without thread support compiled into 
ompi.


It happens with any mpi program I try.

Attaching gdb to each thread of the executable gives:

(original process)
#0  0x420293d5 in sigsuspend () from /lib/i686/libc.so.6
#1  0x401e8609 in __pthread_wait_for_restart_signal () from 
/lib/i686/libpthread.so.0
#2  0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0
#3  0x40bda418 in mca_oob_tcp_msg_wait () from 
/opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so

(thread 1)
#0  0x420e01a7 in poll () from /lib/i686/libc.so.6
#1  0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0

(thread 2)
#0  0x420e01a7 in poll () from /lib/i686/libc.so.6
#1  0x4013268b in poll_dispatch () from 
/opt/openmpi-1.0rc2_asynch/lib/libopal.so.0
Cannot access memory at address 0x3e8

(thread 3)
#0  0x420dae14 in read () from /lib/i686/libc.so.6
#1  0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0
#2  0x40c8dfe3 in mca_btl_sm_component_event_thread ()
   from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.so

And there are also 2 additional threads spawned by each of mpirun and 
orted.


Any clues or hints on how to debug this would be appreciated, but I 
understand that it is probably not high priority right now.


Thanks,

Hugh


Hugh Merz wrote:

Howdy,

   I tried installing the release candidate with thread support
enabled ( --enable-mpi-threads and --enable-progress-threads ) using an
old rh7.3 install and a recent fc4 install (Intel compilers). When I try
to run a simple test program, the executable, mpirun and orted all sleep
in what appears to be a deadlock.  If I compile ompi without threads
everything works fine.

   The faq states that thread support has only been lightly tested, and
there was only brief discussion about it in the maillist 8 months ago -
have there been any developments, and should I expect it to work properly?

Thanks,

Hugh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [O-MPI users] thread support

2005-10-26 Thread Hugh Merz
I've tried with Thread support: posix (mpi: yes, progress: no), using 
MPI_THREAD_MULTIPLE and MPI_THREAD_SINGLE and these all hang as well.


Unlike Arnstein I do not find that jobs will work properly running on a 
single node.


Hugh

On Tue, 25 Oct 2005, Jeff Squyres wrote:


Hugh --

We are actually unable to replicate the problem; we've run some
single-threaded and multi-threaded apps with no problems.  This is
unfortunately probably symptomatic of bugs that are still remaining in
the code.  :-(

Can you try disabling MPI progress threads (I believe that tcp may be
the only BTL component that has async progress support implemented
anyway; sm *may*, but I'd have to go back and check)?  Leave MPI threads
enabled (i.e., MPI_THREAD_MULTIPLE) and see if that gets you further.



Hugh Merz wrote:

It's still only lightly tested.  I'm surprised that it totally hangs for
you, though -- what is your simple test program doing?



It just initializes mpi (tried both mpi_init and mpi_init_thread), prints
a string and exits.  It works fine without thread support compiled into
ompi.

It happens with any mpi program I try.

Attaching gdb to each thread of the executable gives:

(original process)
#0  0x420293d5 in sigsuspend () from /lib/i686/libc.so.6
#1  0x401e8609 in __pthread_wait_for_restart_signal () from 
/lib/i686/libpthread.so.0
#2  0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0
#3  0x40bda418 in mca_oob_tcp_msg_wait () from 
/opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so

(thread 1)
#0  0x420e01a7 in poll () from /lib/i686/libc.so.6
#1  0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0

(thread 2)
#0  0x420e01a7 in poll () from /lib/i686/libc.so.6
#1  0x4013268b in poll_dispatch () from 
/opt/openmpi-1.0rc2_asynch/lib/libopal.so.0
Cannot access memory at address 0x3e8

(thread 3)
#0  0x420dae14 in read () from /lib/i686/libc.so.6
#1  0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0
#2  0x40c8dfe3 in mca_btl_sm_component_event_thread ()
from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.so

And there are also 2 additional threads spawned by each of mpirun and
orted.

Any clues or hints on how to debug this would be appreciated, but I
understand that it is probably not high priority right now.

Thanks,

Hugh



Hugh Merz wrote:


Howdy,

  I tried installing the release candidate with thread support
enabled ( --enable-mpi-threads and --enable-progress-threads ) using an
old rh7.3 install and a recent fc4 install (Intel compilers). When I try
to run a simple test program, the executable, mpirun and orted all sleep
in what appears to be a deadlock.  If I compile ompi without threads
everything works fine.

  The faq states that thread support has only been lightly tested, and
there was only brief discussion about it in the maillist 8 months ago -
have there been any developments, and should I expect it to work properly?

Thanks,

Hugh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Intel EM64T Compiler error on Opteron

2006-04-11 Thread Hugh Merz

I am trying to build OpenMPI v1.0.2 (stable) on an Opteron using the v8.1 Intel 
EM64T compilers:

Intel(R) C Compiler for Intel(R) EM64T-based applications, Version 8.1 Build 
20041123 Package ID: l_cce_pc_8.1.024
Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 8.1 
Build 20041123 Package ID: l_fce_pc_8.1.024

The compiler core dumps during make with:

 icc -DHAVE_CONFIG_H -I. -I. -I../../include -I../../include 
-DOMPI_PKGDATADIR=\"/scratch/merz//share/openmpi\" -I../../include -I../.. 
-I../.. -I../../include -I../../opal -I../../orte -I../../ompi -O3 -DNDEBUG 
-fno-strict-aliasing -pthread -MT cmd_line.lo -MD -MP -MF .deps/cmd_line.Tpo -c 
cmd_line.c  -fPIC -DPIC -o .libs/cmd_line.o
icc: error: /opt/intel_cce_80/bin/mcpcom: core dumped
icc: error: Fatal error in /opt/intel_cce_80/bin/mcpcom, terminated by unknown 
signal(139)

I couldn't find any other threads in the mailing list concerning usage of the 
Intel EM64T compilers - has anyone successfully compiled OpenMPI using this 
combination?  It also occurs on the Athlon 64 processor.  Logs attached.

Thanks,

Hugh

openmpi_1.0.2_logs.tar.bz2
Description: BZip2 compressed data


Re: [OMPI users] Intel EM64T Compiler error on Opteron

2006-04-12 Thread Hugh Merz

FWIW, I know that we saw similar issues with the Intel 8.1 series
(segv's during compilation).  Since we are not doing anything illegal in
terms of C++, we already treated this as a compiler bug that we couldn't
really do much about.  Plus, we [perhaps incorrectly] assumed that most
sites using the Intel compilers would be using more recent versions.

Troy seems to confirm that later builds of the 8.1 series seem to have
fixed the problem -- can you try upgrading?


I tried the most recent v8.1 compiler, Build 20060202 Package ID: 
l_cce_pc_8.1.034

But it still core dumps:

icc: error: /scratch/merz/intel_cce_80/bin/mcpcom: core dumped
icc: error: Fatal error in /scratch/merz/intel_cce_80/bin/mcpcom, terminated by 
unknown signal(139)
compilation aborted for cmd_line.c (code 1)

I'm running on Opteron 254s, Fedora Core 4.  I can get by with building it
using gcc, so there's no urgency.

Thanks!

Hugh


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford
Sent: Tuesday, April 11, 2006 4:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] Intel EM64T Compiler error on Opteron

On Tue, 11 Apr 2006 13:48:43 -0600, Troy Telford
 wrote:


I have compiled Open MPI (on an Opteron) with the Intel 9 EM64T
compilers;
It's been a while since I've used the 8.1 series, but I'll

give it a shot

with Intel 8.1 and tell you what happens.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


I can confirm that I'm able to compile Open MPI 1.0.2 on my systems.

Other info:
* Opteron 244 CPUs
* SLES 9 SP3 x86_64
* Intel(R) C Compiler for Intel(R) EM64T-based applications, Version
8.1Build 20050628
* Intel(R) Fortran Compiler for Intel(R) EM64T-based
applications, Version
8.1Build 20050517
--
Troy Telford
Linux Networx
ttelf...@linuxnetworx.com
(801) 649-1356
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] error for open-mpi application

2006-06-07 Thread Hugh Merz

However, when I used openmpi to compile a application program( Molecular
dynamcis code: Amber9), error messages are given:


I think you would be better off using the OpenMPI wrapper compilers rather than 
trying to link the mpi libraries by hand.  For mo
re information read the FAQ, which contains a section on how to compile mpi 
programs with OpenMPI:

http://www.open-mpi.org/faq/?category=mpi-apps

Likely all that is required is to change the Makefile for Amber9.


For PMEMD module:


For example, use mpif90 instead of pgf90 here:


pgf90  -o pmemd gbl_constants.o gbl_datatypes.o state_info.o file_io_dat.o
...
-L/home/ytang/gdata/whli/openmpi/lib -lmpich
/usr/bin/ld: cannot find -lmpich


This is the error - there is no libmpich in OpenMPI.


make[1]: *** [pmemd] Error 2
make[1]: Leaving directory `/gdata/lun8/ytang/whli/amber9/src/pmemd/src'
make: *** [install] Error 2

For sander module:


Presumably this was not compiled with mpif90 either, but you did not send the 
command you used to compile it.


../lmod/lmod.a ../lapack/lapack.a ../blas/blas.a \
../lib/nxtsec.o ../lib/sys.a  -L/home/ytang/gdata/whli/openmpi/lib -lmpi_f90
-lmpi -lorte -lopal -lutil -lnsl -lpthread -ldl -Wl,--export-dynamic -lm
-lutil -lnsl -lpthread -ldl
/usr/bin/ld: skipping incompatible
/home/ytang/gdata/whli/openmpi/lib/libmpi_f90.a when searching for -lmpi_f90
/usr/bin/ld: cannot find -lmpi_f90
make[1]: *** [sander.MPI] Error 2
make[1]: Leaving directory `/gdata/lun8/ytang/whli/amber9/src/sander'
make: *** [parallel] Error 2

I know it must be something wrong with the installation of open-mpi, but I
don't know where it is.

Could you please give me some advice?


Read the FAQ.  I also suggest you email the Amber mailing list and read through 
any documentation on their site, which looks quite extensive.

Hugh


Re: [OMPI users] Dual core Intel CPU

2006-08-17 Thread Hugh Merz

On Wed, 16 Aug 2006, Allan Menezes wrote:

Hi AnyOne,
 I have an 18 node cluster of heterogenous machines. I used fc5 smp
kernel and ocsar 5.0 beta.
I tried the following out on a machine with Open mpi 1.1 and 1.1.1b4
versions. The machine consists of a Dlink 1gigb/s DGE-530T etherent card
2.66GHz dual core Intel Cpu Pentium D 805 with Dual Cannel 1 gig DDR
3200 ram. I compiled the ATLAS libs (ver 3.7.13beta) for this machine
and HPL (xhpl executable) and ran the following experiment twice:
content of my "hosts" file1 for this machine for 1st experiment:
a8.lightning.net slots=2
content of my "hosts" file2 for this machine for 2nd experiment:
a8.lightning.net

On the single node I ran for HPL.dat N =6840 and NB=120 : 1024 MB of Ram
N = sqrt(0.75* ((1024-32 video overhead)/2 )*100*1/8)=approx 6840;
512MB Ram per CPU otherwise the OS uses the hard drive for virtaul
memory. This way it resides totally in Ram.
I ran this command twice for the two different hosts files above in two
experiments:
# mpirun --prefix /opt/openmpi114 --hostsfile hosts -mca btl tcp, self
-np 1 ./xhpl
In both cases the performance remains the same around 4.040 GFlops I
would expect since I am running slots =2 as two CPU's I would get a
performance  increase from expt 2 by 100 -50%
But I see no difference.Can anybody tell me why this is so?


You are only launching 1 process in both cases. Try `mpirun -np 2 ...` to 
launch 2 processes, which will load each of your processors with an xhpl 
process.

Please read the faq:

http://www.open-mpi.org/faq/?category=running#simple-spmd-run

It includes a lot of information about slots and how they should be set as well.

Hugh


I have not tried mpich 2.
Thank you,
Regards,
Allan Menezes

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] building openmpi with gfortran and g95

2006-08-18 Thread Hugh Merz

On Fri, 18 Aug 2006, Steven A. DuChene wrote:

I am attempting to build OpenMPI-1.1 on a RHEL4u2 system that has
the standard gfortran install as part of the distro and with a self installed
recent version of g95 from g95.org but when I use the FC flag to
configure to tell it where to find the g95 executable I get a message
during the configure run of

*** Fortran 90/95 compiler
checking whether we are using the GNU Fortran compiler... yes
checking whether /app/g95-newer/g95-install/bin/x86_64-unknown-linux-gnu-g95 
accepts -g... yes
checking if Fortran compiler works... yes
checking whether gfortran and 
/app/g95-newer/g95-install/bin/x86_64-unknown-linux-gnu-g95 compilers are 
compatible... no
configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not link 
compatible
configure: WARNING: *** Disabling MPI Fortran 90/95 bindings
checking if Fortran 90 compiler supports LOGICAL... skipped
checking if Fortran 90 compiler supports INTEGER... skipped
checking if Fortran 90 compiler supports INTEGER*1... skipped
checking if Fortran 90 compiler supports INTEGER*2... skipped
checking if Fortran 90 compiler supports INTEGER*4... skipped
checking if Fortran 90 compiler supports INTEGER*8... skipped
checking if Fortran 90 compiler supports INTEGER*16... skipped
checking if Fortran 90 compiler supports REAL... skipped
checking if Fortran 90 compiler supports REAL*4... skipped
checking if Fortran 90 compiler supports REAL*8... skipped
checking if Fortran 90 compiler supports REAL*16... skipped
checking if Fortran 90 compiler supports DOUBLE PRECISION... skipped
checking if Fortran 90 compiler supports COMPLEX... skipped
checking if Fortran 90 compiler supports COMPLEX*8... skipped
checking if Fortran 90 compiler supports COMPLEX*16... skipped
checking if Fortran 90 compiler supports COMPLEX*32... skipped
checking if Fortran 90 compiler supports DOUBLE COMPLEX... skipped

I am not specifying any particular FCFLAGS but from searching through the 
mailing
lists I am pretty sure I should be but I just don't know what exactly.

Any info on this anywhere?


Compiler incompatibility is hinted at in this FAQ entry:

   http://www.open-mpi.org/faq/?category=sysadmin#multiple-installs

Try setting the F77 environment variable to use g95 for Fortran 77 compiler as 
well.

If you don't need F77 bindings, then supply '--disable-f77' to configure and it 
should build fine.

Hugh


Re: [OMPI users] efficient memory to memory transfer

2006-11-08 Thread Hugh Merz

On Wed, 8 Nov 2006, Larry Stewart wrote:

Miguel Figueiredo Mascarenhas Sousa Filipe wrote:

H

the MPI model assumes you don't have a "shared memory" system..
therefore it is "message passing" oriented, and not designed to
perform optimally on shared memory systems (like SMPs, or numa-CCs).


For many programs with both MPI and shared memory implementations, the
MPI version runs faster on SMPs and numa-CCs. Why? See the previous
paragraph...



Of course it does..its faster to copy data in main memory than it is
to do it thought any kind of network interface. You can optimize you
message passing implementation to a couple of memory to memory copies
when ranks are on the same node. In the worst case, even if using
local IP addresses to communicate between peers/ranks (in the same
node), the operating  system doesn't even touch the interface.. it
will just copy data from a tcp sender buffer to a tcp receiver
buffer.. in the end - that's always faster than going through a
phisical network link.



There are a lot of papers about the relative merits of a mixed shared
memory and
MPI model - OpenMP on-node and MPI inter-node, for example.  Generally they
seem to show that MPI is at least as good.


The conventional wisdom of pure MPI being as good as hybrid models is primarily 
driven by the fact that people haven't had much incentive to re-write their 
algorithms to support both models.   It's a lot easier to focus only on MPI, 
hence the limited (and lightly tested) support for MPI_THREAD_MULTIPLE and 
asynchronous progress in Open-MPI.

If current HPC trends continue into the future there is going to be increased 
motivation to implement fine-grained parallelism in addition to MPI.  As an 
example, the amount of RAM/node doesn't seem to be increasing as fast as the 
number of cores/node, so pure MPI codes which use a significant amount of 
memory for buffers (domain decomposition algorithms are a good example) will 
not scale to as large of a problem size as hybrid implementations in 
weak-scaling scenarios.

Hugh


[OMPI users] v1.2 Bus Error (/tmp usage)

2007-03-20 Thread Hugh Merz
Good Day,

  I'm using Open MPI on a diskless cluster (/tmp is part of a 1m ramdisk), and 
I found that after upgrading from v1.1.4 to v1.2 that jobs using np > 4 would 
fail to start during MPI_Init, due to what appears to be a lack of space in 
/tmp.  The error output is:

-

[tpb200:32193] *** Process received signal ***
[tpb200:32193] Signal: Bus error (7)
[tpb200:32193] Signal code:  (2)
[tpb200:32193] Failing at address: 0x2a998f4120
[tpb200:32193] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f6e430]
[tpb200:32193] [ 1] 
/opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_grow+0x138) [0x2a9568abc8]
[tpb200:32193] [ 2] 
/opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_resize+0x2d) [0x2a9568b0dd]
[tpb200:32193] [ 3] 
/opt/openmpi/1.2.gcc3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs_same_base_addr+0x6bf)
 [0x2a98ba419f]
[tpb200:32193] [ 4] 
/opt/openmpi/1.2.gcc3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x28a) 
[0x2a9899a4fa]
[tpb200:32193] [ 5] 
/opt/openmpi/1.2.gcc3/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xe8) 
[0x2a98889308]
[tpb200:32193] [ 6] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_mpi_init+0x45d) 
[0x2a956a32ed]
[tpb200:32193] [ 7] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(MPI_Init+0x93) 
[0x2a956c5c93]
[tpb200:32193] [ 8] a.out(main+0x1c) [0x400a44]
[tpb200:32193] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a960933fb]
[tpb200:32193] [10] a.out [0x40099a]
[tpb200:32193] *** End of error message ***

... lots of the above for each process ...

mpirun noticed that job rank 0 with PID 32040 on node tpb200 exited on signal 7 
(Bus error). 

--/--

  If I increase the size of my ramdisk or point $TMP to a network filesystem 
then jobs start and complete fine, so it's not a showstopper, but with v1.1.4 
(or LAM v7.1.2) I didn't encounter this issue with my default 1m ramdisk (even 
with np > 100 ).  Is there a way to limit /tmp usage in Open MPI v1.2?

Hugh