[O-MPI users] thread support
Howdy, I tried installing the release candidate with thread support enabled ( --enable-mpi-threads and --enable-progress-threads ) using an old rh7.3 install and a recent fc4 install (Intel compilers). When I try to run a simple test program, the executable, mpirun and orted all sleep in what appears to be a deadlock. If I compile ompi without threads everything works fine. The faq states that thread support has only been lightly tested, and there was only brief discussion about it in the maillist 8 months ago - have there been any developments, and should I expect it to work properly? Thanks, Hugh
Re: [O-MPI users] thread support
It's still only lightly tested. I'm surprised that it totally hangs for you, though -- what is your simple test program doing? It just initializes mpi (tried both mpi_init and mpi_init_thread), prints a string and exits. It works fine without thread support compiled into ompi. It happens with any mpi program I try. Attaching gdb to each thread of the executable gives: (original process) #0 0x420293d5 in sigsuspend () from /lib/i686/libc.so.6 #1 0x401e8609 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0 #2 0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0 #3 0x40bda418 in mca_oob_tcp_msg_wait () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so (thread 1) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0 (thread 2) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x4013268b in poll_dispatch () from /opt/openmpi-1.0rc2_asynch/lib/libopal.so.0 Cannot access memory at address 0x3e8 (thread 3) #0 0x420dae14 in read () from /lib/i686/libc.so.6 #1 0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0 #2 0x40c8dfe3 in mca_btl_sm_component_event_thread () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.so And there are also 2 additional threads spawned by each of mpirun and orted. Any clues or hints on how to debug this would be appreciated, but I understand that it is probably not high priority right now. Thanks, Hugh Hugh Merz wrote: Howdy, I tried installing the release candidate with thread support enabled ( --enable-mpi-threads and --enable-progress-threads ) using an old rh7.3 install and a recent fc4 install (Intel compilers). When I try to run a simple test program, the executable, mpirun and orted all sleep in what appears to be a deadlock. If I compile ompi without threads everything works fine. The faq states that thread support has only been lightly tested, and there was only brief discussion about it in the maillist 8 months ago - have there been any developments, and should I expect it to work properly? Thanks, Hugh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [O-MPI users] thread support
I've tried with Thread support: posix (mpi: yes, progress: no), using MPI_THREAD_MULTIPLE and MPI_THREAD_SINGLE and these all hang as well. Unlike Arnstein I do not find that jobs will work properly running on a single node. Hugh On Tue, 25 Oct 2005, Jeff Squyres wrote: Hugh -- We are actually unable to replicate the problem; we've run some single-threaded and multi-threaded apps with no problems. This is unfortunately probably symptomatic of bugs that are still remaining in the code. :-( Can you try disabling MPI progress threads (I believe that tcp may be the only BTL component that has async progress support implemented anyway; sm *may*, but I'd have to go back and check)? Leave MPI threads enabled (i.e., MPI_THREAD_MULTIPLE) and see if that gets you further. Hugh Merz wrote: It's still only lightly tested. I'm surprised that it totally hangs for you, though -- what is your simple test program doing? It just initializes mpi (tried both mpi_init and mpi_init_thread), prints a string and exits. It works fine without thread support compiled into ompi. It happens with any mpi program I try. Attaching gdb to each thread of the executable gives: (original process) #0 0x420293d5 in sigsuspend () from /lib/i686/libc.so.6 #1 0x401e8609 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0 #2 0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0 #3 0x40bda418 in mca_oob_tcp_msg_wait () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so (thread 1) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0 (thread 2) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x4013268b in poll_dispatch () from /opt/openmpi-1.0rc2_asynch/lib/libopal.so.0 Cannot access memory at address 0x3e8 (thread 3) #0 0x420dae14 in read () from /lib/i686/libc.so.6 #1 0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0 #2 0x40c8dfe3 in mca_btl_sm_component_event_thread () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.so And there are also 2 additional threads spawned by each of mpirun and orted. Any clues or hints on how to debug this would be appreciated, but I understand that it is probably not high priority right now. Thanks, Hugh Hugh Merz wrote: Howdy, I tried installing the release candidate with thread support enabled ( --enable-mpi-threads and --enable-progress-threads ) using an old rh7.3 install and a recent fc4 install (Intel compilers). When I try to run a simple test program, the executable, mpirun and orted all sleep in what appears to be a deadlock. If I compile ompi without threads everything works fine. The faq states that thread support has only been lightly tested, and there was only brief discussion about it in the maillist 8 months ago - have there been any developments, and should I expect it to work properly? Thanks, Hugh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Intel EM64T Compiler error on Opteron
I am trying to build OpenMPI v1.0.2 (stable) on an Opteron using the v8.1 Intel EM64T compilers: Intel(R) C Compiler for Intel(R) EM64T-based applications, Version 8.1 Build 20041123 Package ID: l_cce_pc_8.1.024 Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 8.1 Build 20041123 Package ID: l_fce_pc_8.1.024 The compiler core dumps during make with: icc -DHAVE_CONFIG_H -I. -I. -I../../include -I../../include -DOMPI_PKGDATADIR=\"/scratch/merz//share/openmpi\" -I../../include -I../.. -I../.. -I../../include -I../../opal -I../../orte -I../../ompi -O3 -DNDEBUG -fno-strict-aliasing -pthread -MT cmd_line.lo -MD -MP -MF .deps/cmd_line.Tpo -c cmd_line.c -fPIC -DPIC -o .libs/cmd_line.o icc: error: /opt/intel_cce_80/bin/mcpcom: core dumped icc: error: Fatal error in /opt/intel_cce_80/bin/mcpcom, terminated by unknown signal(139) I couldn't find any other threads in the mailing list concerning usage of the Intel EM64T compilers - has anyone successfully compiled OpenMPI using this combination? It also occurs on the Athlon 64 processor. Logs attached. Thanks, Hugh openmpi_1.0.2_logs.tar.bz2 Description: BZip2 compressed data
Re: [OMPI users] Intel EM64T Compiler error on Opteron
FWIW, I know that we saw similar issues with the Intel 8.1 series (segv's during compilation). Since we are not doing anything illegal in terms of C++, we already treated this as a compiler bug that we couldn't really do much about. Plus, we [perhaps incorrectly] assumed that most sites using the Intel compilers would be using more recent versions. Troy seems to confirm that later builds of the 8.1 series seem to have fixed the problem -- can you try upgrading? I tried the most recent v8.1 compiler, Build 20060202 Package ID: l_cce_pc_8.1.034 But it still core dumps: icc: error: /scratch/merz/intel_cce_80/bin/mcpcom: core dumped icc: error: Fatal error in /scratch/merz/intel_cce_80/bin/mcpcom, terminated by unknown signal(139) compilation aborted for cmd_line.c (code 1) I'm running on Opteron 254s, Fedora Core 4. I can get by with building it using gcc, so there's no urgency. Thanks! Hugh -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford Sent: Tuesday, April 11, 2006 4:14 PM To: Open MPI Users Subject: Re: [OMPI users] Intel EM64T Compiler error on Opteron On Tue, 11 Apr 2006 13:48:43 -0600, Troy Telford wrote: I have compiled Open MPI (on an Opteron) with the Intel 9 EM64T compilers; It's been a while since I've used the 8.1 series, but I'll give it a shot with Intel 8.1 and tell you what happens. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users I can confirm that I'm able to compile Open MPI 1.0.2 on my systems. Other info: * Opteron 244 CPUs * SLES 9 SP3 x86_64 * Intel(R) C Compiler for Intel(R) EM64T-based applications, Version 8.1Build 20050628 * Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 8.1Build 20050517 -- Troy Telford Linux Networx ttelf...@linuxnetworx.com (801) 649-1356 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] error for open-mpi application
However, when I used openmpi to compile a application program( Molecular dynamcis code: Amber9), error messages are given: I think you would be better off using the OpenMPI wrapper compilers rather than trying to link the mpi libraries by hand. For mo re information read the FAQ, which contains a section on how to compile mpi programs with OpenMPI: http://www.open-mpi.org/faq/?category=mpi-apps Likely all that is required is to change the Makefile for Amber9. For PMEMD module: For example, use mpif90 instead of pgf90 here: pgf90 -o pmemd gbl_constants.o gbl_datatypes.o state_info.o file_io_dat.o ... -L/home/ytang/gdata/whli/openmpi/lib -lmpich /usr/bin/ld: cannot find -lmpich This is the error - there is no libmpich in OpenMPI. make[1]: *** [pmemd] Error 2 make[1]: Leaving directory `/gdata/lun8/ytang/whli/amber9/src/pmemd/src' make: *** [install] Error 2 For sander module: Presumably this was not compiled with mpif90 either, but you did not send the command you used to compile it. ../lmod/lmod.a ../lapack/lapack.a ../blas/blas.a \ ../lib/nxtsec.o ../lib/sys.a -L/home/ytang/gdata/whli/openmpi/lib -lmpi_f90 -lmpi -lorte -lopal -lutil -lnsl -lpthread -ldl -Wl,--export-dynamic -lm -lutil -lnsl -lpthread -ldl /usr/bin/ld: skipping incompatible /home/ytang/gdata/whli/openmpi/lib/libmpi_f90.a when searching for -lmpi_f90 /usr/bin/ld: cannot find -lmpi_f90 make[1]: *** [sander.MPI] Error 2 make[1]: Leaving directory `/gdata/lun8/ytang/whli/amber9/src/sander' make: *** [parallel] Error 2 I know it must be something wrong with the installation of open-mpi, but I don't know where it is. Could you please give me some advice? Read the FAQ. I also suggest you email the Amber mailing list and read through any documentation on their site, which looks quite extensive. Hugh
Re: [OMPI users] Dual core Intel CPU
On Wed, 16 Aug 2006, Allan Menezes wrote: Hi AnyOne, I have an 18 node cluster of heterogenous machines. I used fc5 smp kernel and ocsar 5.0 beta. I tried the following out on a machine with Open mpi 1.1 and 1.1.1b4 versions. The machine consists of a Dlink 1gigb/s DGE-530T etherent card 2.66GHz dual core Intel Cpu Pentium D 805 with Dual Cannel 1 gig DDR 3200 ram. I compiled the ATLAS libs (ver 3.7.13beta) for this machine and HPL (xhpl executable) and ran the following experiment twice: content of my "hosts" file1 for this machine for 1st experiment: a8.lightning.net slots=2 content of my "hosts" file2 for this machine for 2nd experiment: a8.lightning.net On the single node I ran for HPL.dat N =6840 and NB=120 : 1024 MB of Ram N = sqrt(0.75* ((1024-32 video overhead)/2 )*100*1/8)=approx 6840; 512MB Ram per CPU otherwise the OS uses the hard drive for virtaul memory. This way it resides totally in Ram. I ran this command twice for the two different hosts files above in two experiments: # mpirun --prefix /opt/openmpi114 --hostsfile hosts -mca btl tcp, self -np 1 ./xhpl In both cases the performance remains the same around 4.040 GFlops I would expect since I am running slots =2 as two CPU's I would get a performance increase from expt 2 by 100 -50% But I see no difference.Can anybody tell me why this is so? You are only launching 1 process in both cases. Try `mpirun -np 2 ...` to launch 2 processes, which will load each of your processors with an xhpl process. Please read the faq: http://www.open-mpi.org/faq/?category=running#simple-spmd-run It includes a lot of information about slots and how they should be set as well. Hugh I have not tried mpich 2. Thank you, Regards, Allan Menezes ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] building openmpi with gfortran and g95
On Fri, 18 Aug 2006, Steven A. DuChene wrote: I am attempting to build OpenMPI-1.1 on a RHEL4u2 system that has the standard gfortran install as part of the distro and with a self installed recent version of g95 from g95.org but when I use the FC flag to configure to tell it where to find the g95 executable I get a message during the configure run of *** Fortran 90/95 compiler checking whether we are using the GNU Fortran compiler... yes checking whether /app/g95-newer/g95-install/bin/x86_64-unknown-linux-gnu-g95 accepts -g... yes checking if Fortran compiler works... yes checking whether gfortran and /app/g95-newer/g95-install/bin/x86_64-unknown-linux-gnu-g95 compilers are compatible... no configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not link compatible configure: WARNING: *** Disabling MPI Fortran 90/95 bindings checking if Fortran 90 compiler supports LOGICAL... skipped checking if Fortran 90 compiler supports INTEGER... skipped checking if Fortran 90 compiler supports INTEGER*1... skipped checking if Fortran 90 compiler supports INTEGER*2... skipped checking if Fortran 90 compiler supports INTEGER*4... skipped checking if Fortran 90 compiler supports INTEGER*8... skipped checking if Fortran 90 compiler supports INTEGER*16... skipped checking if Fortran 90 compiler supports REAL... skipped checking if Fortran 90 compiler supports REAL*4... skipped checking if Fortran 90 compiler supports REAL*8... skipped checking if Fortran 90 compiler supports REAL*16... skipped checking if Fortran 90 compiler supports DOUBLE PRECISION... skipped checking if Fortran 90 compiler supports COMPLEX... skipped checking if Fortran 90 compiler supports COMPLEX*8... skipped checking if Fortran 90 compiler supports COMPLEX*16... skipped checking if Fortran 90 compiler supports COMPLEX*32... skipped checking if Fortran 90 compiler supports DOUBLE COMPLEX... skipped I am not specifying any particular FCFLAGS but from searching through the mailing lists I am pretty sure I should be but I just don't know what exactly. Any info on this anywhere? Compiler incompatibility is hinted at in this FAQ entry: http://www.open-mpi.org/faq/?category=sysadmin#multiple-installs Try setting the F77 environment variable to use g95 for Fortran 77 compiler as well. If you don't need F77 bindings, then supply '--disable-f77' to configure and it should build fine. Hugh
Re: [OMPI users] efficient memory to memory transfer
On Wed, 8 Nov 2006, Larry Stewart wrote: Miguel Figueiredo Mascarenhas Sousa Filipe wrote: H the MPI model assumes you don't have a "shared memory" system.. therefore it is "message passing" oriented, and not designed to perform optimally on shared memory systems (like SMPs, or numa-CCs). For many programs with both MPI and shared memory implementations, the MPI version runs faster on SMPs and numa-CCs. Why? See the previous paragraph... Of course it does..its faster to copy data in main memory than it is to do it thought any kind of network interface. You can optimize you message passing implementation to a couple of memory to memory copies when ranks are on the same node. In the worst case, even if using local IP addresses to communicate between peers/ranks (in the same node), the operating system doesn't even touch the interface.. it will just copy data from a tcp sender buffer to a tcp receiver buffer.. in the end - that's always faster than going through a phisical network link. There are a lot of papers about the relative merits of a mixed shared memory and MPI model - OpenMP on-node and MPI inter-node, for example. Generally they seem to show that MPI is at least as good. The conventional wisdom of pure MPI being as good as hybrid models is primarily driven by the fact that people haven't had much incentive to re-write their algorithms to support both models. It's a lot easier to focus only on MPI, hence the limited (and lightly tested) support for MPI_THREAD_MULTIPLE and asynchronous progress in Open-MPI. If current HPC trends continue into the future there is going to be increased motivation to implement fine-grained parallelism in addition to MPI. As an example, the amount of RAM/node doesn't seem to be increasing as fast as the number of cores/node, so pure MPI codes which use a significant amount of memory for buffers (domain decomposition algorithms are a good example) will not scale to as large of a problem size as hybrid implementations in weak-scaling scenarios. Hugh
[OMPI users] v1.2 Bus Error (/tmp usage)
Good Day, I'm using Open MPI on a diskless cluster (/tmp is part of a 1m ramdisk), and I found that after upgrading from v1.1.4 to v1.2 that jobs using np > 4 would fail to start during MPI_Init, due to what appears to be a lack of space in /tmp. The error output is: - [tpb200:32193] *** Process received signal *** [tpb200:32193] Signal: Bus error (7) [tpb200:32193] Signal code: (2) [tpb200:32193] Failing at address: 0x2a998f4120 [tpb200:32193] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f6e430] [tpb200:32193] [ 1] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_grow+0x138) [0x2a9568abc8] [tpb200:32193] [ 2] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_resize+0x2d) [0x2a9568b0dd] [tpb200:32193] [ 3] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs_same_base_addr+0x6bf) [0x2a98ba419f] [tpb200:32193] [ 4] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x28a) [0x2a9899a4fa] [tpb200:32193] [ 5] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xe8) [0x2a98889308] [tpb200:32193] [ 6] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_mpi_init+0x45d) [0x2a956a32ed] [tpb200:32193] [ 7] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(MPI_Init+0x93) [0x2a956c5c93] [tpb200:32193] [ 8] a.out(main+0x1c) [0x400a44] [tpb200:32193] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a960933fb] [tpb200:32193] [10] a.out [0x40099a] [tpb200:32193] *** End of error message *** ... lots of the above for each process ... mpirun noticed that job rank 0 with PID 32040 on node tpb200 exited on signal 7 (Bus error). --/-- If I increase the size of my ramdisk or point $TMP to a network filesystem then jobs start and complete fine, so it's not a showstopper, but with v1.1.4 (or LAM v7.1.2) I didn't encounter this issue with my default 1m ramdisk (even with np > 100 ). Is there a way to limit /tmp usage in Open MPI v1.2? Hugh