Re: [OMPI users] OPENMPI 1.2.7 & PGI compilers: configure option --disable-ptmalloc2-opt-sbrk
Hi Jeff Sorry to disturb you I send you the Stack Frame captured with Totalview. The example program "callocrash" goes in Segmentation Violation on sYMALLOc function: set_head(remainder, remainder_size | PREV_INUSE); The Stack frame is Function "sYSMALLOc": nb:0x00025216d050 (9967161424) av:0x2a95c1ef00 (&main_arena) -> (struct malloc_state) Local variables: old_top: 0x0b8bc110 -> (struct malloc_chunk) old_size: 0x00020ef0 (134896) old_end: 0x0b8dd000 -> "" size: 0x00025218def0 (9967296240) correction:0x (0) brk: 0x0b8dd000 -> "" snd_brk: 0x -> front_misalign:0x (0) end_misalign: 0x0b8dd000 (193843200) aligned_brk: 0x00507000 -> "" p: 0x0b8bc110 -> (struct malloc_chunk) remainder: 0x25da29160 -> (struct malloc_chunk) remainder_size:0x00020ea0 (134816) sum: 0x3828b000 (942190592) pagemask: 0x0fff (4095) On 16/10/08 14:05, "Francesco Iannone" wrote: > Hi Jeff > I used the configure option: > > --enable-ptmalloc2-opt-sbrk > > To solve a segmentation fault in memory allocation with openmpi.1.2.x and > PGI 7.1-4 and 7.2. > > I have a simple source code (Callocrash.c) as example of this (see belowe). > > Could you test this code on a node with 8 Gbyte of RAM and RedHat enterprise > 4+ openmpi 1.2.x, PGI 7.1-4. > > I compiled it with: > > pgcc -o Callocrash Callocreash.c (it's ok) > gnu4 -o Callocrash Callocreash.c (it's ok) > mpicc -o Callocrash Callocreash.c (Segmentation fault in sysMALLOC when > it has to allocate 622947588 bytes) > > However thanks in advance > > greetings > > > Callocrash.c > > > #include > #include > > int main( int argc, char *argv[]) > { > /* > * memory allocations simulation for ~50M nonzeros: > * nd=180 md=350 mdy=420 > * > * if this program crashes, there is a compiler problem > */ > printf("memory allocations simulation for ~50M nonzeros: nd=180 > md=350 mdy=420\n"); > printf("if this program crashes, there check your > compiler/environment configuration\n"); > > printf("sizeof(int)%d\n",sizeof(int)); > printf("sizeof(int*) %d\n",sizeof(int*)); > printf("sizeof(size_t) %d\n",sizeof(size_t)); > > if( sizeof(size_t)<8 || sizeof(int*)<8 ) > { > printf("please compile this program for a 64 bit > environment!\n"); > return -1; > } > > int *p; > > printf("allocation 1/4..\n"); > p = calloc(47109185,16); > if(!p)printf("..failed.\n"); > printf("allocation 2/4..\n"); > p = calloc(47109185,4); > if(!p)printf("..failed.\n"); > printf("allocation 3/4..\n"); > p = calloc(47109185,4); > if(!p)printf("..failed.\n"); > printf("allocation 4/4..\n"); > > p = calloc(622947588,16); > if(!p)printf("..failed.\n"); > if(!p) return -1; > > printf("allocations test passed (no crash)\n"); > return 0; > } > > > On 15/10/08 19:42, "Jeff Squyres" wrote: > >> On Oct 15, 2008, at 9:35 AM, Francesco Iannone wrote: >> >>> I have a cluster of 16 nodes DualCPU DualCore AMD RAM 16 GB with >>> InfiniBand >>> CISCO HCA and switch InfiniBand. >>> It uses Linux RH Enterprise 4 64 bit , OpenMPI 1.2.7, PGI 7.1-4 and >>> openib-1.2-7. >>> >>> Hence it means that the option disable-ptmalloc2 is catastrophic in >>> the >>> above configuration. >> >> Actually, I notice that in your original message, you said "--disable- >> ptmalloc2-opt-sbrk", but here you said "--disable-ptmalloc2". The >> former is: >> >>Only trigger callbacks when sbrk is used >> for small >>allocations, rather than every call to >> malloc/free. >>(default: enabled) >> >> So it should be fine to disable; it shouldn't affect overall MPI >> performance too much. >> >> The latter disables ptmalloc2 entirely (and you'll likely get lower >> benchmark bandwidth for large messages). >> >> I'm unaware of either of these options leading to problems with the >> PGI compiler suite; I have tested OMPI v1.2.x with several versions of >> the PGI compiler without problem (although my latest version is PGI >> 7.1-4). > > Dr. Francesco Iannone > Associazione EURATOM-ENEA sulla Fusione > C.R. ENEA Frascati > Via E. Fermi 45 > 00044 Frascati (Roma) Italy > phone 00-39-06-9400-5124 > fax 00-39-06-9400-5524 > mailto:francesco.iann...@frascati.enea.it > http://www.afs.enea.it/iannone > > > >
[OMPI users] OPAL_PREFIX is not passed to remote node in pls_rsh_module.c
Hi All, We have bundled Open MPI with our product and shipped it to the customer. According to http://www.open-mpi.org/faq/?category=building#installdirs , Below is the command we used to launch MPI program: env OPAL_PREFIX=/path/to/openmpi \ /path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c The interesting fact is that it always works on csh/tcsh. But quite a few users told us that they runs into below errors: [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- Sorry! You were supposed to get help about: orte_init:startup:internal-failure from the file: help-orte-runtime But I couldn't find any file matching that name. Sorry! -- [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Sorry! You were supposed to get help about: orted:init-failure from the file: help-orted.txt But I couldn't find any file matching that name. Sorry! Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php that OPAL_PREFIX was propagated for him automatically. I bet Jeff uses csh/tcsh. Anyway, it can be traced back to how the daemon is launched. sh/bash: [x:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x OPAL_PREFIX=/opt/openmpi-1.2.4 ; PATH=/opt/openmpi-1.2.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; csh/tcsh: [x:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x setenv OPAL_PREFIX /opt/openmpi-1.2.4 ; It seems to work after I patched pls_rsh_module.c --- pls_rsh_module.c.orig 2008-10-16 17:15:32.0 -0400 +++ pls_rsh_module.c2008-10-16 17:15:51.0 -0400 @@ -989,7 +989,7 @@ "%s/%s/%s", (opal_prefix != NULL ? "OPAL_PREFIX=" : ""), (opal_prefix != NULL ? opal_prefix : ""), - (opal_prefix != NULL ? " ;" : ""), + (opal_prefix != NULL ? " ; export OPAL_PREFIX ; " : ""), prefix_dir, bin_base, prefix_dir, lib_base, prefix_dir, bin_base, Another workaround is to add export OPAL_PREFIX into $HOME/.bashrc. Jeff, is this a bug in the code? Or there is a reason that OPAL_PREFIX is not exported for sh/bash? Teng
[OMPI users] Problems with OpenMPI running with Rmpi
Dear all, I managed to install successfully Rmpi 0.5-5 on a quad opteron machine (8 cores overall) running on OpenSUSE 11.0 and Open MPI 1.5.2. this is what I get > library(Rmpi) [gauss:24207] mca: base: component_find: unable to open osc pt2pt: file not found (ignored) libibverbs: Fatal: couldn't read uverbs ABI version. -- [0,0,0]: OpenIB on host gauss was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -- -- WARNING: Failed to open "OpenIB-cma" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "OpenIB-cma-1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "OpenIB-cma-2" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "OpenIB-cma-3" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "OpenIB-bond" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "ofa-v2-ib0" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "ofa-v2-ib1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- WARNING: Failed to open "ofa-v2-ib2" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- [0,0,0]: uDAPL on host gauss was unable to find any NICs. Another transport will be used instead, although this may result in lower performance. -- > mpi.spawn.Rslaves() 1 slaves are spawned successfully. 0 failed. master (rank 0, comm 1) of size 2 is running on: gauss slave1 (rank 1, comm 1) of size 2 is running on: gauss as you can see, just 1 cpu per session (2 cores) is recognized and used. and this is the content of my etc/conf.dat OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl
Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful
Some further clarification, I read a post over on the SGE mailing list that said the --with-sge is part of ompi 1.3, not 1.2.x. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Aleksej Saushev Sent: Thursday, October 16, 2008 12:39 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful Jeff Squyres writes: > On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote: > >> $ ompi_info | grep oob >> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) >> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7) > > Good! > >>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile >> [asau.local:09060] mca: base: components_open: Looking for rml >> components >> [asau.local:09060] mca: base: components_open: distilling rml >> components >> [asau.local:09060] mca: base: components_open: accepting all >> rml components >> [asau.local:09060] mca: base: components_open: opening rml components >> [asau.local:09060] mca: base: components_open: found loaded >> component oob >> [asau.local:09060] mca: base: components_open: component oob >> open function successful >> [asau.local:09060] orte_rml_base_select: initializing rml >> component oob >> [asau.local:09060] orte_rml_base_select: init returned failure > > Ah ha -- this is progress. For some reason, your "oob" RML > plugin is declining to run. I see that its > query/initialization function is actually quite short: > > if(mca_oob_base_init() != ORTE_SUCCESS) > return NULL; > *priority = 1; > return &orte_rml_oob_module; > > So it must be failing the mca_oob_base_init() function -- this > is what initializes the underling "OOB" (out of band) > communications subsystem. > > Of course, this doesn't fail often, so we don't have any > run-time switches to enable the debugging output. :-( Edit > orte/mca/oob/base/ oob_base_open.c line 43 and change the value > of mca_oob_base_output from -1 to 0. Let's see that output -- > I'm particularly interested in the output from querying the tcp > oob component. I suspect that it's declining to run as well. > > I wonder if this is going to end up being an opal_if() issue -- > where we are traversing all the IP network interfaces from the > kernel... I'll bet even money that it is. [asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6 [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -- Why don't you use strerror(3) to print errno value explanation? >From : #define ENXIO 6 /* Device not configured */ It seems that I have to debug network interface probing, how should I use *_output subroutines so that they do print? I tried these changes but in vain: --- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400 +++ opal/util/if.c 2008-10-15 23:55:07.0 +0400 @@ -242,6 +242,8 @@ if(ifr->ifr_addr.sa_family != AF_INET) continue; + opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name); + /* HERE IT FAILS!! */ if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) { opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno); continue; --- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400 +++ opal/util/if.c 2008-10-15 23:55:07.0 +0400 @@ -242,6 +242,8 @@ if(ifr->ifr_addr.sa_family != AF_INET) continue; + fprintf(stderr, "opal_ifinit: checking netif %s\n", ifr->ifr_name); + /* HERE IT FAILS!! */ if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) { opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno); continue; --- opal/util/output.c.orig 2008-08-25 23:16:50.0 +0400 +++ opal/util/output.c 2008-10-16 19:58:49.
[OMPI users] Debian MPI -- mpirun missing
Hi all, I'm very new to MPI and am trying to install it on to a Debian Etch system. I did have mpich installed and I believe that is causing me problems. I completely uninstalled it and then ran: update-alternatives --remove-all mpicc Then, I installed the following packages: libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev And it now says: >> update-alternatives --display mpicc mpicc - status is auto. link currently points to /usr/bin/mpicc.openmpi /usr/bin/mpicc.openmpi - priority 40 slave mpif90: /usr/bin/mpif90.openmpi slave mpiCC: /usr/bin/mpic++.openmpi slave mpic++: /usr/bin/mpic++.openmpi slave mpif77: /usr/bin/mpif77.openmpi slave mpicxx: /usr/bin/mpic++.openmpi Current `best' version is /usr/bin/mpicc.openmpi. which seems ok to me... So, I tried to compile something (I had sample code from a book I purchased a while back, but for mpich), however, I can run the program as-is, but I think I should be running it with mpirun -- the FAQ suggests there is one? But, there is no mpirun anywhere. It's not in /usr/bin. I updated the filename database (updatedb) and tried a "locate mpirun", and I get only one hit: /usr/include/openmpi/ompi/runtime/mpiruntime.h Is there a package that I neglected to install? I did an "aptitude search openmpi" and installed everything listed... :-) Or perhaps I haven't removed all trace of mpich? Thank you in advance! Ray
Re: [OMPI users] Debian MPI -- mpirun missing
Er, shouldn't this be in the Debian support list? A correctly installed OpenMPI will give you mpirun. If their openmpi-bin package doesn't, then surely it's broken? (Or is there a straight openmpi package?) On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote: > Hi all, > > I'm very new to MPI and am trying to install it on to a Debian Etch > system. I did have mpich installed and I believe that is causing me > problems. I completely uninstalled it and then ran: > > update-alternatives --remove-all mpicc > > Then, I installed the following packages: > > libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev > > And it now says: > > >> update-alternatives --display mpicc > mpicc - status is auto. > link currently points to /usr/bin/mpicc.openmpi > /usr/bin/mpicc.openmpi - priority 40 > slave mpif90: /usr/bin/mpif90.openmpi > slave mpiCC: /usr/bin/mpic++.openmpi > slave mpic++: /usr/bin/mpic++.openmpi > slave mpif77: /usr/bin/mpif77.openmpi > slave mpicxx: /usr/bin/mpic++.openmpi > Current `best' version is /usr/bin/mpicc.openmpi. > > which seems ok to me... So, I tried to compile something (I had sample > code from a book I purchased a while back, but for mpich), however, I > can run the program as-is, but I think I should be running it with > mpirun -- the FAQ suggests there is one? But, there is no mpirun > anywhere. It's not in /usr/bin. I updated the filename database > (updatedb) and tried a "locate mpirun", and I get only one hit: > > /usr/include/openmpi/ompi/runtime/mpiruntime.h > > Is there a package that I neglected to install? I did an "aptitude > search openmpi" and installed everything listed... :-) Or perhaps I > haven't removed all trace of mpich? > > Thank you in advance! > > Ray > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] The --with-sge option
On Oct 16, 2008, at 12:06 PM, Mike Hanby wrote: I’m compiling 1.2.8 on a system with SGE 6.1u4 and came across the “--with-sge” option on a Grid Engine posting. A couple questions: 1. I don’t see --with-sge mentioned in the “./configure --help" output, nor can I find much reference to it on the open-mpi site, is this option really implemented? What does it do? Sorry -- this is an option for OMPI v1.3 and later; it doesn't exist in the v1.2 series. [8:31] svbu-mpi:~/svn/ompi4 % ./configure --help |& grep sge --with-sge Build SGE or Grid Engine support (default: no) So in the v1.3 series, using --without-sge will disable OMPI from understanding SGE host lists, etc. 2. After compiling openmpi providing the --with-sge switch I ran the ompi_info binary and grep’d for sge in the output, there isn’t any reference, should there be if the option was successfully passed to configure? From your second mail: I did find the following in ompi_info: MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) However I see that in an ompi_info built without using the --with- sge switch. Per above, that should be ok in the 1.2 series. Also, since I'm building 1.2.8, shouldn't those versions after Component reflect 1.2.8? Yes, actually, they should... That's somewhat concerning. I set the PATH and LD_LIBRARY_PATH to point to the temp location of my new build and it still reports 1.2.7. You might want to double check your setup. Since OMPI uses plugins, it can be each to accidentally mix versions by installing one over another, etc. Note that the output from configure will also indicate whether it's going to build SGE support, as well. Look in the stdout of configure and search for "gridengine". -- Jeff Squyres Cisco Systems
[OMPI users] OpenMPI 1.2.8 on Solaris: configure problems
Hi guys, did you test OpenMPI 1.2.8 on Solaris at all?! We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all 2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 1.2.6 versions. The configuring processes runs through, but if "gmake all" called, it seems to be so, that the configure stage restarts or being resumed: .. orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is undefined orte/mca/smr/bproc/Makefile.am:47: The usual way to define `LIBTOOL' is to add `AC_PROG_LIBTOOL' orte/mca/smr/bproc/Makefile.am:47: to `configure.ac' and run `aclocal' and `autoconf' again. orte/mca/smr/bproc/Makefile.am:47: If `AC_PROG_LIBTOOL' is in `configure.ac', make sure orte/mca/smr/bproc/Makefile.am:47: its definition is in aclocal's search path. test/support/Makefile.am:29: library used but `RANLIB' is undefined test/support/Makefile.am:29: The usual way to define `RANLIB' is to add `AC_PROG_RANLIB' test/support/Makefile.am:29: to `configure.ac' and run `autoconf' again. . and breaks. If "gmake all" again we also see error messages like: *** Fortran 77 compiler checking for gfortran... gfortran checking whether we are using the GNU Fortran 77 compiler... yes checking whether gfortran accepts -g... yes checking if Fortran 77 compiler works... yes checking gfortran external symbol convention... ./configure: line 26340: ./conftest.o: Permission denied ./configure: line 26342: ./conftest.o: Permission denied ./configure: line 26344: ./conftest.o: Permission denied ./configure: line 26346: ./conftest.o: Permission denied ./configure: line 26348: ./conftest.o: Permission denied configure: error: Could not determine Fortran naming convention. Considered the configure script we see on these lines in ./configire: if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="double underscore" elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="single underscore" elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="mixed case" elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="no underscore" elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="upper case" else $NM conftest.o >conftest.out 2>&1 and searching through ./configire says us, that $NM is never set (neither in ./configure nor in our environment) So, we think that somewhat is not OK with ./configure script. Attend to the fact, that we were able to install 1.2.5 and 1.2.6 some time ago on same boxes without problems. Or maybe we do somewhat wrong? best regards, Paul Kapinos HPC Group RZ RWTH Aachen P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris sucessfully? This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by Open MPI configure 1.2.8, which was generated by GNU Autoconf 2.61. Invocation command line was $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 -m64 FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ FC=gfortran --enable-ltdl-convenience --no-create --no-recursion ## - ## ## Platform. ## ## - ## hostname = sunoc63.rz.RWTH-Aachen.DE uname -m = i86pc uname -r = 5.10 uname -s = SunOS uname -v = Generic_137112-06 /usr/bin/uname -p = i386 /bin/uname -X = System = SunOS Node = sunoc63.rz.RWTH-Aachen.DE Release = 5.10 KernelID = Generic_137112-06 Machine = i86pc BusType = Serial = Users = OEM# = 0 Origin# = 1 NumCPU = 4 /bin/arch = i86pc /usr/bin/arch -k = i86pc /usr/convex/getsysinfo = unknown /usr/bin/hostinfo = unknown /bin/machine = unknown /usr/bin/oslevel = unknown /bin/universe = unknown PATH: /home/pk224850/bin PATH: /rwthfs/rz/SW/UTIL.common/gcc/4.2/i386-pc-solaris2.10/bin PATH: /home/pk224850/bin PATH: /home/pk224850/bin PATH: /usr/local_host/sbin PATH: /usr/local_host/bin PATH: /usr/local_rwth/sbin PATH: /usr/local_rwth/bin PATH: /usr/bin PATH: /usr/sbin PATH: /sbin PATH: /usr/dt/bin PATH: /usr/X11/bin PATH: /usr/java/bin PATH: /usr/openwin/bin PATH: /usr/ccs/bin PATH: /usr/ucb PATH: /opt/SUNWexplo/bin PATH: /usr/sfw/bin PATH: /opt/sfw/bin PATH: /usr/local/bin PATH: /usr/local/sbin PATH: /opt/csw/bin PATH: . ## --- ## ## Core tests. ## ## --- ## configure:2817: checking for a BSD-compatible install configure:2873: result: /usr/local_rwth/bin/ginstall -c configure:2884: checking whet
Re: [OMPI users] Problem launching onto Bourne shell
Doh; yes we did. This was a minor glitch in porting the 1.2 series fix to the trunk/v1.3 (i.e., the fix in v1.2.8 is ok -- whew!). Fixed on the trunk in r19758; thanks for noticing. I'll file a CMR for v1.3. On Oct 16, 2008, at 7:05 PM, Mostyn Lewis wrote: Jeff, You broke my ksh (and I expect something else) Today's SVN 1.4a1r19757 orte/mca/plm/rsh/plm_rsh_module.c line 471: tmp = opal_argv_split("( test ! -r ./.profile || . ./.profile;", ' '); ^ ARGHH No ( tmp = opal_argv_split(" test ! -r ./.profile || . ./.profile;", ' '); and all is well again :) Regards, Mostyn On Thu, 9 Oct 2008, Jeff Squyres wrote: FWIW, the fix has been pushed into the trunk, 1.2.8, and 1.3 SVN branches. So I'll probably take down the hg tree (we use those as temporary branches). On Oct 9, 2008, at 2:32 PM, Hahn Kim wrote: Hi, Thanks for providing a fix, sorry for the delay in response. Once I found out about -x, I've been busy working on the rest of our code, so I haven't had the time to try out the fix. I'll take a look at it soon as I can and will let you know how it works out. Hahn On Oct 7, 2008, at 5:41 PM, Jeff Squyres wrote: On Oct 7, 2008, at 4:19 PM, Hahn Kim wrote: you probably want to set the LD_LIBRARY_PATH (and PATH, likely, and possibly others, such as that LICENSE key, etc.) regardless of whether it's an interactive or non-interactive login. Right, that's exactly what I want to do. I was hoping that mpirun would run .profile as the FAQ page stated, but the -x fix works for now. If you're using Bash, it should be running .bashrc. But it looks like you did identify a bug that we're *not* running .profile. I have a Mercurial branch up with a fix if you want to give it a spin: http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/sh-profile-fixes/ I just realized that I'm using .bash_profile on the x86 and need to move its contents into .bashrc and call .bashrc from .bash_profile, since eventually I will also be launching MPI jobs onto other x86 processors. Thanks to everyone for their help. Hahn On Oct 7, 2008, at 2:16 PM, Jeff Squyres wrote: On Oct 7, 2008, at 12:48 PM, Hahn Kim wrote: Regarding 1., we're actually using 1.2.5. We started using Open MPI last winter and just stuck with it. For now, using the -x flag with mpirun works. If this really is a bug in 1.2.7, then I think we'll stick with 1.2.5 for now, then upgrade later when it's fixed. It looks like this behavior has been the same throughout the entire 1.2 series. Regarding 2., are you saying I should run the commands you suggest from the x86 node running bash, so that ssh logs into the Cell node running Bourne? I'm saying that if "ssh othernode env" gives different answers than "ssh othernode"/"env", then your .bashrc or .profile or whatever is dumping out early depending on whether you have an interactive login or not. This is the real cause of the error -- you probably want to set the LD_LIBRARY_PATH (and PATH, likely, and possibly others, such as that LICENSE key, etc.) regardless of whether it's an interactive or non-interactive login. When I run "ssh othernode env" from the x86 node, I get the following vanilla environment: USER=ha17646 HOME=/home/ha17646 LOGNAME=ha17646 SHELL=/bin/sh PWD=/home/ha17646 When I run "ssh othernode" from the x86 node, then run "env" on the Cell, I get the following: USER=ha17646 LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 HOME=/home/ha17646 MCS_LICENSE_PATH=/opt/MultiCorePlus/mcf.key LOGNAME=ha17646 TERM=xterm-color PATH=/usr/local/bin:/usr/bin:/sbin:/bin:/tools/openmpi-1.2.5/ bin:/ tools/cmake-2.4.7/bin:/tools SHELL=/bin/sh PWD=/home/ha17646 TZ=EST5EDT Hahn On Oct 7, 2008, at 12:07 PM, Jeff Squyres wrote: Ralph and I just talked about this a bit: 1. In all released versions of OMPI, we *do* source the .profile file on the target node if it exists (because vanilla Bourne shells do not source anything on remote nodes -- Bash does, though, per the FAQ). However, looking in 1.2.7, it looks like it might not be executing that code -- there *may* be a bug in this area. We're checking into it. 2. You might want to check your configuration to see if your .bashrc is dumping out early because it's a non-interactive shell. Check the output of: ssh othernode env vs. ssh othernode env (i.e., a non-interactive running of "env" vs. an interactive login and running "env") On Oct 7, 2008, at 8:53 AM, Ralph Castain wrote: I am unaware of anything in the code that would "source .profile" for you. I believe the FAQ page is in error here. Ralph On Oct 6, 2008, at 7:47 PM, Hahn Kim wrote: Great, that worked, thanks! However, it still concerns me that the FAQ page says that mpirun will execute .profile which doesn't seem to work for me. Are there any configuration issues that could possibly be preventing mpirun from doing this? It wou
Re: [OMPI users] Problems with OpenMPI running with Rmpi
On 17 October 2008 at 12:42, Simone Giannerini wrote: | Dear all, | | I managed to install successfully Rmpi 0.5-5 on a quad opteron machine (8 | cores overall) running on OpenSUSE 11.0 and Open MPI 1.5.2. | | this is what I get | | > library(Rmpi) | [gauss:24207] mca: base: component_find: unable to open osc pt2pt: file not | found (ignored) | libibverbs: Fatal: couldn't read uverbs ABI version. | -- | [0,0,0]: OpenIB on host gauss was unable to find any HCAs. | Another transport will be used instead, although this may result in | lower performance. | -- I am surprised that your googling did lead to you stumbling on dozens of posts on this telling you that the config file /etc/openmpi/openmpi-mca-params.conf(location for Debian etc) can be changed to explicitly setting btl to 'no openib' as in # Disable the use of InfiniBand # btl = ^openib btl = ^openib which will suppress the warning by suppressing the load of IB. Better still, newer Open MPI release do this by default. | I have searched the archives and found that the following suggestion was | given for a similar problem: | | > Open MPI has Infiniband module compiled but there is no IB device found | > on your host. Try to add "--mca btl ^openib" string to your command | > line. That's one way of suppressing it, but not the only one. | Since I am not calling mpi directly but through Rmpi I do not know where to | put that flag, I might contact the Rmpi mantainer, in any case, I would be | grateful if you had further suggestions. There is nothing Rmpi can do there so contacting Dr Yu, while generally a good idea with actual Rmpi issues, is not really advised here. Cheers, Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Debian MPI -- mpirun missing
On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote: > > Is there a package that I neglected to install? I did an "aptitude > search openmpi" and installed everything listed... :-) Or perhaps I > haven't removed all trace of mpich? According to packages.debian.org there isn't a openmpi pacakge which contains mpirun which as you note isn't expected. There is a orterun however which you could use instead. The Etch version of openmpi is very old, openmpi has made a lot of progress since 1.1-2.3, I'd recommend building from source if you are able to. Ashley.
Re: [OMPI users] Debian MPI -- mpirun missing
On 18 October 2008 at 00:16, Raymond Wan wrote: | | Hi all, | | I'm very new to MPI and am trying to install it on to a Debian Etch | system. I did have mpich installed and I believe that is causing me Etch is getting old, and its Open MPI 1.1 package were in suboptimal shape. A few of us get together as a new Open MPI team within Debian, and 1.2.* packages are in much better shape. So please try to get 1.2 packages. | problems. I completely uninstalled it and then ran: | | update-alternatives --remove-all mpicc | | Then, I installed the following packages: | | libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev | | And it now says: | | >> update-alternatives --display mpicc | mpicc - status is auto. | link currently points to /usr/bin/mpicc.openmpi | /usr/bin/mpicc.openmpi - priority 40 | slave mpif90: /usr/bin/mpif90.openmpi | slave mpiCC: /usr/bin/mpic++.openmpi | slave mpic++: /usr/bin/mpic++.openmpi | slave mpif77: /usr/bin/mpif77.openmpi | slave mpicxx: /usr/bin/mpic++.openmpi | Current `best' version is /usr/bin/mpicc.openmpi. | | which seems ok to me... So, I tried to compile something (I had sample | code from a book I purchased a while back, but for mpich), however, I | can run the program as-is, but I think I should be running it with | mpirun -- the FAQ suggests there is one? But, there is no mpirun | anywhere. It's not in /usr/bin. I updated the filename database | (updatedb) and tried a "locate mpirun", and I get only one hit: Well when I use Open MPI I go with the new convention and call orterun instead of mpirun. I think you should have. Maybe a local alias in your ~/.bashrc can do the trick. Current packages do have mpirun.openmpi but we were unable to devise a bullet-proof scheme between lam, mpich and Open MPI for sharing / updating / ... the alternatives links as there are sublte differences that prevent us from switching all these aliases consistently. Hope this helps, Dirk | | /usr/include/openmpi/ompi/runtime/mpiruntime.h | | Is there a package that I neglected to install? I did an "aptitude | search openmpi" and installed everything listed... :-) Or perhaps I | haven't removed all trace of mpich? | | Thank you in advance! | | Ray | | | ___ | users mailing list | us...@open-mpi.org | http://www.open-mpi.org/mailman/listinfo.cgi/users -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Debian MPI -- mpirun missing
> Well when I use Open MPI I go with the new convention and call orterun > instead of mpirun. I think you should have. Maybe a local alias in your > ~/.bashrc can do the trick. > > Current packages do have mpirun.openmpi but we were unable to devise a > bullet-proof scheme between lam, mpich and Open MPI for sharing / updating / > ... the alternatives links as there are sublte differences that prevent us > from switching all these aliases consistently. Eh? Surely it's a simple case of conflict? If you want multiple packages providing similar functionality, it's up to you to specify how the user should chose which one they want to run. Breaking any particular package (or all packages) seems like a particularly poor choice, but that's only my opinion. I would argue that orterun is a very long way from a "new convention". I'd draw attention to section 8.8 of the MPI 2.1 standard. But again, this is a discussion for the Debian list.
Re: [OMPI users] OpenMPI 1.2.8 on Solaris: configure problems
On Fri, Oct/17/2008 05:53:07PM, Paul Kapinos wrote: > Hi guys, > > did you test OpenMPI 1.2.8 on Solaris at all?! We built 1.2.8 on Solaris successfully a few days ago: http://www.open-mpi.org/mtt/index.php?do_redir=869 But due to hardware/software/man-hour resource limitations, there are often combinations of configure options, mpirun options, etc. that end up going untested. E.g., I see you're using some configure options we haven't tried: * --enable-ltdl-convenience * --no-create * --no-recursion * GCC on Solaris > We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for > both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all > 2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 1.2.6 > versions. > > > The configuring processes runs through, but if "gmake all" called, it seems > to be so, that the configure stage restarts or being resumed: > > .. > orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is > undefined > orte/mca/smr/bproc/Makefile.am:47: The usual way to define `LIBTOOL' is > to add `AC_PROG_LIBTOOL' > orte/mca/smr/bproc/Makefile.am:47: to `configure.ac' and run `aclocal' > and `autoconf' again. > orte/mca/smr/bproc/Makefile.am:47: If `AC_PROG_LIBTOOL' is in > `configure.ac', make sure > orte/mca/smr/bproc/Makefile.am:47: its definition is in aclocal's search > path. > test/support/Makefile.am:29: library used but `RANLIB' is undefined > test/support/Makefile.am:29: The usual way to define `RANLIB' is to add > `AC_PROG_RANLIB' > test/support/Makefile.am:29: to `configure.ac' and run `autoconf' again. > > . and breaks. I'm confused why aclocal (or are these automake errors?) is getting invoked in "gmake all". Did you try running "aclocal" and "autoconf" in the top-level directory? (You shouldn't have to do that, but it might resolve this problem.) Make sure "ranlib" is in your PATH, mine's at /usr/ccs/bin/ranlib. (Also, we don't have a sys/bproc.h file on our lab machine, so the above might be an untested scenario.) > > If "gmake all" again we also see error messages like: > > *** Fortran 77 compiler > checking for gfortran... gfortran > checking whether we are using the GNU Fortran 77 compiler... yes > checking whether gfortran accepts -g... yes > checking if Fortran 77 compiler works... yes > checking gfortran external symbol convention... ./configure: line 26340: > ./conftest.o: Permission denied > ./configure: line 26342: ./conftest.o: Permission denied > ./configure: line 26344: ./conftest.o: Permission denied > ./configure: line 26346: ./conftest.o: Permission denied > ./configure: line 26348: ./conftest.o: Permission denied > configure: error: Could not determine Fortran naming convention. > We didn't test 1.2.8 with GCC/Solaris. Let me see if we can reproduce this, and get back to you. > > Considered the configure script we see on these lines in ./configire: > > if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then > ompi_cv_f77_external_symbol="double underscore" > elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then > ompi_cv_f77_external_symbol="single underscore" > elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then > ompi_cv_f77_external_symbol="mixed case" > elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then > ompi_cv_f77_external_symbol="no underscore" > elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then > ompi_cv_f77_external_symbol="upper case" > else > $NM conftest.o >conftest.out 2>&1 > > and searching through ./configire says us, that $NM is never set > (neither in ./configure nor in our environment) > Is "nm" in your path? I have this in my config.log file: NM='/usr/ccs/bin/nm -p' Thanks, Ethan > > So, we think that somewhat is not OK with ./configure script. Attend to the > fact, that we were able to install 1.2.5 and 1.2.6 some time ago on same > boxes without problems. > > Or maybe we do somewhat wrong? > > best regards, > Paul Kapinos > HPC Group RZ RWTH Aachen > > P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris > sucessfully? > > > This file contains any messages produced by compilers while > running configure, to aid debugging if configure makes a mistake. > > It was created by Open MPI configure 1.2.8, which was > generated by GNU Autoconf 2.61. Invocation command line was > > $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 -m64 > FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 > --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ > FC=gfortran --enable-ltdl-convenience --no-create --no-recursion > > ## - ## > ## Platform. ## > ## - ## > > hostname = sunoc63.rz.RWTH-Aachen.DE > uname -m = i86pc > uname -r = 5.10 > uname -s = SunOS > uname -v = Generic_137112-06 > > /usr/bin/uname -p = i386 > /bin/uname -X = System = SunOS > Node = su
Re: [OMPI users] Debian MPI -- mpirun missing
On 18 October 2008 at 03:30, Terry Frankcombe wrote: | | > Well when I use Open MPI I go with the new convention and call orterun | > instead of mpirun. I think you should have. Maybe a local alias in your | > ~/.bashrc can do the trick. | > | > Current packages do have mpirun.openmpi but we were unable to devise a | > bullet-proof scheme between lam, mpich and Open MPI for sharing / updating / | > ... the alternatives links as there are sublte differences that prevent us | > from switching all these aliases consistently. | | Eh? Surely it's a simple case of conflict? If you want multiple It is not simple or else we'd do it. Trust us, several folks tried. IIRC one of the issues was that among mpich, lam and Open MPI, the set of supplied and potential conflicting apps (and their manual pages etc) is not perfectly overlapping. | packages providing similar functionality, it's up to you to specify how | the user should chose which one they want to run. Breaking any | particular package (or all packages) seems like a particularly poor | choice, but that's only my opinion. | | I would argue that orterun is a very long way from a "new convention". | I'd draw attention to section 8.8 of the MPI 2.1 standard. | | But again, this is a discussion for the Debian list. In particularly for the 'package Open MPI maintainers' list at http://lists.alioth.debian.org/mailman/listinfo/pkg-openmpi-maintainers so if you want to continue this discussion, please take there. We can also point you to a couple of discussion in the Debian bug tracking system, for example http://bugs.debian.org/452047 where Manuel actually goes through the motions. If you think you have fixes for this 'simple case of conflict', as you call, do not hold back and tell us, but please over on that list. Thank you, Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] OpenMPI 1.2.8 on Solaris: configure problems
On Oct 17, 2008, at 12:59 PM, Ethan Mallove wrote: * --enable-ltdl-convenience * --no-create * --no-recursion * GCC on Solaris A user is not usually supposed to add these options. They are added by default when the build system detect that one of the configure files (configure.ac or one of the m4 files) have been modified, and that the regeneration of configure is required. I did had in the past such errors. I figure out that they were generated due to a mismatch between the original version of autotools (used to create the first configure and the cache files) and the one used by the build system when it had to rebuild the configure. If you use NFS you should check that your NTP is doing what is it supposed to do, a wrong time-stamp on one of the m4 files might be the reason for this. george. We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all 2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 1.2.6 versions. The configuring processes runs through, but if "gmake all" called, it seems to be so, that the configure stage restarts or being resumed: .. orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is undefined orte/mca/smr/bproc/Makefile.am:47: The usual way to define `LIBTOOL' is to add `AC_PROG_LIBTOOL' orte/mca/smr/bproc/Makefile.am:47: to `configure.ac' and run `aclocal' and `autoconf' again. orte/mca/smr/bproc/Makefile.am:47: If `AC_PROG_LIBTOOL' is in `configure.ac', make sure orte/mca/smr/bproc/Makefile.am:47: its definition is in aclocal's search path. test/support/Makefile.am:29: library used but `RANLIB' is undefined test/support/Makefile.am:29: The usual way to define `RANLIB' is to add `AC_PROG_RANLIB' test/support/Makefile.am:29: to `configure.ac' and run `autoconf' again. . and breaks. I'm confused why aclocal (or are these automake errors?) is getting invoked in "gmake all". Did you try running "aclocal" and "autoconf" in the top-level directory? (You shouldn't have to do that, but it might resolve this problem.) Make sure "ranlib" is in your PATH, mine's at /usr/ccs/bin/ranlib. (Also, we don't have a sys/bproc.h file on our lab machine, so the above might be an untested scenario.) If "gmake all" again we also see error messages like: *** Fortran 77 compiler checking for gfortran... gfortran checking whether we are using the GNU Fortran 77 compiler... yes checking whether gfortran accepts -g... yes checking if Fortran 77 compiler works... yes checking gfortran external symbol convention... ./configure: line 26340: ./conftest.o: Permission denied ./configure: line 26342: ./conftest.o: Permission denied ./configure: line 26344: ./conftest.o: Permission denied ./configure: line 26346: ./conftest.o: Permission denied ./configure: line 26348: ./conftest.o: Permission denied configure: error: Could not determine Fortran naming convention. We didn't test 1.2.8 with GCC/Solaris. Let me see if we can reproduce this, and get back to you. Considered the configure script we see on these lines in ./configire: if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="double underscore" elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="single underscore" elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="mixed case" elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="no underscore" elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then ompi_cv_f77_external_symbol="upper case" else $NM conftest.o >conftest.out 2>&1 and searching through ./configire says us, that $NM is never set (neither in ./configure nor in our environment) Is "nm" in your path? I have this in my config.log file: NM='/usr/ccs/bin/nm -p' Thanks, Ethan So, we think that somewhat is not OK with ./configure script. Attend to the fact, that we were able to install 1.2.5 and 1.2.6 some time ago on same boxes without problems. Or maybe we do somewhat wrong? best regards, Paul Kapinos HPC Group RZ RWTH Aachen P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris sucessfully? This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by Open MPI configure 1.2.8, which was generated by GNU Autoconf 2.61. Invocation command line was $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 - m64 FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 --prefix=/ rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ FC=gfortran --enable-ltdl-convenience --no-create --no-recursion ## - ## ## Platform. ## ## - ## hostname = sunoc63.rz.RWTH-Aachen.DE uname -m = i86p
[OMPI users] MPI_ERR_TRUNCATE
Hi, I'm getting an error I don't quite understand. The code: MPI_Irecv(recv->data, recv->count, recv->datatype, recv->sender_id, recv->agent_type, MPI_COMM_WORLD, &recv->request); ... recv = (AgentRequestRecv*) item->data; MPI_Wait(&recv->request, &status); receive_complete(process, recv); And under some conditions, I get the error: [3] [belafonte.home:04938] *** An error occurred in MPI_Wait [3] [belafonte.home:04938] *** on communicator MPI_COMM_WORLD [3] [belafonte.home:04938] *** MPI_ERR_TRUNCATE: message truncated [3] [belafonte.home:04938] *** MPI_ERRORS_ARE_FATAL (goodbye) When I do get the error, tracking the send and receive counts shows them as equal. And what I don't understand is that the receive_complete function in the above executes and the recv Struct actually contains the data that was sent. So, I'm confused about the error and what its trying to tell me as it looks like everything worked OK. This is on OSX 10.5.5 with OpenMPI 1.2.6. thanks, Nick
Re: [OMPI users] Debian MPI -- mpirun missing
Hi all, Dirk Eddelbuettel wrote: On 18 October 2008 at 03:30, Terry Frankcombe wrote: | | But again, this is a discussion for the Debian list. In particularly for the 'package Open MPI maintainers' list at http://lists.alioth.debian.org/mailman/listinfo/pkg-openmpi-maintainers so if you want to continue this discussion, please take there. Thanks a lot, Dirk; I'll take my Debian problems over to that list then! I didn't realize that this had to be a Debian-specific problem; I know so little, I was even open to a response like, "No, there is no mpirun anymore". Of course, if mpirun is just an alias to orterun, then I will just do that (use orterun instead). The system administrator of one of the machines I'll use prefers to stick to Debian packages, despite their age, so unless I can find a good reason (serious security flaw), I guess doing this is far easier (politically) than installing from source. Thank you all for your help! Ray
[OMPI users] Bus Error in ompi_free_list_grow
Hi: A customer is running our parallel application on an SGI Altix machine. They compiled OMPI 1.2.8 themselves. The Altix uses IB interfaces and they recently upgraded to OFED 1.3 (in SGI Propack 6). They are receiving a bus error in ompi_free_list_grow: [r1i0n0:01321] *** Process received signal *** [r1i0n0:01321] Signal: Bus error (7) [r1i0n0:01321] Signal code: (2) [r1i0n0:01321] Failing at address: 0x2b04ba07c4a0 [r1i0n0:01321] [ 0] /lib64/libpthread.so.0 [0x2b04b00cfc00] [r1i0n0:01321] [ 1] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(ompi_free_list_grow+0x14a) [0x2b04af7dc058] [r1i0n0:01321] [ 2] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b04b38c8e35] [r1i0n0:01321] [ 3] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d) [0x2b04b3378f91] [r1i0n0:01321] [ 4] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x546) [0x2b04b3370c7e] [r1i0n0:01321] [ 5] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(MPI_Send+0x28) [0x2b04af814098] Here is some more information about the machine: SGI Altix ICE 8200 cluster; each node has two quad core Xeons with 16GB SUSE Linux Enterprise Server 10 Service Pack 2 GNU C Library stable release version 2.4 (20080421) gcc (GCC) 4.1.2 20070115 (SUSE Linux) SGI Propack 6 (just upgraded from Propack 5 SP3: changed from OFED 1.2 to 1.3) The output from ompi_info is attached. I would appreciate any help debugging this. Thanks, Allen -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930 ompi_info.txt.bz2 Description: application/bzip