Thanks Jeff.The kind of faults I was trying to trap are those of application/node faults/failures. I literally kill the application on another node in hope to try to trap it and react accordingly. This is similar to FT-MPI shrinking the size, etc.
If you suggest a different implementation that will allow me to trap please let me know.
Regards, Mohammad Huwaidi users-requ...@open-mpi.org wrote:
Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. open-mpi 1.2 build failure under Mac OS X 10.3.9 (Marius Schamschula) 2. Re: OpenMPI 1.2 bug: segmentation violation in mpi_pack (Jeff Squyres) 3. Re: Fault Tolerance (Jeff Squyres) 4. Re: Signal 13 (Ralph Castain) ---------------------------------------------------------------------- Message: 1 Date: Fri, 16 Mar 2007 18:42:22 -0500 From: Marius Schamschula <mar...@physics.aamu.edu> Subject: [OMPI users] open-mpi 1.2 build failure under Mac OS X 10.3.9 To: us...@open-mpi.org Message-ID: <82367db0-ebc6-4438-bbc2-d78963186...@physics.aamu.edu> Content-Type: text/plain; charset="us-ascii" Hi all,I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and had a build failure with the following:depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]*$|.deps/ &|;s|\.lo$||'`; \ if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include -I../orte/include -I../ompi/include -I../ompi/ include -I.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno- strict-aliasing -MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF "$depbase.Tpo" -c -o runtime/ompi_mpi_preconnect.lo runtime/ ompi_mpi_preconnect.c; \ then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; exit 1; fi libtool: compile: gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include -I../ orte/include -I../ompi/include -I../ompi/include -I.. -D_REENTRANT - O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT runtime/ ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/ ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c -fno-common -DPIC -o runtime/.libs/ompi_mpi_preconnect.o runtime/ompi_mpi_preconnect.c: In function `ompi_init_do_oob_preconnect': runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg' isn't knownmake[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 $ gcc -v Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs Thread model: posix gcc version 3.3 20030304 (Apple Computer, Inc. build 1495) $ g77 -vReading specs from /usr/local/lib/gcc/powerpc-apple-darwin7.3.0/3.5.0/ specs Configured with: ../gcc/configure --enable-threads=posix --enable- languages=f77Thread model: posix gcc version 3.5.0 20040429 (experimental) (g77 from hpc.sf.net)Note: I had no such problem under Mac OS X 10.4.9 with my ppc and x86 builds. However, I did notice that the configure script did not detect g95 from g95.org correctly:*** Fortran 90/95 compiler checking for gfortran... no checking for f95... no checking for fort... no checking for xlf95... no checking for ifort... no checking for ifc... no checking for efc... no checking for pgf95... no checking for lf95... no checking for f90... no checking for xlf90... no checking for pgf90... no checking for epcf90... no checking whether we are using the GNU Fortran compiler... no configure --help doesn't give any hint about specifying F95. TIA, Marius -- Marius Schamschula, Alabama A & M University, Department of Physics The Center for Hydrology Soil Climatology and Remote Sensing http://optics.physics.aamu.edu/ - http://www.physics.aamu.edu/ http://wx.aamu.edu/ - http://www.aamu.edu/hscars/ -------------- next part -------------- HTML attachment scrubbed and removed ------------------------------ Message: 2 Date: Fri, 16 Mar 2007 19:46:39 -0400 From: Jeff Squyres <jsquy...@cisco.com> Subject: Re: [OMPI users] OpenMPI 1.2 bug: segmentation violation in mpi_pack To: Open MPI Users <us...@open-mpi.org> Message-ID: <045dabac-1369-4e45-8e0c-fd9fba13c...@cisco.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowedThe problem with both the f77 and f90 programs is that you forgot to put "ierr" as the last argument to MPI_PACK. This causes a segv because neither of them are correct MPI programs.But it's always good to hear that we can deliver a smaller corefile in v1.2! :-)On Mar 16, 2007, at 7:25 PM, Erik Deumens wrote:I have a small program in F77 that makes a SEGV crash with a 130MB core file. It is true that the crash is much cleaner in OpenMPI 1.2; nice improvement! The core file is 500 MB with OpenMPI 1.1. I am running on CentOS 4.4 with the latest patches. mpif77 -g -o bug bug.f mpirun -np 2 ./bug I also have a bug.f90 (which I made first) and it crashes too with the Intel ifort compiler 9.1.039. -- Dr. Erik Deumens Interim Director Quantum Theory Project New Physics Building 2334 deum...@qtp.ufl.edu University of Florida http://www.qtp.ufl.edu/~deumens Gainesville, Florida 32611-8435 (352)392-6980 program mainf c mpif77 -g -o bug bug.f c mpirun -np 2 ./bug implicit none include 'mpif.h' character*80 inpfile integer l integer i integer stat integer cmdbuf(4) integer lcmdbuf integer ierr integer ntasks integer taskid integer bufpos integer cmd integer ldata character*(mpi_max_processor_name) hostnm integer iuinp integer iuout integer lnam real*8 bcaststart iuinp = 5 iuout = 6 lcmdbuf = 16 i = 0 call mpi_init(ierr) call mpi_comm_size (mpi_comm_world, ntasks, ierr) call mpi_comm_rank (mpi_comm_world, taskid, ierr) hostnm = ' ' call mpi_get_processor_name (hostnm, lnam, ierr) write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam) if (taskid == 0) then inpfile = ' ' do write (iuout,*) 'Enter .inp filename:' read (iuinp,*) inpfile if (inpfile /= ' ') exit end do l = len_trim(inpfile) write (iuout,*) 'task',taskid,inpfile(1:l) bufpos = 0 cmd = 1099 ldata = 7 write (iuout,*) 'task',taskid,cmdbuf,bufpos write (iuout,*) 'task',taskid,cmd,lcmdbuf call mpi_pack (cmd, 1, MPI_INTEGER, * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) write (iuout,*) 'task',taskid,cmdbuf,bufpos write (iuout,*) 'task',taskid,ldata,lcmdbuf call mpi_pack (ldata, 1, MPI_INTEGER, * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) bcaststart = mpi_wtime() write (iuout,*) 'task',taskid,cmdbuf,bufpos write (iuout,*) 'task',taskid,bcaststart,lcmdbuf call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION, * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) write (iuout,*) 'task',taskid,cmdbuf,bufpos end if call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED, * 0, MPI_COMM_WORLD, ierr) call mpi_finalize(ierr) stop end program mainf program mainf ! ifort -g -I /share/local/lib/ompi/include -o bug bug.f90 ! -L /share/local/lib/ompi/lib -lmpi_f77 -lmpi ! mpirun -np 2 ./bug implicit none include 'mpif.h' character(len=80) :: inpfile character(len=1), dimension(80) :: cinpfile integer :: l integer :: i integer :: stat integer, dimension(4) :: cmdbuf integer :: lcmdbuf integer :: ierr integer :: ntasks integer :: taskid integer :: bufpos integer :: cmd integer :: ldata character(len=mpi_max_processor_name) :: hostnm integer :: iuinp = 5 integer :: iuout = 6 integer :: lnam real(8) :: bcaststart lcmdbuf = 16 i = 0 call mpi_init(ierr) call mpi_comm_size (mpi_comm_world, ntasks, ierr) call mpi_comm_rank (mpi_comm_world, taskid, ierr) hostnm = ' ' call mpi_get_processor_name (hostnm, lnam, ierr) write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam) if (taskid == 0) then inpfile = ' ' do write (iuout,*) 'Enter .inp filename:' read (iuinp,*) inpfile if (inpfile /= ' ') exit end do l = len_trim(inpfile) do i=1,l cinpfile(i) = inpfile(i:i) end do cinpfile(l+1) = char(0) write (iuout,*) 'task',taskid,inpfile(1:l) bufpos = 0 cmd = 1099 ldata = 7 write (iuout,*) 'task',taskid,cmdbuf,bufpos ! The next two lines exhibit the bug ! Uncomment the first and the program works ! Uncomment the second and the program dies in mpi_pack ! and produces a 137 MB core file. write (iuout,*) 'task',taskid,cmd,lcmdbuf ! write (iuout,*) 'task',taskid,cmd call mpi_pack (cmd, 1, MPI_INTEGER, & cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) write (iuout,*) 'task',taskid,cmdbuf,bufpos write (iuout,*) 'task',taskid,ldata,lcmdbuf call mpi_pack (ldata, 1, MPI_INTEGER, & cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) bcaststart = mpi_wtime() write (iuout,*) 'task',taskid,cmdbuf,bufpos write (iuout,*) 'task',taskid,bcaststart,lcmdbuf call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION, & cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD) write (iuout,*) 'task',taskid,cmdbuf,bufpos end if call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED, & 0, MPI_COMM_WORLD, ierr) call mpi_finalize(ierr) stop end program mainf _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
-- Regards, Mohammad Huwaidi We can't resolve problems by using the same kind of thinking we used when we created them. --Albert Einstein
<<attachment: mohammad.vcf>>