I'm not sure why the group communicator would make a difference - the code area in question knows nothing about the mpi aspects of the job. It looks like you are hitting a race condition that causes a particular internal recv to not exist when we subsequently try to cancel it, which generates that error message.
How did you configure OMPI? On Oct 3, 2010, at 6:40 PM, Milan Hodoscek wrote: > Hi, > > I am a long time happy user of mpi_comm_spawn() routine. But so far I > used it only with the MPI_COMM_WORLD communicator. Now I want to > execute more mpi_comm_spawn() routines, by creating and using group > communicators. However this seems to have some problems. I can get it > to run about 50% times on my laptop, but on some more "speedy" > machines it just produces the following message: > > $ mpirun -n 4 a.out > [ala:31406] [[45304,0],0] ORTE_ERROR_LOG: Not found in file > base/plm_base_launch_support.c at line 758 > -------------------------------------------------------------------------- > mpirun was unable to start the specified application as it encountered an > error. > More information may be available above. > -------------------------------------------------------------------------- > > I am attaching the 2 programs needed to test the behavior. Compile: > $ mpif90 -o sps sps.f08 # spawned program > $ mpif90 mspbug.f08 # program with problems > $ mpirun -n 4 a.out > > The compiler is gfortran-4.4.4, and openmpi is 1.4.2. > > Needless to say it runs with mpich2, but mpich2 doesn't know how to > deal with stdin on a spawned process, so it's useless for my project :-( > > Any ideas? > > ------------------------------------------------- > program sps > use mpi > implicit none > integer :: ier,nproc,me,pcomm,meroot,mi,on > integer, dimension(1:10) :: num > > call mpi_init(ier) > > mi=mpi_integer > call mpi_comm_rank(mpi_comm_world,me,ier) > meroot=0 > > on=1 > > call mpi_comm_get_parent(pcomm,ier) > > call mpi_bcast(num,on,mi,meroot,pcomm,ier) > write(*,*)'sps>me,num=',me,num(on) > > call mpi_finalize(ier) > > end program sps > ------------------------------------------------- > > program groupspawn > > use mpi > > implicit none > ! in the case use mpi does not work (eg Ubuntu) use the include below > ! include 'mpif.h' > integer :: ier,intercom,nproc,meroot,info,mpierrs(1),mcw > integer :: i,myrepsiz,me,np,mcg,repdgrp,repdcom,on,mi,op > integer, dimension(1:10) :: myrepgrp > character(len=5) :: sarg(1),prog > integer, dimension(1:10) :: num,sm > integer :: newme,ngrp,igrp > > call mpi_init(ier) > > prog='sps' > sarg(1) = '' > nproc=2 > on=1 > meroot=0 > mcw=mpi_comm_world > info=mpi_info_null > mi=mpi_integer > op=mpi_sum > mpierrs(1)=mpi_errcodes_ignore(1) > > call mpi_comm_rank(mcw,me,ier) > call mpi_comm_size(mcw,np,ier) > > ngrp=2 ! lets have some groups > myrepsiz=np/ngrp > igrp=me/myrepsiz > do i = 1, myrepsiz > myrepgrp(i)=i+me-mod(me,myrepsiz)-1 > enddo > > call mpi_comm_group(mcw,mcg,ier) > call mpi_group_incl(mcg,myrepsiz,myrepgrp,repdgrp,ier) > call mpi_comm_create(mcw,repdgrp,repdcom,ier) > > ! call mpi_comm_spawn(prog,sarg,nproc,info,meroot,mcw,intercom,mpierrs,ier) > call mpi_comm_spawn(prog,sarg,nproc,info,meroot,repdcom,intercom,mpierrs,ier) > > ! send a number to spawned ones... > > call mpi_comm_rank(intercom,newme,ier) > write(*,*)'me,intercom,newme=',me,intercom,newme > num(1)=111*(igrp+1) > > meroot=mpi_proc_null > if(newme == 0) meroot=mpi_root ! to send data > > call mpi_bcast(num,on,mi,meroot,intercom,ier) > ! sometimes there is no output from sps programs, so we wait here: WEIRD :-( > !call sleep(1) > > call mpi_finalize(ier) > > end program groupspawn > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users