This might not have anything to do with your problem, but how do you finalize your worker nodes when your master loop terminates?
On Sun, Mar 27, 2011 at 3:27 PM, Jack Bryan <dtustud...@hotmail.com> wrote: > Hi, my original bug is : > > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 77967 on node n342 exited on > signal 9 (Killed). > -------------------------------------------------------------------------- > > The main framework of my code is: > > main() > { > for masternode: > while (loop <= LOOP_NUMBER) > { > master node distributes tasks to workers; > master collects results from workers; > ++loop; > } > for worker nodes: > { > get the task ; > run the task; // call CPLEX API lib > return results to master; > } > } > > When the LOOP_NUMBER <= 600 (with 200 parallel processes), it works well. > But, when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error: > > The possible limit of my Torque may be reason for the above error ? > > It seems that Torque complains about my high I/O caused by print out > something from each process. > > But, if I comment out the printout statements in my code the Torque complains > will be gone, but > the signal 9 error is still there. > > Any help is really appreciated. > > thanks > > Jack > > > ------------------------------ > From: r...@open-mpi.org > Date: Sun, 27 Mar 2011 13:08:31 -0600 > > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > It means that Torque is unhappy with your job - either you are running > longer than it permits, or you exceeded some other system limit. > > Talk to your sys admin about imposed limits. Usually, there are flags you > can provide to your job submission that allow you to change limits for your > program. > > > On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote: > > Hi, I have figured out how to run the command. > > OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks > > mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > -output-filename 700g200i200p14ye ./myapplication > > Each process print out to a distinct file. > > But, the program is terminated by the error : > > --------------------------------------------------------------------------------------------------------------------- > =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code > 1099) - received SISTER_EOF attempting to communicate with sister MOM's > mpirun: Forwarding signal 10 to job > mpirun: killing job... > > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > n341 > n338 > n337 > n336 > n335 > n334 > n333 > n332 > n331 > n329 > n328 > n326 > n324 > n321 > n318 > n316 > n315 > n314 > n313 > n312 > n309 > n308 > n306 > n305 > > -------------------------------------------------------------------- > > After searching, I find that the error is probably related to the highly > frequent I/O activities. > > I have also run valgrind to do mem check in order to find the possible > reason for the original > signal 9 (SIGKILL) problem. > > mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes > --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication > > But, I got the similar error as the above. > > What does the error mean ? > I cannot change the file system of the cluster. > > I only want to find a way to find the bug, which only appears in the case > that the problem size is very large. > > But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. > > Any help is really appreciated. > > thanks > > Jack > > > -------------------------------------------------------------------------------------------------------- > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 20:47:19 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > That command line cannot possibly work. Both the -rf and --output-filename > options require arguments. > > PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how > to correctly use these options. > > > On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote: > > Hi, I used : > > mpirun -np 200 -rf --output-filename /mypath/myapplication > But, no files are printed out. > > Can "--debug" option help me hear ? > > When I tried : > > -bash-3.2$ mpirun -debug > -------------------------------------------------------------------------- > A suitable debugger could not be found in your PATH. Check the values > specified in the orte_base_user_debugger MCA parameter for the list of > debuggers that was searched. > -------------------------------------------------------------------------- > Any help is really appreciated. > > thanks > > ------------------------------ > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 15:45:39 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > If you use that mpirun option, mpirun will place the output from each rank > into a -separate- file for you. Give it: > > mpirun --output-filename /myhome/debug/run01 > > and in /myhome/debug, you will find files: > > run01.0 > run01.1 > ... > > each with the output from the indicated rank. > > > > On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote: > > The cluster can print out all output into one file. > > But, checking them for bugs is very hard. > > The cluster also print out possible error messages into one file. > > But, sometimes the error file is empty , sometimes it is signal 9. > > If I only run dummy tasks on worker nodes, no errors. > > If I run real task, sometimes processes are terminated w/o any errors > before the program normally exit. > Sometimes, the program get signal 9 but no other error messages. > > It is weird. > > Any help is really appreciated. > > Jack > ------------------------------ > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 15:18:53 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > I don't know, but Ashley may be able to help - or you can see his web site > for instructions. > > Alternatively, since you can put print statements into your code, have you > considered using mpirun's option to direct output from each rank into its > own file? Look at "mpirun -h" for the options. > > -output-filename|--output-filename <arg0> > Redirect output from application processes into > filename.rank > > > On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote: > > Is it possible to enable padb to print out the stack trace and other > program execute information into a file ? > > I can run the program in gdb as this: > > mpirun -np 200 -e gdb ./myapplication > > How to make gdb print out the debug information to a file ? > So that I can check it when the program is terminated. > > thanks > > Jack > > ------------------------------ > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 13:56:13 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > You don't need to install anything on a system folder - you can just > install it in your home directory, assuming that is accessible on the remote > nodes. > > As for the script - unless you can somehow modify it to allow you to run > under a debugger, I am afraid you are completely out of luck. > > > On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote: > > Hi, > > I am working on a cluster, where I am not allowed to install software on > system folder. > > My Open MPI is 1.3.4. > > I have a very quick of the padb on http://padb.pittman.org.uk/ . > > Does it require some software install on the cluster in order to use it ? > > I cannot use command-line to run job on the lcuster , but only script. > > thanks > > ------------------------------ > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 12:12:11 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > Have you tried a parallel debugger such as padb? > > On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote: > > Hi, > > I have tried this. But, the printout from 200 parallel processes make it > very hard to locate the possible bug. > > They may not stop at the same point when the program got signal 9. > > So, even though I can figure out the print out statements from all > 200 processes, so many different locations where the processes > are stopped make it harder to find out some hints about the bug. > > Are there some other programming tricks, which can help me > narrow down to the doubt points ASAP. > Any help is appreciated. > > Jack > > ------------------------------ > From: r...@open-mpi.org > Date: Sat, 26 Mar 2011 07:53:40 -0600 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > Try adding some print statements so you can see where the error occurs. > > On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote: > > Hi , All: > > I running a Open MPI (1.3.4) program by 200 parallel processes. > > But, the program is terminated with > > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 77967 on node n342 exited on > signal 9 (Killed). > -------------------------------------------------------------------------- > > After searching, the signal 9 means: > > the process is currently in an unworkable state and should be terminated > with extreme prejudice > > If a process does not respond to any other termination signals, sending > it a SIGKILL signal will almost always cause it to go away. > > The system will generate SIGKILL for a process itself under some unusual > conditions where the program cannot possibly continue to run (even to run a > signal handler). > > But, the error message does not indicate any possible reasons for the > termination. > > There is a FOR loop in the main() program, if the loop number is small (< > 200), the program works well, > but if it becomes lager and larger, the program will got SIGKILL. > > The cluster where I am running the MPI program does not allow running debug > tools. > > If I run it on a workstation, it will take a very very long time (for > 200 > loops) in order to > get the error occur again. > > What can I do to find the possible bugs ? > > Any help is really appreciated. > > thanks > > Jack > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego