Hi, The job queue has a time budget, which has been set in my job script. For example, my current job queue is 24 hours. But, my program got SIGKILL (signal 9) within not more than 2 hours since it began to run. Are there other possible settings that I need to consider ? thanks Jack
> From: jsquy...@cisco.com > Date: Sun, 27 Mar 2011 20:29:11 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > +1 on what Ralph is saying. > > You need to talk to your local administrators and ask them why Torque is > killing your job. Perhaps you're submitting to a queue that only allows jobs > to run for a few seconds, or something like that. > > > On Mar 27, 2011, at 3:08 PM, Ralph Castain wrote: > > > It means that Torque is unhappy with your job - either you are running > > longer than it permits, or you exceeded some other system limit. > > > > Talk to your sys admin about imposed limits. Usually, there are flags you > > can provide to your job submission that allow you to change limits for your > > program. > > > > > > On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote: > > > >> Hi, I have figured out how to run the command. > >> > >> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks > >> > >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > >> -output-filename 700g200i200p14ye ./myapplication > >> > >> Each process print out to a distinct file. > >> > >> But, the program is terminated by the error : > >> --------------------------------------------------------------------------------------------------------------------- > >> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code > >> 1099) - received SISTER_EOF attempting to communicate with sister MOM's > >> mpirun: Forwarding signal 10 to job > >> mpirun: killing job... > >> > >> -------------------------------------------------------------------------- > >> mpirun was unable to cleanly terminate the daemons on the nodes shown > >> below. Additional manual cleanup may be required - please refer to > >> the "orte-clean" tool for assistance. > >> -------------------------------------------------------------------------- > >> n341 > >> n338 > >> n337 > >> n336 > >> n335 > >> n334 > >> n333 > >> n332 > >> n331 > >> n329 > >> n328 > >> n326 > >> n324 > >> n321 > >> n318 > >> n316 > >> n315 > >> n314 > >> n313 > >> n312 > >> n309 > >> n308 > >> n306 > >> n305 > >> > >> -------------------------------------------------------------------- > >> > >> After searching, I find that the error is probably related to the highly > >> frequent I/O activities. > >> > >> I have also run valgrind to do mem check in order to find the possible > >> reason for the original > >> signal 9 (SIGKILL) problem. > >> > >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > >> /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes > >> --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication > >> > >> But, I got the similar error as the above. > >> > >> What does the error mean ? > >> I cannot change the file system of the cluster. > >> > >> I only want to find a way to find the bug, which only appears in the case > >> that the problem size is very large. > >> > >> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. > >> > >> Any help is really appreciated. > >> > >> thanks > >> > >> Jack > >> > >> -------------------------------------------------------------------------------------------------------- > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 20:47:19 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> That command line cannot possibly work. Both the -rf and --output-filename > >> options require arguments. > >> > >> PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you > >> how to correctly use these options. > >> > >> > >> On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote: > >> > >> Hi, I used : > >> > >> mpirun -np 200 -rf --output-filename /mypath/myapplication > >> But, no files are printed out. > >> > >> Can "--debug" option help me hear ? > >> > >> When I tried : > >> > >> -bash-3.2$ mpirun -debug > >> -------------------------------------------------------------------------- > >> A suitable debugger could not be found in your PATH. Check the values > >> specified in the orte_base_user_debugger MCA parameter for the list of > >> debuggers that was searched. > >> -------------------------------------------------------------------------- > >> Any help is really appreciated. > >> > >> thanks > >> > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 15:45:39 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> If you use that mpirun option, mpirun will place the output from each rank > >> into a -separate- file for you. Give it: > >> > >> mpirun --output-filename /myhome/debug/run01 > >> > >> and in /myhome/debug, you will find files: > >> > >> run01.0 > >> run01.1 > >> ... > >> > >> each with the output from the indicated rank. > >> > >> > >> > >> On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote: > >> > >> The cluster can print out all output into one file. > >> > >> But, checking them for bugs is very hard. > >> > >> The cluster also print out possible error messages into one file. > >> > >> But, sometimes the error file is empty , sometimes it is signal 9. > >> > >> If I only run dummy tasks on worker nodes, no errors. > >> > >> If I run real task, sometimes processes are terminated w/o any errors > >> before the program normally exit. > >> Sometimes, the program get signal 9 but no other error messages. > >> > >> It is weird. > >> > >> Any help is really appreciated. > >> > >> Jack > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 15:18:53 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> I don't know, but Ashley may be able to help - or you can see his web site > >> for instructions. > >> > >> Alternatively, since you can put print statements into your code, have you > >> considered using mpirun's option to direct output from each rank into its > >> own file? Look at "mpirun -h" for the options. > >> > >> -output-filename|--output-filename <arg0> > >> Redirect output from application processes into > >> filename.rank > >> > >> > >> On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote: > >> > >> Is it possible to enable padb to print out the stack trace and other > >> program execute information into a file ? > >> > >> I can run the program in gdb as this: > >> > >> mpirun -np 200 -e gdb ./myapplication > >> > >> How to make gdb print out the debug information to a file ? > >> So that I can check it when the program is terminated. > >> > >> thanks > >> > >> Jack > >> > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 13:56:13 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> You don't need to install anything on a system folder - you can just > >> install it in your home directory, assuming that is accessible on the > >> remote nodes. > >> > >> As for the script - unless you can somehow modify it to allow you to run > >> under a debugger, I am afraid you are completely out of luck. > >> > >> > >> On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote: > >> > >> Hi, > >> > >> I am working on a cluster, where I am not allowed to install software on > >> system folder. > >> > >> My Open MPI is 1.3.4. > >> > >> I have a very quick of the padb on http://padb.pittman.org.uk/ . > >> > >> Does it require some software install on the cluster in order to use it ? > >> > >> I cannot use command-line to run job on the lcuster , but only script. > >> > >> thanks > >> > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 12:12:11 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> Have you tried a parallel debugger such as padb? > >> > >> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote: > >> > >> Hi, > >> > >> I have tried this. But, the printout from 200 parallel processes make it > >> very hard to locate the possible bug. > >> > >> They may not stop at the same point when the program got signal 9. > >> > >> So, even though I can figure out the print out statements from all > >> 200 processes, so many different locations where the processes > >> are stopped make it harder to find out some hints about the bug. > >> > >> Are there some other programming tricks, which can help me > >> narrow down to the doubt points ASAP. > >> Any help is appreciated. > >> > >> Jack > >> > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 07:53:40 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> Try adding some print statements so you can see where the error occurs. > >> > >> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote: > >> > >> Hi , All: > >> > >> I running a Open MPI (1.3.4) program by 200 parallel processes. > >> > >> But, the program is terminated with > >> > >> -------------------------------------------------------------------------- > >> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on > >> signal 9 (Killed). > >> -------------------------------------------------------------------------- > >> > >> After searching, the signal 9 means: > >> > >> the process is currently in an unworkable state and should be terminated > >> with extreme prejudice > >> > >> If a process does not respond to any other termination signals, sending > >> it a SIGKILL signal will almost always cause it to go away. > >> > >> The system will generate SIGKILL for a process itself under some unusual > >> conditions where the program cannot possibly continue to run (even to run > >> a signal handler). > >> > >> But, the error message does not indicate any possible reasons for the > >> termination. > >> > >> There is a FOR loop in the main() program, if the loop number is small (< > >> 200), the program works well, > >> but if it becomes lager and larger, the program will got SIGKILL. > >> > >> The cluster where I am running the MPI program does not allow running > >> debug tools. > >> > >> If I run it on a workstation, it will take a very very long time (for > > >> 200 loops) in order to > >> get the error occur again. > >> > >> What can I do to find the possible bugs ? > >> > >> Any help is really appreciated. > >> > >> thanks > >> > >> Jack > >> > >> > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ users mailing list > >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users