Hi, 
The job queue has a time budget, which has been set in my job script.
For example, my current job queue is 24 hours. 
But, my program got SIGKILL (signal 9) within not more than 2 hours since it 
began to run. 
Are there other possible settings that I need to consider ? 
thanks
Jack

> From: jsquy...@cisco.com
> Date: Sun, 27 Mar 2011 20:29:11 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> 
> +1 on what Ralph is saying.
> 
> You need to talk to your local administrators and ask them why Torque is 
> killing your job.  Perhaps you're submitting to a queue that only allows jobs 
> to run for a few seconds, or something like that.
> 
> 
> On Mar 27, 2011, at 3:08 PM, Ralph Castain wrote:
> 
> > It means that Torque is unhappy with your job - either you are running 
> > longer than it permits, or you exceeded some other system limit.
> > 
> > Talk to your sys admin about imposed limits. Usually, there are flags you 
> > can provide to your job submission that allow you to change limits for your 
> > program.
> > 
> > 
> > On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:
> > 
> >> Hi, I have figured out how to run the command. 
> >> 
> >> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
> >> 
> >>  mpirun -np 200  -rf $OMPI_RANKFILE --mca btl self,sm,openib 
> >> -output-filename 700g200i200p14ye  ./myapplication 
> >> 
> >> Each process print out to a distinct file.
> >> 
> >> But, the program is terminated by the error :
> >> ---------------------------------------------------------------------------------------------------------------------
> >> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 
> >> 1099) - received SISTER_EOF attempting to communicate with sister MOM's
> >> mpirun: Forwarding signal 10 to job
> >> mpirun: killing job...
> >> 
> >> --------------------------------------------------------------------------
> >> mpirun was unable to cleanly terminate the daemons on the nodes shown
> >> below. Additional manual cleanup may be required - please refer to
> >> the "orte-clean" tool for assistance.
> >> --------------------------------------------------------------------------
> >>         n341
> >>         n338
> >>         n337
> >>         n336
> >>         n335
> >>         n334
> >>         n333
> >>         n332
> >>         n331
> >>         n329
> >>         n328
> >>         n326
> >>         n324
> >>         n321
> >>         n318
> >>         n316
> >>         n315
> >>         n314
> >>         n313
> >>         n312
> >>         n309
> >>         n308
> >>         n306
> >>         n305
> >> 
> >> --------------------------------------------------------------------
> >> 
> >> After searching, I find that the error is probably related to the highly 
> >> frequent I/O activities. 
> >> 
> >> I have also run valgrind to do mem check in  order to find the possible 
> >> reason for the original 
> >> signal 9 (SIGKILL) problem. 
> >> 
> >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib  
> >> /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes 
> >> --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log  ./myapplication 
> >> 
> >> But, I got the similar error as the above. 
> >> 
> >> What does the error mean ?   
> >> I cannot change the file system of the cluster. 
> >> 
> >> I only want to find a way to find the bug, which only appears in the case 
> >> that the problem size is very large. 
> >> 
> >> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. 
> >> 
> >> Any help is really appreciated. 
> >> 
> >> thanks
> >> 
> >> Jack 
> >> 
> >> --------------------------------------------------------------------------------------------------------
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 20:47:19 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> That command line cannot possibly work. Both the -rf and --output-filename 
> >> options require arguments.
> >> 
> >> PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you 
> >> how to correctly use these options.
> >> 
> >> 
> >> On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:
> >> 
> >> Hi, I used : 
> >> 
> >>  mpirun -np 200 -rf  --output-filename /mypath/myapplication
> >> But, no files are printed out.
> >> 
> >> Can "--debug" option help me hear ? 
> >> 
> >> When I tried :
> >> 
> >> -bash-3.2$ mpirun -debug
> >> --------------------------------------------------------------------------
> >> A suitable debugger could not be found in your PATH.  Check the values
> >> specified in the orte_base_user_debugger MCA parameter for the list of
> >> debuggers that was searched.
> >> --------------------------------------------------------------------------
> >> Any help is really appreciated. 
> >> 
> >> thanks
> >> 
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 15:45:39 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> If you use that mpirun option, mpirun will place the output from each rank 
> >> into a -separate- file for you. Give it:
> >> 
> >> mpirun --output-filename /myhome/debug/run01
> >> 
> >> and in /myhome/debug, you will find files:
> >> 
> >> run01.0
> >> run01.1
> >> ...
> >> 
> >> each with the output from the indicated rank.
> >> 
> >> 
> >> 
> >> On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:
> >> 
> >> The cluster can print out all output into one file. 
> >> 
> >> But, checking them for bugs is very hard. 
> >> 
> >> The cluster also print out possible error messages into one file. 
> >> 
> >> But, sometimes the error file is empty , sometimes it is signal 9.
> >> 
> >> If I only run dummy tasks on worker nodes, no errors. 
> >> 
> >> If I run real task, sometimes processes are terminated w/o any errors 
> >> before the program normally exit.
> >> Sometimes, the program get signal 9 but no other error messages. 
> >> 
> >> It is weird. 
> >> 
> >> Any help is really appreciated. 
> >> 
> >> Jack
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 15:18:53 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> I don't know, but Ashley may be able to help - or you can see his web site 
> >> for instructions.
> >> 
> >> Alternatively, since you can put print statements into your code, have you 
> >> considered using mpirun's option to direct output from each rank into its 
> >> own file? Look at "mpirun -h" for the options.
> >> 
> >>    -output-filename|--output-filename <arg0>  
> >>                          Redirect output from application processes into
> >>                          filename.rank
> >> 
> >> 
> >> On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:
> >> 
> >> Is it possible to enable padb to print out the stack trace and other 
> >> program execute information into a file ?
> >> 
> >> I can run the program in gdb as this: 
> >> 
> >> mpirun -np 200 -e gdb ./myapplication 
> >> 
> >> How to make gdb print out the debug information to a file ? 
> >> So that I can check it when the program is terminated. 
> >> 
> >> thanks
> >> 
> >> Jack
> >> 
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 13:56:13 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> You don't need to install anything on a system folder - you can just 
> >> install it in your home directory, assuming that is accessible on the 
> >> remote nodes.
> >> 
> >> As for the script - unless you can somehow modify it to allow you to run 
> >> under a debugger, I am afraid you are completely out of luck.
> >> 
> >> 
> >> On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:
> >> 
> >> Hi, 
> >> 
> >> I am working on a cluster, where I am not allowed to install software on 
> >> system folder. 
> >> 
> >> My Open MPI is 1.3.4. 
> >> 
> >> I have a very quick of the padb on http://padb.pittman.org.uk/ . 
> >> 
> >> Does it require some software install on the cluster in order to use it ? 
> >> 
> >> I cannot use command-line to run job on the lcuster , but only script.
> >> 
> >> thanks
> >> 
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 12:12:11 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> Have you tried a parallel debugger such as padb?
> >> 
> >> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:
> >> 
> >> Hi, 
> >> 
> >> I have tried this. But, the printout from 200 parallel processes make it 
> >> very hard to locate the possible bug. 
> >> 
> >> They may not stop at the same point when the program got signal 9.
> >> 
> >> So, even though I can figure out the print out statements from all
> >> 200 processes, so many different locations where the processes
> >> are stopped make it harder to find out some hints about the bug. 
> >> 
> >> Are there some other programming tricks, which can help me 
> >> narrow down to the doubt points ASAP.
> >> Any help is appreciated. 
> >> 
> >> Jack
> >> 
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 07:53:40 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> Try adding some print statements so you can see where the error occurs.
> >> 
> >> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:
> >> 
> >> Hi , All: 
> >> 
> >> I running a Open MPI (1.3.4) program by 200 parallel processes. 
> >> 
> >> But, the program is terminated with 
> >> 
> >> --------------------------------------------------------------------------
> >> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on 
> >> signal 9 (Killed).
> >> --------------------------------------------------------------------------
> >> 
> >> After searching, the signal 9 means: 
> >> 
> >> the process is currently in an unworkable state and should be terminated 
> >> with extreme prejudice
> >> 
> >>  If a process does not respond to any other termination signals, sending 
> >> it a SIGKILL signal will almost always cause it to go away.
> >> 
> >>  The system will generate SIGKILL for a process itself under some unusual 
> >> conditions where the program cannot possibly continue to run (even to run 
> >> a signal handler).
> >>  
> >> But, the error message does not indicate any possible reasons for the 
> >> termination. 
> >> 
> >> There is a FOR loop in the main() program, if the loop number is small (< 
> >> 200), the program works well, 
> >> but if it becomes lager and larger, the program will got SIGKILL. 
> >> 
> >> The cluster where I am running the MPI program does not allow running 
> >> debug tools. 
> >> 
> >> If I run it on a workstation, it will take a very very long time (for > 
> >> 200 loops) in order to 
> >> get the error occur again. 
> >> 
> >> What can I do to find the possible bugs ? 
> >> 
> >> Any help is really appreciated. 
> >> 
> >> thanks
> >> 
> >> Jack
> >> 
> >> 
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________ users mailing list 
> >> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
                                          

Reply via email to