I have verified that maui is killing the job. I actually ran into this with another user all of a sudden. I don't know why its only effecting a few currently. Here's the maui log extract for a current run of this users' program:
----------- [root@aeolus log]# grep 2120 * maui.log:05/29 09:01:45 INFO: job '2118' loaded: 1 patton patton 1800 Idle 0 1212076905 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1212076905 maui.log:05/29 09:23:40 INFO: job '2119' loaded: 1 patton patton 1800 Idle 0 1212078218 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1212078220 maui.log:05/29 09:26:19 MPBSJobLoad(2120,2120.aeolus.eecs.wsu.edu,J,TaskList,0) maui.log:05/29 09:26:19 MReqCreate(2120,SrcRQ,DstRQ,DoCreate) maui.log:05/29 09:26:19 MJobSetCreds(2120,patton,patton,) maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) maui.log:05/29 09:26:19 INFO: job '2120' loaded: 1 patton patton 1800 Idle 0 1212078378 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1212078379 maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1 maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1 maui.log:05/29 09:26:19 INFO: 8 feasible tasks found for job 2120:0 in partition DEFAULT (1 Needed) maui.log:05/29 09:26:19 INFO: 1 requested hostlist tasks allocated for job 2120 (0 remain) maui.log:05/29 09:26:19 MJobStart(2120) maui.log:05/29 09:26:19 MJobDistributeTasks(2120,base,NodeList,TaskMap) maui.log:05/29 09:26:19 MAMAllocJReserve(2120,RIndex,ErrMsg) maui.log:05/29 09:26:19 MRMJobStart(2120,Msg,SC) maui.log:05/29 09:26:19 MPBSJobStart(2120,base,Msg,SC) maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,compute-0-0.local) maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,1) maui.log:05/29 09:26:19 INFO: job '2120' successfully started maui.log:05/29 09:26:19 MStatUpdateActiveJobUsage(2120) maui.log:05/29 09:26:19 MResJCreate(2120,MNodeList,00:00:00,ActiveJob,Res) maui.log:05/29 09:26:19 INFO: starting job '2120' maui.log:05/29 09:26:50 INFO: node compute-0-0.local has joblist '0/2120.aeolus.eecs.wsu.edu' maui.log:05/29 09:26:50 INFO: job 2120 adds 1 processors per task to node compute-0-0.local (1) maui.log:05/29 09:26:50 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0) maui.log:05/29 09:26:50 MStatUpdateActiveJobUsage(2120) maui.log:05/29 09:26:50 MResDestroy(2120) maui.log:05/29 09:26:50 MResChargeAllocation(2120,2) maui.log:05/29 09:26:50 MResJCreate(2120,MNodeList,-00:00:31,ActiveJob,Res) maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1 maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1 maui.log:05/29 09:27:21 INFO: node compute-0-0.local has joblist '0/2120.aeolus.eecs.wsu.edu' maui.log:05/29 09:27:21 INFO: job 2120 adds 1 processors per task to node compute-0-0.local (1) maui.log:05/29 09:27:21 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0) maui.log:05/29 09:27:21 MStatUpdateActiveJobUsage(2120) maui.log:05/29 09:27:21 MResDestroy(2120) maui.log:05/29 09:27:21 MResChargeAllocation(2120,2) maui.log:05/29 09:27:21 MResJCreate(2120,MNodeList,-00:01:02,ActiveJob,Res) maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1 maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1 maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc limit (3.72 > 1.00) maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in state 'Running' has exceeded PROC resource limit (372 > 100) (action CANCEL will be taken) job start time: Thu May 29 09:26:19 maui.log:05/29 09:27:21 MRMJobCancel(2120,job violates resource utilization policies,SC) maui.log:05/29 09:27:21 MPBSJobCancel(2120,base,CMsg,Msg,job violates resource utilization policies) maui.log:05/29 09:27:21 INFO: job '2120' successfully cancelled maui.log:05/29 09:27:27 INFO: active PBS job 2120 has been removed from the queue. assuming successful completion maui.log:05/29 09:27:27 MJobProcessCompleted(2120) maui.log:05/29 09:27:27 MAMAllocJDebit(A,2120,SC,ErrMsg) maui.log:05/29 09:27:27 INFO: job ' 2120' completed. QueueTime: 1 RunTime: 62 Accuracy: 3.44 XFactor: 0.04 maui.log:05/29 09:27:27 INFO: job '2120' completed X: 0.035000 T: 62 PS: 62 A: 0.034444 maui.log:05/29 09:27:27 MJobSendFB(2120) maui.log:05/29 09:27:27 INFO: job usage sent for job '2120' maui.log:05/29 09:27:27 MJobRemove(2120) maui.log:05/29 09:27:27 MResDestroy(2120) maui.log:05/29 09:27:27 MResChargeAllocation(2120,2) maui.log:05/29 09:27:27 MJobDestroy(2120) maui.log:05/29 09:42:54 INFO: job '2121' loaded: 1 sledburg sledburg 1800 Idle 0 1212079373 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1212079374 maui.log:05/29 09:43:34 INFO: job '2122' loaded: 1 sledburg sledburg 1800 Idle 0 1212079413 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1212079414 [root@aeolus log]# ------------------------------ Any thoughts? Thank you. On Wed, May 28, 2008 at 5:21 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > (I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not > sure my post will get through) > > I wonder if Jan's idea has merit -- if Torque is killing the job for > some other reason (i.e., not wallclock). The message printed by > mpirun ("mpirun: killing job...") is *only* displayed if mpirun > receives a SIGINT or SIGTERM. So perhaps some other resource limit is > being reached...? > > Is there a way to have Torque log if it is killing a job for some > reason? > > > On May 27, 2008, at 7:02 PM, Jim Kusznir wrote: > >> Yep. Wall time is no where near violation (dies about 2 minutes into >> a 30 minute allocation). I did a ulimit -a through qsub and direct on >> the node (as the same user in both cases), and the results were >> identical (most items were unlimited). >> >> Any other ideas? >> >> --Jim >> >> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de> >> wrote: >>> >>> This suggestion is rather trivial, but since you have not mentioned >>> anything in this area: >>> >>> Are you sure that the job is not exceeding resource limits >>> (walltime - >>> enforced by TORQUE, or rlimits such as memory - enforced by the >>> kernel, >>> but they could be set differently in TORQUE and your manual >>> invocations of >>> mpirun). >>> >>> Regards, >>> Jan Ploski >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >