I have verified that maui is killing the job. I actually ran into
this with another user all of a sudden. I don't know why its only
effecting a few currently. Here's the maui log extract for a current
run of this users' program:
-----------
[root@aeolus log]# grep 2120 *
maui.log:05/29 09:01:45 INFO: job '2118' loaded: 1 patton
patton 1800 Idle 0 1212076905 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212076905
maui.log:05/29 09:23:40 INFO: job '2119' loaded: 1 patton
patton 1800 Idle 0 1212078218 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212078220
maui.log:05/29 09:26:19
MPBSJobLoad(2120,2120.aeolus.eecs.wsu.edu,J,TaskList,0)
maui.log:05/29 09:26:19 MReqCreate(2120,SrcRQ,DstRQ,DoCreate)
maui.log:05/29 09:26:19 MJobSetCreds(2120,patton,patton,)
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: job '2120' loaded: 1 patton
patton 1800 Idle 0 1212078378 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212078379
maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:19 INFO: 8 feasible tasks found for job
2120:0 in partition DEFAULT (1 Needed)
maui.log:05/29 09:26:19 INFO: 1 requested hostlist tasks allocated
for job 2120 (0 remain)
maui.log:05/29 09:26:19 MJobStart(2120)
maui.log:05/29 09:26:19
MJobDistributeTasks(2120,base,NodeList,TaskMap)
maui.log:05/29 09:26:19 MAMAllocJReserve(2120,RIndex,ErrMsg)
maui.log:05/29 09:26:19 MRMJobStart(2120,Msg,SC)
maui.log:05/29 09:26:19 MPBSJobStart(2120,base,Msg,SC)
maui.log:05/29 09:26:19
MPBSJobModify(2120,Resource_List,Resource,compute-0-0.local)
maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,1)
maui.log:05/29 09:26:19 INFO: job '2120' successfully started
maui.log:05/29 09:26:19 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:19 MResJCreate(2120,MNodeList,
00:00:00,ActiveJob,Res)
maui.log:05/29 09:26:19 INFO: starting job '2120'
maui.log:05/29 09:26:50 INFO: node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:26:50 INFO: job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:26:50
MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:26:50 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:50 MResDestroy(2120)
maui.log:05/29 09:26:50 MResChargeAllocation(2120,2)
maui.log:05/29 09:26:50
MResJCreate(2120,MNodeList,-00:00:31,ActiveJob,Res)
maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:27:21 INFO: job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:27:21
MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:27:21 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:27:21 MResDestroy(2120)
maui.log:05/29 09:27:21 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:21
MResJCreate(2120,MNodeList,-00:01:02,ActiveJob,Res)
maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc
limit (3.72 > 1.00)
maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in
state 'Running' has exceeded PROC resource limit (372 > 100) (action
CANCEL will be taken) job start time: Thu May 29 09:26:19
maui.log:05/29 09:27:21 MRMJobCancel(2120,job violates resource
utilization policies,SC)
maui.log:05/29 09:27:21 MPBSJobCancel(2120,base,CMsg,Msg,job violates
resource utilization policies)
maui.log:05/29 09:27:21 INFO: job '2120' successfully cancelled
maui.log:05/29 09:27:27 INFO: active PBS job 2120 has been removed
from the queue. assuming successful completion
maui.log:05/29 09:27:27 MJobProcessCompleted(2120)
maui.log:05/29 09:27:27 MAMAllocJDebit(A,2120,SC,ErrMsg)
maui.log:05/29 09:27:27 INFO: job ' 2120' completed.
QueueTime: 1 RunTime: 62 Accuracy: 3.44 XFactor: 0.04
maui.log:05/29 09:27:27 INFO: job '2120' completed X: 0.035000
T: 62 PS: 62 A: 0.034444
maui.log:05/29 09:27:27 MJobSendFB(2120)
maui.log:05/29 09:27:27 INFO: job usage sent for job '2120'
maui.log:05/29 09:27:27 MJobRemove(2120)
maui.log:05/29 09:27:27 MResDestroy(2120)
maui.log:05/29 09:27:27 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:27 MJobDestroy(2120)
maui.log:05/29 09:42:54 INFO: job '2121' loaded: 1 sledburg
sledburg 1800 Idle 0 1212079373 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212079374
maui.log:05/29 09:43:34 INFO: job '2122' loaded: 1 sledburg
sledburg 1800 Idle 0 1212079413 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212079414
[root@aeolus log]#
------------------------------
Any thoughts?
Thank you.
On Wed, May 28, 2008 at 5:21 AM, Jeff Squyres <jsquy...@cisco.com>
wrote:
(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm
not
sure my post will get through)
I wonder if Jan's idea has merit -- if Torque is killing the job for
some other reason (i.e., not wallclock). The message printed by
mpirun ("mpirun: killing job...") is *only* displayed if mpirun
receives a SIGINT or SIGTERM. So perhaps some other resource limit
is
being reached...?
Is there a way to have Torque log if it is killing a job for some
reason?
On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:
Yep. Wall time is no where near violation (dies about 2 minutes
into
a 30 minute allocation). I did a ulimit -a through qsub and
direct on
the node (as the same user in both cases), and the results were
identical (most items were unlimited).
Any other ideas?
--Jim
On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de>
wrote:
This suggestion is rather trivial, but since you have not mentioned
anything in this area:
Are you sure that the job is not exceeding resource limits
(walltime -
enforced by TORQUE, or rlimits such as memory - enforced by the
kernel,
but they could be set differently in TORQUE and your manual
invocations of
mpirun).
Regards,
Jan Ploski
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users