Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Jim Kusznir Thu, 29 May 2008 14:25:51 -0400

I have verified that maui is killing the job.  I actually ran into
this with another user all of a sudden.  I don't know why its only
effecting a few currently.  Here's the maui log extract for a current
run of this users' program:


-----------
[root@aeolus log]# grep 2120 *
maui.log:05/29 09:01:45 INFO:     job '2118' loaded:   1   patton
patton   1800       Idle   0 1212076905   [NONE] [NONE] [NONE] >=
0 >=      0 [NONE] 1212076905
maui.log:05/29 09:23:40 INFO:     job '2119' loaded:   1   patton
patton   1800       Idle   0 1212078218   [NONE] [NONE] [NONE] >=
0 >=      0 [NONE] 1212078220
maui.log:05/29 09:26:19 MPBSJobLoad(2120,2120.aeolus.eecs.wsu.edu,J,TaskList,0)
maui.log:05/29 09:26:19 MReqCreate(2120,SrcRQ,DstRQ,DoCreate)
maui.log:05/29 09:26:19 MJobSetCreds(2120,patton,patton,)
maui.log:05/29 09:26:19 INFO:     default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO:     default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO:     default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO:     job '2120' loaded:   1   patton
patton   1800       Idle   0 1212078378   [NONE] [NONE] [NONE] >=
0 >=      0 [NONE] 1212078379
maui.log:05/29 09:26:19 INFO:     job '2120' Priority:        1
maui.log:05/29 09:26:19 INFO:     job '2120' Priority:        1
maui.log:05/29 09:26:19 INFO:     8 feasible tasks found for job
2120:0 in partition DEFAULT (1 Needed)
maui.log:05/29 09:26:19 INFO:     1 requested hostlist tasks allocated
for job 2120 (0 remain)
maui.log:05/29 09:26:19 MJobStart(2120)
maui.log:05/29 09:26:19 MJobDistributeTasks(2120,base,NodeList,TaskMap)
maui.log:05/29 09:26:19 MAMAllocJReserve(2120,RIndex,ErrMsg)
maui.log:05/29 09:26:19 MRMJobStart(2120,Msg,SC)
maui.log:05/29 09:26:19 MPBSJobStart(2120,base,Msg,SC)
maui.log:05/29 09:26:19
MPBSJobModify(2120,Resource_List,Resource,compute-0-0.local)
maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,1)
maui.log:05/29 09:26:19 INFO:     job '2120' successfully started
maui.log:05/29 09:26:19 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:19 MResJCreate(2120,MNodeList,00:00:00,ActiveJob,Res)
maui.log:05/29 09:26:19 INFO:     starting job '2120'
maui.log:05/29 09:26:50 INFO:     node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:26:50 INFO:     job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:26:50 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:26:50 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:50 MResDestroy(2120)
maui.log:05/29 09:26:50 MResChargeAllocation(2120,2)
maui.log:05/29 09:26:50 MResJCreate(2120,MNodeList,-00:00:31,ActiveJob,Res)
maui.log:05/29 09:26:50 INFO:     job '2120' Priority:        1
maui.log:05/29 09:26:50 INFO:     job '2120' Priority:        1
maui.log:05/29 09:27:21 INFO:     node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:27:21 INFO:     job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:27:21 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:27:21 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:27:21 MResDestroy(2120)
maui.log:05/29 09:27:21 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:21 MResJCreate(2120,MNodeList,-00:01:02,ActiveJob,Res)
maui.log:05/29 09:27:21 INFO:     job '2120' Priority:        1
maui.log:05/29 09:27:21 INFO:     job '2120' Priority:        1
maui.log:05/29 09:27:21 INFO:     job 2120 exceeds requested proc
limit (3.72 > 1.00)
maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION:  job '2120' in
state 'Running' has exceeded PROC resource limit (372 > 100) (action
CANCEL will be taken)  job start time: Thu May 29 09:26:19
maui.log:05/29 09:27:21 MRMJobCancel(2120,job violates resource
utilization policies,SC)
maui.log:05/29 09:27:21 MPBSJobCancel(2120,base,CMsg,Msg,job violates
resource utilization policies)
maui.log:05/29 09:27:21 INFO:     job '2120' successfully cancelled
maui.log:05/29 09:27:27 INFO:     active PBS job 2120 has been removed
from the queue.  assuming successful completion
maui.log:05/29 09:27:27 MJobProcessCompleted(2120)
maui.log:05/29 09:27:27 MAMAllocJDebit(A,2120,SC,ErrMsg)
maui.log:05/29 09:27:27 INFO:     job '              2120' completed.
QueueTime:      1  RunTime:     62  Accuracy:  3.44  XFactor:  0.04
maui.log:05/29 09:27:27 INFO:     job '2120' completed  X: 0.035000
T: 62  PS: 62  A: 0.034444
maui.log:05/29 09:27:27 MJobSendFB(2120)
maui.log:05/29 09:27:27 INFO:     job usage sent for job '2120'
maui.log:05/29 09:27:27 MJobRemove(2120)
maui.log:05/29 09:27:27 MResDestroy(2120)
maui.log:05/29 09:27:27 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:27 MJobDestroy(2120)
maui.log:05/29 09:42:54 INFO:     job '2121' loaded:   1 sledburg
sledburg   1800       Idle   0 1212079373   [NONE] [NONE] [NONE] >=
  0 >=      0 [NONE] 1212079374
maui.log:05/29 09:43:34 INFO:     job '2122' loaded:   1 sledburg
sledburg   1800       Idle   0 1212079413   [NONE] [NONE] [NONE] >=
  0 >=      0 [NONE] 1212079414
[root@aeolus log]#
------------------------------

Any thoughts?

Thank you.

On Wed, May 28, 2008 at 5:21 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> (I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not
> sure my post will get through)
>
> I wonder if Jan's idea has merit -- if Torque is killing the job for
> some other reason (i.e., not wallclock).  The message printed by
> mpirun ("mpirun: killing job...") is *only* displayed if mpirun
> receives a SIGINT or SIGTERM.  So perhaps some other resource limit is
> being reached...?
>
> Is there a way to have Torque log if it is killing a job for some
> reason?
>
>
> On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:
>
>> Yep.  Wall time is no where near violation (dies about 2 minutes into
>> a 30 minute allocation).  I did a ulimit -a through qsub and direct on
>> the node (as the same user in both cases), and the results were
>> identical (most items were unlimited).
>>
>> Any other ideas?
>>
>> --Jim
>>
>> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de>
>> wrote:
>>>
>>> This suggestion is rather trivial, but since you have not mentioned
>>> anything in this area:
>>>
>>> Are you sure that the job is not exceeding resource limits
>>> (walltime -
>>> enforced by TORQUE, or rlimits such as memory - enforced by the
>>> kernel,
>>> but they could be set differently in TORQUE and your manual
>>> invocations of
>>> mpirun).
>>>
>>> Regards,
>>> Jan Ploski
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Reply via email to