Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Jeff Squyres Wed, 28 May 2008 08:21:59 -0400

(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm notsure my post will get through)

I wonder if Jan's idea has merit -- if Torque is killing the job forsome other reason (i.e., not wallclock). The message printed bympirun ("mpirun: killing job...") is *only* displayed if mpirunreceives a SIGINT or SIGTERM. So perhaps some other resource limit isbeing reached...?

Is there a way to have Torque log if it is killing a job for somereason?



On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:

Yep.  Wall time is no where near violation (dies about 2 minutes into
a 30 minute allocation).  I did a ulimit -a through qsub and direct on
the node (as the same user in both cases), and the results were
identical (most items were unlimited).

Any other ideas?

--Jim
On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de>wrote:
This suggestion is rather trivial, but since you have not mentioned
anything in this area:
Are you sure that the job is not exceeding resource limits(walltime -enforced by TORQUE, or rlimits such as memory - enforced by thekernel,but they could be set differently in TORQUE and your manualinvocations of
mpirun).

Regards,
Jan Ploski
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Reply via email to