(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not
sure my post will get through)
I wonder if Jan's idea has merit -- if Torque is killing the job for
some other reason (i.e., not wallclock). The message printed by
mpirun ("mpirun: killing job...") is *only* displayed if mpirun
receives a SIGINT or SIGTERM. So perhaps some other resource limit is
being reached...?
Is there a way to have Torque log if it is killing a job for some
reason?
On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:
Yep. Wall time is no where near violation (dies about 2 minutes into
a 30 minute allocation). I did a ulimit -a through qsub and direct on
the node (as the same user in both cases), and the results were
identical (most items were unlimited).
Any other ideas?
--Jim
On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de>
wrote:
This suggestion is rather trivial, but since you have not mentioned
anything in this area:
Are you sure that the job is not exceeding resource limits
(walltime -
enforced by TORQUE, or rlimits such as memory - enforced by the
kernel,
but they could be set differently in TORQUE and your manual
invocations of
mpirun).
Regards,
Jan Ploski
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems