(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not sure my post will get through)

I wonder if Jan's idea has merit -- if Torque is killing the job for some other reason (i.e., not wallclock). The message printed by mpirun ("mpirun: killing job...") is *only* displayed if mpirun receives a SIGINT or SIGTERM. So perhaps some other resource limit is being reached...?

Is there a way to have Torque log if it is killing a job for some reason?


On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:

Yep.  Wall time is no where near violation (dies about 2 minutes into
a 30 minute allocation).  I did a ulimit -a through qsub and direct on
the node (as the same user in both cases), and the results were
identical (most items were unlimited).

Any other ideas?

--Jim

On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <jan.plo...@offis.de> wrote:

This suggestion is rather trivial, but since you have not mentioned
anything in this area:

Are you sure that the job is not exceeding resource limits (walltime - enforced by TORQUE, or rlimits such as memory - enforced by the kernel, but they could be set differently in TORQUE and your manual invocations of
mpirun).

Regards,
Jan Ploski

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to