Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

2008-05-29 Thread Jeff Squyres
I don't know much about Maui, but these lines from the log seem relevant: - maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc limit (3.72 > 1.00) maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in state 'Running' has exceeded PROC resource limit (372 >

Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

2008-05-29 Thread Jim Kusznir
I have verified that maui is killing the job. I actually ran into this with another user all of a sudden. I don't know why its only effecting a few currently. Here's the maui log extract for a current run of this users' program: --- [root@aeolus log]# grep 2120 * maui.log:05/29 09:01:45

Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

2008-05-28 Thread Jeff Squyres
(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not sure my post will get through) I wonder if Jan's idea has merit -- if Torque is killing the job for some other reason (i.e., not wallclock). The message printed by mpirun ("mpirun: killing job...") is *only* displayed i

Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

2008-05-27 Thread Jim Kusznir
Yep. Wall time is no where near violation (dies about 2 minutes into a 30 minute allocation). I did a ulimit -a through qsub and direct on the node (as the same user in both cases), and the results were identical (most items were unlimited). Any other ideas? --Jim On Tue, May 27, 2008 at 9:25