In the message dated: Fri, 18 Oct 2019 15:34:02 -0000,
The pithy ruminations from WALLIS Michael on 
[[gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened] were:
=> Hi folks,
=> 
=> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs 
at the start of the
=> month, and then started again at 1 as expected. About ten days later we 
started the qmaster
=> a few times (it was segfaulting, originally we thought that a user was using 
newer qstat
=> binaries to query an old qmaster) with JID nearing ~20k, only after each of 
the restarts the JID
=> started at about 1100, not the number we were expecting. Because of this 
there's duplicate JID
=> entries in accounting and it's causing a bit of a problem for people who 
monitor for failed jobs.

We've seen that too.

Restarting the queue master doesn't rotate the accounting file, so qacct output 
may be 'wrong', unless the query is restricted by a time range (ie.,
jobID 1000 may exist from 2017 and 2019).

Mark

=> 
=> Because of the nature of the workload the currently-running JIDs are now all 
over the place,
=> with some JIDs in the queue still in the 9,99n,nnn range and some in four 
figures. If we need to
=> restart the qmaster again, will the jobseqnum file be overwritten with the 
largest JID still in
=> the queue (as suggested in
=> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)?
=> 
=> Am aware that this is an old version of SGE and we're in the middle of 
transitioning to a
=> much newer one, but this is a bit of an issue while we're still shifting 
workloads over.
=> 
=> Thanks,
=> Mike
=> --
=> Mike Wallis x503305
=> University of Edinburgh, Research Services,
=> Argyle House, 3 Lady Lawson Street,
=> Edinburgh, EH3 9DR
=> 
=> 

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to