In the message dated: Fri, 18 Oct 2019 15:34:02 -0000, The pithy ruminations from WALLIS Michael on [[gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened] were: => Hi folks, => => Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs at the start of the => month, and then started again at 1 as expected. About ten days later we started the qmaster => a few times (it was segfaulting, originally we thought that a user was using newer qstat => binaries to query an old qmaster) with JID nearing ~20k, only after each of the restarts the JID => started at about 1100, not the number we were expecting. Because of this there's duplicate JID => entries in accounting and it's causing a bit of a problem for people who monitor for failed jobs.
We've seen that too. Restarting the queue master doesn't rotate the accounting file, so qacct output may be 'wrong', unless the query is restricted by a time range (ie., jobID 1000 may exist from 2017 and 2019). Mark => => Because of the nature of the workload the currently-running JIDs are now all over the place, => with some JIDs in the queue still in the 9,99n,nnn range and some in four figures. If we need to => restart the qmaster again, will the jobseqnum file be overwritten with the largest JID still in => the queue (as suggested in => http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)? => => Am aware that this is an old version of SGE and we're in the middle of transitioning to a => much newer one, but this is a bit of an issue while we're still shifting workloads over. => => Thanks, => Mike => -- => Mike Wallis x503305 => University of Edinburgh, Research Services, => Argyle House, 3 Lady Lawson Street, => Edinburgh, EH3 9DR => => _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users