Only time I ever had problems with duplicate IDs it was simply because
they rolled over - been a while ago though (might've been SGE6.2,
actually - I think that might've hit max job ID at 999999 ). You'd have
to run through a very large amount of jobs to hit that monthly, though.
Tina
On 14/02/17 10:49, William Hay wrote:
On Mon, Feb 13, 2017 at 03:17:29PM -0500, Douglas Duckworth wrote:
Hello
About a month ago we recently started seeing duplicate job in SGE.
For example:
sysadmin@panda2[~]$ qacct -j 878815
==============================================================
qname standard.q
hostname node127.panda.pbtech
group abc
owner developer
project NONE
department cmlab.u
jobname old job
jobnumber 878815
taskid undefined
account sge
priority 0
qsub_time Tue Jan 10 11:49:45 2017
start_time Tue Jan 10 11:51:40 2017
end_time Tue Jan 10 11:51:40 2017
granted_pe smp
slots 1
failed 0
exit_status 0
ru_wallclock 0
ru_utime 0.001
ru_stime 0.006
ru_maxrss 1428
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 1254
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 8
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 60
ru_nivcsw 4
cpu 0.007
mem 0.000
io 0.000
iow 0.000
maxvmem 0.000
arid undefined
==============================================================
qname standard.q
hostname node120.panda.pbtech
group abc
owner developer
project NONE
department cmlab.u
jobname newjob
jobnumber 878815
taskid undefined
account sge
priority 0
qsub_time Wed Feb 8 12:37:38 2017
start_time Wed Feb 8 13:20:49 2017
end_time Wed Feb 8 13:41:01 2017
granted_pe smp
slots 12
failed 100 : assumedly after job
exit_status 137
ru_wallclock 1212
ru_utime 0.002
ru_stime 0.022
ru_maxrss 1280
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 623
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 8
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 47
ru_nivcsw 2
cpu 13816.930
mem 48585.941
io 34.210
iow 0.000
maxvmem 3.692G
arid undefined
As you can see the jobs are nearly a month apart. This does not affect
their ability to complete though it's required that we not have
these duplicates.
Has anyone experienced this issue or have an idea of what could be causing
this behavior?
We are not rotating our accounting logs.
Thanks,
Douglas Duckworth, MSc, LFCS
HPC System Administrator
Scientific Computing Unit
Physiology and Biophysics
Weill Cornell Medicine
E: d...@med.cornell.edu
O: 212-746-6305
F: 212-746-8690
Apart from Reuti's suggestion. Possibly the original job got returned to the
queue for some reason
and the user then qalter'd it so it was unrecognisable?
William
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users