Only time I ever had problems with duplicate IDs it was simply because they rolled over - been a while ago though (might've been SGE6.2, actually - I think that might've hit max job ID at 999999 ). You'd have to run through a very large amount of jobs to hit that monthly, though.

Tina

On 14/02/17 10:49, William Hay wrote:
On Mon, Feb 13, 2017 at 03:17:29PM -0500, Douglas Duckworth wrote:
   Hello
   About a month ago we recently started seeing duplicate job in SGE.
   For example:
   sysadmin@panda2[~]$ qacct -j 878815
   ==============================================================
   qname        standard.q
   hostname     node127.panda.pbtech
   group        abc
   owner        developer
   project      NONE
   department   cmlab.u
   jobname old job
   jobnumber    878815
   taskid       undefined
   account      sge
   priority     0
   qsub_time    Tue Jan 10 11:49:45 2017
   start_time   Tue Jan 10 11:51:40 2017
   end_time     Tue Jan 10 11:51:40 2017
   granted_pe   smp
   slots        1
   failed       0
   exit_status  0
   ru_wallclock 0
   ru_utime     0.001
   ru_stime     0.006
   ru_maxrss    1428
   ru_ixrss     0
   ru_ismrss    0
   ru_idrss     0
   ru_isrss     0
   ru_minflt    1254
   ru_majflt    0
   ru_nswap     0
   ru_inblock   0
   ru_oublock   8
   ru_msgsnd    0
   ru_msgrcv    0
   ru_nsignals  0
   ru_nvcsw     60
   ru_nivcsw    4
   cpu          0.007
   mem          0.000
   io           0.000
   iow          0.000
   maxvmem      0.000
   arid         undefined
   ==============================================================
   qname        standard.q
   hostname     node120.panda.pbtech
   group        abc
   owner        developer
   project      NONE
   department   cmlab.u
   jobname      newjob
   jobnumber    878815
   taskid       undefined
   account      sge
   priority     0
   qsub_time    Wed Feb  8 12:37:38 2017
   start_time   Wed Feb  8 13:20:49 2017
   end_time     Wed Feb  8 13:41:01 2017
   granted_pe   smp
   slots        12
   failed       100 : assumedly after job
   exit_status  137
   ru_wallclock 1212
   ru_utime     0.002
   ru_stime     0.022
   ru_maxrss    1280
   ru_ixrss     0
   ru_ismrss    0
   ru_idrss     0
   ru_isrss     0
   ru_minflt    623
   ru_majflt    0
   ru_nswap     0
   ru_inblock   0
   ru_oublock   8
   ru_msgsnd    0
   ru_msgrcv    0
   ru_nsignals  0
   ru_nvcsw     47
   ru_nivcsw    2
   cpu          13816.930
   mem          48585.941
   io           34.210
   iow          0.000
   maxvmem      3.692G
   arid         undefined
   As you can see the jobs are nearly a month apart.  This does not affect
   their ability to complete though it's required that we not have
   these duplicates.
   Has anyone experienced this issue or have an idea of what could be causing
   this behavior?
   We are not rotating our accounting logs.
   Thanks,
   Douglas Duckworth, MSc, LFCS
   HPC System Administrator
   Scientific Computing Unit
   Physiology and Biophysics
   Weill Cornell Medicine
   E: d...@med.cornell.edu
   O: 212-746-6305
   F: 212-746-8690

Apart from Reuti's suggestion.  Possibly the original job got returned to the 
queue for some reason
and the user then qalter'd it so it was unrecognisable?


William



_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to