On Mon, Feb 13, 2017 at 03:17:29PM -0500, Douglas Duckworth wrote: > Hello > About a month ago we recently started seeing duplicate job in SGE. > For example: > sysadmin@panda2[~]$ qacct -j 878815 > ============================================================== > qname standard.q > hostname node127.panda.pbtech > group abc > owner developer > project NONE > department cmlab.u > jobname old job > jobnumber 878815 > taskid undefined > account sge > priority 0 > qsub_time Tue Jan 10 11:49:45 2017 > start_time Tue Jan 10 11:51:40 2017 > end_time Tue Jan 10 11:51:40 2017 > granted_pe smp > slots 1 > failed 0 > exit_status 0 > ru_wallclock 0 > ru_utime 0.001 > ru_stime 0.006 > ru_maxrss 1428 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 1254 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 8 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 60 > ru_nivcsw 4 > cpu 0.007 > mem 0.000 > io 0.000 > iow 0.000 > maxvmem 0.000 > arid undefined > ============================================================== > qname standard.q > hostname node120.panda.pbtech > group abc > owner developer > project NONE > department cmlab.u > jobname newjob > jobnumber 878815 > taskid undefined > account sge > priority 0 > qsub_time Wed Feb 8 12:37:38 2017 > start_time Wed Feb 8 13:20:49 2017 > end_time Wed Feb 8 13:41:01 2017 > granted_pe smp > slots 12 > failed 100 : assumedly after job > exit_status 137 > ru_wallclock 1212 > ru_utime 0.002 > ru_stime 0.022 > ru_maxrss 1280 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 623 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 8 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 47 > ru_nivcsw 2 > cpu 13816.930 > mem 48585.941 > io 34.210 > iow 0.000 > maxvmem 3.692G > arid undefined > As you can see the jobs are nearly a month apart. This does not affect > their ability to complete though it's required that we not have > these duplicates. > Has anyone experienced this issue or have an idea of what could be causing > this behavior? > We are not rotating our accounting logs. > Thanks, > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Physiology and Biophysics > Weill Cornell Medicine > E: d...@med.cornell.edu > O: 212-746-6305 > F: 212-746-8690
Apart from Reuti's suggestion. Possibly the original job got returned to the queue for some reason and the user then qalter'd it so it was unrecognisable? William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users