-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi,
Did you restore an part on the qmaster? Either $SGE_ROOT/default or a dedicated /var/spool/sge/qmaster? There is a file "jobseqnum" - this file contains the next job number? For parallel jobs getting more than one entry in the accounting file might be fine, as long as the "accounting_summary FALSE" is set in the PE and there were `qrsh -inherit …` calls included. But the dates being so far apart and having different names points to the first cause given. - -- Reuti Am 13.02.2017 um 21:17 schrieb Douglas Duckworth: > Hello > > About a month ago we recently started seeing duplicate job in SGE. > > For example: > > sysadmin@panda2[~]$ qacct -j 878815 > > ============================================================== > qname standard.q > hostname node127.panda.pbtech > group abc > owner developer > project NONE > department cmlab.u > jobname old job > jobnumber 878815 > taskid undefined > account sge > priority 0 > qsub_time Tue Jan 10 11:49:45 2017 > start_time Tue Jan 10 11:51:40 2017 > end_time Tue Jan 10 11:51:40 2017 > granted_pe smp > slots 1 > failed 0 > exit_status 0 > ru_wallclock 0 > ru_utime 0.001 > ru_stime 0.006 > ru_maxrss 1428 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 1254 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 8 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 60 > ru_nivcsw 4 > cpu 0.007 > mem 0.000 > io 0.000 > iow 0.000 > maxvmem 0.000 > arid undefined > ============================================================== > qname standard.q > hostname node120.panda.pbtech > group abc > owner developer > project NONE > department cmlab.u > jobname newjob > jobnumber 878815 > taskid undefined > account sge > priority 0 > qsub_time Wed Feb 8 12:37:38 2017 > start_time Wed Feb 8 13:20:49 2017 > end_time Wed Feb 8 13:41:01 2017 > granted_pe smp > slots 12 > failed 100 : assumedly after job > exit_status 137 > ru_wallclock 1212 > ru_utime 0.002 > ru_stime 0.022 > ru_maxrss 1280 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 623 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 8 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 47 > ru_nivcsw 2 > cpu 13816.930 > mem 48585.941 > io 34.210 > iow 0.000 > maxvmem 3.692G > arid undefined > > As you can see the jobs are nearly a month apart. This does not affect their > ability to complete though it's required that we not have these duplicates. > > Has anyone experienced this issue or have an idea of what could be causing > this behavior? > > We are not rotating our accounting logs. > > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Physiology and Biophysics > Weill Cornell Medicine > E: d...@med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iEYEARECAAYFAliiKvYACgkQo/GbGkBRnRrXVwCgzfztDlbPga7OH8KjHIdOpl3f aswAnA5BTYi0JHz/zilUFKpTYF0iGnu7 =/bC2 -----END PGP SIGNATURE----- _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users