Reuti Your insight helped us solve the problem.
Thank you! Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: d...@med.cornell.edu O: 212-746-5454 F: 212-746-8690 On Feb 13, 2017 4:54 PM, "Reuti" <re...@staff.uni-marburg.de> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > Did you restore an part on the qmaster? Either $SGE_ROOT/default or a > dedicated /var/spool/sge/qmaster? There is a file "jobseqnum" - this file > contains the next job number? > > For parallel jobs getting more than one entry in the accounting file might > be fine, as long as the "accounting_summary FALSE" is set in the PE and > there were `qrsh -inherit …` calls included. But the dates being so far > apart and having different names points to the first cause given. > > - -- Reuti > > > Am 13.02.2017 um 21:17 schrieb Douglas Duckworth: > > > Hello > > > > About a month ago we recently started seeing duplicate job in SGE. > > > > For example: > > > > sysadmin@panda2[~]$ qacct -j 878815 > > > > ============================================================== > > qname standard.q > > hostname node127.panda.pbtech > > group abc > > owner developer > > project NONE > > department cmlab.u > > jobname old job > > jobnumber 878815 > > taskid undefined > > account sge > > priority 0 > > qsub_time Tue Jan 10 11:49:45 2017 > > start_time Tue Jan 10 11:51:40 2017 > > end_time Tue Jan 10 11:51:40 2017 > > granted_pe smp > > slots 1 > > failed 0 > > exit_status 0 > > ru_wallclock 0 > > ru_utime 0.001 > > ru_stime 0.006 > > ru_maxrss 1428 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 1254 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 8 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 60 > > ru_nivcsw 4 > > cpu 0.007 > > mem 0.000 > > io 0.000 > > iow 0.000 > > maxvmem 0.000 > > arid undefined > > ============================================================== > > qname standard.q > > hostname node120.panda.pbtech > > group abc > > owner developer > > project NONE > > department cmlab.u > > jobname newjob > > jobnumber 878815 > > taskid undefined > > account sge > > priority 0 > > qsub_time Wed Feb 8 12:37:38 2017 > > start_time Wed Feb 8 13:20:49 2017 > > end_time Wed Feb 8 13:41:01 2017 > > granted_pe smp > > slots 12 > > failed 100 : assumedly after job > > exit_status 137 > > ru_wallclock 1212 > > ru_utime 0.002 > > ru_stime 0.022 > > ru_maxrss 1280 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 623 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 8 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 47 > > ru_nivcsw 2 > > cpu 13816.930 > > mem 48585.941 > > io 34.210 > > iow 0.000 > > maxvmem 3.692G > > arid undefined > > > > As you can see the jobs are nearly a month apart. This does not affect > their ability to complete though it's required that we not have these > duplicates. > > > > Has anyone experienced this issue or have an idea of what could be > causing this behavior? > > > > We are not rotating our accounting logs. > > > > Thanks, > > > > Douglas Duckworth, MSc, LFCS > > HPC System Administrator > > Scientific Computing Unit > > Physiology and Biophysics > > Weill Cornell Medicine > > E: d...@med.cornell.edu > > O: 212-746-6305 > > F: 212-746-8690 > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > -----BEGIN PGP SIGNATURE----- > Comment: GPGTools - https://gpgtools.org > > iEYEARECAAYFAliiKvYACgkQo/GbGkBRnRrXVwCgzfztDlbPga7OH8KjHIdOpl3f > aswAnA5BTYi0JHz/zilUFKpTYF0iGnu7 > =/bC2 > -----END PGP SIGNATURE----- >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users