Reuti

Your insight helped us solve the problem.

Thank you!

Thanks,

Douglas Duckworth, MSc, LFCS
HPC System Administrator
Scientific Computing Unit
Physiology and Biophysics
Weill Cornell Medicine
E: d...@med.cornell.edu
O: 212-746-5454
F: 212-746-8690

On Feb 13, 2017 4:54 PM, "Reuti" <re...@staff.uni-marburg.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> Did you restore an part on the qmaster? Either $SGE_ROOT/default or a
> dedicated /var/spool/sge/qmaster? There is a file "jobseqnum" - this file
> contains the next job number?
>
> For parallel jobs getting more than one entry in the accounting file might
> be fine, as long as the "accounting_summary FALSE" is set in the PE and
> there were `qrsh -inherit …` calls included. But the dates being so far
> apart and having different names points to the first cause given.
>
> - -- Reuti
>
>
> Am 13.02.2017 um 21:17 schrieb Douglas Duckworth:
>
> > Hello
> >
> > About a month ago we recently started seeing duplicate job in SGE.
> >
> > For example:
> >
> > sysadmin@panda2[~]$ qacct -j 878815
> >
> > ==============================================================
> > qname        standard.q
> > hostname     node127.panda.pbtech
> > group        abc
> > owner        developer
> > project      NONE
> > department   cmlab.u
> > jobname old job
> > jobnumber    878815
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Tue Jan 10 11:49:45 2017
> > start_time   Tue Jan 10 11:51:40 2017
> > end_time     Tue Jan 10 11:51:40 2017
> > granted_pe   smp
> > slots        1
> > failed       0
> > exit_status  0
> > ru_wallclock 0
> > ru_utime     0.001
> > ru_stime     0.006
> > ru_maxrss    1428
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    1254
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   8
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     60
> > ru_nivcsw    4
> > cpu          0.007
> > mem          0.000
> > io           0.000
> > iow          0.000
> > maxvmem      0.000
> > arid         undefined
> > ==============================================================
> > qname        standard.q
> > hostname     node120.panda.pbtech
> > group        abc
> > owner        developer
> > project      NONE
> > department   cmlab.u
> > jobname      newjob
> > jobnumber    878815
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Wed Feb  8 12:37:38 2017
> > start_time   Wed Feb  8 13:20:49 2017
> > end_time     Wed Feb  8 13:41:01 2017
> > granted_pe   smp
> > slots        12
> > failed       100 : assumedly after job
> > exit_status  137
> > ru_wallclock 1212
> > ru_utime     0.002
> > ru_stime     0.022
> > ru_maxrss    1280
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    623
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   8
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     47
> > ru_nivcsw    2
> > cpu          13816.930
> > mem          48585.941
> > io           34.210
> > iow          0.000
> > maxvmem      3.692G
> > arid         undefined
> >
> > As you can see the jobs are nearly a month apart.  This does not affect
> their ability to complete though it's required that we not have these
> duplicates.
> >
> > Has anyone experienced this issue or have an idea of what could be
> causing this behavior?
> >
> > We are not rotating our accounting logs.
> >
> > Thanks,
> >
> > Douglas Duckworth, MSc, LFCS
> > HPC System Administrator
> > Scientific Computing Unit
> > Physiology and Biophysics
> > Weill Cornell Medicine
> > E: d...@med.cornell.edu
> > O: 212-746-6305
> > F: 212-746-8690
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - https://gpgtools.org
>
> iEYEARECAAYFAliiKvYACgkQo/GbGkBRnRrXVwCgzfztDlbPga7OH8KjHIdOpl3f
> aswAnA5BTYi0JHz/zilUFKpTYF0iGnu7
> =/bC2
> -----END PGP SIGNATURE-----
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to