-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Did you restore an part on the qmaster? Either $SGE_ROOT/default or a dedicated 
/var/spool/sge/qmaster? There is a file "jobseqnum" - this file contains the 
next job number?

For parallel jobs getting more than one entry in the accounting file might be 
fine, as long as the "accounting_summary FALSE" is set in the PE and there were 
`qrsh -inherit …` calls included. But the dates being so far apart and having 
different names points to the first cause given.

- -- Reuti


Am 13.02.2017 um 21:17 schrieb Douglas Duckworth:

> Hello
> 
> About a month ago we recently started seeing duplicate job in SGE.
> 
> For example:
> 
> sysadmin@panda2[~]$ qacct -j 878815
> 
> ==============================================================
> qname        standard.q          
> hostname     node127.panda.pbtech
> group        abc                 
> owner        developer             
> project      NONE                
> department   cmlab.u             
> jobname old job
> jobnumber    878815              
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Tue Jan 10 11:49:45 2017
> start_time   Tue Jan 10 11:51:40 2017
> end_time     Tue Jan 10 11:51:40 2017
> granted_pe   smp                 
> slots        1                   
> failed       0    
> exit_status  0                   
> ru_wallclock 0            
> ru_utime     0.001        
> ru_stime     0.006        
> ru_maxrss    1428                
> ru_ixrss     0                   
> ru_ismrss    0                   
> ru_idrss     0                   
> ru_isrss     0                   
> ru_minflt    1254                
> ru_majflt    0                   
> ru_nswap     0                   
> ru_inblock   0                   
> ru_oublock   8                   
> ru_msgsnd    0                   
> ru_msgrcv    0                   
> ru_nsignals  0                   
> ru_nvcsw     60                  
> ru_nivcsw    4                   
> cpu          0.007        
> mem          0.000             
> io           0.000             
> iow          0.000             
> maxvmem      0.000
> arid         undefined
> ==============================================================
> qname        standard.q          
> hostname     node120.panda.pbtech
> group        abc                 
> owner        developer             
> project      NONE                
> department   cmlab.u             
> jobname      newjob
> jobnumber    878815              
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Wed Feb  8 12:37:38 2017
> start_time   Wed Feb  8 13:20:49 2017
> end_time     Wed Feb  8 13:41:01 2017
> granted_pe   smp                 
> slots        12                  
> failed       100 : assumedly after job
> exit_status  137                 
> ru_wallclock 1212         
> ru_utime     0.002        
> ru_stime     0.022        
> ru_maxrss    1280                
> ru_ixrss     0                   
> ru_ismrss    0                   
> ru_idrss     0                   
> ru_isrss     0                   
> ru_minflt    623                 
> ru_majflt    0                   
> ru_nswap     0                   
> ru_inblock   0                   
> ru_oublock   8                   
> ru_msgsnd    0                   
> ru_msgrcv    0                   
> ru_nsignals  0                   
> ru_nvcsw     47                  
> ru_nivcsw    2                   
> cpu          13816.930    
> mem          48585.941         
> io           34.210            
> iow          0.000             
> maxvmem      3.692G
> arid         undefined
> 
> As you can see the jobs are nearly a month apart.  This does not affect their 
> ability to complete though it's required that we not have these duplicates.
> 
> Has anyone experienced this issue or have an idea of what could be causing 
> this behavior?
> 
> We are not rotating our accounting logs.
> 
> Thanks,
> 
> Douglas Duckworth, MSc, LFCS
> HPC System Administrator
> Scientific Computing Unit
> Physiology and Biophysics
> Weill Cornell Medicine
> E: d...@med.cornell.edu
> O: 212-746-6305
> F: 212-746-8690
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAliiKvYACgkQo/GbGkBRnRrXVwCgzfztDlbPga7OH8KjHIdOpl3f
aswAnA5BTYi0JHz/zilUFKpTYF0iGnu7
=/bC2
-----END PGP SIGNATURE-----

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to