On Mon, Feb 13, 2017 at 03:17:29PM -0500, Douglas Duckworth wrote:
>    Hello
>    About a month ago we recently started seeing duplicate job in SGE.
>    For example:
>    sysadmin@panda2[~]$ qacct -j 878815
>    ==============================================================
>    qname        standard.q          
>    hostname     node127.panda.pbtech
>    group        abc                 
>    owner        developer             
>    project      NONE                
>    department   cmlab.u             
>    jobname old job
>    jobnumber    878815              
>    taskid       undefined
>    account      sge                 
>    priority     0                   
>    qsub_time    Tue Jan 10 11:49:45 2017
>    start_time   Tue Jan 10 11:51:40 2017
>    end_time     Tue Jan 10 11:51:40 2017
>    granted_pe   smp                 
>    slots        1                   
>    failed       0    
>    exit_status  0                   
>    ru_wallclock 0            
>    ru_utime     0.001        
>    ru_stime     0.006        
>    ru_maxrss    1428                
>    ru_ixrss     0                   
>    ru_ismrss    0                   
>    ru_idrss     0                   
>    ru_isrss     0                   
>    ru_minflt    1254                
>    ru_majflt    0                   
>    ru_nswap     0                   
>    ru_inblock   0                   
>    ru_oublock   8                   
>    ru_msgsnd    0                   
>    ru_msgrcv    0                   
>    ru_nsignals  0                   
>    ru_nvcsw     60                  
>    ru_nivcsw    4                   
>    cpu          0.007        
>    mem          0.000             
>    io           0.000             
>    iow          0.000             
>    maxvmem      0.000
>    arid         undefined
>    ==============================================================
>    qname        standard.q          
>    hostname     node120.panda.pbtech
>    group        abc                 
>    owner        developer             
>    project      NONE                
>    department   cmlab.u             
>    jobname      newjob
>    jobnumber    878815              
>    taskid       undefined
>    account      sge                 
>    priority     0                   
>    qsub_time    Wed Feb  8 12:37:38 2017
>    start_time   Wed Feb  8 13:20:49 2017
>    end_time     Wed Feb  8 13:41:01 2017
>    granted_pe   smp                 
>    slots        12                  
>    failed       100 : assumedly after job
>    exit_status  137                 
>    ru_wallclock 1212         
>    ru_utime     0.002        
>    ru_stime     0.022        
>    ru_maxrss    1280                
>    ru_ixrss     0                   
>    ru_ismrss    0                   
>    ru_idrss     0                   
>    ru_isrss     0                   
>    ru_minflt    623                 
>    ru_majflt    0                   
>    ru_nswap     0                   
>    ru_inblock   0                   
>    ru_oublock   8                   
>    ru_msgsnd    0                   
>    ru_msgrcv    0                   
>    ru_nsignals  0                   
>    ru_nvcsw     47                  
>    ru_nivcsw    2                   
>    cpu          13816.930    
>    mem          48585.941         
>    io           34.210            
>    iow          0.000             
>    maxvmem      3.692G
>    arid         undefined
>    As you can see the jobs are nearly a month apart.  This does not affect
>    their ability to complete though it's required that we not have
>    these duplicates.
>    Has anyone experienced this issue or have an idea of what could be causing
>    this behavior?
>    We are not rotating our accounting logs.
>    Thanks,
>    Douglas Duckworth, MSc, LFCS
>    HPC System Administrator
>    Scientific Computing Unit
>    Physiology and Biophysics
>    Weill Cornell Medicine
>    E: d...@med.cornell.edu
>    O: 212-746-6305
>    F: 212-746-8690

Apart from Reuti's suggestion.  Possibly the original job got returned to the 
queue for some reason
and the user then qalter'd it so it was unrecognisable?


William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to