Howdy.
A user accidentally submitted a 1.4 BILLION job array on our HPC
cluster. How can I remove it?
I cannot qdel the job nor can I qhold the job because it crashes SGE.
I can restart SGE just fine but the job remains.
I removed the SGE job script itself from /var/spool/sge/job_scripts and
restarted SGE, job remains.
The only thing I can do is remove tasks a time either one at a time or
in groups which works but at 1.4 BILLION tasks, that will take a while.
Added max_aj_task to SGE to prevent this in the future.
# qconf -sconf|grep tasks
max_aj_tasks 100000
Any help appreciated.
Thank you,
Joseph
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users