On Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote
A user accidentally submitted a 1.4 BILLION job array on our HPC cluster. How can I remove it?
And I thought I had problems with a user submitting a million+ individual jobs. That was fun too.
I cannot qdel the job nor can I qhold the job because it crashes SGE. I can restart SGE just fine but the job remains.I removed the SGE job script itself from /var/spool/sge/job_scripts and restarted SGE, job remains.
You also need to remove the job's entry in the job "database". Assuming you're using flat files spooling, that entry will be a directory under the "jobs" directory in the spool. If the job ID is 8027327, e.g., then the directory is jobs/00/0802/7327. Stop SGE, 'rm -rf jobs/00/0802/7327', then start SGE up again and the job should be gone.
-- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users