Thanks Joshua!
That did the 1-Trillion
dollar trick!
Best,
Joseph
On 8/7/2019 10:50 PM, Joshua
Baker-LePain wrote:
On
Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote
A user accidentally submitted a 1.4
BILLION job array on our HPC cluster. How can I remove it?
And I thought I had problems with a user submitting a million+
individual jobs. That was fun too.
I cannot qdel the job nor can I qhold the
job because it crashes SGE. I can restart SGE just fine but
the job remains.
I removed the SGE job script itself from
/var/spool/sge/job_scripts and restarted SGE, job remains.
You also need to remove the job's entry in the job "database".
Assuming you're using flat files spooling, that entry will be a
directory under the "jobs" directory in the spool. If the job ID
is 8027327, e.g., then the directory is jobs/00/0802/7327. Stop
SGE, 'rm -rf jobs/00/0802/7327', then start SGE up again and the
job should be gone.
|
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users