Hi all,
We are having issues submitting MPI jobs. There is intermittent issues
as well where no slurm commands can be issued. The major concern is that
jobs cannot be submitted to a particular set of 16 nodes in its own
partition (1 mpi job for example, not 100s at once..)
There are numerous
I had similar problems in the past.
The 2 most common issues were:
1. Controller load - if the slurmctld was in heavy use, it
sometimes didn't respond in timely manner, exceeding the timeout
limit.
2. Topology and msg forwarding and aggregation.
For
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:
> Hi
>
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes,
> the command "sbatch" fails:
>
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p
> operw /home2/mma002/ecf/home/Aos/Prod/
Hi
Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the
command "sbatch" fails:
+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p
operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed: