Hi !
I have a cluster running CentOS release 6.7 (Final), on which I
installed SGE version 8.1.9 with RPMs downloaded from
https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/, e.g. :
I have also installed openmpi-1.8-1.8.1-5.el6.x86_64 and
environment-modules-3.2.10-3.el6.x86_64 from epel6.
Out of the box, this SGE distribution comes with 3 pre-configured PEs,
one of which is 'mpi' and which is associated with the default all.q queue:
qconf -sp mpi
pe_name mpi
slots 99999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qsort_args NONE
qconf -sq all.q|grep pe_list
pe_list make smp mpi
If I try though to run the simplest MPI job that looks like this:
## -q all.q
#$ -N MPI_Job
#$ -pe mpi 5
#$ -cwd
## ------------------------------------
module load openmpi-x86_64
echo "I got $NSLOTS slots to run on!"
mpirun -np $NSLOTS ./mpi_hello_world
where mpi_hello_world is compiled with openmpi-1.8 libraries from this
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment. The two arguments to MPI Init are not
// currently used by MPI implementations, but are there in case future
// implementations might need the arguments.
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment. No more MPI calls can be made after this
the job just sits there in a 'qw' state, with this scheduling info showing:
qstat -j 13
job_number: 13
exec_file: job_scripts/13
submission_time: Thu Jun 9 12:01:55 2016
owner: rsultana
uid: 10010
group: domain-users
gid: 5000
sge_o_home: /home/rsultana
sge_o_log_name: rsultana
sge_o_shell: /bin/bash
sge_o_workdir: /home/rsultana
sge_o_host: bio-pp9-01
account: sge
cwd: /home/rsultana
mail_list: rsultana@bio-pp9-01
notify: FALSE
job_name: MPI_Job
jobshare: 0
env_list: TERM=NONE
script_file: mpi_job.sh
parallel environment: mpi range: 5
binding: NONE
job_type: NONE
scheduling info: cannot run in PE "mpi" because it only
offers 2147483648 slots
I have tried anything I could think of - changing the number of slots in
th PE queue, changing the allocation rule, etc.
Nothing changed - all the jobs with `-pe mpi` fail to be scheduled.
This looks like a bug to me.
2147483648 is 0x80000000 and it's -2147483648 when seen as a signed int,
so 5 > -2147483648
But of course, the number of available slots to the PE should be
anything but this number (I tried 9999, 99, 10 - no change).
I tried looking in this (and other precursor) discussion list archives
for similar error messages and although it pops up from time to time,
nobody seems to know why that is or how to fix it.
Any suggestions to fix this issue?
Thank you,
