Hi !
I have a cluster running CentOS release 6.7 (Final), on which I installed SGE version 8.1.9 with RPMs downloaded from
https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/, e.g. :

gridengine-8.1.9-1.el6.x86_64.rpm
gridengine-execd-8.1.9-1.el6.x86_64.rpm
gridengine-qmaster-8.1.9-1.el6.x86_64.rpm
gridengine-qmon-8.1.9-1.el6.x86_64.rpm

I have also installed openmpi-1.8-1.8.1-5.el6.x86_64 and environment-modules-3.2.10-3.el6.x86_64 from epel6.

Out of the box, this SGE distribution comes with 3 pre-configured PEs, one of which is 'mpi' and which is associated with the default all.q queue:
qconf -sp mpi
pe_name            mpi
slots              99999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE

qconf -sq all.q|grep pe_list
pe_list               make smp mpi

If I try though to run the simplest MPI job that looks like this:
#!/bin/bash
## ---- EMBEDDED SGE ARGUMENTS ----
## -q all.q
#$ -N MPI_Job
#$ -pe mpi 5
#$ -cwd
## ------------------------------------
module load openmpi-x86_64
echo "I got $NSLOTS slots to run on!"
mpirun -np $NSLOTS ./mpi_hello_world

where mpi_hello_world is compiled with openmpi-1.8 libraries from this source:
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
  // Initialize the MPI environment. The two arguments to MPI Init are not
  // currently used by MPI implementations, but are there in case future
  // implementations might need the arguments.
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
         processor_name, world_rank, world_size);

  // Finalize the MPI environment. No more MPI calls can be made after this
  MPI_Finalize();
}

the job just sits there in a 'qw' state, with this scheduling info showing:
qstat -j 13
==============================================================
job_number:                 13
exec_file:                  job_scripts/13
submission_time:            Thu Jun  9 12:01:55 2016
owner:                      rsultana
uid:                        10010
group:                      domain-users
gid:                        5000
sge_o_home:                 /home/rsultana
sge_o_log_name:             rsultana
sge_o_path: /opt/sge/bin:/opt/sge/bin/lx-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/rsultana/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/rsultana
sge_o_host:                 bio-pp9-01
account:                    sge
cwd:                        /home/rsultana
mail_list:                  rsultana@bio-pp9-01
notify:                     FALSE
job_name:                   MPI_Job
jobshare:                   0
env_list:                   TERM=NONE
script_file:                mpi_job.sh
parallel environment:  mpi range: 5
binding:                    NONE
job_type:                   NONE
scheduling info: cannot run in PE "mpi" because it only offers 2147483648 slots

I have tried anything I could think of - changing the number of slots in th PE queue, changing the allocation rule, etc.
Nothing changed - all the jobs with `-pe mpi` fail to be scheduled.

This looks like a bug to me.
2147483648 is 0x80000000 and it's -2147483648 when seen as a signed int, so 5 > -2147483648 But of course, the number of available slots to the PE should be anything but this number (I tried 9999, 99, 10 - no change).

I tried looking in this (and other precursor) discussion list archives for similar error messages and although it pops up from time to time, nobody seems to know why that is or how to fix it.
Any suggestions to fix this issue?

Thank you,
Razvan

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to