Dear Experts:

I am a newbie to linux clusters and have only yeoman competence in information 
technology generally so my culture and intuition are not deep.  Some help on a 
perplexing problem would be greatly appreciated.

About a month ago we installed Rocks v. 6.1 on a small cluster consisting of a 
FrontEnd and two compute nodes.  The installation proceeded without error and 
parallel processing on the cluster works fine.  The SGE queue works fine as 
well.  SGE is a package installed with Rocks software.

We have just installed another Rocks 6.1 cluster, using different hardware, on 
a FrontEnd and 5 Compute nodes.  After some adjustments to the BIOS on the 
motherboards of the compute nodes, the installation looks normal, 1 FrontEnd 
and compute-0-0, compute-0-1…compute-0-5.  The compute nodes consist of a 
SuperMicro motherboard with dual processors, Intel E5 2650 with hyper 
threading.  Each compute node has a total of 16 physical cores and with hyper 
threading the "effective" number of cores is 32.

> When parallel jobs, using MPI, are submitted "by hand", typing out the 
> explicit commands at the command line, the system works without any problem.  
> When the very same job is submitted to the SGE queue, an error is generated, 
> and although qstat indicates a running program, in fact it is not.  qstat -f 
> shows that the job was not distributed among the four compute nodes as 
> specified by mpi.exe command.
> 
> The command line for submitting the job to the SGE queue is
> 
> qsub -pe orte 64 shellfile.sh  (there are 64 cores specified for the job on 4 
> compute nodes)

In this case job 43 was started, but the program does not run on the specified 
nodes with the specified cores.
> 
> The error from shellfile.sh.e43 is:
> 
> error: executing task of job 43 failed: execution daemon on host 
> "compute-0-0" didn't accept task
> error: executing task of job 43 failed: execution daemon on host 
> "compute-0-1" didn't accept task
> 
> The job had been submitted to compute-0-0, compute-0-1, compute-0-2, 
> compute-0-3
> 
> What does "execution daemon on host "compute-0-0" didn't accept task" mean?

Since SGE works without problems on the earlier cluster, I don't understand 
what where the error is here.
> 
> Any suggestions would be much appreciated.

John
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to