Dear Experts: I am a newbie to linux clusters and have only yeoman competence in information technology generally so my culture and intuition are not deep. Some help on a perplexing problem would be greatly appreciated.
About a month ago we installed Rocks v. 6.1 on a small cluster consisting of a FrontEnd and two compute nodes. The installation proceeded without error and parallel processing on the cluster works fine. The SGE queue works fine as well. SGE is a package installed with Rocks software. We have just installed another Rocks 6.1 cluster, using different hardware, on a FrontEnd and 5 Compute nodes. After some adjustments to the BIOS on the motherboards of the compute nodes, the installation looks normal, 1 FrontEnd and compute-0-0, compute-0-1…compute-0-5. The compute nodes consist of a SuperMicro motherboard with dual processors, Intel E5 2650 with hyper threading. Each compute node has a total of 16 physical cores and with hyper threading the "effective" number of cores is 32. > When parallel jobs, using MPI, are submitted "by hand", typing out the > explicit commands at the command line, the system works without any problem. > When the very same job is submitted to the SGE queue, an error is generated, > and although qstat indicates a running program, in fact it is not. qstat -f > shows that the job was not distributed among the four compute nodes as > specified by mpi.exe command. > > The command line for submitting the job to the SGE queue is > > qsub -pe orte 64 shellfile.sh (there are 64 cores specified for the job on 4 > compute nodes) In this case job 43 was started, but the program does not run on the specified nodes with the specified cores. > > The error from shellfile.sh.e43 is: > > error: executing task of job 43 failed: execution daemon on host > "compute-0-0" didn't accept task > error: executing task of job 43 failed: execution daemon on host > "compute-0-1" didn't accept task > > The job had been submitted to compute-0-0, compute-0-1, compute-0-2, > compute-0-3 > > What does "execution daemon on host "compute-0-0" didn't accept task" mean? Since SGE works without problems on the earlier cluster, I don't understand what where the error is here. > > Any suggestions would be much appreciated. John _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
