Am 15.01.2013 um 02:06 schrieb John Weiner: > Dear Experts: > > I am a newbie to linux clusters and have only yeoman competence in > information technology generally so my culture and intuition are not deep. > Some help on a perplexing problem would be greatly appreciated. > > About a month ago we installed Rocks v. 6.1 on a small cluster consisting of > a FrontEnd and two compute nodes. The installation proceeded without error > and parallel processing on the cluster works fine. The SGE queue works fine > as well. SGE is a package installed with Rocks software. > > We have just installed another Rocks 6.1 cluster, using different hardware, > on a FrontEnd and 5 Compute nodes. After some adjustments to the BIOS on the > motherboards of the compute nodes, the installation looks normal, 1 FrontEnd > and compute-0-0, compute-0-1…compute-0-5. The compute nodes consist of a > SuperMicro motherboard with dual processors, Intel E5 2650 with hyper > threading. Each compute node has a total of 16 physical cores and with hyper > threading the "effective" number of cores is 32. > >> When parallel jobs, using MPI,
By MPI you mean Open MPI (as you use a PE "orte" below) - there is only one `mpirun` installed resp. the correct one called in the jobscript? >> are submitted "by hand", typing out the explicit commands at the command >> line, the system works without any problem. When the very same job is >> submitted to the SGE queue, an error is generated, and although qstat >> indicates a running program, in fact it is not. qstat -f shows that the job >> was not distributed among the four compute nodes as specified by mpi.exe >> command. It's working the other way round: SGE will grant access to the requested number of slots and Open MPI has to use these and only these. http://www.open-mpi.org/faq/?category=building#build-rte-sge http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge The granted allocation can be viewed by: $ qstat -g t >> The command line for submitting the job to the SGE queue is >> >> qsub -pe orte 64 shellfile.sh (there are 64 cores specified for the job on >> 4 compute nodes) You compiled Open MPI with --with-sge? > In this case job 43 was started, but the program does not run on the > specified nodes with the specified cores. >> >> The error from shellfile.sh.e43 is: >> >> error: executing task of job 43 failed: execution daemon on host >> "compute-0-0" didn't accept task >> error: executing task of job 43 failed: execution daemon on host >> "compute-0-1" didn't accept task You set up a PE for Tight Integration of Open MPI? One of the causes can be, that there is at least one `qrsh - inherit ...` call made too much to a slave machine of the parallel job than allowed by the granted slot count thereon. -- Reuti NB: Nowadays often only one `qrsh -inherit ...` call is made at all to each slave machine, as additional processes are started as forks (you can observe this with `ps -e f`). >> The job had been submitted to compute-0-0, compute-0-1, compute-0-2, >> compute-0-3 >> >> What does "execution daemon on host "compute-0-0" didn't accept task" mean? > > Since SGE works without problems on the earlier cluster, I don't understand > what where the error is here. >> >> Any suggestions would be much appreciated. > > John > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
