Hi, >> Not sure whether I get it right. When I launch the same application with: >> >> "mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines): >> >> 27422 ? Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd >> 9504 ? S 0:00 \_ sge_shepherd-3791 -bg >> 9506 ? Ss 0:00 \_ /bin/sh >> /var/spool/sge/pc15370/job_scripts/3791 >> 9507 ? S 0:00 \_ mpiexec -np 1 ./Mpitest >> 9508 ? R 0:07 \_ ./Mpitest >> 9509 ? Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh >> -inherit -nostdin -V pc15381 orted -mca >> 9513 ? S 0:00 \_ /home/reuti/mpitest/Mpitest --child >> >> 2861 ? Sl 10:47 /usr/sge/bin/lx24-x86/sge_execd >> 25434 ? Sl 0:00 \_ sge_shepherd-3791 -bg >> 25436 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter >> /var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381 >> 25444 ? S 0:00 \_ orted -mca ess env -mca >> orte_ess_jobid 821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri >> 25447 ? S 0:01 \_ /home/reuti/mpitest/Mpitest >> --child >> 25448 ? S 0:01 \_ /home/reuti/mpitest/Mpitest >> --child >> >> This is what I expect (main + 1 child, other node gets 2 children). Now I >> launch the singleton instead (nothing changed besides this, still 2+2 >> granted): >> >> "./Mpitest" and get: >> >> 27422 ? Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd >> 9546 ? S 0:00 \_ sge_shepherd-3793 -bg >> 9548 ? Ss 0:00 \_ /bin/sh >> /var/spool/sge/pc15370/job_scripts/3793 >> 9549 ? R 0:00 \_ ./Mpitest >> 9550 ? Ss 0:00 \_ orted --hnp --set-sid --report-uri >> 6 --singleton-died-pipe 7 >> 9551 ? Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh >> -inherit -nostdin -V pc15381 orted >> 9554 ? S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> 9555 ? S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> >> 2861 ? Sl 10:47 /usr/sge/bin/lx24-x86/sge_execd >> 25494 ? Sl 0:00 \_ sge_shepherd-3793 -bg >> 25495 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter >> /var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381 >> 25502 ? S 0:00 \_ orted -mca ess env -mca >> orte_ess_jobid 814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri >> 25503 ? S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> >> Only one child is going to the other node. The environment is the same in >> both cases. Is this the correct behavior? > > > We probably aren't correctly marking the original singleton on that node, and > so the mapper thinks there are still two slots available on the original node.
Was there any further discussion about the different slot allocations between the two startup methods off-list? One could even argue, it's the intended way it is right now: - you have an MPI style application (rank0 is doing work) => use mpiexec Corresponding SGE setting: "job_is_first_task true" in the PE - you have a true master/slave application and the master is not doing any work => start it as a singleton Corresponding SGE setting: "job_is_first_task false" in the PE This would then be worth to be noted somewhere in the FAQ. (I couldn't compile the Mpitest with MPICH2 to check their behavior, it chocks on some overloading operators.) -- Reuti