Yes, I believe this solves the mystery. In short OGE and ORTE both
work. In the linear:1 case the job is exiting because there are not
enough resources for the orte binding to work, which actually makes
sense. In the linear:2 case I think we've proven that we are binding to
the right amount of resources and to the correct physical resources at
the process level.
In the case you do not do pass bind-to-core to mpirun with a qsub using
linear:2 the processes on the same node will actually bind to the same
two cores. The only way to determine this is to run something that
prints out the binding from the system. There is no way to do this via
OMPI because it only reports binding when you are requesting mpirun to
do some type of binding (like -bind-to-core or -bind-to-socket.
In the linear:1 case with no binding I think you are having the
processes on the same node run on the same core. Which is exactly what
you are asking for I believe.
So I believe we understand what is going on with the binding and it
makes sense to me. As far as the allocation issue of slots vs. cores
and trying to not overallocate cores I believe the new allocation rule
make sense to do but I'll let you hash that out with Daniel.
In summary I don't believe there is any OMPI bugs related to what we've
seen and the OGE issue is just the allocation issue, right?
--td
On 11/18/2010 01:32 AM, Chris Jewell wrote:
Perhaps if someone could run this test again with --report-bindings
--leave-session-attached and provide -all- output we could verify that analysis
and clear up the confusion?
Yeah, however I bet you we still won't see output.
Actually, it seems we do get more output! Results of 'qsub -pe mpi 8 -binding
linear:2 myScript.com'
with
'mpirun -mca ras_gridengine_verbose 100 -report-bindings
--leave-session-attached -bycore -bind-to-core ./unterm'
[exec1:06504] System has detected external process binding to cores 0028
[exec1:06504] ras:gridengine: JOB_ID: 59467
[exec1:06504] ras:gridengine: PE_HOSTFILE:
/usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
[exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows
slots=2
[exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to
cpus 0008
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to
cpus 0020
[exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to
cpus 0008
[exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to
cpus 0001
[exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to
cpus 0001
[exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to
cpus 0002
[exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to
cpus 0001
[exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to
cpus 0001
AHHA! Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1
myScript.com' with the above mpirun command:
[exec1:06552] System has detected external process binding to cores 0020
[exec1:06552] ras:gridengine: JOB_ID: 59468
[exec1:06552] ras:gridengine: PE_HOSTFILE:
/usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
[exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows
slots=2
[exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows
slots=1
[exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows
slots=1
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error:
Error name: Unknown error: 1
Node: exec1
when attempting to start process rank 0.
--------------------------------------------------------------------------
[exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to
cpus 0020
--------------------------------------------------------------------------
Not enough processors were found on the local host to meet the requested
binding action:
Local host: exec1
Action requested: bind-to-core
Application name: ./unterm
Please revise the request and try again.
--------------------------------------------------------------------------
[exec4:26816] [[59432,0],4] odls:default:fork binding child [[59432,1],5] to
cpus 0001
[exec3:20345] [[59432,0],1] odls:default:fork binding child [[59432,1],2] to
cpus 0020
[exec2:32486] [[59432,0],2] odls:default:fork binding child [[59432,1],3] to
cpus 0001
[exec7:09921] [[59432,0],3] odls:default:fork binding child [[59432,1],4] to
cpus 0002
[exec6:04257] [[59432,0],6] odls:default:fork binding child [[59432,1],7] to
cpus 0001
[exec5:10861] [[59432,0],5] odls:default:fork binding child [[59432,1],6] to
cpus 0001
Hope that helps clear up the confusion! Please say it does, my head hurts...
Chris
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>