Yes, I believe this solves the mystery. In short OGE and ORTE both work. In the linear:1 case the job is exiting because there are not enough resources for the orte binding to work, which actually makes sense. In the linear:2 case I think we've proven that we are binding to the right amount of resources and to the correct physical resources at the process level.

In the case you do not do pass bind-to-core to mpirun with a qsub using linear:2 the processes on the same node will actually bind to the same two cores. The only way to determine this is to run something that prints out the binding from the system. There is no way to do this via OMPI because it only reports binding when you are requesting mpirun to do some type of binding (like -bind-to-core or -bind-to-socket.

In the linear:1 case with no binding I think you are having the processes on the same node run on the same core. Which is exactly what you are asking for I believe.

So I believe we understand what is going on with the binding and it makes sense to me. As far as the allocation issue of slots vs. cores and trying to not overallocate cores I believe the new allocation rule make sense to do but I'll let you hash that out with Daniel.

In summary I don't believe there is any OMPI bugs related to what we've seen and the OGE issue is just the allocation issue, right?

--td


On 11/18/2010 01:32 AM, Chris Jewell wrote:
Perhaps if someone could run this test again with --report-bindings 
--leave-session-attached and provide -all- output we could verify that analysis 
and clear up the confusion?

Yeah, however I bet you we still won't see output.
Actually, it seems we do get more output!  Results of 'qsub -pe mpi 8 -binding 
linear:2 myScript.com'

with

'mpirun -mca ras_gridengine_verbose 100 -report-bindings 
--leave-session-attached -bycore -bind-to-core ./unterm'

[exec1:06504] System has detected external process binding to cores 0028
[exec1:06504] ras:gridengine: JOB_ID: 59467
[exec1:06504] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
[exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to 
cpus 0008
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to 
cpus 0020
[exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to 
cpus 0008
[exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to 
cpus 0001
[exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to 
cpus 0001
[exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to 
cpus 0002
[exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to 
cpus 0001
[exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to 
cpus 0001

AHHA!  Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 
myScript.com' with the above mpirun command:

[exec1:06552] System has detected external process binding to cores 0020
[exec1:06552] ras:gridengine: JOB_ID: 59468
[exec1:06552] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
[exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error:

Error name: Unknown error: 1
Node: exec1

when attempting to start process rank 0.
--------------------------------------------------------------------------
[exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to 
cpus 0020
--------------------------------------------------------------------------
Not enough processors were found on the local host to meet the requested
binding action:

   Local host:        exec1
   Action requested:  bind-to-core
   Application name:  ./unterm

Please revise the request and try again.
--------------------------------------------------------------------------
[exec4:26816] [[59432,0],4] odls:default:fork binding child [[59432,1],5] to 
cpus 0001
[exec3:20345] [[59432,0],1] odls:default:fork binding child [[59432,1],2] to 
cpus 0020
[exec2:32486] [[59432,0],2] odls:default:fork binding child [[59432,1],3] to 
cpus 0001
[exec7:09921] [[59432,0],3] odls:default:fork binding child [[59432,1],4] to 
cpus 0002
[exec6:04257] [[59432,0],6] odls:default:fork binding child [[59432,1],7] to 
cpus 0001
[exec5:10861] [[59432,0],5] odls:default:fork binding child [[59432,1],6] to 
cpus 0001



Hope that helps clear up the confusion!  Please say it does, my head hurts...

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to