On Mar 27, 2017, at 11:00 AM, r...@open-mpi.org wrote:
> 
> I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” 
> setting. So why would you expect different results?

A few additional points to Ralph's question:

1. Recall that sched_yield() has effectively become a no-op in newer Linux 
kernels.  Hence, Open MPI's "yield when idle" may not do much to actually 
de-schedule a currently-running process.

2. As for why there is a difference between version 1.10.1 and 1.10.2 in 
oversubscription behavior, we likely do not know offhand (as all of these 
emails have shown!).  Honestly, we don't really pay much attention to 
oversubscription performance -- our focus tends to be on 
under/exactly-subscribed performance, because that's the normal operating mode 
for MPI applications.  With oversubscribed, we have typically just said "all 
bets are off" and leave it at that.

3. I don't recall if there was a default affinity policy change between 1.10.1 
and 1.10.2.  Do you know that your taskset command is -- for absolutely sure -- 
overriding what Open MPI is doing?  Or is what Open MPI is doing in terms of 
affinity/binding getting merged with what your taskset call is doing 
somehow...?  (seems unlikely, but I figured I'd ask anyway)

One more question -- see below:

>> Thanks for your feedback. As described here 
>> (https://www.open-mpi.org/faq/?category=running#oversubscribing), OpenMPI 
>> detects that I'm oversubscribing and runs in degraded mode (yielding the 
>> processor). Anyway, I repeated the experiments setting explicitly the 
>> yielding flag, and I obtained the same weird results:
>> 
>> $HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
>> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
>> $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
>> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93

Per text later in your mail, "taskset -c 0-27" corresponds to the first 
hardware thread on each core.

Hence, this is effectively binding each process to the set of all "first 
hardware threads" across all cores.

>> Given these results, it seems that spin-waiting is not causing the issue.

I'm guessing that this difference is going to end up being the symptom of a 
highly complex system, of which spin-waiting is playing a part.  I.e., if Open 
MPI weren't spin waiting, this might not be happening.

-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to