Hi,

On 27/03/2017 17:51, Jeff Squyres (jsquyres) wrote:
1. Recall that sched_yield() has effectively become a no-op in newer Linux kernels.  
Hence, Open MPI's "yield when idle" may not do much to actually de-schedule a 
currently-running process.
Yes, I'm aware of this. However, this should impact both OpenMPI versions in the same way.
2. As for why there is a difference between version 1.10.1 and 1.10.2 in oversubscription 
behavior, we likely do not know offhand (as all of these emails have shown!).  Honestly, 
we don't really pay much attention to oversubscription performance -- our focus tends to 
be on under/exactly-subscribed performance, because that's the normal operating mode for 
MPI applications.  With oversubscribed, we have typically just said "all bets are 
off" and leave it at that.
I agree that oversubscription is not the typical usage scenario, and I can understand the optimizing its performance is not a priority. But maybe the problem that I'm facing is just a symptom that something is not working properly and this could impact also undersubscription scenarios (of course, to a lesser extent).

3. I don't recall if there was a default affinity policy change between 1.10.1 
and 1.10.2.  Do you know that your taskset command is -- for absolutely sure -- 
overriding what Open MPI is doing?  Or is what Open MPI is doing in terms of 
affinity/binding getting merged with what your taskset call is doing 
somehow...?  (seems unlikely, but I figured I'd ask anyway)
Regarding the changes between 1.10.1 and 1.10.2, I only found one that seems related with oversubscription (i.e. "Correctly handle oversubscription when not given directives to permit it"). I don't know if this could be impacting somehow ...

Regarding the impact of OpenMPI affinity options with taskset, I'd say that it is a combination. With taskset I'm just constraining the affinity placement decided by OpenMPI to the set of processors from 0 to 27. In any case, the affinity configuration is the same for v1.10.1 and v1.10.2, namely:

Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSOCKET Ranking policy: SLOT Binding policy: NONE:IF-SUPPORTED Cpu set: NULL PPR: NULL Cpus-per-rank: 1
     Num new daemons: 0    New daemon starting vpid INVALID
     Num nodes: 1

Per text later in your mail, "taskset -c 0-27" corresponds to the first 
hardware thread on each core.

Hence, this is effectively binding each process to the set of all "first hardware 
threads" across all cores.
Yes, that was the intention to avoid running two MPI processes in the same physical core.
I'm guessing that this difference is going to end up being the symptom of a 
highly complex system, of which spin-waiting is playing a part.  I.e., if Open 
MPI weren't spin waiting, this might not be happening.

I'm not sure about the impact of spin-waiting, taking into account that OpenMPI is running in degraded mode.

Thanks

http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to