Hi,
On 27/03/2017 17:51, Jeff Squyres (jsquyres) wrote:
1. Recall that sched_yield() has effectively become a no-op in newer Linux kernels.
Hence, Open MPI's "yield when idle" may not do much to actually de-schedule a
currently-running process.
Yes, I'm aware of this. However, this should impact both OpenMPI
versions in the same way.
2. As for why there is a difference between version 1.10.1 and 1.10.2 in oversubscription
behavior, we likely do not know offhand (as all of these emails have shown!). Honestly,
we don't really pay much attention to oversubscription performance -- our focus tends to
be on under/exactly-subscribed performance, because that's the normal operating mode for
MPI applications. With oversubscribed, we have typically just said "all bets are
off" and leave it at that.
I agree that oversubscription is not the typical usage scenario, and I
can understand the optimizing its performance is not a priority. But
maybe the problem that I'm facing is just a symptom that something is
not working properly and this could impact also undersubscription
scenarios (of course, to a lesser extent).
3. I don't recall if there was a default affinity policy change between 1.10.1
and 1.10.2. Do you know that your taskset command is -- for absolutely sure --
overriding what Open MPI is doing? Or is what Open MPI is doing in terms of
affinity/binding getting merged with what your taskset call is doing
somehow...? (seems unlikely, but I figured I'd ask anyway)
Regarding the changes between 1.10.1 and 1.10.2, I only found one that
seems related with oversubscription (i.e. "Correctly handle
oversubscription when not given directives to permit it"). I don't know
if this could be impacting somehow ...
Regarding the impact of OpenMPI affinity options with taskset, I'd say
that it is a combination. With taskset I'm just constraining the
affinity placement decided by OpenMPI to the set of processors from 0 to
27. In any case, the affinity configuration is the same for v1.10.1 and
v1.10.2, namely:
Mapper requested: NULL Last mapper: round_robin Mapping policy:
BYSOCKET Ranking policy: SLOT
Binding policy: NONE:IF-SUPPORTED Cpu set: NULL PPR: NULL
Cpus-per-rank: 1
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Per text later in your mail, "taskset -c 0-27" corresponds to the first
hardware thread on each core.
Hence, this is effectively binding each process to the set of all "first hardware
threads" across all cores.
Yes, that was the intention to avoid running two MPI processes in the
same physical core.
I'm guessing that this difference is going to end up being the symptom of a
highly complex system, of which spin-waiting is playing a part. I.e., if Open
MPI weren't spin waiting, this might not be happening.
I'm not sure about the impact of spin-waiting, taking into account that
OpenMPI is running in degraded mode.
Thanks
http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users