Up to this point, I've been running a single MPI rank per physical host (using multithreading within my application to use all available cores). I use this command: mpirun -N 1 --bind-to none --hostfile hosts.txt Where hosts.txt has an IP address on each line
I've started running on machines with significant NUMA effects... on a single one of these machines, I've started running a separate rank per NUMA node. On a machine with 64 CPUs and 4 NUMA nodes, I do this: mpirun -N 1 --bind-to numa I've convinced myself by watching the processors that are active on 'top' that this is behaving like I want it to. I now want to combine these two - running on, say, 10 physical hosts with 4 NUMA nodes - a total of 40 ranks. But, the order of the ranks is important (for efficiency, due to how the application divides up work across ranks). So, I want ranks 0-3 to be on host 0 across its NUMA nodes, then ranks 4-7 on host 1 across its NUMA nodes, etc. Some guesses: mpirun -n 40 --map-by numa --rank-by numa --hostfile hosts.txt or mpirun --map-by ppr:4:node --rank-by numa --hostfile hosts.txt Where hosts.txt still has a single IP address per line (and doesn't need a 'slots=4') I'd like to make sure I get the syntax right in general and not just empirically try guesses until one looks like it works... and find inevitably it doesn't work like I thought when I change the # of machines or run on machines with a different # of NUMA nodes. Thanks. -Adam
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users