Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-25 Thread Brock Palen
Yes ompi_info --all Works, ompi_info -param all all [brockp@flux-login1 34241]$ ompi_info --param all all Error getting SCIF driver version MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: default, level: 1 user/basic, type:

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Jeff Squyres (jsquyres)
Brock -- Can you run with "ompi_info --all"? With "--param all all", ompi_info in v1.8.x is defaulting to only showing level 1 MCA params. It's showing you all possible components and variables, but only level 1. Or you could also use "--level 9" to show all 9 levels. Here's the relevant se

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
That's odd - it shouldn't truncate the output. I'll take a look later today - we're all gathered for a developer's conference this week, so I'll be able to poke at this with Nathan. On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen wrote: > Perfection, flexible, extensible, so nice. > > BTW this do

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
Let's say that the downside is an unknown at this time. The only real impact of setting that param is that each daemon now reports its topology at startup. Without the param, only the daemon on the first node does so. The concern expressed when we first added that report was that the volume of data

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-23 Thread Maxime Boissonneault
Hi, I've been following this thread because it may be relevant to our setup. Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter ? Is there a reason why the most generic case is not assumed ? Maxime Boissonneault Le 2014-06-20 13:48, Ralph Castain a écrit : Put "orte_h

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-23 Thread Brock Palen
Perfection, flexible, extensible, so nice. BTW this doesn't happen older versions, [brockp@flux-login2 34241]$ ompi_info --param all all Error getting SCIF driver version MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: default,

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by setting that param to 0 On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: > Perfection! That appears to do it for our standard case. > > Now I know how to set MCA options by env var or config file. How can I make

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Perfection! That appears to do it for our standard case. Now I know how to set MCA options by env var or config file. How can I make this the default, that then a user can override? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation. Otherwise, we expect that the cpuset of t

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Extra data point if I do: [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
I was able to produce it in my test. orted affinity set by cpuset: [root@nyx5874 ~]# hwloc-bind --get --pid 103645 0xc002 This mask (1, 14,15) which is across sockets, matches the cpu set setup by the batch system. [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Got it, I have the input from the user and am testing it out. It probably has less todo with torque and more cpuset's, I'm working on producing it myself also. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Thanks - I'm just trying to reproduce one problem case so I can look at it. Given that I don't have access to a Torque machine, I need to "fake" it. On Jun 20, 2014, at 9:15 AM, Brock Palen wrote: > In this case they are a single socket, but as you can see they could be > ether/or depending o

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
In this case they are a single socket, but as you can see they could be ether/or depending on the job. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 19, 2014, at 2:44 PM, Ralph Castain wrote: > Sorry, I should have been

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-19 Thread Ralph Castain
Sorry, I should have been clearer - I was asking if cores 8-11 are all on one socket, or span multiple sockets On Jun 19, 2014, at 11:36 AM, Brock Palen wrote: > Ralph, > > It was a large job spread across. Our system allows users to ask for 'procs' > which are laid out in any format. > >

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-19 Thread Brock Palen
Ralph, It was a large job spread across. Our system allows users to ask for 'procs' which are laid out in any format. The list: > [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] > [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] > [nyx5409:11][nyx5411:11][nyx5412:3] Shows that nyx5406 had 2 cores

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-18 Thread Ralph Castain
The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket. I'm not sure what your cpuset is telling us - are you binding us to a so

[OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-18 Thread Brock Palen
I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though. Example job, default binding options (so by-core right?): Assigned nodes, the one in question is nyx5398, w