Yes ompi_info --all
Works, ompi_info -param all all [brockp@flux-login1 34241]$ ompi_info --param all all Error getting SCIF driver version MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: default, level: 1 user/basic, type: string) Comma-delimited list of devices and/or CIDR notation of networks to use for MPI communication (e.g., "eth0,192.168.0.0/16"). Mutually exclusive with btl_tcp_if_exclude. MCA btl: parameter "btl_tcp_if_exclude" (current value: "127.0.0.1/8,sppp", data source: default, level: 1 user/basic, type: string) Comma-delimited list of devices and/or CIDR notation of networks to NOT use for MPI communication -- all devices not matching these specifications will be used (e.g., "eth0,192.168.0.0/16"). If set to a non-default value, it is mutually exclusive with btl_tcp_if_include. [brockp@flux-login1 34241]$ ompi_info --param all all --level 9 (gives me what I expect). Thanks, Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 24, 2014, at 10:22 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Brock -- > > Can you run with "ompi_info --all"? > > With "--param all all", ompi_info in v1.8.x is defaulting to only showing > level 1 MCA params. It's showing you all possible components and variables, > but only level 1. > > Or you could also use "--level 9" to show all 9 levels. Here's the relevant > section from the README: > > ----- > The following options may be helpful: > > --all Show a *lot* of information about your Open MPI > installation. > --parsable Display all the information in an easily > grep/cut/awk/sed-able format. > --param <framework> <component> > A <framework> of "all" and a <component> of "all" will > show all parameters to all components. Otherwise, the > parameters of all the components in a specific framework, > or just the parameters of a specific component can be > displayed by using an appropriate <framework> and/or > <component> name. > --level <level> > By default, ompi_info only shows "Level 1" MCA parameters > -- parameters that can affect whether MPI processes can > run successfully or not (e.g., determining which network > interfaces to use). The --level option will display all > MCA parameters from level 1 to <level> (the max <level> > value is 9). Use "ompi_info --param <framework> > <component> --level 9" to see *all* MCA parameters for a > given component. See "The Modular Component Architecture > (MCA)" section, below, for a fuller explanation. > ---- > > > > > On Jun 24, 2014, at 5:19 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> That's odd - it shouldn't truncate the output. I'll take a look later today >> - we're all gathered for a developer's conference this week, so I'll be able >> to poke at this with Nathan. >> >> >> >> On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen <bro...@umich.edu> wrote: >> Perfection, flexible, extensible, so nice. >> >> BTW this doesn't happen older versions, >> >> [brockp@flux-login2 34241]$ ompi_info --param all all >> Error getting SCIF driver version >> MCA btl: parameter "btl_tcp_if_include" (current value: "", >> data source: default, level: 1 user/basic, type: >> string) >> Comma-delimited list of devices and/or CIDR >> notation of networks to use for MPI communication >> (e.g., "eth0,192.168.0.0/16"). Mutually exclusive >> with btl_tcp_if_exclude. >> MCA btl: parameter "btl_tcp_if_exclude" (current value: >> "127.0.0.1/8,sppp", data source: default, level: 1 >> user/basic, type: string) >> Comma-delimited list of devices and/or CIDR >> notation of networks to NOT use for MPI >> communication -- all devices not matching these >> specifications will be used (e.g., >> "eth0,192.168.0.0/16"). If set to a non-default >> value, it is mutually exclusive with >> btl_tcp_if_include. >> >> >> This is normally much longer. And yes we don't have the PHI stuff installed >> on all nodes, strange that 'all all' is now very short, ompi_info -a still >> works though. >> >> >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> XSEDE Campus Champion >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Jun 20, 2014, at 1:48 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Put "orte_hetero_nodes=1" in your default MCA param file - uses can >>> override by setting that param to 0 >>> >>> >>> On Jun 20, 2014, at 10:30 AM, Brock Palen <bro...@umich.edu> wrote: >>> >>>> Perfection! That appears to do it for our standard case. >>>> >>>> Now I know how to set MCA options by env var or config file. How can I >>>> make this the default, that then a user can override? >>>> >>>> Brock Palen >>>> www.umich.edu/~brockp >>>> CAEN Advanced Computing >>>> XSEDE Campus Champion >>>> bro...@umich.edu >>>> (734)936-1985 >>>> >>>> >>>> >>>> On Jun 20, 2014, at 1:21 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> I think I begin to grok at least part of the problem. If you are >>>>> assigning different cpus on each node, then you'll need to tell us that >>>>> by setting --hetero-nodes otherwise we won't have any way to report that >>>>> back to mpirun for its binding calculation. >>>>> >>>>> Otherwise, we expect that the cpuset of the first node we launch a daemon >>>>> onto (or where mpirun is executing, if we are only launching local to >>>>> mpirun) accurately represents the cpuset on every node in the allocation. >>>>> >>>>> We still might well have a bug in our binding computation - but the above >>>>> will definitely impact what you said the user did. >>>>> >>>>> On Jun 20, 2014, at 10:06 AM, Brock Palen <bro...@umich.edu> wrote: >>>>> >>>>>> Extra data point if I do: >>>>>> >>>>>> [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname >>>>>> -------------------------------------------------------------------------- >>>>>> A request was made to bind to that would result in binding more >>>>>> processes than cpus on a resource: >>>>>> >>>>>> Bind to: CORE >>>>>> Node: nyx5513 >>>>>> #processes: 2 >>>>>> #cpus: 1 >>>>>> >>>>>> You can override this protection by adding the "overload-allowed" >>>>>> option to your binding directive. >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime >>>>>> 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 >>>>>> 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 >>>>>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get >>>>>> 0x00000010 >>>>>> 0x00001000 >>>>>> [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 >>>>>> nyx5513 >>>>>> nyx5513 >>>>>> >>>>>> Interesting, if I force bind to core, MPI barfs saying there is only 1 >>>>>> cpu available, PBS says it gave it two, and if I force (this is all >>>>>> inside an interactive job) just on that node hwloc-bind --get I get what >>>>>> I expect, >>>>>> >>>>>> Is there a way to get a map of what MPI thinks it has on each host? >>>>>> >>>>>> Brock Palen >>>>>> www.umich.edu/~brockp >>>>>> CAEN Advanced Computing >>>>>> XSEDE Campus Champion >>>>>> bro...@umich.edu >>>>>> (734)936-1985 >>>>>> >>>>>> >>>>>> >>>>>> On Jun 20, 2014, at 12:38 PM, Brock Palen <bro...@umich.edu> wrote: >>>>>> >>>>>>> I was able to produce it in my test. >>>>>>> >>>>>>> orted affinity set by cpuset: >>>>>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103645 >>>>>>> 0x0000c002 >>>>>>> >>>>>>> This mask (1, 14,15) which is across sockets, matches the cpu set setup >>>>>>> by the batch system. >>>>>>> [root@nyx5874 ~]# cat >>>>>>> /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus >>>>>>> 1,14-15 >>>>>>> >>>>>>> The ranks though were then all set to the same core: >>>>>>> >>>>>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103871 >>>>>>> 0x00008000 >>>>>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103872 >>>>>>> 0x00008000 >>>>>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103873 >>>>>>> 0x00008000 >>>>>>> >>>>>>> Which is core 15: >>>>>>> >>>>>>> report-bindings gave me: >>>>>>> You can see how a few nodes were bound to all the same core, the last >>>>>>> one in each case. I only gave you the results for the hose nyx5874. >>>>>>> >>>>>>> [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 60 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 56 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 57 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 58 bound to socket 1[core >>>>>>> 15[hwt 0]]: [./././././././.][./././././././B] >>>>>>> [nyx5545.engin.umich.edu:88170] MCW rank 2 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5613.engin.umich.edu:25229] MCW rank 31 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5880.engin.umich.edu:01406] MCW rank 10 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5770.engin.umich.edu:86538] MCW rank 6 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5613.engin.umich.edu:25228] MCW rank 30 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5577.engin.umich.edu:65949] MCW rank 4 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5607.engin.umich.edu:30379] MCW rank 14 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5544.engin.umich.edu:72960] MCW rank 47 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5544.engin.umich.edu:72959] MCW rank 46 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5848.engin.umich.edu:04332] MCW rank 33 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5848.engin.umich.edu:04333] MCW rank 34 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5544.engin.umich.edu:72958] MCW rank 45 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5858.engin.umich.edu:12165] MCW rank 35 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5607.engin.umich.edu:30380] MCW rank 15 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5544.engin.umich.edu:72957] MCW rank 44 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5858.engin.umich.edu:12167] MCW rank 37 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5870.engin.umich.edu:33811] MCW rank 7 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5582.engin.umich.edu:81994] MCW rank 5 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5848.engin.umich.edu:04331] MCW rank 32 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5557.engin.umich.edu:46654] MCW rank 50 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5858.engin.umich.edu:12166] MCW rank 36 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5799.engin.umich.edu:67802] MCW rank 22 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5799.engin.umich.edu:67803] MCW rank 23 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5556.engin.umich.edu:50889] MCW rank 3 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5625.engin.umich.edu:95931] MCW rank 53 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5625.engin.umich.edu:95930] MCW rank 52 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5557.engin.umich.edu:46655] MCW rank 51 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5625.engin.umich.edu:95932] MCW rank 54 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5625.engin.umich.edu:95933] MCW rank 55 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5866.engin.umich.edu:16306] MCW rank 40 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5861.engin.umich.edu:22761] MCW rank 61 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5861.engin.umich.edu:22762] MCW rank 62 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5861.engin.umich.edu:22763] MCW rank 63 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5557.engin.umich.edu:46652] MCW rank 48 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5557.engin.umich.edu:46653] MCW rank 49 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5866.engin.umich.edu:16304] MCW rank 38 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5788.engin.umich.edu:02465] MCW rank 20 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5597.engin.umich.edu:68071] MCW rank 27 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5775.engin.umich.edu:27952] MCW rank 17 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5866.engin.umich.edu:16305] MCW rank 39 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5788.engin.umich.edu:02466] MCW rank 21 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5775.engin.umich.edu:27951] MCW rank 16 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5597.engin.umich.edu:68073] MCW rank 29 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5597.engin.umich.edu:68072] MCW rank 28 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5552.engin.umich.edu:30481] MCW rank 12 is not bound (or bound to >>>>>>> all available processors) >>>>>>> [nyx5552.engin.umich.edu:30482] MCW rank 13 is not bound (or bound to >>>>>>> all available processors) >>>>>>> >>>>>>> >>>>>>> Brock Palen >>>>>>> www.umich.edu/~brockp >>>>>>> CAEN Advanced Computing >>>>>>> XSEDE Campus Champion >>>>>>> bro...@umich.edu >>>>>>> (734)936-1985 >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jun 20, 2014, at 12:20 PM, Brock Palen <bro...@umich.edu> wrote: >>>>>>> >>>>>>>> Got it, >>>>>>>> >>>>>>>> I have the input from the user and am testing it out. >>>>>>>> >>>>>>>> It probably has less todo with torque and more cpuset's, >>>>>>>> >>>>>>>> I'm working on producing it myself also. >>>>>>>> >>>>>>>> Brock Palen >>>>>>>> www.umich.edu/~brockp >>>>>>>> CAEN Advanced Computing >>>>>>>> XSEDE Campus Champion >>>>>>>> bro...@umich.edu >>>>>>>> (734)936-1985 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Jun 20, 2014, at 12:18 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> Thanks - I'm just trying to reproduce one problem case so I can look >>>>>>>>> at it. Given that I don't have access to a Torque machine, I need to >>>>>>>>> "fake" it. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jun 20, 2014, at 9:15 AM, Brock Palen <bro...@umich.edu> wrote: >>>>>>>>> >>>>>>>>>> In this case they are a single socket, but as you can see they could >>>>>>>>>> be ether/or depending on the job. >>>>>>>>>> >>>>>>>>>> Brock Palen >>>>>>>>>> www.umich.edu/~brockp >>>>>>>>>> CAEN Advanced Computing >>>>>>>>>> XSEDE Campus Champion >>>>>>>>>> bro...@umich.edu >>>>>>>>>> (734)936-1985 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jun 19, 2014, at 2:44 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry, I should have been clearer - I was asking if cores 8-11 are >>>>>>>>>>> all on one socket, or span multiple sockets >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jun 19, 2014, at 11:36 AM, Brock Palen <bro...@umich.edu> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ralph, >>>>>>>>>>>> >>>>>>>>>>>> It was a large job spread across. Our system allows users to ask >>>>>>>>>>>> for 'procs' which are laid out in any format. >>>>>>>>>>>> >>>>>>>>>>>> The list: >>>>>>>>>>>> >>>>>>>>>>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >>>>>>>>>>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >>>>>>>>>>>>> [nyx5409:11][nyx5411:11][nyx5412:3] >>>>>>>>>>>> >>>>>>>>>>>> Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11. >>>>>>>>>>>> >>>>>>>>>>>> They could be spread across any number of sockets configuration. >>>>>>>>>>>> We start very lax "user requests X procs" and then the user can >>>>>>>>>>>> request more strict requirements from there. We support mostly >>>>>>>>>>>> serial users, and users can colocate on nodes. >>>>>>>>>>>> >>>>>>>>>>>> That is good to know, I think we would want to turn our default to >>>>>>>>>>>> 'bind to core' except for our few users who use hybrid mode. >>>>>>>>>>>> >>>>>>>>>>>> Our CPU set tells you what cores the job is assigned. So in the >>>>>>>>>>>> problem case provided, the cpuset/cgroup shows only cores 8-11 are >>>>>>>>>>>> available to this job on this node. >>>>>>>>>>>> >>>>>>>>>>>> Brock Palen >>>>>>>>>>>> www.umich.edu/~brockp >>>>>>>>>>>> CAEN Advanced Computing >>>>>>>>>>>> XSEDE Campus Champion >>>>>>>>>>>> bro...@umich.edu >>>>>>>>>>>> (734)936-1985 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Jun 18, 2014, at 11:10 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The default binding option depends on the number of procs - it is >>>>>>>>>>>>> bind-to core for np=2, and bind-to socket for np > 2. You never >>>>>>>>>>>>> said, but should I assume you ran 4 ranks? If so, then we should >>>>>>>>>>>>> be trying to bind-to socket. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure what your cpuset is telling us - are you binding us >>>>>>>>>>>>> to a socket? Are some cpus in one socket, and some in another? >>>>>>>>>>>>> >>>>>>>>>>>>> It could be that the cpuset + bind-to socket is resulting in some >>>>>>>>>>>>> odd behavior, but I'd need a little more info to narrow it down. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Jun 18, 2014, at 7:48 PM, Brock Palen <bro...@umich.edu> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I have started using 1.8.1 for some codes (meep in this case) >>>>>>>>>>>>>> and it sometimes works fine, but in a few cases I am seeing >>>>>>>>>>>>>> ranks being given overlapping CPU assignments, not always though. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Example job, default binding options (so by-core right?): >>>>>>>>>>>>>> >>>>>>>>>>>>>> Assigned nodes, the one in question is nyx5398, we use torque >>>>>>>>>>>>>> CPU sets, and use TM to spawn. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >>>>>>>>>>>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >>>>>>>>>>>>>> [nyx5409:11][nyx5411:11][nyx5412:3] >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16065 >>>>>>>>>>>>>> 0x00000200 >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16066 >>>>>>>>>>>>>> 0x00000800 >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16067 >>>>>>>>>>>>>> 0x00000200 >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16068 >>>>>>>>>>>>>> 0x00000800 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@nyx5398 ~]# cat >>>>>>>>>>>>>> /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus >>>>>>>>>>>>>> 8-11 >>>>>>>>>>>>>> >>>>>>>>>>>>>> So torque claims the CPU set setup for the job has 4 cores, but >>>>>>>>>>>>>> as you can see the ranks were giving identical binding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I checked the pids they were part of the correct CPU set, I also >>>>>>>>>>>>>> checked, orted: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16064 >>>>>>>>>>>>>> 0x00000f00 >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 >>>>>>>>>>>>>> ignored unrecognized argument 16064 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00 >>>>>>>>>>>>>> 8,9,10,11 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which is exactly what I would expect. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So ummm, i'm lost why this might happen? What else should I >>>>>>>>>>>>>> check? Like I said not all jobs show this behavior. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brock Palen >>>>>>>>>>>>>> www.umich.edu/~brockp >>>>>>>>>>>>>> CAEN Advanced Computing >>>>>>>>>>>>>> XSEDE Campus Champion >>>>>>>>>>>>>> bro...@umich.edu >>>>>>>>>>>>>> (734)936-1985 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24672.php >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24673.php >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24675.php >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24676.php >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24677.php >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24678.php >>>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24681.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24683.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24684.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24690.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24694.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24696.php
signature.asc
Description: Message signed with OpenPGP using GPGMail