Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)
Joshua, I am using a job scheduling system so ulimit –v is set by me. Nevertheless the ulimit –l is always half the value of ulimit –v. This is a bit strange, I am not sure whether this might be an issue (31GB and 156GB are decent values). For completeness the output of ulimit –o from one of th

Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Joshua Ladd
Aleksandar, Please ensure your system administrator follows the guidelines outlined in the link printed in the error message http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Best, Josh On Fri, Jun 20, 2014 at 2:56 PM, Ivanov, Aleksandar (INR) < aleksandar.iva...@kit.edu> wrot

Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)
Hi, I was not the one updating the machine unfortunately, however I can ask my colleagues for specific list of modifications done. If I understand you correctly you are referring to the "ulimit" parameters. They are properly set, in fact we use JMS as job scheduler, therefore the "ulimit -v" is

Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ralph Castain
What was updated? If the OS, did you remember to set the memory registration limits to max? On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) wrote: > > Dear Sir or Madam, > > I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since > a recent update of our machi

[OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)
Dear Sir or Madam, I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since a recent update of our machine I started generating mpi errors. The code crashes after completing approx. 24 % from the total job. The same code and input were run before on the same machine and no

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by setting that param to 0 On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: > Perfection! That appears to do it for our standard case. > > Now I know how to set MCA options by env var or config file. How can I make

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Perfection! That appears to do it for our standard case. Now I know how to set MCA options by env var or config file. How can I make this the default, that then a user can override? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation. Otherwise, we expect that the cpuset of t

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Extra data point if I do: [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
I was able to produce it in my test. orted affinity set by cpuset: [root@nyx5874 ~]# hwloc-bind --get --pid 103645 0xc002 This mask (1, 14,15) which is across sockets, matches the cpu set setup by the batch system. [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Got it, I have the input from the user and am testing it out. It probably has less todo with torque and more cpuset's, I'm working on producing it myself also. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Thanks - I'm just trying to reproduce one problem case so I can look at it. Given that I don't have access to a Torque machine, I need to "fake" it. On Jun 20, 2014, at 9:15 AM, Brock Palen wrote: > In this case they are a single socket, but as you can see they could be > ether/or depending o

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
In this case they are a single socket, but as you can see they could be ether/or depending on the job. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 19, 2014, at 2:44 PM, Ralph Castain wrote: > Sorry, I should have been