Edson,
Based on your questions I would suggest you take a look at the ULFM-enabled
version of Open MPI. You can find it at http://fault-tolerance.org/.
George.
On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo
wrote:
> Thanks a lot for your reply, Ralph!
>
> Could you tell me in what si
Dear Open MPI pros
On one of the clusters here, that has Infinband,
I am getting this type of errors from
OpenMPI 1.4.3 (OK, I know it is old ...):
*
Tcl_InitNotifier: unable to start notifier thread
Abort: Command not found.
Tcl_InitNotifi
Check ompi_info - was it built with openib support?
Then check that the mca_btl_openib library is present in the prefix/lib/openmpi
directory
Sounds like it isn't finding the openib plugin
On Aug 12, 2013, at 11:57 AM, Gus Correa wrote:
> Dear Open MPI pros
>
> On one of the clusters here,
Hi, George!
I had studied the ULFM document before begin the tests with failure
detection in open mpi and seems me a good choice.
But I'm having trouble with the ULFM-enabled version of Open MPI
(openmpi-1.7ft_b3.tar.gz). I follow the UFML setup (in
http://fault-tolerance.org/ulfm/ulfm-setup/). T
Thank you for the prompt help, Ralph!
Yes, it is OMPI 1.4.3 built with openib support:
$ ompi_info | grep openib
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
There are three libraries in prefix/lib/openmpi,
no mca_btl_openib library.
$ ls $PREFIX/lib/openmpi/
libompi
Hi Ralph, all
I include more information below,
after turning on btl_openib_verbose 30.
As you can see, OMPI tries, and fails, to load openib.
Last week I reduced the memlock limit from unlimited
to ~12GB, as part of a general attempt to reign on memory
use/abuse by jobs sharing a node.
No paral
No, this has nothing to do with the registration limit. For some reason, the
system is refusing to create a thread - i.e., it is pthread_create that is
failing. I have no idea what would be causing that to happen.
Try setting it to unlimited and see if it allows the thread to start, I guess.
O
Hi Ralph
Sorry if this is more of an IB than an OMPI problem,
but my view angle shows it through the OMPI jobs failing.
Yes, indeed I was setting memlock to unlimited in limits.conf
and in the pbs_mom, restarting everything, relaunching the job.
The error message changes, but it still fails on I
Seems strange that it would have something to do with IB - it seems that alloc
itself is failing, and at only 512 bytes, that doesn't seem like something IB
would cause.
If you write a little program that calls alloc (no MPI), does it also fail?
On Aug 12, 2013, at 3:35 PM, Gus Correa wrote: