It is quite likely that the lsf integration on the 1.6 series is broken. We don't have a way to test it any more (all our LSF access is gone). I recently was briefly given access to an LSF machine and fixed it for the 1.7 series, but that series doesn't support checkpoint/restart.
On Mar 30, 2013, at 1:01 AM, Jorge Naranjo Bouzas <jona...@gmail.com> wrote: > Hello! > > We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster > with Infiniband > > We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line > was like: > > ./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 > --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr > --enable-ft-thread --enable-opal-multi-threads --with-psm > > The problem I am having is that for some reason the ft-enable-cr features > freezes my mpi application when I use more that one node. The job is never > started ... > > We narrowed the search down and we noticed that when mpirun is used out of > the batch system, it works... but as soon as the mpirun detects the env > variable LSB_JOBID and assumes it is under LSF environment, the problem > arises... Additionally, if we use "--mca plm rsh" which should deactivate the > LSF integration , it works again, as expected... > > So, or guess is: or there is something misconfigured in our LSF or there is a > problem in the plm module inside openmpi ... > > Any hint??? > > Thanks!! > > Jorge Naranjo > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users