It is quite likely that the lsf integration on the 1.6 series is broken. We 
don't have a way to test it any more (all our LSF access is gone). I recently 
was briefly given access to an LSF machine and fixed it for the 1.7 series, but 
that series doesn't support checkpoint/restart.


On Mar 30, 2013, at 1:01 AM, Jorge Naranjo Bouzas <jona...@gmail.com> wrote:

> Hello!
> 
> We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster 
> with Infiniband
> 
> We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line 
> was like:
> 
> ./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 
> --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr 
> --enable-ft-thread --enable-opal-multi-threads --with-psm
> 
> The problem I am having is that for some reason the ft-enable-cr features 
> freezes my mpi application when I use more that one node. The job is never 
> started ...
> 
> We narrowed the search down and we noticed that when mpirun is used out of 
> the batch system, it works... but as soon as the mpirun detects the env 
> variable LSB_JOBID and assumes it is under LSF environment, the problem 
> arises... Additionally, if we use "--mca plm rsh" which should deactivate the 
> LSF integration , it works again, as expected...
> 
> So, or guess is: or there is something misconfigured in our LSF or there is a 
> problem in the plm module inside openmpi ...
> 
> Any hint???
> 
> Thanks!!
> 
> Jorge Naranjo
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to