Few thoughts occur:

1. 1.4.3 is awfully old - I would recommend you update to at least the 1.6 
series if you can. We don't actively support 1.4 any more, and I don't know 
what the issues might have been with PSM that long ago

2. I see that you built LSF support for some reason, or there is a stale LSF 
support library from a prior build. You might want to clean that out just to 
avoid any future problems.

3. Just looking at your output, I see something a little weird where you appear 
to load both gcc and icc modules, then load an icc version of OMPI. Any chance 
you are getting confusing libc's as a result?

4. The error message seems to indicate an issue with initializing the PSM 
driver. Is it possible that you need to load a module or something to prep PSM 
- something you do in your environment that ssh would activate (say in a 
.bashrc), but sge isn't doing automatically for you?

Ralph

On Oct 28, 2013, at 6:58 AM, Luigi Cavallo <luigi.cava...@kaust.edu.sa> wrote:

> 
> Hi,
> 
> we are facing problems with openmpi under sge on a cluster equipped with 
> QLogic IB HCAs.  Working off sge, openmpi works perfectly, we can dispatch 
> the job as we want, no warning/error messages at all.  If we do the same 
> under sge, even the hello-world program crashes. The main issue is PSM 
> related, as you can see from the error message attached at the end of this 
> email.  We solved this issue by switching off  PSM, basically using 2 
> possible strategies. Either adding --mca  mtl ^psm  at the mpirun command, or 
> setting the env variable OMPI_MCA_pml ob1.  This way jobs under SGE runs 
> properly.  Any preference for one or the two options we found to switch off 
> PSM ?
> 
> However, we would really like to understand why we have this PSM error when 
> we run under SGE, since the IB performance without PSM is of course 
> deteriorated.  We asked SGE users list, but nothing smart from them.  
> Wondering if this list can help.
> 
> Thanks,
> Luigi
> 
> 
> --------- BEGINNING OF error file from sge ------------
> Loading module gcc version 4.6.0
> Initial gcc version: 4.4.6
> Current gcc version: 4.6.0
> Loading module icc version 11.1.075
> Current icc version: none
> Current icc version: 11.1
> Loading module ifort version 11.1.075
> Current ifort version: none
> Current ifort version: 11.1
> Loading module for compilers-extra
> Extra compiler modules now loaded
> Loading module mpi-openmpi version 1.4.3-icc-11.1
> Current mpi-openmpi version: 1.4.3
> [c1bay2:31113] mca: base: component_find: unable to open 
> /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ess_lsf: 
> perhaps a missing symbol, or compiled for a different version of Open MPI? 
> (ignored)
> [c1bay2:31113] mca: base: component_find: unable to open 
> /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_plm_lsf: 
> perhaps a missing symbol, or compiled for a different version of Open MPI? 
> (ignored)
> [c1bay2:31113] mca: base: component_find: unable to open 
> /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: 
> perhaps a missing symbol, or compiled for a different version of Open MPI? 
> (ignored)
> c1bay2.31114Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31116Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31117Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> PSM was unable to open an endpoint. Please make sure that the network link is
> active on the node and the hardware is functioning.
> 
>  Error: Failure in initializing endpoint
> --------------------------------------------------------------------------
> c1bay2.31115Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  PML add procs failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [c1bay2:31114] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> 
> --------- END OF error file from sge ------------
> 
> 
> 
> This message and its contents including attachments are intended solely for 
> the original recipient. If you are not the intended recipient or have 
> received this message in error, please notify me immediately and delete this 
> message from your computer system. Any unauthorized use or distribution is 
> prohibited. Please consider the environment before printing this email.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to