Few thoughts occur: 1. 1.4.3 is awfully old - I would recommend you update to at least the 1.6 series if you can. We don't actively support 1.4 any more, and I don't know what the issues might have been with PSM that long ago
2. I see that you built LSF support for some reason, or there is a stale LSF support library from a prior build. You might want to clean that out just to avoid any future problems. 3. Just looking at your output, I see something a little weird where you appear to load both gcc and icc modules, then load an icc version of OMPI. Any chance you are getting confusing libc's as a result? 4. The error message seems to indicate an issue with initializing the PSM driver. Is it possible that you need to load a module or something to prep PSM - something you do in your environment that ssh would activate (say in a .bashrc), but sge isn't doing automatically for you? Ralph On Oct 28, 2013, at 6:58 AM, Luigi Cavallo <luigi.cava...@kaust.edu.sa> wrote: > > Hi, > > we are facing problems with openmpi under sge on a cluster equipped with > QLogic IB HCAs. Working off sge, openmpi works perfectly, we can dispatch > the job as we want, no warning/error messages at all. If we do the same > under sge, even the hello-world program crashes. The main issue is PSM > related, as you can see from the error message attached at the end of this > email. We solved this issue by switching off PSM, basically using 2 > possible strategies. Either adding --mca mtl ^psm at the mpirun command, or > setting the env variable OMPI_MCA_pml ob1. This way jobs under SGE runs > properly. Any preference for one or the two options we found to switch off > PSM ? > > However, we would really like to understand why we have this PSM error when > we run under SGE, since the IB performance without PSM is of course > deteriorated. We asked SGE users list, but nothing smart from them. > Wondering if this list can help. > > Thanks, > Luigi > > > --------- BEGINNING OF error file from sge ------------ > Loading module gcc version 4.6.0 > Initial gcc version: 4.4.6 > Current gcc version: 4.6.0 > Loading module icc version 11.1.075 > Current icc version: none > Current icc version: 11.1 > Loading module ifort version 11.1.075 > Current ifort version: none > Current ifort version: 11.1 > Loading module for compilers-extra > Extra compiler modules now loaded > Loading module mpi-openmpi version 1.4.3-icc-11.1 > Current mpi-openmpi version: 1.4.3 > [c1bay2:31113] mca: base: component_find: unable to open > /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ess_lsf: > perhaps a missing symbol, or compiled for a different version of Open MPI? > (ignored) > [c1bay2:31113] mca: base: component_find: unable to open > /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_plm_lsf: > perhaps a missing symbol, or compiled for a different version of Open MPI? > (ignored) > [c1bay2:31113] mca: base: component_find: unable to open > /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: > perhaps a missing symbol, or compiled for a different version of Open MPI? > (ignored) > c1bay2.31114Driver initialization failure on /dev/ipath (err=23) > c1bay2.31116Driver initialization failure on /dev/ipath (err=23) > c1bay2.31117Driver initialization failure on /dev/ipath (err=23) > -------------------------------------------------------------------------- > PSM was unable to open an endpoint. Please make sure that the network link is > active on the node and the hardware is functioning. > > Error: Failure in initializing endpoint > -------------------------------------------------------------------------- > c1bay2.31115Driver initialization failure on /dev/ipath (err=23) > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > *** The MPI_Init() function was called before MPI_INIT was invoked. > *** This is disallowed by the MPI standard. > *** Your MPI job will now abort. > [c1bay2:31114] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** The MPI_Init() function was called before MPI_INIT was invoked. > *** This is disallowed by the MPI standard. > *** Your MPI job will now abort. > > --------- END OF error file from sge ------------ > > > > This message and its contents including attachments are intended solely for > the original recipient. If you are not the intended recipient or have > received this message in error, please notify me immediately and delete this > message from your computer system. Any unauthorized use or distribution is > prohibited. Please consider the environment before printing this email. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users