Hi, Am 28.10.2013 um 14:58 schrieb Luigi Cavallo:
> we are facing problems with openmpi under sge on a cluster equipped with > QLogic IB HCAs. Working off sge, openmpi works perfectly, we can dispatch > the job as we want, no warning/error messages at all. If we do the same > under sge, even the hello-world program crashes. The main issue is PSM > related, as you can see from the error message attached at the end of this > email. We solved this issue by switching off PSM, basically using 2 > possible strategies. Either adding --mca mtl ^psm at the mpirun command, or > setting the env variable OMPI_MCA_pml ob1. This way jobs under SGE runs > properly. Any preference for one or the two options we found to switch off > PSM ? So, Open MPI was build "--with-sge"? There is an option in the "execd_params" setting to increase the memory: S_MEMORYLOCKED, H_MEMORYLOCKED, S_LOCKS, H_LOCKS (`man sge_conf`) which is often necessary for IB. > However, we would really like to understand why we have this PSM error when > we run under SGE, since the IB performance without PSM is of course > deteriorated. We asked SGE users list, but nothing smart from them. Which list do you refer to - the one at http://gridengine.org? > <snip> > [c1bay2:31113] mca: base: component_find: unable to open > /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: > perhaps a missing symbol, or compiled for a different version of Open MPI? > (ignored) Is the same version of Open MPI available on all machines and the first one in $LD_LIBRARY_PATH resp. $PATH to be targeted? -- Reuti > c1bay2.31114Driver initialization failure on /dev/ipath (err=23) > c1bay2.31116Driver initialization failure on /dev/ipath (err=23) > c1bay2.31117Driver initialization failure on /dev/ipath (err=23) > -------------------------------------------------------------------------- > PSM was unable to open an endpoint. Please make sure that the network link is > active on the node and the hardware is functioning. > > Error: Failure in initializing endpoint > -------------------------------------------------------------------------- > c1bay2.31115Driver initialization failure on /dev/ipath (err=23) > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > *** The MPI_Init() function was called before MPI_INIT was invoked. > *** This is disallowed by the MPI standard. > *** Your MPI job will now abort. > [c1bay2:31114] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** The MPI_Init() function was called before MPI_INIT was invoked. > *** This is disallowed by the MPI standard. > *** Your MPI job will now abort. > > --------- END OF error file from sge ------------ > > > > This message and its contents including attachments are intended solely for > the original recipient. If you are not the intended recipient or have > received this message in error, please notify me immediately and delete this > message from your computer system. Any unauthorized use or distribution is > prohibited. Please consider the environment before printing this email. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users