Hi,

Am 28.10.2013 um 14:58 schrieb Luigi Cavallo:

> we are facing problems with openmpi under sge on a cluster equipped with 
> QLogic IB HCAs.  Working off sge, openmpi works perfectly, we can dispatch 
> the job as we want, no warning/error messages at all.  If we do the same 
> under sge, even the hello-world program crashes. The main issue is PSM 
> related, as you can see from the error message attached at the end of this 
> email.  We solved this issue by switching off  PSM, basically using 2 
> possible strategies. Either adding --mca  mtl ^psm  at the mpirun command, or 
> setting the env variable OMPI_MCA_pml ob1.  This way jobs under SGE runs 
> properly.  Any preference for one or the two options we found to switch off 
> PSM ?

So, Open MPI was build "--with-sge"? There is an option in the "execd_params" 
setting to increase the memory: S_MEMORYLOCKED, H_MEMORYLOCKED, S_LOCKS, 
H_LOCKS (`man sge_conf`) which is often necessary for IB.


> However, we would really like to understand why we have this PSM error when 
> we run under SGE, since the IB performance without PSM is of course 
> deteriorated.  We asked SGE users list, but nothing smart from them.


Which list do you refer to - the one at http://gridengine.org?


> <snip>
> [c1bay2:31113] mca: base: component_find: unable to open 
> /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: 
> perhaps a missing symbol, or compiled for a different version of Open MPI? 
> (ignored)

Is the same version of Open MPI available on all machines and the first one in 
$LD_LIBRARY_PATH resp. $PATH to be targeted?

-- Reuti


> c1bay2.31114Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31116Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31117Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> PSM was unable to open an endpoint. Please make sure that the network link is
> active on the node and the hardware is functioning.
> 
>  Error: Failure in initializing endpoint
> --------------------------------------------------------------------------
> c1bay2.31115Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  PML add procs failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [c1bay2:31114] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> 
> --------- END OF error file from sge ------------
> 
> 
> 
> This message and its contents including attachments are intended solely for 
> the original recipient. If you are not the intended recipient or have 
> received this message in error, please notify me immediately and delete this 
> message from your computer system. Any unauthorized use or distribution is 
> prohibited. Please consider the environment before printing this email.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to