I am getting a problem where something called "PSM" is failing to start and 
that in turn is preventing my job from running.  Command and output are below.  
I would like to understand what's going on.  Apparently this version of OpenMPI 
decided to build itself with support for PSM, but if it's not available, why 
fail if other transports are available?  Also, in my command I think I've told 
OpenMPI not to use anything but self and sm, so why would it try to use PSM? 

Thanks in advance for any help...

user@machinename:~> /usr/mpi/intel/openmpi-1.4.3/bin/ompi_info -all | grep psm
                 MCA mtl: psm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA mtl: parameter "mtl_psm_connect_timeout" (current value: 
"180", data source: default value)
                 MCA mtl: parameter "mtl_psm_debug" (current value: "1", data 
source: default value)
                 MCA mtl: parameter "mtl_psm_ib_unit" (current value: "-1", 
data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_port" (current value: "0", data 
source: default value)
                 MCA mtl: parameter "mtl_psm_ib_service_level" (current value: 
"0", data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_pkey" (current value: "32767", 
data source: default value)
                 MCA mtl: parameter "mtl_psm_priority" (current value: "0", 
data source: default value)

Here is my command:

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun -n 1 --mca btl_base_verbose 30 --mca 
btl self,sm /release/cfd/simgrid/P_OPT.LINUX64

and here is the output:

[machinename:01124] mca: base: components_open: Looking for btl components
[machinename:01124] mca: base: components_open: opening btl components
[machinename:01124] mca: base: components_open: found loaded component self
[machinename:01124] mca: base: components_open: component self has no register 
function
[machinename:01124] mca: base: components_open: component self open function 
successful
[machinename:01124] mca: base: components_open: found loaded component sm
[machinename:01124] mca: base: components_open: component sm has no register 
function
[machinename:01124] mca: base: components_open: component sm open function 
successful
machinename.1124ipath_userinit: assign_context command failed: Network is down
machinename.1124can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--------------------------------------------------------------------------
[machinename:01124] mca: base: close: component self closed
[machinename:01124] mca: base: close: unloading component self
[machinename:01124] mca: base: close: component sm closed
[machinename:01124] mca: base: close: unloading component sm
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.

Reply via email to