[OMPI users] orte_grpcomm_modex failed

2011-10-21 Thread devendra rai
Hello Community,

I have been struggling with this error for quite some time:

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  orte_grpcomm_modex failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead of 
"Success" (0)
--
--
mpirun has exited due to process rank 1 with PID 18945 on
node tik35x.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

I am running this on a cluster and this has started happening only after a 
recent rebuild of openmpi-1.4.3. Interestingly, I have the same version of 
openmpi on my PC, and the same application works fine.

I have looked into this error on the web, but there is very little discussion, 
on the causes, or how to correct it. I asked the admin to attempt a re-install 
of openmpi, but I am not sure whether this will solve the problem.

Can some one please help?

Thanks a lot.

Best,

Devendra Rai


Re: [OMPI users] orte_grpcomm_modex failed

2011-10-21 Thread Jeff Squyres
This usually means that you have a Open MPI version mismatch between some of 
your nodes.  Meaning: on some nodes, you're finding version X.Y.Z of Open MPI 
by default, but on other nodes, you're finding version A.B.C.


On Oct 21, 2011, at 7:00 AM, devendra rai wrote:

> Hello Community,
> 
> I have been struggling with this error for quite some time:
> 
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   orte_grpcomm_modex failed
>   --> Returned "Data unpack would read past end of buffer" (-26) instead of 
> "Success" (0)
> --
> --
> mpirun has exited due to process rank 1 with PID 18945 on
> node tik35x.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> 
> I am running this on a cluster and this has started happening only after a 
> recent rebuild of openmpi-1.4.3. Interestingly, I have the same version of 
> openmpi on my PC, and the same application works fine.
> 
> I have looked into this error on the web, but there is very little 
> discussion, on the causes, or how to correct it. I asked the admin to attempt 
> a re-install of openmpi, but I am not sure whether this will solve the 
> problem.
> 
> Can some one please help?
> 
> Thanks a lot.
> 
> Best,
> 
> Devendra Rai
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Technical details of various MPI APIs

2011-10-21 Thread ramu
Hi,
I am trying to explore more on technical details of MPI APIs defined in OpenMPI
(for e.g., MPI_Init(), MPI_Barrier(), MPI_Send(), MPI_Recv(), MPI_Waitall(),
MPI_Finalize() etc) when the MPI Processes are running on Infiniband cluster
(OFED).  I mean, what are the messages exchanged between MPI processes over IB,
how does processes identify each other and what messages they exchange to
identify and what all is needed to trigger data traffic.  Is there any doc/link
available which describes these details.  Please suggest me. 

Thanks & Regards,
Ramu