Hi,

> Am 29.09.2016 um 14:41 schrieb <aditi...@wipro.com> <aditi...@wipro.com>:
> 
> Hi,
>  
> I am trying to run a Job on parallel nodes using openmpi1.4.5

Better would be Open MPI 1.6.5. Nevertheless the questions are:

- Was Open MPI compiled with SGE integration, i.e. something like:

$ ompi_info  | grep grid
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)


- Did you request a PE in the submission and how is this PE set up?

- How does the `mpiexec` line in your jobscript look like?

- All nodes can talk to each other directly?


> and ge2011.11,  job goes in Running state and then gets aborted.
> After Job gets aborted, I get following error message on the primary node:
>  
> error: executing task of job 28561 failed: failed sending task to 
> ex...@punehpcdl01.wiprohpc.com: can't find connection
>  
> Or 
>  
> error: executing task of job 28560 failed: failed sending task to 
> ex...@punehpcdl01.wiprohpc.com: can't find connection
> --------------------------------------------------------------------------
> A daemon (pid 20651) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the

- Regarding this error, maybe you have to set explicitly LD_LIBRARY_PATH with 
the path to the dynamic libraries, and export this in your jobscript to the 
nodes:

export LD_LIBRARY_PATH=<your_location_of_the_shared_libs>
mpiexec -x LD_LIBRARY_PATH

BTW: The Open MPI is also available on all nodes?

-- Reuti


> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>  
> Is this mpi issue ? Please suggest how do I resolve this connection issue 
> between nodes.
>  
> Thanks & Regards,
> Aditi
>  
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email. www.wipro.com 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to