The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid Engine. You can find more information and several remedies here: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
I usually resolve this problem by adding "ulimit -l unlimited" near the top of the SGE startup script on the computation nodes and restarting SGE on every node. Jeremy Stout On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum....@gmail.com> wrote: > Hello all, > > Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of > SGE i.e using --with-sge. > But the ompi_info shows only one component: > # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) > > Is this right? Because during ompi installation SGE qmaster daemon was > not working. > > Now the problem is, the open mpi parallel jobs submitted thru > gridengine are failing (when run on multiple nodes) with the error: > > $ cat err.26.Helloworld-PRL > ssh_exchange_identification: Connection closed by remote host > -------------------------------------------------------------------------- > A daemon (pid 8462) died unexpectedly with status 129 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > When the job runs on single node, it runs well with producing the > output but with an error: > $ cat err.23.Helloworld-PRL > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > > Local host: node-0-4.local > Local device: mthca0 > -------------------------------------------------------------------------- > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > [node-0-4.local:07869] 7 more processes have sent help message > help-mpi-btl-openib.txt / error in device init > [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > > What may be the problem for this behavior? > > Thanks, > Sangamesh > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >