Am 25.01.2009 um 06:16 schrieb Sangamesh B:
Thanks Reuti for the reply.
On Sun, Jan 25, 2009 at 2:22 AM, Reuti <re...@staff.uni-marburg.de>
wrote:
Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
Engine. You can find more information and several remedies here:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
I usually resolve this problem by adding "ulimit -l unlimited" near
the top of the SGE startup script on the computation nodes and
restarting SGE on every node.
Did you request/set any limits with SGE's h_vmem/h_stack resource
request?
Was this also your problem:
http://gridengine.sunsource.net/ds/viewMessage.do?
dsForumId=38&dsMessageId=99442
-- Reuti
No.
The used queue is as follows:
qconf -sq ib.q
qname ib.q
hostlist @ibhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list orte
rerun FALSE
slots 8
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode unix_behavior
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
# qconf -sp orte
pe_name orte
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
# qconf -shgrp @ibhosts
group_name @ibhosts
hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
node-0-8.local node-0-9.local node-0-10.local
node-0-11.local \
node-0-12.local node-0-13.local node-0-14.local
node-0-16.local \
node-0-17.local node-0-18.local node-0-19.local
node-0-20.local \
node-0-21.local node-0-22.local
The Hostnames for IB interface are like ibc0 ibc1.. ibc22
Is this difference caussing the problem.
ssh issues:
between master & node: works fine but with some delay.
between nodes: works fine, no delay
From command line the open mpi jobs were run with no error, even
master node is not used in hostfile.
Thanks,
Sangamesh
-- Reuti
Jeremy Stout
On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B
<forum....@gmail.com> wrote:
Hello all,
Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with
support of
SGE i.e using --with-sge.
But the ompi_info shows only one component:
# /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0,
Component v1.3)
Is this right? Because during ompi installation SGE qmaster
daemon was
not working.
Now the problem is, the open mpi parallel jobs submitted thru
gridengine are failing (when run on multiple nodes) with the error:
$ cat err.26.Helloworld-PRL
ssh_exchange_identification: Connection closed by remote host
-------------------------------------------------------------------
-------
A daemon (pid 8462) died unexpectedly with status 129 while
attempting
to launch so we are aborting.
There may be more information reported by the environment (see
above).
This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH
to have
the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
-------------------------------------------------------------------
-------
-------------------------------------------------------------------
-------
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.
-------------------------------------------------------------------
-------
mpirun: clean termination accomplished
When the job runs on single node, it runs well with producing the
output but with an error:
$ cat err.23.Helloworld-PRL
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
-------------------------------------------------------------------
-------
WARNING: There was an error initializing an OpenFabrics device.
Local host: node-0-4.local
Local device: mthca0
-------------------------------------------------------------------
-------
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
[node-0-4.local:07869] 7 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[node-0-4.local:07869] Set MCA parameter
"orte_base_help_aggregate" to
0 to see all help / error messages
What may be the problem for this behavior?
Thanks,
Sangamesh
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users