Rahul,
per the logs, it seems the /sys pseudo filesystem is not mounted in your
chroot.
at first, can you make sure this is mounted and try again ?
Cheers,
Gilles
On 5/26/2015 12:51 PM, Rahul Yadav wrote:
We were able to solve ssh problem.
But now MPI is not able to use component yalla. We are running
following command
mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
/root/app2 : -n 1 --hostfile /root/host2 /root/backend
command is run in chroot environment on JARVICENAE27 and other node is
JARVICENAE125. JARVICENAE125 is able to select yalla since that is a
remote node and thus is not trying to run the job in chroot
environment. But JARVICENAE27 is throwing few MXM related errors and
yalla is not selected.
Following are the logs of the command with verbose.
Any idea what might be wrong ?
[1432283901.548917] sys.c:719 MXM WARN Conflicting CPU
frequencies detected, using: 2601.00
[JARVICENAE125:00909] mca: base: components_register: registering pml
components
[JARVICENAE125:00909] mca: base: components_register: found loaded
component v
[JARVICENAE125:00909] mca: base: components_register: component v
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component bfo
[JARVICENAE125:00909] mca: base: components_register: component bfo
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component cm
[JARVICENAE125:00909] mca: base: components_register: component cm
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component ob1
[JARVICENAE125:00909] mca: base: components_register: component ob1
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component yalla
[JARVICENAE125:00909] mca: base: components_register: component yalla
register function successful
[JARVICENAE125:00909] mca: base: components_open: opening pml components
[JARVICENAE125:00909] mca: base: components_open: found loaded component v
[JARVICENAE125:00909] mca: base: components_open: component v open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded
component bfo
[JARVICENAE125:00909] mca: base: components_open: component bfo open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded
component cm
[JARVICENAE125:00909] mca: base: components_open: component cm open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded
component ob1
[JARVICENAE125:00909] mca: base: components_open: component ob1 open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded
component yalla
[JARVICENAE125:00909] mca: base: components_open: component yalla open
function successful
[JARVICENAE125:00909] select: component v not in the include list
[JARVICENAE125:00909] select: component bfo not in the include list
[JARVICENAE125:00909] select: initializing pml component cm
[JARVICENAE27:06474] mca: base: components_register: registering pml
components
[JARVICENAE27:06474] mca: base: components_register: found loaded
component v
[JARVICENAE27:06474] mca: base: components_register: component v
register function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded
component bfo
[JARVICENAE27:06474] mca: base: components_register: component bfo
register function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded
component cm
[JARVICENAE27:06474] mca: base: components_register: component cm
register function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded
component ob1
[JARVICENAE27:06474] mca: base: components_register: component ob1
register function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded
component yalla
[JARVICENAE27:06474] mca: base: components_register: component yalla
register function successful
[JARVICENAE27:06474] mca: base: components_open: opening pml components
[JARVICENAE27:06474] mca: base: components_open: found loaded component v
[JARVICENAE27:06474] mca: base: components_open: component v open
function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded
component bfo
[JARVICENAE27:06474] mca: base: components_open: component bfo open
function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded component cm
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs0
[1432283901.559929] sys.c:719 MXM WARN Conflicting CPU
frequencies detected, using: 2601.00
[1432283901.561294] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM
ERROR There are no Mellanox cards detected.
[JARVICENAE27:06474] mca: base: close: component cm closed
[JARVICENAE27:06474] mca: base: close: unloading component cm
[JARVICENAE27:06474] mca: base: components_open: found loaded
component ob1
[JARVICENAE27:06474] mca: base: components_open: component ob1 open
function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded
component yalla
[1432283901.561732] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM
ERROR There are no Mellanox cards detected.
[JARVICENAE27:06474] mca: base: components_open: component yalla open
function failed
[JARVICENAE27:06474] mca: base: close: component yalla closed
[JARVICENAE27:06474] mca: base: close: unloading component yalla
[JARVICENAE27:06474] select: component v not in the include list
[JARVICENAE27:06474] select: component bfo not in the include list
[JARVICENAE27:06474] select: initializing pml component ob1
[JARVICENAE27:06474] select: init returned priority 20
[JARVICENAE27:06474] selected ob1 best priority 20
[JARVICENAE27:06474] select: component ob1 selected
[JARVICENAE27:06474] mca: base: close: component v closed
[JARVICENAE27:06474] mca: base: close: unloading component v
[JARVICENAE27:06474] mca: base: close: component bfo closed
[JARVICENAE27:06474] mca: base: close: unloading component bfo
[JARVICENAE125:00909] select: init returned priority 30
[JARVICENAE125:00909] select: initializing pml component ob1
[JARVICENAE125:00909] select: init returned failure for component ob1
[JARVICENAE125:00909] select: initializing pml component yalla
[JARVICENAE125:00909] select: init returned priority 50
[JARVICENAE125:00909] selected yalla best priority 50
[JARVICENAE125:00909] select: component cm not selected / finalized
[JARVICENAE125:00909] select: component yalla selected
[JARVICENAE125:00909] mca: base: close: component v closed
[JARVICENAE125:00909] mca: base: close: unloading component v
[JARVICENAE125:00909] mca: base: close: component bfo closed
[JARVICENAE125:00909] mca: base: close: unloading component bfo
[JARVICENAE125:00909] mca: base: close: component cm closed
[JARVICENAE125:00909] mca: base: close: unloading component cm
[JARVICENAE125:00909] mca: base: close: component ob1 closed
[JARVICENAE125:00909] mca: base: close: unloading component ob1
[JARVICENAE27:06474] check:select: modex not reqd
On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
Okay, so we see two nodes have been allocated:
1. JARVICENAE27 - appears to be the node where mpirun is running
2. 10.3.0.176
Does that match what you expected?
If you cannot ssh (without a password) between machines, then we
will not be able to run.
On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com
<mailto:robora...@gmail.com>> wrote:
I get following output with verbose
[JARVICENAE27:00654] mca: base: components_register: registering
ras components
[JARVICENAE27:00654] mca: base: components_register: found loaded
component loadleveler
[JARVICENAE27:00654] mca: base: components_register: component
loadleveler register function successful
[JARVICENAE27:00654] mca: base: components_register: found loaded
component simulator
[JARVICENAE27:00654] mca: base: components_register: component
simulator register function successful
[JARVICENAE27:00654] mca: base: components_register: found loaded
component slurm
[JARVICENAE27:00654] mca: base: components_register: component
slurm register function successful
[JARVICENAE27:00654] mca: base: components_open: opening ras
components
[JARVICENAE27:00654] mca: base: components_open: found loaded
component loadleveler
[JARVICENAE27:00654] mca: base: components_open: component
loadleveler open function successful
[JARVICENAE27:00654] mca: base: components_open: found loaded
component simulator
[JARVICENAE27:00654] mca: base: components_open: found loaded
component slurm
[JARVICENAE27:00654] mca: base: components_open: component slurm
open function successful
[JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
[JARVICENAE27:00654] mca:base:select:( ras) Querying component
[loadleveler]
[JARVICENAE27:00654] mca:base:select:( ras) Skipping component
[loadleveler]. Query failed to return a module
[JARVICENAE27:00654] mca:base:select:( ras) Querying component
[simulator]
[JARVICENAE27:00654] mca:base:select:( ras) Skipping component
[simulator]. Query failed to return a module
[JARVICENAE27:00654] mca:base:select:( ras) Querying component
[slurm]
[JARVICENAE27:00654] mca:base:select:( ras) Skipping component
[slurm]. Query failed to return a module
[JARVICENAE27:00654] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
10.3.0.176 <http://10.3.0.176/>: slots=1 max_slots=0
slots_inuse=0 state=UNKNOWN
Also, I am not able to ssh to other machine from one machine in
chroot environment. Can that be a problem ?
Thanks
Rahul
On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
Try adding —mca ras_base_verbose 10 to your cmd line and
let’s see what it thinks it is doing. Which OMPI version are
you using - master?
On May 6, 2015, at 11:24 PM, Rahul Yadav
<robora...@gmail.com <mailto:robora...@gmail.com>> wrote:
Hi,
We have been trying to run MPI jobs (consisting of two
different binaries, one each ) in two nodes, using hostfile
option as following
mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile
/root/host1 /root/app2 : -n 1 --hostfile /root/host2
/root/backend
We are doing this in chroot environment. We have set the
HPCX env in chroot'ed environment itself. /root/host1 and
/root/host2 (inside chroot env) contains IPs of two nodes
respectively.
We are getting following error
" all nodes which are allocated for this job are already
filled "
However when we use chroot but don't use hostfile option
(both processes run in same node) OR we use hostfile option
but outside chroot, it works.
Anyone has any idea if chroot can cause above error and how
to solve it ?
Thanks
Rahul
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26845.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26847.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26860.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26861.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26927.php