Rahul,

per the logs, it seems the /sys pseudo filesystem is not mounted in your chroot.

at first, can you make sure this is mounted and try again ?

Cheers,

Gilles

On 5/26/2015 12:51 PM, Rahul Yadav wrote:
We were able to solve ssh problem.

But now MPI is not able to use component yalla. We are running following command

mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 /root/app2 : -n 1 --hostfile /root/host2 /root/backend

command is run in chroot environment on JARVICENAE27 and other node is JARVICENAE125. JARVICENAE125 is able to select yalla since that is a remote node and thus is not trying to run the job in chroot environment. But JARVICENAE27 is throwing few MXM related errors and yalla is not selected.

Following are the logs of the command with verbose.

Any idea what might be wrong ?

[1432283901.548917] sys.c:719 MXM WARN Conflicting CPU frequencies detected, using: 2601.00 [JARVICENAE125:00909] mca: base: components_register: registering pml components [JARVICENAE125:00909] mca: base: components_register: found loaded component v [JARVICENAE125:00909] mca: base: components_register: component v register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component bfo [JARVICENAE125:00909] mca: base: components_register: component bfo register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component cm [JARVICENAE125:00909] mca: base: components_register: component cm register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component ob1 [JARVICENAE125:00909] mca: base: components_register: component ob1 register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component yalla [JARVICENAE125:00909] mca: base: components_register: component yalla register function successful
[JARVICENAE125:00909] mca: base: components_open: opening pml components
[JARVICENAE125:00909] mca: base: components_open: found loaded component v
[JARVICENAE125:00909] mca: base: components_open: component v open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component bfo [JARVICENAE125:00909] mca: base: components_open: component bfo open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component cm [JARVICENAE125:00909] mca: base: components_open: component cm open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component ob1 [JARVICENAE125:00909] mca: base: components_open: component ob1 open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component yalla [JARVICENAE125:00909] mca: base: components_open: component yalla open function successful
[JARVICENAE125:00909] select: component v not in the include list
[JARVICENAE125:00909] select: component bfo not in the include list
[JARVICENAE125:00909] select: initializing pml component cm
[JARVICENAE27:06474] mca: base: components_register: registering pml components [JARVICENAE27:06474] mca: base: components_register: found loaded component v [JARVICENAE27:06474] mca: base: components_register: component v register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component bfo [JARVICENAE27:06474] mca: base: components_register: component bfo register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component cm [JARVICENAE27:06474] mca: base: components_register: component cm register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component ob1 [JARVICENAE27:06474] mca: base: components_register: component ob1 register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component yalla [JARVICENAE27:06474] mca: base: components_register: component yalla register function successful
[JARVICENAE27:06474] mca: base: components_open: opening pml components
[JARVICENAE27:06474] mca: base: components_open: found loaded component v
[JARVICENAE27:06474] mca: base: components_open: component v open function successful [JARVICENAE27:06474] mca: base: components_open: found loaded component bfo [JARVICENAE27:06474] mca: base: components_open: component bfo open function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded component cm
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 [1432283901.559929] sys.c:719 MXM WARN Conflicting CPU frequencies detected, using: 2601.00 [1432283901.561294] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR There are no Mellanox cards detected.
[JARVICENAE27:06474] mca: base: close: component cm closed
[JARVICENAE27:06474] mca: base: close: unloading component cm
[JARVICENAE27:06474] mca: base: components_open: found loaded component ob1 [JARVICENAE27:06474] mca: base: components_open: component ob1 open function successful [JARVICENAE27:06474] mca: base: components_open: found loaded component yalla [1432283901.561732] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR There are no Mellanox cards detected. [JARVICENAE27:06474] mca: base: components_open: component yalla open function failed
[JARVICENAE27:06474] mca: base: close: component yalla closed
[JARVICENAE27:06474] mca: base: close: unloading component yalla
[JARVICENAE27:06474] select: component v not in the include list
[JARVICENAE27:06474] select: component bfo not in the include list
[JARVICENAE27:06474] select: initializing pml component ob1
[JARVICENAE27:06474] select: init returned priority 20
[JARVICENAE27:06474] selected ob1 best priority 20
[JARVICENAE27:06474] select: component ob1 selected
[JARVICENAE27:06474] mca: base: close: component v closed
[JARVICENAE27:06474] mca: base: close: unloading component v
[JARVICENAE27:06474] mca: base: close: component bfo closed
[JARVICENAE27:06474] mca: base: close: unloading component bfo
[JARVICENAE125:00909] select: init returned priority 30
[JARVICENAE125:00909] select: initializing pml component ob1
[JARVICENAE125:00909] select: init returned failure for component ob1
[JARVICENAE125:00909] select: initializing pml component yalla
[JARVICENAE125:00909] select: init returned priority 50
[JARVICENAE125:00909] selected yalla best priority 50
[JARVICENAE125:00909] select: component cm not selected / finalized
[JARVICENAE125:00909] select: component yalla selected
[JARVICENAE125:00909] mca: base: close: component v closed
[JARVICENAE125:00909] mca: base: close: unloading component v
[JARVICENAE125:00909] mca: base: close: component bfo closed
[JARVICENAE125:00909] mca: base: close: unloading component bfo
[JARVICENAE125:00909] mca: base: close: component cm closed
[JARVICENAE125:00909] mca: base: close: unloading component cm
[JARVICENAE125:00909] mca: base: close: component ob1 closed
[JARVICENAE125:00909] mca: base: close: unloading component ob1
[JARVICENAE27:06474] check:select: modex not reqd


On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    Okay, so we see two nodes have been allocated:

    1. JARVICENAE27 - appears to be the node where mpirun is running

    2. 10.3.0.176

    Does that match what you expected?

    If you cannot ssh (without a password) between machines, then we
    will not be able to run.


    On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com
    <mailto:robora...@gmail.com>> wrote:

    I get following output with verbose

    [JARVICENAE27:00654] mca: base: components_register: registering
    ras components
    [JARVICENAE27:00654] mca: base: components_register: found loaded
    component loadleveler
    [JARVICENAE27:00654] mca: base: components_register: component
    loadleveler register function successful
    [JARVICENAE27:00654] mca: base: components_register: found loaded
    component simulator
    [JARVICENAE27:00654] mca: base: components_register: component
    simulator register function successful
    [JARVICENAE27:00654] mca: base: components_register: found loaded
    component slurm
    [JARVICENAE27:00654] mca: base: components_register: component
    slurm register function successful
    [JARVICENAE27:00654] mca: base: components_open: opening ras
    components
    [JARVICENAE27:00654] mca: base: components_open: found loaded
    component loadleveler
    [JARVICENAE27:00654] mca: base: components_open: component
    loadleveler open function successful
    [JARVICENAE27:00654] mca: base: components_open: found loaded
    component simulator
    [JARVICENAE27:00654] mca: base: components_open: found loaded
    component slurm
    [JARVICENAE27:00654] mca: base: components_open: component slurm
    open function successful
    [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
    [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
    [loadleveler]
    [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
    [loadleveler]. Query failed to return a module
    [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
    [simulator]
    [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
    [simulator]. Query failed to return a module
    [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
    [slurm]
    [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
    [slurm]. Query failed to return a module
    [JARVICENAE27:00654] mca:base:select:(  ras) No component selected!

    ======================   ALLOCATED NODES ======================
         JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
    10.3.0.176 <http://10.3.0.176/>: slots=1 max_slots=0
    slots_inuse=0 state=UNKNOWN

    Also, I am not able to ssh to other machine from one machine in
    chroot environment. Can that be a problem ?

    Thanks
    Rahul

    On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        Try adding —mca ras_base_verbose 10 to your cmd line and
        let’s see what it thinks it is doing. Which OMPI version are
        you using - master?


        On May 6, 2015, at 11:24 PM, Rahul Yadav
        <robora...@gmail.com <mailto:robora...@gmail.com>> wrote:

        Hi,

        We have been trying to run MPI jobs (consisting of two
        different binaries, one each ) in two nodes,  using hostfile
        option as following

        mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile
        /root/host1 /root/app2 : -n 1 --hostfile /root/host2
        /root/backend

        We are doing this in chroot environment. We have set the
        HPCX env in chroot'ed environment itself. /root/host1 and
        /root/host2 (inside chroot env) contains IPs of two nodes
        respectively.

        We are getting following error

        " all nodes which are allocated for this job are already
        filled "

        However when we use chroot but don't use hostfile option
        (both processes run in same node) OR we use hostfile option
        but outside chroot, it works.

        Anyone has any idea if chroot can cause above error and how
        to solve it ?

        Thanks
        Rahul
        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2015/05/26845.php


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2015/05/26847.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/05/26860.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/05/26861.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26927.php

Reply via email to