Yes Ralph, MXM cards are on the node. Command runs fine if I run it out of
the chroot environment.

Thanks
Rahul

On Mon, May 25, 2015 at 9:03 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Well, it isn’t finding any MXM cards on NAE27 - do you have any there?
>
> You can’t use yalla without MXM cards on all nodes
>
>
> On May 25, 2015, at 8:51 PM, Rahul Yadav <robora...@gmail.com> wrote:
>
> We were able to solve ssh problem.
>
> But now MPI is not able to use component yalla. We are running following
> command
>
> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>
> command is run in chroot environment on JARVICENAE27 and other node is
> JARVICENAE125. JARVICENAE125 is able to select yalla since that is a
> remote node and thus is not trying to run the job in chroot environment.
> But JARVICENAE27 is throwing few MXM related errors and yalla is not
> selected.
>
> Following are the logs of the command with verbose.
>
> Any idea what might be wrong ?
>
> [1432283901.548917]         sys.c:719  MXM  WARN  Conflicting CPU
> frequencies detected, using: 2601.00
> [JARVICENAE125:00909] mca: base: components_register: registering pml
> components
> [JARVICENAE125:00909] mca: base: components_register: found loaded
> component v
> [JARVICENAE125:00909] mca: base: components_register: component v register
> function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded
> component bfo
> [JARVICENAE125:00909] mca: base: components_register: component bfo
> register function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded
> component cm
> [JARVICENAE125:00909] mca: base: components_register: component cm
> register function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded
> component ob1
> [JARVICENAE125:00909] mca: base: components_register: component ob1
> register function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded
> component yalla
> [JARVICENAE125:00909] mca: base: components_register: component yalla
> register function successful
> [JARVICENAE125:00909] mca: base: components_open: opening pml components
> [JARVICENAE125:00909] mca: base: components_open: found loaded component v
> [JARVICENAE125:00909] mca: base: components_open: component v open
> function successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component
> bfo
> [JARVICENAE125:00909] mca: base: components_open: component bfo open
> function successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component cm
> [JARVICENAE125:00909] mca: base: components_open: component cm open
> function successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component
> ob1
> [JARVICENAE125:00909] mca: base: components_open: component ob1 open
> function successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component
> yalla
> [JARVICENAE125:00909] mca: base: components_open: component yalla open
> function successful
> [JARVICENAE125:00909] select: component v not in the include list
> [JARVICENAE125:00909] select: component bfo not in the include list
> [JARVICENAE125:00909] select: initializing pml component cm
> [JARVICENAE27:06474] mca: base: components_register: registering pml
> components
> [JARVICENAE27:06474] mca: base: components_register: found loaded
> component v
> [JARVICENAE27:06474] mca: base: components_register: component v register
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded
> component bfo
> [JARVICENAE27:06474] mca: base: components_register: component bfo
> register function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded
> component cm
> [JARVICENAE27:06474] mca: base: components_register: component cm register
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded
> component ob1
> [JARVICENAE27:06474] mca: base: components_register: component ob1
> register function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded
> component yalla
> [JARVICENAE27:06474] mca: base: components_register: component yalla
> register function successful
> [JARVICENAE27:06474] mca: base: components_open: opening pml components
> [JARVICENAE27:06474] mca: base: components_open: found loaded component v
> [JARVICENAE27:06474] mca: base: components_open: component v open function
> successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component bfo
> [JARVICENAE27:06474] mca: base: components_open: component bfo open
> function successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component cm
> libibverbs: Warning: no userspace device-specific driver found for
> /sys/class/infiniband_verbs/uverbs0
> [1432283901.559929]         sys.c:719  MXM  WARN  Conflicting CPU
> frequencies detected, using: 2601.00
> [1432283901.561294] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
> There are no Mellanox cards detected.
> [JARVICENAE27:06474] mca: base: close: component cm closed
> [JARVICENAE27:06474] mca: base: close: unloading component cm
> [JARVICENAE27:06474] mca: base: components_open: found loaded component ob1
> [JARVICENAE27:06474] mca: base: components_open: component ob1 open
> function successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component
> yalla
> [1432283901.561732] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
> There are no Mellanox cards detected.
> [JARVICENAE27:06474] mca: base: components_open: component yalla open
> function failed
> [JARVICENAE27:06474] mca: base: close: component yalla closed
> [JARVICENAE27:06474] mca: base: close: unloading component yalla
> [JARVICENAE27:06474] select: component v not in the include list
> [JARVICENAE27:06474] select: component bfo not in the include list
> [JARVICENAE27:06474] select: initializing pml component ob1
> [JARVICENAE27:06474] select: init returned priority 20
> [JARVICENAE27:06474] selected ob1 best priority 20
> [JARVICENAE27:06474] select: component ob1 selected
> [JARVICENAE27:06474] mca: base: close: component v closed
> [JARVICENAE27:06474] mca: base: close: unloading component v
> [JARVICENAE27:06474] mca: base: close: component bfo closed
> [JARVICENAE27:06474] mca: base: close: unloading component bfo
> [JARVICENAE125:00909] select: init returned priority 30
> [JARVICENAE125:00909] select: initializing pml component ob1
> [JARVICENAE125:00909] select: init returned failure for component ob1
> [JARVICENAE125:00909] select: initializing pml component yalla
> [JARVICENAE125:00909] select: init returned priority 50
> [JARVICENAE125:00909] selected yalla best priority 50
> [JARVICENAE125:00909] select: component cm not selected / finalized
> [JARVICENAE125:00909] select: component yalla selected
> [JARVICENAE125:00909] mca: base: close: component v closed
> [JARVICENAE125:00909] mca: base: close: unloading component v
> [JARVICENAE125:00909] mca: base: close: component bfo closed
> [JARVICENAE125:00909] mca: base: close: unloading component bfo
> [JARVICENAE125:00909] mca: base: close: component cm closed
> [JARVICENAE125:00909] mca: base: close: unloading component cm
> [JARVICENAE125:00909] mca: base: close: component ob1 closed
> [JARVICENAE125:00909] mca: base: close: unloading component ob1
> [JARVICENAE27:06474] check:select: modex not reqd
>
>
> On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Okay, so we see two nodes have been allocated:
>>
>> 1. JARVICENAE27 - appears to be the node where mpirun is running
>>
>> 2. 10.3.0.176
>>
>> Does that match what you expected?
>>
>> If you cannot ssh (without a password) between machines, then we will not
>> be able to run.
>>
>>
>> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com> wrote:
>>
>> I get following output with verbose
>>
>> [JARVICENAE27:00654] mca: base: components_register: registering ras
>> components
>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>> component loadleveler
>> [JARVICENAE27:00654] mca: base: components_register: component
>> loadleveler register function successful
>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>> component simulator
>> [JARVICENAE27:00654] mca: base: components_register: component simulator
>> register function successful
>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>> component slurm
>> [JARVICENAE27:00654] mca: base: components_register: component slurm
>> register function successful
>> [JARVICENAE27:00654] mca: base: components_open: opening ras components
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>> loadleveler
>> [JARVICENAE27:00654] mca: base: components_open: component loadleveler
>> open function successful
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>> simulator
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>> slurm
>> [JARVICENAE27:00654] mca: base: components_open: component slurm open
>> function successful
>> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
>> [loadleveler]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
>> [loadleveler]. Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
>> [simulator]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
>> [simulator]. Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [slurm]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component [slurm].
>> Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) No component selected!
>>
>> ======================   ALLOCATED NODES   ======================
>>        JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
>>        10.3.0.176: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>
>> Also, I am not able to ssh to other machine from one machine in chroot
>> environment. Can that be a problem ?
>>
>> Thanks
>> Rahul
>>
>> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what
>>> it thinks it is doing. Which OMPI version are you using - master?
>>>
>>>
>>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We have been trying to run MPI jobs (consisting of two different
>>> binaries, one each ) in two nodes,  using hostfile option as following
>>>
>>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
>>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>>
>>> We are doing this in chroot environment. We have set the HPCX env in
>>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot
>>> env) contains IPs of two nodes respectively.
>>>
>>> We are getting following error
>>>
>>> " all nodes which are allocated for this job are already filled "
>>>
>>> However when we use chroot but don't use hostfile option (both processes
>>> run in same node) OR we use hostfile option but outside chroot, it works.
>>>
>>> Anyone has any idea if chroot can cause above error and how to solve it ?
>>>
>>> Thanks
>>> Rahul
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/05/26847.php
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26860.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26861.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26927.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26929.php
>

Reply via email to