Well, it isn’t finding any MXM cards on NAE27 - do you have any there?

You can’t use yalla without MXM cards on all nodes


> On May 25, 2015, at 8:51 PM, Rahul Yadav <robora...@gmail.com> wrote:
> 
> We were able to solve ssh problem. 
> 
> But now MPI is not able to use component yalla. We are running following 
> command
> 
> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 
> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
> 
> command is run in chroot environment on JARVICENAE27 and other node is 
> JARVICENAE125. JARVICENAE125 is able to select yalla since that is a remote 
> node and thus is not trying to run the job in chroot environment. But 
> JARVICENAE27 is throwing few MXM related errors and yalla is not selected.
> 
> Following are the logs of the command with verbose.
> 
> Any idea what might be wrong ?
> 
> [1432283901.548917]         sys.c:719  MXM  WARN  Conflicting CPU frequencies 
> detected, using: 2601.00
> [JARVICENAE125:00909] mca: base: components_register: registering pml 
> components
> [JARVICENAE125:00909] mca: base: components_register: found loaded component v
> [JARVICENAE125:00909] mca: base: components_register: component v register 
> function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded component 
> bfo
> [JARVICENAE125:00909] mca: base: components_register: component bfo register 
> function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded component 
> cm
> [JARVICENAE125:00909] mca: base: components_register: component cm register 
> function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded component 
> ob1
> [JARVICENAE125:00909] mca: base: components_register: component ob1 register 
> function successful
> [JARVICENAE125:00909] mca: base: components_register: found loaded component 
> yalla
> [JARVICENAE125:00909] mca: base: components_register: component yalla 
> register function successful
> [JARVICENAE125:00909] mca: base: components_open: opening pml components
> [JARVICENAE125:00909] mca: base: components_open: found loaded component v
> [JARVICENAE125:00909] mca: base: components_open: component v open function 
> successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component bfo
> [JARVICENAE125:00909] mca: base: components_open: component bfo open function 
> successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component cm
> [JARVICENAE125:00909] mca: base: components_open: component cm open function 
> successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component ob1
> [JARVICENAE125:00909] mca: base: components_open: component ob1 open function 
> successful
> [JARVICENAE125:00909] mca: base: components_open: found loaded component yalla
> [JARVICENAE125:00909] mca: base: components_open: component yalla open 
> function successful
> [JARVICENAE125:00909] select: component v not in the include list
> [JARVICENAE125:00909] select: component bfo not in the include list
> [JARVICENAE125:00909] select: initializing pml component cm
> [JARVICENAE27:06474] mca: base: components_register: registering pml 
> components
> [JARVICENAE27:06474] mca: base: components_register: found loaded component v
> [JARVICENAE27:06474] mca: base: components_register: component v register 
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded component 
> bfo
> [JARVICENAE27:06474] mca: base: components_register: component bfo register 
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded component cm
> [JARVICENAE27:06474] mca: base: components_register: component cm register 
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded component 
> ob1
> [JARVICENAE27:06474] mca: base: components_register: component ob1 register 
> function successful
> [JARVICENAE27:06474] mca: base: components_register: found loaded component 
> yalla
> [JARVICENAE27:06474] mca: base: components_register: component yalla register 
> function successful
> [JARVICENAE27:06474] mca: base: components_open: opening pml components
> [JARVICENAE27:06474] mca: base: components_open: found loaded component v
> [JARVICENAE27:06474] mca: base: components_open: component v open function 
> successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component bfo
> [JARVICENAE27:06474] mca: base: components_open: component bfo open function 
> successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component cm
> libibverbs: Warning: no userspace device-specific driver found for 
> /sys/class/infiniband_verbs/uverbs0
> [1432283901.559929]         sys.c:719  MXM  WARN  Conflicting CPU frequencies 
> detected, using: 2601.00
> [1432283901.561294] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR 
> There are no Mellanox cards detected.
> [JARVICENAE27:06474] mca: base: close: component cm closed
> [JARVICENAE27:06474] mca: base: close: unloading component cm
> [JARVICENAE27:06474] mca: base: components_open: found loaded component ob1
> [JARVICENAE27:06474] mca: base: components_open: component ob1 open function 
> successful
> [JARVICENAE27:06474] mca: base: components_open: found loaded component yalla
> [1432283901.561732] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR 
> There are no Mellanox cards detected.
> [JARVICENAE27:06474] mca: base: components_open: component yalla open 
> function failed
> [JARVICENAE27:06474] mca: base: close: component yalla closed
> [JARVICENAE27:06474] mca: base: close: unloading component yalla
> [JARVICENAE27:06474] select: component v not in the include list
> [JARVICENAE27:06474] select: component bfo not in the include list
> [JARVICENAE27:06474] select: initializing pml component ob1
> [JARVICENAE27:06474] select: init returned priority 20
> [JARVICENAE27:06474] selected ob1 best priority 20
> [JARVICENAE27:06474] select: component ob1 selected
> [JARVICENAE27:06474] mca: base: close: component v closed
> [JARVICENAE27:06474] mca: base: close: unloading component v
> [JARVICENAE27:06474] mca: base: close: component bfo closed
> [JARVICENAE27:06474] mca: base: close: unloading component bfo
> [JARVICENAE125:00909] select: init returned priority 30
> [JARVICENAE125:00909] select: initializing pml component ob1
> [JARVICENAE125:00909] select: init returned failure for component ob1
> [JARVICENAE125:00909] select: initializing pml component yalla
> [JARVICENAE125:00909] select: init returned priority 50
> [JARVICENAE125:00909] selected yalla best priority 50
> [JARVICENAE125:00909] select: component cm not selected / finalized
> [JARVICENAE125:00909] select: component yalla selected
> [JARVICENAE125:00909] mca: base: close: component v closed
> [JARVICENAE125:00909] mca: base: close: unloading component v
> [JARVICENAE125:00909] mca: base: close: component bfo closed
> [JARVICENAE125:00909] mca: base: close: unloading component bfo
> [JARVICENAE125:00909] mca: base: close: component cm closed
> [JARVICENAE125:00909] mca: base: close: unloading component cm
> [JARVICENAE125:00909] mca: base: close: component ob1 closed
> [JARVICENAE125:00909] mca: base: close: unloading component ob1
> [JARVICENAE27:06474] check:select: modex not reqd
> 
> 
> On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Okay, so we see two nodes have been allocated:
> 
> 1. JARVICENAE27 - appears to be the node where mpirun is running
> 
> 2. 10.3.0.176
> 
> Does that match what you expected?
> 
> If you cannot ssh (without a password) between machines, then we will not be 
> able to run.
> 
> 
>> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com 
>> <mailto:robora...@gmail.com>> wrote:
>> 
>> I get following output with verbose
>> 
>> [JARVICENAE27:00654] mca: base: components_register: registering ras 
>> components
>> [JARVICENAE27:00654] mca: base: components_register: found loaded component 
>> loadleveler
>> [JARVICENAE27:00654] mca: base: components_register: component loadleveler 
>> register function successful
>> [JARVICENAE27:00654] mca: base: components_register: found loaded component 
>> simulator
>> [JARVICENAE27:00654] mca: base: components_register: component simulator 
>> register function successful
>> [JARVICENAE27:00654] mca: base: components_register: found loaded component 
>> slurm
>> [JARVICENAE27:00654] mca: base: components_register: component slurm 
>> register function successful
>> [JARVICENAE27:00654] mca: base: components_open: opening ras components
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component 
>> loadleveler
>> [JARVICENAE27:00654] mca: base: components_open: component loadleveler open 
>> function successful
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component 
>> simulator
>> [JARVICENAE27:00654] mca: base: components_open: found loaded component slurm
>> [JARVICENAE27:00654] mca: base: components_open: component slurm open 
>> function successful
>> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [loadleveler]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component 
>> [loadleveler]. Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [simulator]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component [simulator]. 
>> Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [slurm]
>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component [slurm]. 
>> Query failed to return a module
>> [JARVICENAE27:00654] mca:base:select:(  ras) No component selected!
>> 
>> ======================   ALLOCATED NODES   ======================
>>        JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
>>        10.3.0.176 <http://10.3.0.176/>: slots=1 max_slots=0 slots_inuse=0 
>> state=UNKNOWN
>> 
>> Also, I am not able to ssh to other machine from one machine in chroot 
>> environment. Can that be a problem ?
>> 
>> Thanks
>> Rahul
>> 
>> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what it 
>> thinks it is doing. Which OMPI version are you using - master?
>> 
>> 
>>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com 
>>> <mailto:robora...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> We have been trying to run MPI jobs (consisting of two different binaries, 
>>> one each ) in two nodes,  using hostfile option as following
>>> 
>>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 
>>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>> 
>>> We are doing this in chroot environment. We have set the HPCX env in 
>>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot 
>>> env) contains IPs of two nodes respectively.
>>> 
>>> We are getting following error
>>> 
>>> " all nodes which are allocated for this job are already filled "
>>> 
>>> However when we use chroot but don't use hostfile option (both processes 
>>> run in same node) OR we use hostfile option but outside chroot, it works.
>>> 
>>> Anyone has any idea if chroot can cause above error and how to solve it ?
>>> 
>>> Thanks
>>> Rahul
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/05/26845.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/05/26847.php 
>> <http://www.open-mpi.org/community/lists/users/2015/05/26847.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/05/26860.php 
>> <http://www.open-mpi.org/community/lists/users/2015/05/26860.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26861.php 
> <http://www.open-mpi.org/community/lists/users/2015/05/26861.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26927.php

Reply via email to