Well, it isn’t finding any MXM cards on NAE27 - do you have any there? You can’t use yalla without MXM cards on all nodes
> On May 25, 2015, at 8:51 PM, Rahul Yadav <robora...@gmail.com> wrote: > > We were able to solve ssh problem. > > But now MPI is not able to use component yalla. We are running following > command > > mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 > /root/app2 : -n 1 --hostfile /root/host2 /root/backend > > command is run in chroot environment on JARVICENAE27 and other node is > JARVICENAE125. JARVICENAE125 is able to select yalla since that is a remote > node and thus is not trying to run the job in chroot environment. But > JARVICENAE27 is throwing few MXM related errors and yalla is not selected. > > Following are the logs of the command with verbose. > > Any idea what might be wrong ? > > [1432283901.548917] sys.c:719 MXM WARN Conflicting CPU frequencies > detected, using: 2601.00 > [JARVICENAE125:00909] mca: base: components_register: registering pml > components > [JARVICENAE125:00909] mca: base: components_register: found loaded component v > [JARVICENAE125:00909] mca: base: components_register: component v register > function successful > [JARVICENAE125:00909] mca: base: components_register: found loaded component > bfo > [JARVICENAE125:00909] mca: base: components_register: component bfo register > function successful > [JARVICENAE125:00909] mca: base: components_register: found loaded component > cm > [JARVICENAE125:00909] mca: base: components_register: component cm register > function successful > [JARVICENAE125:00909] mca: base: components_register: found loaded component > ob1 > [JARVICENAE125:00909] mca: base: components_register: component ob1 register > function successful > [JARVICENAE125:00909] mca: base: components_register: found loaded component > yalla > [JARVICENAE125:00909] mca: base: components_register: component yalla > register function successful > [JARVICENAE125:00909] mca: base: components_open: opening pml components > [JARVICENAE125:00909] mca: base: components_open: found loaded component v > [JARVICENAE125:00909] mca: base: components_open: component v open function > successful > [JARVICENAE125:00909] mca: base: components_open: found loaded component bfo > [JARVICENAE125:00909] mca: base: components_open: component bfo open function > successful > [JARVICENAE125:00909] mca: base: components_open: found loaded component cm > [JARVICENAE125:00909] mca: base: components_open: component cm open function > successful > [JARVICENAE125:00909] mca: base: components_open: found loaded component ob1 > [JARVICENAE125:00909] mca: base: components_open: component ob1 open function > successful > [JARVICENAE125:00909] mca: base: components_open: found loaded component yalla > [JARVICENAE125:00909] mca: base: components_open: component yalla open > function successful > [JARVICENAE125:00909] select: component v not in the include list > [JARVICENAE125:00909] select: component bfo not in the include list > [JARVICENAE125:00909] select: initializing pml component cm > [JARVICENAE27:06474] mca: base: components_register: registering pml > components > [JARVICENAE27:06474] mca: base: components_register: found loaded component v > [JARVICENAE27:06474] mca: base: components_register: component v register > function successful > [JARVICENAE27:06474] mca: base: components_register: found loaded component > bfo > [JARVICENAE27:06474] mca: base: components_register: component bfo register > function successful > [JARVICENAE27:06474] mca: base: components_register: found loaded component cm > [JARVICENAE27:06474] mca: base: components_register: component cm register > function successful > [JARVICENAE27:06474] mca: base: components_register: found loaded component > ob1 > [JARVICENAE27:06474] mca: base: components_register: component ob1 register > function successful > [JARVICENAE27:06474] mca: base: components_register: found loaded component > yalla > [JARVICENAE27:06474] mca: base: components_register: component yalla register > function successful > [JARVICENAE27:06474] mca: base: components_open: opening pml components > [JARVICENAE27:06474] mca: base: components_open: found loaded component v > [JARVICENAE27:06474] mca: base: components_open: component v open function > successful > [JARVICENAE27:06474] mca: base: components_open: found loaded component bfo > [JARVICENAE27:06474] mca: base: components_open: component bfo open function > successful > [JARVICENAE27:06474] mca: base: components_open: found loaded component cm > libibverbs: Warning: no userspace device-specific driver found for > /sys/class/infiniband_verbs/uverbs0 > [1432283901.559929] sys.c:719 MXM WARN Conflicting CPU frequencies > detected, using: 2601.00 > [1432283901.561294] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR > There are no Mellanox cards detected. > [JARVICENAE27:06474] mca: base: close: component cm closed > [JARVICENAE27:06474] mca: base: close: unloading component cm > [JARVICENAE27:06474] mca: base: components_open: found loaded component ob1 > [JARVICENAE27:06474] mca: base: components_open: component ob1 open function > successful > [JARVICENAE27:06474] mca: base: components_open: found loaded component yalla > [1432283901.561732] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR > There are no Mellanox cards detected. > [JARVICENAE27:06474] mca: base: components_open: component yalla open > function failed > [JARVICENAE27:06474] mca: base: close: component yalla closed > [JARVICENAE27:06474] mca: base: close: unloading component yalla > [JARVICENAE27:06474] select: component v not in the include list > [JARVICENAE27:06474] select: component bfo not in the include list > [JARVICENAE27:06474] select: initializing pml component ob1 > [JARVICENAE27:06474] select: init returned priority 20 > [JARVICENAE27:06474] selected ob1 best priority 20 > [JARVICENAE27:06474] select: component ob1 selected > [JARVICENAE27:06474] mca: base: close: component v closed > [JARVICENAE27:06474] mca: base: close: unloading component v > [JARVICENAE27:06474] mca: base: close: component bfo closed > [JARVICENAE27:06474] mca: base: close: unloading component bfo > [JARVICENAE125:00909] select: init returned priority 30 > [JARVICENAE125:00909] select: initializing pml component ob1 > [JARVICENAE125:00909] select: init returned failure for component ob1 > [JARVICENAE125:00909] select: initializing pml component yalla > [JARVICENAE125:00909] select: init returned priority 50 > [JARVICENAE125:00909] selected yalla best priority 50 > [JARVICENAE125:00909] select: component cm not selected / finalized > [JARVICENAE125:00909] select: component yalla selected > [JARVICENAE125:00909] mca: base: close: component v closed > [JARVICENAE125:00909] mca: base: close: unloading component v > [JARVICENAE125:00909] mca: base: close: component bfo closed > [JARVICENAE125:00909] mca: base: close: unloading component bfo > [JARVICENAE125:00909] mca: base: close: component cm closed > [JARVICENAE125:00909] mca: base: close: unloading component cm > [JARVICENAE125:00909] mca: base: close: component ob1 closed > [JARVICENAE125:00909] mca: base: close: unloading component ob1 > [JARVICENAE27:06474] check:select: modex not reqd > > > On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Okay, so we see two nodes have been allocated: > > 1. JARVICENAE27 - appears to be the node where mpirun is running > > 2. 10.3.0.176 > > Does that match what you expected? > > If you cannot ssh (without a password) between machines, then we will not be > able to run. > > >> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com >> <mailto:robora...@gmail.com>> wrote: >> >> I get following output with verbose >> >> [JARVICENAE27:00654] mca: base: components_register: registering ras >> components >> [JARVICENAE27:00654] mca: base: components_register: found loaded component >> loadleveler >> [JARVICENAE27:00654] mca: base: components_register: component loadleveler >> register function successful >> [JARVICENAE27:00654] mca: base: components_register: found loaded component >> simulator >> [JARVICENAE27:00654] mca: base: components_register: component simulator >> register function successful >> [JARVICENAE27:00654] mca: base: components_register: found loaded component >> slurm >> [JARVICENAE27:00654] mca: base: components_register: component slurm >> register function successful >> [JARVICENAE27:00654] mca: base: components_open: opening ras components >> [JARVICENAE27:00654] mca: base: components_open: found loaded component >> loadleveler >> [JARVICENAE27:00654] mca: base: components_open: component loadleveler open >> function successful >> [JARVICENAE27:00654] mca: base: components_open: found loaded component >> simulator >> [JARVICENAE27:00654] mca: base: components_open: found loaded component slurm >> [JARVICENAE27:00654] mca: base: components_open: component slurm open >> function successful >> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components >> [JARVICENAE27:00654] mca:base:select:( ras) Querying component [loadleveler] >> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component >> [loadleveler]. Query failed to return a module >> [JARVICENAE27:00654] mca:base:select:( ras) Querying component [simulator] >> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component [simulator]. >> Query failed to return a module >> [JARVICENAE27:00654] mca:base:select:( ras) Querying component [slurm] >> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component [slurm]. >> Query failed to return a module >> [JARVICENAE27:00654] mca:base:select:( ras) No component selected! >> >> ====================== ALLOCATED NODES ====================== >> JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP >> 10.3.0.176 <http://10.3.0.176/>: slots=1 max_slots=0 slots_inuse=0 >> state=UNKNOWN >> >> Also, I am not able to ssh to other machine from one machine in chroot >> environment. Can that be a problem ? >> >> Thanks >> Rahul >> >> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what it >> thinks it is doing. Which OMPI version are you using - master? >> >> >>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com >>> <mailto:robora...@gmail.com>> wrote: >>> >>> Hi, >>> >>> We have been trying to run MPI jobs (consisting of two different binaries, >>> one each ) in two nodes, using hostfile option as following >>> >>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 >>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend >>> >>> We are doing this in chroot environment. We have set the HPCX env in >>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot >>> env) contains IPs of two nodes respectively. >>> >>> We are getting following error >>> >>> " all nodes which are allocated for this job are already filled " >>> >>> However when we use chroot but don't use hostfile option (both processes >>> run in same node) OR we use hostfile option but outside chroot, it works. >>> >>> Anyone has any idea if chroot can cause above error and how to solve it ? >>> >>> Thanks >>> Rahul >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php >>> <http://www.open-mpi.org/community/lists/users/2015/05/26845.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26847.php >> <http://www.open-mpi.org/community/lists/users/2015/05/26847.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26860.php >> <http://www.open-mpi.org/community/lists/users/2015/05/26860.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26861.php > <http://www.open-mpi.org/community/lists/users/2015/05/26861.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26927.php