Re: [OMPI users] openmpi query

Jeff Squyres (jsquyres) Tue, 8 Apr 2014 06:37:21 -0400 (EDT)

You should ping the Rocks maintainers and ask them to upgrade.  Open MPI 1.4.3 
was released in September of 2010.



On Apr 8, 2014, at 5:37 AM, Nisha Dhankher -M.Tech(CSE) 
<nishadhankher-coaese...@pau.edu> wrote:

> latest rocks 6.2  carry this version only
> 
> 
> On Tue, Apr 8, 2014 at 3:49 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> Open MPI 1.4.3 is *ancient*.  Please upgrade -- we just released Open MPI 1.8 
> last week.
> 
> Also, please look at this FAQ entry -- it steps you through a lot of basic 
> troubleshooting steps about getting basic MPI programs working.
> 
> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
> 
> Once you get basic MPI programs working, then try with MPI Blast.
> 
> 
> 
> On Apr 5, 2014, at 3:11 AM, Nisha Dhankher -M.Tech(CSE) 
> <nishadhankher-coaese...@pau.edu> wrote:
> 
> > Mpirun --mca btl ^openib --mca btl_tcp_if_include eth0  -np 16  
> > -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o out.txt
> >
> > was the command i executed on cluster...
> >
> >
> >
> > On Sat, Apr 5, 2014 at 12:34 PM, Nisha Dhankher -M.Tech(CSE) 
> > <nishadhankher-coaese...@pau.edu> wrote:
> > sorry Ralph my mistake its not names...it is "it does not happen on same 
> > nodes."
> >
> >
> > On Sat, Apr 5, 2014 at 12:33 PM, Nisha Dhankher -M.Tech(CSE) 
> > <nishadhankher-coaese...@pau.edu> wrote:
> > same vm on all machines that is virt-manager
> >
> >
> > On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) 
> > <nishadhankher-coaese...@pau.edu> wrote:
> > opmpi version 1.4.3
> >
> >
> > On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > Okay, so if you run mpiBlast on all the non-name nodes, everything is okay? 
> > What do you mean by "names nodes"?
> >
> >
> > On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) 
> > <nishadhankher-coaese...@pau.edu> wrote:
> >
> >> no it does not happen on names nodes
> >>
> >>
> >> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >> Hi Nisha
> >>
> >> I'm sorry if my questions appear abrasive - I'm just a little frustrated 
> >> at the communication bottleneck as I can't seem to get a clear picture of 
> >> your situation. So you really don't need to keep calling me "sir" :-)
> >>
> >> The error you are hitting is very unusual - it means that the processes 
> >> are able to make a connection, but are failing to correctly complete a 
> >> simple handshake exchange of their process identifications. There are only 
> >> a few ways that can happen, and I'm trying to get you to test for them.
> >>
> >> So let's try and see if we can narrow this down. You mention that it works 
> >> on some machines, but not all. Is this consistent - i.e., is it always the 
> >> same machines that work, and the same ones that generate the error? If you 
> >> exclude the ones that show the error, does it work? If so, what is 
> >> different about those nodes? Are they a different architecture?
> >>
> >>
> >> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) 
> >> <nishadhankher-coaese...@pau.edu> wrote:
> >>
> >>> sir
> >>> smae virt-manager is bein used by all pc's.no i did n't enable 
> >>> openmpi-hetro.Yes openmpi version is same in all through same kickstart 
> >>> file.
> >>> ok...actually sir...rocks itself installed,configured openmpi and mpich 
> >>> on it own through hpc roll.
> >>>
> >>>
> >>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>
> >>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) 
> >>> <nishadhankher-coaese...@pau.edu> wrote:
> >>>
> >>>> thankyou Ralph.
> >>>> Yes cluster is heterogenous...
> >>>
> >>> And did you configure OMPI --enable-heterogeneous? And are you running it 
> >>> with ---hetero-nodes? What version of OMPI are you using anyway?
> >>>
> >>> Note that we don't care if the host pc's are hetero - what we care about 
> >>> is the VM. If all the VMs are the same, then it shouldn't matter. 
> >>> However, most VM technologies don't handle hetero hardware very well - 
> >>> i.e., you can't emulate an x86 architecture on top of a Sparc or Power 
> >>> chip or vice versa.
> >>>
> >>>
> >>>> And i haven't made compute nodes on direct physical nodes (pc's) becoz 
> >>>> in college it is not possible to take whole lab of 32 pc's for your work 
> >>>>  so i ran on vm.
> >>>
> >>> Yes, but at least it would let you test the setup to run MPI across even 
> >>> a couple of pc's - this is simple debugging practice.
> >>>
> >>>> In Rocks cluster, frontend give the same kickstart to all the pc's so 
> >>>> openmpi version should be same i guess.
> >>>
> >>> Guess? or know? Makes a difference - might be worth testing.
> >>>
> >>>> Sir
> >>>> mpiformatdb is a command to distribute database fragments to different 
> >>>> compute nodes after partitioning od database.
> >>>> And sir have you done mpiblast ?
> >>>
> >>> Nope - but that isn't the issue, is it? The issue is with the MPI setup.
> >>>
> >>>>
> >>>>
> >>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>> What is "mpiformatdb"? We don't have an MPI database in our system, and 
> >>>> I have no idea what that command means
> >>>>
> >>>> As for that error - it means that the identifier we exchange between 
> >>>> processes is failing to be recognized. This could mean a couple of 
> >>>> things:
> >>>>
> >>>> 1. the OMPI version on the two ends is different - could be you aren't 
> >>>> getting the right paths set on the various machines
> >>>>
> >>>> 2. the cluster is heterogeneous
> >>>>
> >>>> You say you have "virtual nodes" running on various PC's? That would be 
> >>>> an unusual setup - VM's can be problematic given the way they handle TCP 
> >>>> connections, so that might be another source of the problem if my 
> >>>> understanding of your setup is correct. Have you tried running this 
> >>>> across the PCs directly - i.e., without any VMs?
> >>>>
> >>>>
> >>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) 
> >>>> <nishadhankher-coaese...@pau.edu> wrote:
> >>>>
> >>>>> i first formatted my database with mpiformatdb command then i ran 
> >>>>> command :
> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i 
> >>>>> query.fas -o output.txt
> >>>>> but then it gave this error 113 from some hosts and continue to run for 
> >>>>> other but with no  results even after 2 hours lapsed.....on rocks 6.0 
> >>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 
> >>>>> 1 gb ram to each
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) 
> >>>>> <nishadhankher-coaese...@pau.edu> wrote:
> >>>>> i also made machine file which contain ip adresses of all compute nodes 
> >>>>> + .ncbirc file for path to mpiblast and shared ,local storage path....
> >>>>> Sir
> >>>>> I ran the same command of mpirun on my college supercomputer 8 nodes 
> >>>>> each having 24 processors but it just running....gave no result uptill 
> >>>>> 3 hours...
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) 
> >>>>> <nishadhankher-coaese...@pau.edu> wrote:
> >>>>> i first formatted my database with mpiformatdb command then i ran 
> >>>>> command :
> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i 
> >>>>> query.fas -o output.txt
> >>>>> but then it gave this error 113 from some hosts and continue to run for 
> >>>>> other but with results even after 2 hours lapsed.....on rocks 6.0 
> >>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 
> >>>>> 1 gb ram to each
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>> I'm having trouble understanding your note, so perhaps I am getting 
> >>>>> this wrong. Let's see if I can figure out what you said:
> >>>>>
> >>>>> * your perl command fails with "no route to host" - but I don't see any 
> >>>>> host in your cmd. Maybe I'm just missing something.
> >>>>>
> >>>>> * you tried running a couple of "mpirun", but the mpirun command wasn't 
> >>>>> recognized? Is that correct?
> >>>>>
> >>>>> * you then ran mpiblast and it sounds like it successfully started the 
> >>>>> processes, but then one aborted? Was there an error message beyond just 
> >>>>> the -1 return status?
> >>>>>
> >>>>>
> >>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) 
> >>>>> <nishadhankher-coaese...@pau.edu> wrote:
> >>>>>
> >>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113
> >>>>>>
> >>>>>> In openmpi: this error came when i run my mpiblast program on rocks 
> >>>>>> cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And 
> >>>>>> when i run following command linux_shell$ perl -e 'die$!=113' this msg 
> >>>>>> comes: "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp 
> >>>>>> shell$ mpirun --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca 
> >>>>>> btl_tcp_if_include 10.1.255.244 was also executed but it did nt 
> >>>>>> recognized these commands....nd aborted.... what should i do...? When 
> >>>>>> i run my mpiblast program for the frst time then it give mpi_abort 
> >>>>>> error...bailing out of signal -1 on rank 2 processor...then i removed 
> >>>>>> my public ethernet cable....and then give btl_tcp endpint error 113....
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] openmpi query

Reply via email to