Hi, I did my classic test (see below) with the 1.3.3 , and unfortunately, it doesnt works. It seems that the modifications you made in openmpi-1.4a1r21142 (test passed) have not been correctly reported in this release. Could you confirm that ??
Thanks Geoffroy ***** BASIC TEST ****** cat rankf: rank 0=node1 slot=* rank 1=node2 slot=* cat hostf: node1 slots=2 node2 slots=2 mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname : --host node2 -n 1 hostname -------------------------------------------------------------------------- Error, invalid rank (1) in the rankfile (rankf) -------------------------------------------------------------------------- [rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 404 [rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 87 [rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 77 [rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 990 -------------------------------------------------------------------------- 2009/7/15 Geoffroy Pignot <geopig...@gmail.com> > Hi Lenny and Ralph, > > I saw nothing about rankfile in the 1.3.3 press release. Does it means that > the bug fixes are not included there ?? > Thanks > > Geoffroy > > 2009/7/15 <users-requ...@open-mpi.org> > >> Send users mailing list submissions to >> us...@open-mpi.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> or, via email, send a message with subject or body 'help' to >> users-requ...@open-mpi.org >> >> You can reach the person managing the list at >> users-ow...@open-mpi.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of users digest..." >> >> >> Today's Topics: >> >> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Lenny Verkhovsky) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 15 Jul 2009 15:08:39 +0300 >> From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: >> <453d39990907150508j33ffa3f0qefc0801ea40f0...@mail.gmail.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Same result. >> I still suspect that rankfile claims for node in small hostlist provided >> by >> line in the app file, and not from the hostlist provided by mpirun on HNP >> node. >> According to my suspections your proposal should not work(and it does >> not), >> since in appfile line I provide np=1, and 1 host, while rankfile tries to >> allocate all ranks (np=2). >> >> $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338 >> >> if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list, >> &num_slots, app, >> >> map->policy))) { >> >> node_list will be partial, according to app, and not full provided by >> mpirun >> cmd. If I didnt provide hostlist in the appfile line, mpirun uses local >> host >> and not hosts from the hostfile. >> >> >> Tell me if I am wrong by expecting the following behaivor >> >> I provide to mpirun NP, full_hostlist, full_rankfile, appfile >> I provide in appfile only partial NP and partial hostlist. >> and it works. >> >> Currently, in order to get it working I need to provide full hostlist in >> the >> appfile. Which is quit a problematic. >> >> >> $mpirun -np 2 -rf rankfile -app appfile >> -------------------------------------------------------------------------- >> Rankfile claimed host +n1 by index that is bigger than number of allocated >> hosts. >> -------------------------------------------------------------------------- >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >> >> >> Thanks >> Lenny. >> >> >> On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> > Try your "not working" example without the -H on the mpirun cmd line - >> > i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that >> work? >> > Sorry to have to keep asking you to try things - I don't have a setup >> here >> > where I can test this as everything is RM managed. >> > >> > >> > On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote: >> > >> > >> > Thanks Ralph, after playing with prefixes it worked, >> > >> > I still have a problem running app file with rankfile, by providing full >> > hostlist in mpirun command and not in app file. >> > Is is planned behaviour, or it can be fixed ? >> > >> > See Working example: >> > >> > $cat rankfile >> > rank 0=+n1 slot=0 >> > rank 1=+n0 slot=0 >> > $cat appfile >> > -np 1 -H witch1,witch2 ./hello_world >> > -np 1 -H witch1,witch2 ./hello_world >> > >> > $mpirun -rf rankfile -app appfile >> > Hello world! I'm 1 of 2 on witch1 >> > Hello world! I'm 0 of 2 on witch2 >> > >> > See NOT working example: >> > >> > $cat appfile >> > -np 1 -H witch1 ./hello_world >> > -np 1 -H witch2 ./hello_world >> > $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile >> > >> -------------------------------------------------------------------------- >> > Rankfile claimed host +n1 by index that is bigger than number of >> allocated >> > hosts. >> > >> -------------------------------------------------------------------------- >> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >> > >> > >> > >> > On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <r...@open-mpi.org> >> wrote: >> > >> >> Took a deeper look into this, and I think that your first guess was >> >> correct. >> >> When we changed hostfile and -host to be per-app-context options, it >> >> became necessary for you to put that info in the appfile itself. So try >> >> adding it there. What you would need in your appfile is the following: >> >> >> >> -np 1 -H witch1 hostname >> >> -np 1 -H witch2 hostname >> >> >> >> That should get you what you want. >> >> Ralph >> >> >> >> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote: >> >> >> >> No, it's not working as I expect , unless I expect somthing wrong . >> >> ( sorry for the long PATH, I needed to provide it ) >> >> >> >> >> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ >> >> >> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >> >> -np 2 -H witch1,witch2 hostname >> >> witch1 >> >> witch2 >> >> >> >> >> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ >> >> >> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >> >> -np 2 -H witch1,witch2 -app appfile >> >> dellix7 >> >> dellix7 >> >> $cat appfile >> >> -np 1 hostname >> >> -np 1 hostname >> >> >> >> >> >> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <r...@open-mpi.org> >> wrote: >> >> >> >>> Run it without the appfile, just putting the apps on the cmd line - >> does >> >>> it work right then? >> >>> >> >>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote: >> >>> >> >>> additional info >> >>> I am running mpirun on hostA, and providing hostlist with hostB and >> >>> hostC. >> >>> I expect that each application would run on hostB and hostC, but I get >> >>> all of them running on hostA. >> >>> dellix7$cat appfile >> >>> -np 1 hostname >> >>> -np 1 hostname >> >>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile >> >>> dellix7 >> >>> dellix7 >> >>> Thanks >> >>> Lenny. >> >>> >> >>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <r...@open-mpi.org> >> wrote: >> >>> >> >>>> Strange - let me have a look at it later today. Probably something >> >>>> simple that another pair of eyes might spot. >> >>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote: >> >>>> >> >>>> Seems like connected problem: >> >>>> I can't use rankfile with app, even after all those fixes ( working >> with >> >>>> trunk 1.4a1r21657). >> >>>> This is my case : >> >>>> >> >>>> $cat rankfile >> >>>> rank 0=+n1 slot=0 >> >>>> rank 1=+n0 slot=0 >> >>>> $cat appfile >> >>>> -np 1 hostname >> >>>> -np 1 hostname >> >>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile >> >>>> >> >>>> >> -------------------------------------------------------------------------- >> >>>> Rankfile claimed host +n1 by index that is bigger than number of >> >>>> allocated hosts. >> >>>> >> >>>> >> -------------------------------------------------------------------------- >> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >> >>>> >> >>>> >> >>>> The problem is, that rankfile mapper tries to find an appropriate >> host >> >>>> in the partial ( and not full ) hostlist. >> >>>> >> >>>> Any suggestions how to fix it? >> >>>> >> >>>> Thanks >> >>>> Lenny. >> >>>> >> >>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <r...@open-mpi.org >> >wrote: >> >>>> >> >>>>> Okay, I fixed this today too....r21219 >> >>>>> >> >>>>> >> >>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote: >> >>>>> >> >>>>> Now there is another problem :) >> >>>>>> >> >>>>>> You can try oversubscribe node. At least by 1 task. >> >>>>>> If you hostfile and rank file limit you at N procs, you can ask >> mpirun >> >>>>>> for N+1 and it wil be not rejected. >> >>>>>> Although in reality there will be N tasks. >> >>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun >> -np >> >>>>>> 5" both works, but in both cases there are only 4 tasks. It isn't >> crucial, >> >>>>>> because there is nor real oversubscription, but there is still some >> bug >> >>>>>> which can affect something in future. >> >>>>>> >> >>>>>> -- >> >>>>>> Anton Starikov. >> >>>>>> >> >>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote: >> >>>>>> >> >>>>>> This is fixed as of r21208. >> >>>>>>> >> >>>>>>> Thanks for reporting it! >> >>>>>>> Ralph >> >>>>>>> >> >>>>>>> >> >>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote: >> >>>>>>> >> >>>>>>> Although removing this check solves problem of having more slots >> in >> >>>>>>>> rankfile than necessary, there is another problem. >> >>>>>>>> >> >>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> hostfile: >> >>>>>>>> >> >>>>>>>> node01 >> >>>>>>>> node01 >> >>>>>>>> node02 >> >>>>>>>> node02 >> >>>>>>>> >> >>>>>>>> rankfile: >> >>>>>>>> >> >>>>>>>> rank 0=node01 slot=1 >> >>>>>>>> rank 1=node01 slot=0 >> >>>>>>>> rank 2=node02 slot=1 >> >>>>>>>> rank 3=node02 slot=0 >> >>>>>>>> >> >>>>>>>> mpirun -np 4 ./something >> >>>>>>>> >> >>>>>>>> complains with: >> >>>>>>>> >> >>>>>>>> "There are not enough slots available in the system to satisfy >> the 4 >> >>>>>>>> slots >> >>>>>>>> that were requested by the application" >> >>>>>>>> >> >>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when >> you >> >>>>>>>> ask for 1 CPU less. And the same behavior in any case (shared >> nodes, >> >>>>>>>> non-shared nodes, multi-node) >> >>>>>>>> >> >>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and >> all >> >>>>>>>> affinities set as it requested in rankfile, there is no >> oversubscription. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Anton. >> >>>>>>>> >> >>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote: >> >>>>>>>> >> >>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer >> is >> >>>>>>>>> required. >> >>>>>>>>> >> >>>>>>>>> Thx! >> >>>>>>>>> >> >>>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky < >> >>>>>>>>> lenny.verkhov...@gmail.com> wrote: >> >>>>>>>>> According to the code it does cares. >> >>>>>>>>> >> >>>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 >> >>>>>>>>> >> >>>>>>>>> ival = orte_rmaps_rank_file_value.ival; >> >>>>>>>>> if ( ival > (np-1) ) { >> >>>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, >> >>>>>>>>> ival, rankfile); >> >>>>>>>>> rc = ORTE_ERR_BAD_PARAM; >> >>>>>>>>> goto unlock; >> >>>>>>>>> } >> >>>>>>>>> >> >>>>>>>>> If I remember correctly, I used an array to map ranks, and since >> >>>>>>>>> the length of array is NP, maximum index must be less than np, >> so if you >> >>>>>>>>> have the number of rank > NP, you have no place to put it inside >> array. >> >>>>>>>>> >> >>>>>>>>> "Likewise, if you have more procs than the rankfile specifies, >> we >> >>>>>>>>> map the additional procs either byslot (default) or bynode (if >> you specify >> >>>>>>>>> that option). So the rankfile doesn't need to contain an entry >> for every >> >>>>>>>>> proc." - Correct point. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Lenny. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On 5/5/09, Ralph Castain <r...@open-mpi.org> wrote: Sorry Lenny, >> >>>>>>>>> but that isn't correct. The rankfile mapper doesn't care if the >> rankfile >> >>>>>>>>> contains additional info - it only maps up to the number of >> processes, and >> >>>>>>>>> ignores anything beyond that number. So there is no need to >> remove the >> >>>>>>>>> additional info. >> >>>>>>>>> >> >>>>>>>>> Likewise, if you have more procs than the rankfile specifies, we >> >>>>>>>>> map the additional procs either byslot (default) or bynode (if >> you specify >> >>>>>>>>> that option). So the rankfile doesn't need to contain an entry >> for every >> >>>>>>>>> proc. >> >>>>>>>>> >> >>>>>>>>> Just don't want to confuse folks. >> >>>>>>>>> Ralph >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky < >> >>>>>>>>> lenny.verkhov...@gmail.com> wrote: >> >>>>>>>>> Hi, >> >>>>>>>>> maximum rank number must be less then np. >> >>>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is >> >>>>>>>>> invalid. >> >>>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile >> >>>>>>>>> Best regards, >> >>>>>>>>> Lenny. >> >>>>>>>>> >> >>>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot < >> >>>>>>>>> geopig...@gmail.com> wrote: >> >>>>>>>>> Hi , >> >>>>>>>>> >> >>>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately >> my >> >>>>>>>>> command doesn't work >> >>>>>>>>> >> >>>>>>>>> cat rankf: >> >>>>>>>>> rank 0=node1 slot=* >> >>>>>>>>> rank 1=node2 slot=* >> >>>>>>>>> >> >>>>>>>>> cat hostf: >> >>>>>>>>> node1 slots=2 >> >>>>>>>>> node2 slots=2 >> >>>>>>>>> >> >>>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 >> >>>>>>>>> hostname : --host node2 -n 1 hostname >> >>>>>>>>> >> >>>>>>>>> Error, invalid rank (1) in the rankfile (rankf) >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>>>>>>>> file rmaps_rank_file.c at line 403 >> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>>>>>>>> file base/rmaps_base_map_job.c at line 86 >> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>>>>>>>> file base/plm_base_launch_support.c at line 86 >> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>>>>>>>> file plm_rsh_module.c at line 1016 >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Ralph, could you tell me if my command syntax is correct or not >> ? >> >>>>>>>>> if not, give me the expected one ? >> >>>>>>>>> >> >>>>>>>>> Regards >> >>>>>>>>> >> >>>>>>>>> Geoffroy >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> >> >>>>>>>>> >> >>>>>>>>> Immediately Sir !!! :) >> >>>>>>>>> >> >>>>>>>>> Thanks again Ralph >> >>>>>>>>> >> >>>>>>>>> Geoffroy >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> ------------------------------ >> >>>>>>>>> >> >>>>>>>>> Message: 2 >> >>>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600 >> >>>>>>>>> From: Ralph Castain <r...@open-mpi.org> >> >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >> >>>>>>>>> Message-ID: >> >>>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com >> > >> >>>>>>>>> Content-Type: text/plain; charset="iso-8859-1" >> >>>>>>>>> >> >>>>>>>>> I believe this is fixed now in our development trunk - you can >> >>>>>>>>> download any >> >>>>>>>>> tarball starting from last night and give it a try, if you like. >> >>>>>>>>> Any >> >>>>>>>>> feedback would be appreciated. >> >>>>>>>>> >> >>>>>>>>> Ralph >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >> >>>>>>>>> >> >>>>>>>>> Ah now, I didn't say it -worked-, did I? :-) >> >>>>>>>>> >> >>>>>>>>> Clearly a bug exists in the program. I'll try to take a look at >> it >> >>>>>>>>> (if Lenny >> >>>>>>>>> doesn't get to it first), but it won't be until later in the >> week. >> >>>>>>>>> >> >>>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >> >>>>>>>>> >> >>>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi >> but >> >>>>>>>>> my >> >>>>>>>>> second example shows that it's not working >> >>>>>>>>> >> >>>>>>>>> cat hostfile.0 >> >>>>>>>>> r011n002 slots=4 >> >>>>>>>>> r011n003 slots=4 >> >>>>>>>>> >> >>>>>>>>> cat rankfile.0 >> >>>>>>>>> rank 0=r011n002 slot=0 >> >>>>>>>>> rank 1=r011n003 slot=1 >> >>>>>>>>> >> >>>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> >>>>>>>>> hostname >> >>>>>>>>> ### CRASHED >> >>>>>>>>> >> >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > rmaps_rank_file.c at line 404 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > base/rmaps_base_map_job.c at line 87 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > base/plm_base_launch_support.c at line 77 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > plm_rsh_module.c at line 985 >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >> >>>>>>>>> > attempting to >> >>>>>>>>> > > launch so we are aborting. >> >>>>>>>>> > > >> >>>>>>>>> > > There may be more information reported by the environment >> (see >> >>>>>>>>> > above). >> >>>>>>>>> > > >> >>>>>>>>> > > This may be because the daemon was unable to find all the >> >>>>>>>>> needed >> >>>>>>>>> > shared >> >>>>>>>>> > > libraries on the remote node. You may set your >> LD_LIBRARY_PATH >> >>>>>>>>> to >> >>>>>>>>> > have the >> >>>>>>>>> > > location of the shared libraries on the remote nodes and >> this >> >>>>>>>>> will >> >>>>>>>>> > > automatically be forwarded to the remote nodes. >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to >> the >> >>>>>>>>> > process >> >>>>>>>>> > > that caused that situation. >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > orterun: clean termination accomplished >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Message: 4 >> >>>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600 >> >>>>>>>>> From: Ralph Castain <r...@lanl.gov> >> >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >> >>>>>>>>> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >> >>>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> >>>>>>>>> DelSp="yes" >> >>>>>>>>> >> >>>>>>>>> The rankfile cuts across the entire job - it isn't applied on an >> >>>>>>>>> app_context basis. So the ranks in your rankfile must correspond >> to >> >>>>>>>>> the eventual rank of each process in the cmd line. >> >>>>>>>>> >> >>>>>>>>> Unfortunately, that means you have to count ranks. In your case, >> >>>>>>>>> you >> >>>>>>>>> only have four, so that makes life easier. Your rankfile would >> look >> >>>>>>>>> something like this: >> >>>>>>>>> >> >>>>>>>>> rank 0=r001n001 slot=0 >> >>>>>>>>> rank 1=r001n002 slot=1 >> >>>>>>>>> rank 2=r001n001 slot=1 >> >>>>>>>>> rank 3=r001n002 slot=2 >> >>>>>>>>> >> >>>>>>>>> HTH >> >>>>>>>>> Ralph >> >>>>>>>>> >> >>>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >> >>>>>>>>> >> >>>>>>>>> > Hi, >> >>>>>>>>> > >> >>>>>>>>> > I agree that my examples are not very clear. What I want to do >> is >> >>>>>>>>> to >> >>>>>>>>> > launch a multiexes application (masters-slaves) and benefit >> from >> >>>>>>>>> the >> >>>>>>>>> > processor affinity. >> >>>>>>>>> > Could you show me how to convert this command , using -rf >> option >> >>>>>>>>> > (whatever the affinity is) >> >>>>>>>>> > >> >>>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >> >>>>>>>>> r001n002 >> >>>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n >> 1 - >> >>>>>>>>> > host r001n002 slave.x options4 >> >>>>>>>>> > >> >>>>>>>>> > Thanks for your help >> >>>>>>>>> > >> >>>>>>>>> > Geoffroy >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > Message: 2 >> >>>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300 >> >>>>>>>>> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >> >>>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>>>>>>>> > To: Open MPI Users <us...@open-mpi.org> >> >>>>>>>>> > Message-ID: >> >>>>>>>>> > < >> >>>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >> >>>>>>>>> > Content-Type: text/plain; charset="iso-8859-1" >> >>>>>>>>> > >> >>>>>>>>> > Hi, >> >>>>>>>>> > >> >>>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1 >> >>>>>>>>> > defined, >> >>>>>>>>> > while n=1, which means only rank 0 is present and can be >> >>>>>>>>> allocated. >> >>>>>>>>> > >> >>>>>>>>> > NP must be >= the largest rank in rankfile. >> >>>>>>>>> > >> >>>>>>>>> > What exactly are you trying to do ? >> >>>>>>>>> > >> >>>>>>>>> > I tried to recreate your seqv but all I got was >> >>>>>>>>> > >> >>>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >> --hostfile >> >>>>>>>>> > hostfile.0 >> >>>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >> >>>>>>>>> > [witch19:30798] mca: base: component_find: paffinity >> >>>>>>>>> > "mca_paffinity_linux" >> >>>>>>>>> > uses an MCA interface that is not recognized (component MCA >> >>>>>>>>> v1.0.0 != >> >>>>>>>>> > supported MCA v2.0.0) -- ignored >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > It looks like opal_init failed for some reason; your parallel >> >>>>>>>>> > process is >> >>>>>>>>> > likely to abort. There are many reasons that a parallel >> process >> >>>>>>>>> can >> >>>>>>>>> > fail during opal_init; some of which are due to configuration >> or >> >>>>>>>>> > environment problems. This failure appears to be an internal >> >>>>>>>>> failure; >> >>>>>>>>> > here's some additional information (which may only be relevant >> to >> >>>>>>>>> an >> >>>>>>>>> > Open MPI developer): >> >>>>>>>>> > >> >>>>>>>>> > opal_carto_base_select failed >> >>>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found >> in >> >>>>>>>>> file >> >>>>>>>>> > ../../orte/runtime/orte_init.c at line 78 >> >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found >> in >> >>>>>>>>> file >> >>>>>>>>> > ../../orte/orted/orted_main.c at line 344 >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while >> >>>>>>>>> > attempting >> >>>>>>>>> > to launch so we are aborting. >> >>>>>>>>> > >> >>>>>>>>> > There may be more information reported by the environment (see >> >>>>>>>>> above). >> >>>>>>>>> > >> >>>>>>>>> > This may be because the daemon was unable to find all the >> needed >> >>>>>>>>> > shared >> >>>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH >> to >> >>>>>>>>> > have the >> >>>>>>>>> > location of the shared libraries on the remote nodes and this >> >>>>>>>>> will >> >>>>>>>>> > automatically be forwarded to the remote nodes. >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > mpirun noticed that the job aborted, but has no info as to the >> >>>>>>>>> process >> >>>>>>>>> > that caused that situation. >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > mpirun: clean termination accomplished >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > Lenny. >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >> >>>>>>>>> > > >> >>>>>>>>> > > Hi , >> >>>>>>>>> > > >> >>>>>>>>> > > I am currently testing the process affinity capabilities of >> >>>>>>>>> > openmpi and I >> >>>>>>>>> > > would like to know if the rankfile behaviour I will describe >> >>>>>>>>> below >> >>>>>>>>> > is normal >> >>>>>>>>> > > or not ? >> >>>>>>>>> > > >> >>>>>>>>> > > cat hostfile.0 >> >>>>>>>>> > > r011n002 slots=4 >> >>>>>>>>> > > r011n003 slots=4 >> >>>>>>>>> > > >> >>>>>>>>> > > cat rankfile.0 >> >>>>>>>>> > > rank 0=r011n002 slot=0 >> >>>>>>>>> > > rank 1=r011n003 slot=1 >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> >>>>>>>>> >> ################################################################################## >> >>>>>>>>> > > >> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname >> ### >> >>>>>>>>> OK >> >>>>>>>>> > > r011n002 >> >>>>>>>>> > > r011n003 >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> >>>>>>>>> >> ################################################################################## >> >>>>>>>>> > > but >> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : >> -n >> >>>>>>>>> 1 >> >>>>>>>>> > hostname >> >>>>>>>>> > > ### CRASHED >> >>>>>>>>> > > * >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > rmaps_rank_file.c at line 404 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > base/rmaps_base_map_job.c at line 87 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > base/plm_base_launch_support.c at line 77 >> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >> in >> >>>>>>>>> file >> >>>>>>>>> > > plm_rsh_module.c at line 985 >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >> >>>>>>>>> > attempting to >> >>>>>>>>> > > launch so we are aborting. >> >>>>>>>>> > > >> >>>>>>>>> > > There may be more information reported by the environment >> (see >> >>>>>>>>> > above). >> >>>>>>>>> > > >> >>>>>>>>> > > This may be because the daemon was unable to find all the >> >>>>>>>>> needed >> >>>>>>>>> > shared >> >>>>>>>>> > > libraries on the remote node. You may set your >> LD_LIBRARY_PATH >> >>>>>>>>> to >> >>>>>>>>> > have the >> >>>>>>>>> > > location of the shared libraries on the remote nodes and >> this >> >>>>>>>>> will >> >>>>>>>>> > > automatically be forwarded to the remote nodes. >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to >> the >> >>>>>>>>> > process >> >>>>>>>>> > > that caused that situation. >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> -------------------------------------------------------------------------- >> >>>>>>>>> > > orterun: clean termination accomplished >> >>>>>>>>> > > * >> >>>>>>>>> > > It seems that the rankfile option is not propagted to the >> >>>>>>>>> second >> >>>>>>>>> > command >> >>>>>>>>> > > line ; there is no global understanding of the ranking >> inside a >> >>>>>>>>> > mpirun >> >>>>>>>>> > > command. >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > >> >>>>>>>>> >> >>>>>>>>> >> ################################################################################## >> >>>>>>>>> > > >> >>>>>>>>> > > Assuming that , I tried to provide a rankfile to each >> command >> >>>>>>>>> line: >> >>>>>>>>> > > >> >>>>>>>>> > > cat rankfile.0 >> >>>>>>>>> > > rank 0=r011n002 slot=0 >> >>>>>>>>> > > >> >>>>>>>>> > > cat rankfile.1 >> >>>>>>>>> > > rank 0=r011n003 slot=1 >> >>>>>>>>> > > >> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : >> -rf >> >>>>>>>>> > rankfile.1 >> >>>>>>>>> > > -n 1 hostname ### CRASHED >> >>>>>>>>> > > *[r011n002:28778] *** Process received signal *** >> >>>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11) >> >>>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1) >> >>>>>>>>> > > [r011n002:28778] Failing at address: 0x34 >> >>>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600] >> >>>>>>>>> > > [r011n002:28778] [ 1] >> >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> >>>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d) >> >>>>>>>>> > > [0x5557decd] >> >>>>>>>>> > > [r011n002:28778] [ 2] >> >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> >>>>>>>>> > 0(orte_plm_base_launch_apps+0x117) >> >>>>>>>>> > > [0x555842a7] >> >>>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >> >>>>>>>>> > mca_plm_rsh.so >> >>>>>>>>> > > [0x556098c0] >> >>>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>>>>>>>> > [0x804aa27] >> >>>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>>>>>>>> > [0x804a022] >> >>>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >> >>>>>>>>> > [0x9f1dec] >> >>>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>>>>>>>> > [0x8049f71] >> >>>>>>>>> > > [r011n002:28778] *** End of error message *** >> >>>>>>>>> > > Segmentation fault (core dumped)* >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > I hope that I've found a bug because it would be very >> important >> >>>>>>>>> > for me to >> >>>>>>>>> > > have this kind of capabiliy . >> >>>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my >> >>>>>>>>> exes >> >>>>>>>>> > and >> >>>>>>>>> > > sockets together. >> >>>>>>>>> > > >> >>>>>>>>> > > Thanks in advance for your help >> >>>>>>>>> > > >> >>>>>>>>> > > Geoffroy >> >>>>>>>>> > _______________________________________________ >> >>>>>>>>> > users mailing list >> >>>>>>>>> > us...@open-mpi.org >> >>>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> -------------- next part -------------- >> >>>>>>>>> HTML attachment scrubbed and removed >> >>>>>>>>> >> >>>>>>>>> ------------------------------ >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> End of users Digest, Vol 1202, Issue 2 >> >>>>>>>>> ************************************** >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> -------------- next part -------------- >> >>>>>>>>> HTML attachment scrubbed and removed >> >>>>>>>>> >> >>>>>>>>> ------------------------------ >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> End of users Digest, Vol 1218, Issue 2 >> >>>>>>>>> ************************************** >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>>> _______________________________________________ >> >>>>>>>>> users mailing list >> >>>>>>>>> us...@open-mpi.org >> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> _______________________________________________ >> >>>>>>>> users mailing list >> >>>>>>>> us...@open-mpi.org >> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>>> >> >>>>>>> >> >>>>>>> _______________________________________________ >> >>>>>>> users mailing list >> >>>>>>> us...@open-mpi.org >> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>>> >> >>>>>> >> >>>>>> _______________________________________________ >> >>>>>> users mailing list >> >>>>>> us...@open-mpi.org >> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> users mailing list >> >>>>> us...@open-mpi.org >> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>>> >> >>>> >> >>>> _______________________________________________ >> >>>> users mailing list >> >>>> us...@open-mpi.org >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> users mailing list >> >>>> us...@open-mpi.org >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>>> >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1289, Issue 3 >> ************************************** >> > >