Hi Lenny and Ralph, I saw nothing about rankfile in the 1.3.3 press release. Does it means that the bug fixes are not included there ?? Thanks
Geoffroy 2009/7/15 <users-requ...@open-mpi.org> > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: 1.3.1 -rf rankfile behaviour ?? (Lenny Verkhovsky) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 15 Jul 2009 15:08:39 +0300 > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > To: Open MPI Users <us...@open-mpi.org> > Message-ID: > <453d39990907150508j33ffa3f0qefc0801ea40f0...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Same result. > I still suspect that rankfile claims for node in small hostlist provided by > line in the app file, and not from the hostlist provided by mpirun on HNP > node. > According to my suspections your proposal should not work(and it does not), > since in appfile line I provide np=1, and 1 host, while rankfile tries to > allocate all ranks (np=2). > > $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338 > > if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list, > &num_slots, app, > > map->policy))) { > > node_list will be partial, according to app, and not full provided by > mpirun > cmd. If I didnt provide hostlist in the appfile line, mpirun uses local > host > and not hosts from the hostfile. > > > Tell me if I am wrong by expecting the following behaivor > > I provide to mpirun NP, full_hostlist, full_rankfile, appfile > I provide in appfile only partial NP and partial hostlist. > and it works. > > Currently, in order to get it working I need to provide full hostlist in > the > appfile. Which is quit a problematic. > > > $mpirun -np 2 -rf rankfile -app appfile > -------------------------------------------------------------------------- > Rankfile claimed host +n1 by index that is bigger than number of allocated > hosts. > -------------------------------------------------------------------------- > [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file > ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 > [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file > ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 > [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file > ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 > [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file > ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 > > > Thanks > Lenny. > > > On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain <r...@open-mpi.org> wrote: > > > Try your "not working" example without the -H on the mpirun cmd line - > > i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that work? > > Sorry to have to keep asking you to try things - I don't have a setup > here > > where I can test this as everything is RM managed. > > > > > > On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote: > > > > > > Thanks Ralph, after playing with prefixes it worked, > > > > I still have a problem running app file with rankfile, by providing full > > hostlist in mpirun command and not in app file. > > Is is planned behaviour, or it can be fixed ? > > > > See Working example: > > > > $cat rankfile > > rank 0=+n1 slot=0 > > rank 1=+n0 slot=0 > > $cat appfile > > -np 1 -H witch1,witch2 ./hello_world > > -np 1 -H witch1,witch2 ./hello_world > > > > $mpirun -rf rankfile -app appfile > > Hello world! I'm 1 of 2 on witch1 > > Hello world! I'm 0 of 2 on witch2 > > > > See NOT working example: > > > > $cat appfile > > -np 1 -H witch1 ./hello_world > > -np 1 -H witch2 ./hello_world > > $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile > > > -------------------------------------------------------------------------- > > Rankfile claimed host +n1 by index that is bigger than number of > allocated > > hosts. > > > -------------------------------------------------------------------------- > > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file > > ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 > > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file > > ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 > > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file > > ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 > > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file > > ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 > > > > > > > > On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <r...@open-mpi.org> wrote: > > > >> Took a deeper look into this, and I think that your first guess was > >> correct. > >> When we changed hostfile and -host to be per-app-context options, it > >> became necessary for you to put that info in the appfile itself. So try > >> adding it there. What you would need in your appfile is the following: > >> > >> -np 1 -H witch1 hostname > >> -np 1 -H witch2 hostname > >> > >> That should get you what you want. > >> Ralph > >> > >> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote: > >> > >> No, it's not working as I expect , unless I expect somthing wrong . > >> ( sorry for the long PATH, I needed to provide it ) > >> > >> > $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ > >> > /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun > >> -np 2 -H witch1,witch2 hostname > >> witch1 > >> witch2 > >> > >> > $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ > >> > /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun > >> -np 2 -H witch1,witch2 -app appfile > >> dellix7 > >> dellix7 > >> $cat appfile > >> -np 1 hostname > >> -np 1 hostname > >> > >> > >> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >> > >>> Run it without the appfile, just putting the apps on the cmd line - > does > >>> it work right then? > >>> > >>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote: > >>> > >>> additional info > >>> I am running mpirun on hostA, and providing hostlist with hostB and > >>> hostC. > >>> I expect that each application would run on hostB and hostC, but I get > >>> all of them running on hostA. > >>> dellix7$cat appfile > >>> -np 1 hostname > >>> -np 1 hostname > >>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile > >>> dellix7 > >>> dellix7 > >>> Thanks > >>> Lenny. > >>> > >>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >>> > >>>> Strange - let me have a look at it later today. Probably something > >>>> simple that another pair of eyes might spot. > >>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote: > >>>> > >>>> Seems like connected problem: > >>>> I can't use rankfile with app, even after all those fixes ( working > with > >>>> trunk 1.4a1r21657). > >>>> This is my case : > >>>> > >>>> $cat rankfile > >>>> rank 0=+n1 slot=0 > >>>> rank 1=+n0 slot=0 > >>>> $cat appfile > >>>> -np 1 hostname > >>>> -np 1 hostname > >>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile > >>>> > >>>> > -------------------------------------------------------------------------- > >>>> Rankfile claimed host +n1 by index that is bigger than number of > >>>> allocated hosts. > >>>> > >>>> > -------------------------------------------------------------------------- > >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file > >>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 > >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file > >>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 > >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file > >>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 > >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file > >>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 > >>>> > >>>> > >>>> The problem is, that rankfile mapper tries to find an appropriate host > >>>> in the partial ( and not full ) hostlist. > >>>> > >>>> Any suggestions how to fix it? > >>>> > >>>> Thanks > >>>> Lenny. > >>>> > >>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <r...@open-mpi.org > >wrote: > >>>> > >>>>> Okay, I fixed this today too....r21219 > >>>>> > >>>>> > >>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote: > >>>>> > >>>>> Now there is another problem :) > >>>>>> > >>>>>> You can try oversubscribe node. At least by 1 task. > >>>>>> If you hostfile and rank file limit you at N procs, you can ask > mpirun > >>>>>> for N+1 and it wil be not rejected. > >>>>>> Although in reality there will be N tasks. > >>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np > >>>>>> 5" both works, but in both cases there are only 4 tasks. It isn't > crucial, > >>>>>> because there is nor real oversubscription, but there is still some > bug > >>>>>> which can affect something in future. > >>>>>> > >>>>>> -- > >>>>>> Anton Starikov. > >>>>>> > >>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote: > >>>>>> > >>>>>> This is fixed as of r21208. > >>>>>>> > >>>>>>> Thanks for reporting it! > >>>>>>> Ralph > >>>>>>> > >>>>>>> > >>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote: > >>>>>>> > >>>>>>> Although removing this check solves problem of having more slots in > >>>>>>>> rankfile than necessary, there is another problem. > >>>>>>>> > >>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example: > >>>>>>>> > >>>>>>>> > >>>>>>>> hostfile: > >>>>>>>> > >>>>>>>> node01 > >>>>>>>> node01 > >>>>>>>> node02 > >>>>>>>> node02 > >>>>>>>> > >>>>>>>> rankfile: > >>>>>>>> > >>>>>>>> rank 0=node01 slot=1 > >>>>>>>> rank 1=node01 slot=0 > >>>>>>>> rank 2=node02 slot=1 > >>>>>>>> rank 3=node02 slot=0 > >>>>>>>> > >>>>>>>> mpirun -np 4 ./something > >>>>>>>> > >>>>>>>> complains with: > >>>>>>>> > >>>>>>>> "There are not enough slots available in the system to satisfy the > 4 > >>>>>>>> slots > >>>>>>>> that were requested by the application" > >>>>>>>> > >>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when > you > >>>>>>>> ask for 1 CPU less. And the same behavior in any case (shared > nodes, > >>>>>>>> non-shared nodes, multi-node) > >>>>>>>> > >>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and > all > >>>>>>>> affinities set as it requested in rankfile, there is no > oversubscription. > >>>>>>>> > >>>>>>>> > >>>>>>>> Anton. > >>>>>>>> > >>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote: > >>>>>>>> > >>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer > is > >>>>>>>>> required. > >>>>>>>>> > >>>>>>>>> Thx! > >>>>>>>>> > >>>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky < > >>>>>>>>> lenny.verkhov...@gmail.com> wrote: > >>>>>>>>> According to the code it does cares. > >>>>>>>>> > >>>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 > >>>>>>>>> > >>>>>>>>> ival = orte_rmaps_rank_file_value.ival; > >>>>>>>>> if ( ival > (np-1) ) { > >>>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, > >>>>>>>>> ival, rankfile); > >>>>>>>>> rc = ORTE_ERR_BAD_PARAM; > >>>>>>>>> goto unlock; > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> If I remember correctly, I used an array to map ranks, and since > >>>>>>>>> the length of array is NP, maximum index must be less than np, so > if you > >>>>>>>>> have the number of rank > NP, you have no place to put it inside > array. > >>>>>>>>> > >>>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we > >>>>>>>>> map the additional procs either byslot (default) or bynode (if > you specify > >>>>>>>>> that option). So the rankfile doesn't need to contain an entry > for every > >>>>>>>>> proc." - Correct point. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Lenny. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 5/5/09, Ralph Castain <r...@open-mpi.org> wrote: Sorry Lenny, > >>>>>>>>> but that isn't correct. The rankfile mapper doesn't care if the > rankfile > >>>>>>>>> contains additional info - it only maps up to the number of > processes, and > >>>>>>>>> ignores anything beyond that number. So there is no need to > remove the > >>>>>>>>> additional info. > >>>>>>>>> > >>>>>>>>> Likewise, if you have more procs than the rankfile specifies, we > >>>>>>>>> map the additional procs either byslot (default) or bynode (if > you specify > >>>>>>>>> that option). So the rankfile doesn't need to contain an entry > for every > >>>>>>>>> proc. > >>>>>>>>> > >>>>>>>>> Just don't want to confuse folks. > >>>>>>>>> Ralph > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky < > >>>>>>>>> lenny.verkhov...@gmail.com> wrote: > >>>>>>>>> Hi, > >>>>>>>>> maximum rank number must be less then np. > >>>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is > >>>>>>>>> invalid. > >>>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile > >>>>>>>>> Best regards, > >>>>>>>>> Lenny. > >>>>>>>>> > >>>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot < > >>>>>>>>> geopig...@gmail.com> wrote: > >>>>>>>>> Hi , > >>>>>>>>> > >>>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately > my > >>>>>>>>> command doesn't work > >>>>>>>>> > >>>>>>>>> cat rankf: > >>>>>>>>> rank 0=node1 slot=* > >>>>>>>>> rank 1=node2 slot=* > >>>>>>>>> > >>>>>>>>> cat hostf: > >>>>>>>>> node1 slots=2 > >>>>>>>>> node2 slots=2 > >>>>>>>>> > >>>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 > >>>>>>>>> hostname : --host node2 -n 1 hostname > >>>>>>>>> > >>>>>>>>> Error, invalid rank (1) in the rankfile (rankf) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in > >>>>>>>>> file rmaps_rank_file.c at line 403 > >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in > >>>>>>>>> file base/rmaps_base_map_job.c at line 86 > >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in > >>>>>>>>> file base/plm_base_launch_support.c at line 86 > >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in > >>>>>>>>> file plm_rsh_module.c at line 1016 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Ralph, could you tell me if my command syntax is correct or not ? > >>>>>>>>> if not, give me the expected one ? > >>>>>>>>> > >>>>>>>>> Regards > >>>>>>>>> > >>>>>>>>> Geoffroy > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> > >>>>>>>>> > >>>>>>>>> Immediately Sir !!! :) > >>>>>>>>> > >>>>>>>>> Thanks again Ralph > >>>>>>>>> > >>>>>>>>> Geoffroy > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ------------------------------ > >>>>>>>>> > >>>>>>>>> Message: 2 > >>>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600 > >>>>>>>>> From: Ralph Castain <r...@open-mpi.org> > >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > >>>>>>>>> To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>> Message-ID: > >>>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> > >>>>>>>>> Content-Type: text/plain; charset="iso-8859-1" > >>>>>>>>> > >>>>>>>>> I believe this is fixed now in our development trunk - you can > >>>>>>>>> download any > >>>>>>>>> tarball starting from last night and give it a try, if you like. > >>>>>>>>> Any > >>>>>>>>> feedback would be appreciated. > >>>>>>>>> > >>>>>>>>> Ralph > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: > >>>>>>>>> > >>>>>>>>> Ah now, I didn't say it -worked-, did I? :-) > >>>>>>>>> > >>>>>>>>> Clearly a bug exists in the program. I'll try to take a look at > it > >>>>>>>>> (if Lenny > >>>>>>>>> doesn't get to it first), but it won't be until later in the > week. > >>>>>>>>> > >>>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: > >>>>>>>>> > >>>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi > but > >>>>>>>>> my > >>>>>>>>> second example shows that it's not working > >>>>>>>>> > >>>>>>>>> cat hostfile.0 > >>>>>>>>> r011n002 slots=4 > >>>>>>>>> r011n003 slots=4 > >>>>>>>>> > >>>>>>>>> cat rankfile.0 > >>>>>>>>> rank 0=r011n002 slot=0 > >>>>>>>>> rank 1=r011n003 slot=1 > >>>>>>>>> > >>>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 > >>>>>>>>> hostname > >>>>>>>>> ### CRASHED > >>>>>>>>> > >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > rmaps_rank_file.c at line 404 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > base/rmaps_base_map_job.c at line 87 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > base/plm_base_launch_support.c at line 77 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > plm_rsh_module.c at line 985 > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while > >>>>>>>>> > attempting to > >>>>>>>>> > > launch so we are aborting. > >>>>>>>>> > > > >>>>>>>>> > > There may be more information reported by the environment > (see > >>>>>>>>> > above). > >>>>>>>>> > > > >>>>>>>>> > > This may be because the daemon was unable to find all the > >>>>>>>>> needed > >>>>>>>>> > shared > >>>>>>>>> > > libraries on the remote node. You may set your > LD_LIBRARY_PATH > >>>>>>>>> to > >>>>>>>>> > have the > >>>>>>>>> > > location of the shared libraries on the remote nodes and this > >>>>>>>>> will > >>>>>>>>> > > automatically be forwarded to the remote nodes. > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to > the > >>>>>>>>> > process > >>>>>>>>> > > that caused that situation. > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > orterun: clean termination accomplished > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Message: 4 > >>>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600 > >>>>>>>>> From: Ralph Castain <r...@lanl.gov> > >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > >>>>>>>>> To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> > >>>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; > >>>>>>>>> DelSp="yes" > >>>>>>>>> > >>>>>>>>> The rankfile cuts across the entire job - it isn't applied on an > >>>>>>>>> app_context basis. So the ranks in your rankfile must correspond > to > >>>>>>>>> the eventual rank of each process in the cmd line. > >>>>>>>>> > >>>>>>>>> Unfortunately, that means you have to count ranks. In your case, > >>>>>>>>> you > >>>>>>>>> only have four, so that makes life easier. Your rankfile would > look > >>>>>>>>> something like this: > >>>>>>>>> > >>>>>>>>> rank 0=r001n001 slot=0 > >>>>>>>>> rank 1=r001n002 slot=1 > >>>>>>>>> rank 2=r001n001 slot=1 > >>>>>>>>> rank 3=r001n002 slot=2 > >>>>>>>>> > >>>>>>>>> HTH > >>>>>>>>> Ralph > >>>>>>>>> > >>>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: > >>>>>>>>> > >>>>>>>>> > Hi, > >>>>>>>>> > > >>>>>>>>> > I agree that my examples are not very clear. What I want to do > is > >>>>>>>>> to > >>>>>>>>> > launch a multiexes application (masters-slaves) and benefit > from > >>>>>>>>> the > >>>>>>>>> > processor affinity. > >>>>>>>>> > Could you show me how to convert this command , using -rf > option > >>>>>>>>> > (whatever the affinity is) > >>>>>>>>> > > >>>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host > >>>>>>>>> r001n002 > >>>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 > - > >>>>>>>>> > host r001n002 slave.x options4 > >>>>>>>>> > > >>>>>>>>> > Thanks for your help > >>>>>>>>> > > >>>>>>>>> > Geoffroy > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > Message: 2 > >>>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300 > >>>>>>>>> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> > >>>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > >>>>>>>>> > To: Open MPI Users <us...@open-mpi.org> > >>>>>>>>> > Message-ID: > >>>>>>>>> > < > >>>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> > >>>>>>>>> > Content-Type: text/plain; charset="iso-8859-1" > >>>>>>>>> > > >>>>>>>>> > Hi, > >>>>>>>>> > > >>>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1 > >>>>>>>>> > defined, > >>>>>>>>> > while n=1, which means only rank 0 is present and can be > >>>>>>>>> allocated. > >>>>>>>>> > > >>>>>>>>> > NP must be >= the largest rank in rankfile. > >>>>>>>>> > > >>>>>>>>> > What exactly are you trying to do ? > >>>>>>>>> > > >>>>>>>>> > I tried to recreate your seqv but all I got was > >>>>>>>>> > > >>>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun > --hostfile > >>>>>>>>> > hostfile.0 > >>>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname > >>>>>>>>> > [witch19:30798] mca: base: component_find: paffinity > >>>>>>>>> > "mca_paffinity_linux" > >>>>>>>>> > uses an MCA interface that is not recognized (component MCA > >>>>>>>>> v1.0.0 != > >>>>>>>>> > supported MCA v2.0.0) -- ignored > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > It looks like opal_init failed for some reason; your parallel > >>>>>>>>> > process is > >>>>>>>>> > likely to abort. There are many reasons that a parallel process > >>>>>>>>> can > >>>>>>>>> > fail during opal_init; some of which are due to configuration > or > >>>>>>>>> > environment problems. This failure appears to be an internal > >>>>>>>>> failure; > >>>>>>>>> > here's some additional information (which may only be relevant > to > >>>>>>>>> an > >>>>>>>>> > Open MPI developer): > >>>>>>>>> > > >>>>>>>>> > opal_carto_base_select failed > >>>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found > in > >>>>>>>>> file > >>>>>>>>> > ../../orte/runtime/orte_init.c at line 78 > >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found > in > >>>>>>>>> file > >>>>>>>>> > ../../orte/orted/orted_main.c at line 344 > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while > >>>>>>>>> > attempting > >>>>>>>>> > to launch so we are aborting. > >>>>>>>>> > > >>>>>>>>> > There may be more information reported by the environment (see > >>>>>>>>> above). > >>>>>>>>> > > >>>>>>>>> > This may be because the daemon was unable to find all the > needed > >>>>>>>>> > shared > >>>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH > to > >>>>>>>>> > have the > >>>>>>>>> > location of the shared libraries on the remote nodes and this > >>>>>>>>> will > >>>>>>>>> > automatically be forwarded to the remote nodes. > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > mpirun noticed that the job aborted, but has no info as to the > >>>>>>>>> process > >>>>>>>>> > that caused that situation. > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > mpirun: clean termination accomplished > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > Lenny. > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: > >>>>>>>>> > > > >>>>>>>>> > > Hi , > >>>>>>>>> > > > >>>>>>>>> > > I am currently testing the process affinity capabilities of > >>>>>>>>> > openmpi and I > >>>>>>>>> > > would like to know if the rankfile behaviour I will describe > >>>>>>>>> below > >>>>>>>>> > is normal > >>>>>>>>> > > or not ? > >>>>>>>>> > > > >>>>>>>>> > > cat hostfile.0 > >>>>>>>>> > > r011n002 slots=4 > >>>>>>>>> > > r011n003 slots=4 > >>>>>>>>> > > > >>>>>>>>> > > cat rankfile.0 > >>>>>>>>> > > rank 0=r011n002 slot=0 > >>>>>>>>> > > rank 1=r011n003 slot=1 > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > ################################################################################## > >>>>>>>>> > > > >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname > ### > >>>>>>>>> OK > >>>>>>>>> > > r011n002 > >>>>>>>>> > > r011n003 > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > ################################################################################## > >>>>>>>>> > > but > >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : > -n > >>>>>>>>> 1 > >>>>>>>>> > hostname > >>>>>>>>> > > ### CRASHED > >>>>>>>>> > > * > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > rmaps_rank_file.c at line 404 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > base/rmaps_base_map_job.c at line 87 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > base/plm_base_launch_support.c at line 77 > >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter > in > >>>>>>>>> file > >>>>>>>>> > > plm_rsh_module.c at line 985 > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while > >>>>>>>>> > attempting to > >>>>>>>>> > > launch so we are aborting. > >>>>>>>>> > > > >>>>>>>>> > > There may be more information reported by the environment > (see > >>>>>>>>> > above). > >>>>>>>>> > > > >>>>>>>>> > > This may be because the daemon was unable to find all the > >>>>>>>>> needed > >>>>>>>>> > shared > >>>>>>>>> > > libraries on the remote node. You may set your > LD_LIBRARY_PATH > >>>>>>>>> to > >>>>>>>>> > have the > >>>>>>>>> > > location of the shared libraries on the remote nodes and this > >>>>>>>>> will > >>>>>>>>> > > automatically be forwarded to the remote nodes. > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to > the > >>>>>>>>> > process > >>>>>>>>> > > that caused that situation. > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > -------------------------------------------------------------------------- > >>>>>>>>> > > orterun: clean termination accomplished > >>>>>>>>> > > * > >>>>>>>>> > > It seems that the rankfile option is not propagted to the > >>>>>>>>> second > >>>>>>>>> > command > >>>>>>>>> > > line ; there is no global understanding of the ranking inside > a > >>>>>>>>> > mpirun > >>>>>>>>> > > command. > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > ################################################################################## > >>>>>>>>> > > > >>>>>>>>> > > Assuming that , I tried to provide a rankfile to each command > >>>>>>>>> line: > >>>>>>>>> > > > >>>>>>>>> > > cat rankfile.0 > >>>>>>>>> > > rank 0=r011n002 slot=0 > >>>>>>>>> > > > >>>>>>>>> > > cat rankfile.1 > >>>>>>>>> > > rank 0=r011n003 slot=1 > >>>>>>>>> > > > >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : > -rf > >>>>>>>>> > rankfile.1 > >>>>>>>>> > > -n 1 hostname ### CRASHED > >>>>>>>>> > > *[r011n002:28778] *** Process received signal *** > >>>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11) > >>>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1) > >>>>>>>>> > > [r011n002:28778] Failing at address: 0x34 > >>>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600] > >>>>>>>>> > > [r011n002:28778] [ 1] > >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. > >>>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d) > >>>>>>>>> > > [0x5557decd] > >>>>>>>>> > > [r011n002:28778] [ 2] > >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. > >>>>>>>>> > 0(orte_plm_base_launch_apps+0x117) > >>>>>>>>> > > [0x555842a7] > >>>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ > >>>>>>>>> > mca_plm_rsh.so > >>>>>>>>> > > [0x556098c0] > >>>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun > >>>>>>>>> > [0x804aa27] > >>>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun > >>>>>>>>> > [0x804a022] > >>>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) > >>>>>>>>> > [0x9f1dec] > >>>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun > >>>>>>>>> > [0x8049f71] > >>>>>>>>> > > [r011n002:28778] *** End of error message *** > >>>>>>>>> > > Segmentation fault (core dumped)* > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > I hope that I've found a bug because it would be very > important > >>>>>>>>> > for me to > >>>>>>>>> > > have this kind of capabiliy . > >>>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my > >>>>>>>>> exes > >>>>>>>>> > and > >>>>>>>>> > > sockets together. > >>>>>>>>> > > > >>>>>>>>> > > Thanks in advance for your help > >>>>>>>>> > > > >>>>>>>>> > > Geoffroy > >>>>>>>>> > _______________________________________________ > >>>>>>>>> > users mailing list > >>>>>>>>> > us...@open-mpi.org > >>>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> -------------- next part -------------- > >>>>>>>>> HTML attachment scrubbed and removed > >>>>>>>>> > >>>>>>>>> ------------------------------ > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> End of users Digest, Vol 1202, Issue 2 > >>>>>>>>> ************************************** > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> -------------- next part -------------- > >>>>>>>>> HTML attachment scrubbed and removed > >>>>>>>>> > >>>>>>>>> ------------------------------ > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> End of users Digest, Vol 1218, Issue 2 > >>>>>>>>> ************************************** > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 1289, Issue 3 > ************************************** >