Thanks, Ralph, I guess your guess was correct, here is the display map.
$cat rankfile rank 0=+n1 slot=0 rank 1=+n0 slot=0 $cat appfile -np 1 -host witch1 ./hello_world -np 1 -host witch2 ./hello_world $mpirun -np 2 -rf rankfile --display-allocation -app appfile ====================== ALLOCATED NODES ====================== Data for node: Name: dellix7 Num slots: 0 Max slots: 0 Data for node: Name: witch1 Num slots: 1 Max slots: 0 Data for node: Name: witch2 Num slots: 1 Max slots: 0 ================================================================= -------------------------------------------------------------------------- Rankfile claimed host +n1 by index that is bigger than number of allocated hosts. On Wed, Jul 15, 2009 at 4:10 PM, Ralph Castain <r...@open-mpi.org> wrote: > What is supposed to happen is this: > > 1. each line of the appfile causes us to create a new app_context. We store > the provided -host info in that object. > > 2. when we create the "allocation", we cycle through -all- the app_contexts > and add -all- of their -host info into the list of allocated nodes > > 3. when we get_target_nodes, we start with the entire list of allocated > nodes, and then use -host for that app_context to filter down to the hosts > allowed for that specific app_context > > So you should have to only provide -np 1 and 1 host on each line. My guess > is that the rankfile mapper isn't correctly behaving for multiple > app_contexts. > > Add --display-allocation to your mpirun cmd line for the "not working" cse > and let's see what mpirun thinks the total allocation is - I'll bet that > both nodes show up, which would tell us that my "guess" is correct. Then > I'll know what needs to be fixed. > > Thanks > Ralph > > > > On Wed, Jul 15, 2009 at 6:08 AM, Lenny Verkhovsky < > lenny.verkhov...@gmail.com> wrote: > >> Same result. >> I still suspect that rankfile claims for node in small hostlist provided >> by line in the app file, and not from the hostlist provided by mpirun on HNP >> node. >> According to my suspections your proposal should not work(and it does >> not), since in appfile line I provide np=1, and 1 host, while rankfile tries >> to allocate all ranks (np=2). >> >> $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338 >> >> if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list, >> &num_slots, app, >> >> map->policy))) { >> >> node_list will be partial, according to app, and not full provided by >> mpirun cmd. If I didnt provide hostlist in the appfile line, mpirun uses >> local host and not hosts from the hostfile. >> >> >> Tell me if I am wrong by expecting the following behaivor >> >> I provide to mpirun NP, full_hostlist, full_rankfile, appfile >> I provide in appfile only partial NP and partial hostlist. >> and it works. >> >> Currently, in order to get it working I need to provide full hostlist in >> the appfile. Which is quit a problematic. >> >> >> $mpirun -np 2 -rf rankfile -app appfile >> -------------------------------------------------------------------------- >> Rankfile claimed host +n1 by index that is bigger than number of allocated >> hosts. >> -------------------------------------------------------------------------- >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file >> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >> >> >> Thanks >> Lenny. >> >> >> On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Try your "not working" example without the -H on the mpirun cmd line - >>> i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that work? >>> Sorry to have to keep asking you to try things - I don't have a setup >>> here where I can test this as everything is RM managed. >>> >>> >>> On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote: >>> >>> >>> Thanks Ralph, after playing with prefixes it worked, >>> >>> I still have a problem running app file with rankfile, by providing full >>> hostlist in mpirun command and not in app file. >>> Is is planned behaviour, or it can be fixed ? >>> >>> See Working example: >>> >>> $cat rankfile >>> rank 0=+n1 slot=0 >>> rank 1=+n0 slot=0 >>> $cat appfile >>> -np 1 -H witch1,witch2 ./hello_world >>> -np 1 -H witch1,witch2 ./hello_world >>> >>> $mpirun -rf rankfile -app appfile >>> Hello world! I'm 1 of 2 on witch1 >>> Hello world! I'm 0 of 2 on witch2 >>> >>> See NOT working example: >>> >>> $cat appfile >>> -np 1 -H witch1 ./hello_world >>> -np 1 -H witch2 ./hello_world >>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile >>> >>> -------------------------------------------------------------------------- >>> Rankfile claimed host +n1 by index that is bigger than number of >>> allocated hosts. >>> >>> -------------------------------------------------------------------------- >>> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >>> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >>> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >>> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >>> >>> >>> >>> On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Took a deeper look into this, and I think that your first guess was >>>> correct. >>>> When we changed hostfile and -host to be per-app-context options, it >>>> became necessary for you to put that info in the appfile itself. So try >>>> adding it there. What you would need in your appfile is the following: >>>> >>>> -np 1 -H witch1 hostname >>>> -np 1 -H witch2 hostname >>>> >>>> That should get you what you want. >>>> Ralph >>>> >>>> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote: >>>> >>>> No, it's not working as I expect , unless I expect somthing wrong . >>>> ( sorry for the long PATH, I needed to provide it ) >>>> >>>> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ >>>> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >>>> -np 2 -H witch1,witch2 hostname >>>> witch1 >>>> witch2 >>>> >>>> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ >>>> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >>>> -np 2 -H witch1,witch2 -app appfile >>>> dellix7 >>>> dellix7 >>>> $cat appfile >>>> -np 1 hostname >>>> -np 1 hostname >>>> >>>> >>>> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <r...@open-mpi.org>wrote: >>>> >>>>> Run it without the appfile, just putting the apps on the cmd line - >>>>> does it work right then? >>>>> >>>>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote: >>>>> >>>>> additional info >>>>> I am running mpirun on hostA, and providing hostlist with hostB and >>>>> hostC. >>>>> I expect that each application would run on hostB and hostC, but I get >>>>> all of them running on hostA. >>>>> dellix7$cat appfile >>>>> -np 1 hostname >>>>> -np 1 hostname >>>>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile >>>>> dellix7 >>>>> dellix7 >>>>> Thanks >>>>> Lenny. >>>>> >>>>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <r...@open-mpi.org>wrote: >>>>> >>>>>> Strange - let me have a look at it later today. Probably something >>>>>> simple that another pair of eyes might spot. >>>>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote: >>>>>> >>>>>> Seems like connected problem: >>>>>> I can't use rankfile with app, even after all those fixes ( working >>>>>> with trunk 1.4a1r21657). >>>>>> This is my case : >>>>>> >>>>>> $cat rankfile >>>>>> rank 0=+n1 slot=0 >>>>>> rank 1=+n0 slot=0 >>>>>> $cat appfile >>>>>> -np 1 hostname >>>>>> -np 1 hostname >>>>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> Rankfile claimed host +n1 by index that is bigger than number of >>>>>> allocated hosts. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >>>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >>>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >>>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >>>>>> >>>>>> >>>>>> The problem is, that rankfile mapper tries to find an appropriate host >>>>>> in the partial ( and not full ) hostlist. >>>>>> >>>>>> Any suggestions how to fix it? >>>>>> >>>>>> Thanks >>>>>> Lenny. >>>>>> >>>>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <r...@open-mpi.org>wrote: >>>>>> >>>>>>> Okay, I fixed this today too....r21219 >>>>>>> >>>>>>> >>>>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote: >>>>>>> >>>>>>> Now there is another problem :) >>>>>>>> >>>>>>>> You can try oversubscribe node. At least by 1 task. >>>>>>>> If you hostfile and rank file limit you at N procs, you can ask >>>>>>>> mpirun for N+1 and it wil be not rejected. >>>>>>>> Although in reality there will be N tasks. >>>>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np >>>>>>>> 5" both works, but in both cases there are only 4 tasks. It isn't >>>>>>>> crucial, >>>>>>>> because there is nor real oversubscription, but there is still some bug >>>>>>>> which can affect something in future. >>>>>>>> >>>>>>>> -- >>>>>>>> Anton Starikov. >>>>>>>> >>>>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote: >>>>>>>> >>>>>>>> This is fixed as of r21208. >>>>>>>>> >>>>>>>>> Thanks for reporting it! >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>> >>>>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote: >>>>>>>>> >>>>>>>>> Although removing this check solves problem of having more slots in >>>>>>>>>> rankfile than necessary, there is another problem. >>>>>>>>>> >>>>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> hostfile: >>>>>>>>>> >>>>>>>>>> node01 >>>>>>>>>> node01 >>>>>>>>>> node02 >>>>>>>>>> node02 >>>>>>>>>> >>>>>>>>>> rankfile: >>>>>>>>>> >>>>>>>>>> rank 0=node01 slot=1 >>>>>>>>>> rank 1=node01 slot=0 >>>>>>>>>> rank 2=node02 slot=1 >>>>>>>>>> rank 3=node02 slot=0 >>>>>>>>>> >>>>>>>>>> mpirun -np 4 ./something >>>>>>>>>> >>>>>>>>>> complains with: >>>>>>>>>> >>>>>>>>>> "There are not enough slots available in the system to satisfy the >>>>>>>>>> 4 slots >>>>>>>>>> that were requested by the application" >>>>>>>>>> >>>>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when >>>>>>>>>> you ask for 1 CPU less. And the same behavior in any case (shared >>>>>>>>>> nodes, >>>>>>>>>> non-shared nodes, multi-node) >>>>>>>>>> >>>>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and >>>>>>>>>> all affinities set as it requested in rankfile, there is no >>>>>>>>>> oversubscription. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Anton. >>>>>>>>>> >>>>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote: >>>>>>>>>> >>>>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer >>>>>>>>>>> is required. >>>>>>>>>>> >>>>>>>>>>> Thx! >>>>>>>>>>> >>>>>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky < >>>>>>>>>>> lenny.verkhov...@gmail.com> wrote: >>>>>>>>>>> According to the code it does cares. >>>>>>>>>>> >>>>>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 >>>>>>>>>>> >>>>>>>>>>> ival = orte_rmaps_rank_file_value.ival; >>>>>>>>>>> if ( ival > (np-1) ) { >>>>>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, >>>>>>>>>>> ival, rankfile); >>>>>>>>>>> rc = ORTE_ERR_BAD_PARAM; >>>>>>>>>>> goto unlock; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> If I remember correctly, I used an array to map ranks, and since >>>>>>>>>>> the length of array is NP, maximum index must be less than np, so >>>>>>>>>>> if you >>>>>>>>>>> have the number of rank > NP, you have no place to put it inside >>>>>>>>>>> array. >>>>>>>>>>> >>>>>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we >>>>>>>>>>> map the additional procs either byslot (default) or bynode (if you >>>>>>>>>>> specify >>>>>>>>>>> that option). So the rankfile doesn't need to contain an entry for >>>>>>>>>>> every >>>>>>>>>>> proc." - Correct point. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Lenny. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 5/5/09, Ralph Castain <r...@open-mpi.org> wrote: Sorry Lenny, >>>>>>>>>>> but that isn't correct. The rankfile mapper doesn't care if the >>>>>>>>>>> rankfile >>>>>>>>>>> contains additional info - it only maps up to the number of >>>>>>>>>>> processes, and >>>>>>>>>>> ignores anything beyond that number. So there is no need to remove >>>>>>>>>>> the >>>>>>>>>>> additional info. >>>>>>>>>>> >>>>>>>>>>> Likewise, if you have more procs than the rankfile specifies, we >>>>>>>>>>> map the additional procs either byslot (default) or bynode (if you >>>>>>>>>>> specify >>>>>>>>>>> that option). So the rankfile doesn't need to contain an entry for >>>>>>>>>>> every >>>>>>>>>>> proc. >>>>>>>>>>> >>>>>>>>>>> Just don't want to confuse folks. >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky < >>>>>>>>>>> lenny.verkhov...@gmail.com> wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> maximum rank number must be less then np. >>>>>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is >>>>>>>>>>> invalid. >>>>>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile >>>>>>>>>>> Best regards, >>>>>>>>>>> Lenny. >>>>>>>>>>> >>>>>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot < >>>>>>>>>>> geopig...@gmail.com> wrote: >>>>>>>>>>> Hi , >>>>>>>>>>> >>>>>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately >>>>>>>>>>> my command doesn't work >>>>>>>>>>> >>>>>>>>>>> cat rankf: >>>>>>>>>>> rank 0=node1 slot=* >>>>>>>>>>> rank 1=node2 slot=* >>>>>>>>>>> >>>>>>>>>>> cat hostf: >>>>>>>>>>> node1 slots=2 >>>>>>>>>>> node2 slots=2 >>>>>>>>>>> >>>>>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 >>>>>>>>>>> hostname : --host node2 -n 1 hostname >>>>>>>>>>> >>>>>>>>>>> Error, invalid rank (1) in the rankfile (rankf) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>>>>> file rmaps_rank_file.c at line 403 >>>>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>>>>> file base/rmaps_base_map_job.c at line 86 >>>>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>>>>> file base/plm_base_launch_support.c at line 86 >>>>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>>>>> file plm_rsh_module.c at line 1016 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ralph, could you tell me if my command syntax is correct or not ? >>>>>>>>>>> if not, give me the expected one ? >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Geoffroy >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> >>>>>>>>>>> >>>>>>>>>>> Immediately Sir !!! :) >>>>>>>>>>> >>>>>>>>>>> Thanks again Ralph >>>>>>>>>>> >>>>>>>>>>> Geoffroy >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ------------------------------ >>>>>>>>>>> >>>>>>>>>>> Message: 2 >>>>>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600 >>>>>>>>>>> From: Ralph Castain <r...@open-mpi.org> >>>>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>>>>> Message-ID: >>>>>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> >>>>>>>>>>> Content-Type: text/plain; charset="iso-8859-1" >>>>>>>>>>> >>>>>>>>>>> I believe this is fixed now in our development trunk - you can >>>>>>>>>>> download any >>>>>>>>>>> tarball starting from last night and give it a try, if you like. >>>>>>>>>>> Any >>>>>>>>>>> feedback would be appreciated. >>>>>>>>>>> >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >>>>>>>>>>> >>>>>>>>>>> Ah now, I didn't say it -worked-, did I? :-) >>>>>>>>>>> >>>>>>>>>>> Clearly a bug exists in the program. I'll try to take a look at >>>>>>>>>>> it (if Lenny >>>>>>>>>>> doesn't get to it first), but it won't be until later in the >>>>>>>>>>> week. >>>>>>>>>>> >>>>>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >>>>>>>>>>> >>>>>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi >>>>>>>>>>> but my >>>>>>>>>>> second example shows that it's not working >>>>>>>>>>> >>>>>>>>>>> cat hostfile.0 >>>>>>>>>>> r011n002 slots=4 >>>>>>>>>>> r011n003 slots=4 >>>>>>>>>>> >>>>>>>>>>> cat rankfile.0 >>>>>>>>>>> rank 0=r011n002 slot=0 >>>>>>>>>>> rank 1=r011n003 slot=1 >>>>>>>>>>> >>>>>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>>>>>>>>>> hostname >>>>>>>>>>> ### CRASHED >>>>>>>>>>> >>>>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > rmaps_rank_file.c at line 404 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > base/rmaps_base_map_job.c at line 87 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > base/plm_base_launch_support.c at line 77 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > plm_rsh_module.c at line 985 >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>>>>>>>>>> > attempting to >>>>>>>>>>> > > launch so we are aborting. >>>>>>>>>>> > > >>>>>>>>>>> > > There may be more information reported by the environment >>>>>>>>>>> (see >>>>>>>>>>> > above). >>>>>>>>>>> > > >>>>>>>>>>> > > This may be because the daemon was unable to find all the >>>>>>>>>>> needed >>>>>>>>>>> > shared >>>>>>>>>>> > > libraries on the remote node. You may set your >>>>>>>>>>> LD_LIBRARY_PATH to >>>>>>>>>>> > have the >>>>>>>>>>> > > location of the shared libraries on the remote nodes and this >>>>>>>>>>> will >>>>>>>>>>> > > automatically be forwarded to the remote nodes. >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > orterun noticed that the job aborted, but has no info as to >>>>>>>>>>> the >>>>>>>>>>> > process >>>>>>>>>>> > > that caused that situation. >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > orterun: clean termination accomplished >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Message: 4 >>>>>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600 >>>>>>>>>>> From: Ralph Castain <r...@lanl.gov> >>>>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>>>>> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >>>>>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>>>>>>>>>> DelSp="yes" >>>>>>>>>>> >>>>>>>>>>> The rankfile cuts across the entire job - it isn't applied on an >>>>>>>>>>> app_context basis. So the ranks in your rankfile must correspond >>>>>>>>>>> to >>>>>>>>>>> the eventual rank of each process in the cmd line. >>>>>>>>>>> >>>>>>>>>>> Unfortunately, that means you have to count ranks. In your case, >>>>>>>>>>> you >>>>>>>>>>> only have four, so that makes life easier. Your rankfile would >>>>>>>>>>> look >>>>>>>>>>> something like this: >>>>>>>>>>> >>>>>>>>>>> rank 0=r001n001 slot=0 >>>>>>>>>>> rank 1=r001n002 slot=1 >>>>>>>>>>> rank 2=r001n001 slot=1 >>>>>>>>>>> rank 3=r001n002 slot=2 >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >>>>>>>>>>> >>>>>>>>>>> > Hi, >>>>>>>>>>> > >>>>>>>>>>> > I agree that my examples are not very clear. What I want to do >>>>>>>>>>> is to >>>>>>>>>>> > launch a multiexes application (masters-slaves) and benefit >>>>>>>>>>> from the >>>>>>>>>>> > processor affinity. >>>>>>>>>>> > Could you show me how to convert this command , using -rf >>>>>>>>>>> option >>>>>>>>>>> > (whatever the affinity is) >>>>>>>>>>> > >>>>>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >>>>>>>>>>> r001n002 >>>>>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 >>>>>>>>>>> - >>>>>>>>>>> > host r001n002 slave.x options4 >>>>>>>>>>> > >>>>>>>>>>> > Thanks for your help >>>>>>>>>>> > >>>>>>>>>>> > Geoffroy >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Message: 2 >>>>>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300 >>>>>>>>>>> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >>>>>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>>>>> > To: Open MPI Users <us...@open-mpi.org> >>>>>>>>>>> > Message-ID: >>>>>>>>>>> > < >>>>>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >>>>>>>>>>> > Content-Type: text/plain; charset="iso-8859-1" >>>>>>>>>>> > >>>>>>>>>>> > Hi, >>>>>>>>>>> > >>>>>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1 >>>>>>>>>>> > defined, >>>>>>>>>>> > while n=1, which means only rank 0 is present and can be >>>>>>>>>>> allocated. >>>>>>>>>>> > >>>>>>>>>>> > NP must be >= the largest rank in rankfile. >>>>>>>>>>> > >>>>>>>>>>> > What exactly are you trying to do ? >>>>>>>>>>> > >>>>>>>>>>> > I tried to recreate your seqv but all I got was >>>>>>>>>>> > >>>>>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun >>>>>>>>>>> --hostfile >>>>>>>>>>> > hostfile.0 >>>>>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >>>>>>>>>>> > [witch19:30798] mca: base: component_find: paffinity >>>>>>>>>>> > "mca_paffinity_linux" >>>>>>>>>>> > uses an MCA interface that is not recognized (component MCA >>>>>>>>>>> v1.0.0 != >>>>>>>>>>> > supported MCA v2.0.0) -- ignored >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > It looks like opal_init failed for some reason; your parallel >>>>>>>>>>> > process is >>>>>>>>>>> > likely to abort. There are many reasons that a parallel process >>>>>>>>>>> can >>>>>>>>>>> > fail during opal_init; some of which are due to configuration >>>>>>>>>>> or >>>>>>>>>>> > environment problems. This failure appears to be an internal >>>>>>>>>>> failure; >>>>>>>>>>> > here's some additional information (which may only be relevant >>>>>>>>>>> to an >>>>>>>>>>> > Open MPI developer): >>>>>>>>>>> > >>>>>>>>>>> > opal_carto_base_select failed >>>>>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found >>>>>>>>>>> in file >>>>>>>>>>> > ../../orte/runtime/orte_init.c at line 78 >>>>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found >>>>>>>>>>> in file >>>>>>>>>>> > ../../orte/orted/orted_main.c at line 344 >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while >>>>>>>>>>> > attempting >>>>>>>>>>> > to launch so we are aborting. >>>>>>>>>>> > >>>>>>>>>>> > There may be more information reported by the environment (see >>>>>>>>>>> above). >>>>>>>>>>> > >>>>>>>>>>> > This may be because the daemon was unable to find all the >>>>>>>>>>> needed >>>>>>>>>>> > shared >>>>>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH >>>>>>>>>>> to >>>>>>>>>>> > have the >>>>>>>>>>> > location of the shared libraries on the remote nodes and this >>>>>>>>>>> will >>>>>>>>>>> > automatically be forwarded to the remote nodes. >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>> process >>>>>>>>>>> > that caused that situation. >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > mpirun: clean termination accomplished >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Lenny. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >>>>>>>>>>> > > >>>>>>>>>>> > > Hi , >>>>>>>>>>> > > >>>>>>>>>>> > > I am currently testing the process affinity capabilities of >>>>>>>>>>> > openmpi and I >>>>>>>>>>> > > would like to know if the rankfile behaviour I will describe >>>>>>>>>>> below >>>>>>>>>>> > is normal >>>>>>>>>>> > > or not ? >>>>>>>>>>> > > >>>>>>>>>>> > > cat hostfile.0 >>>>>>>>>>> > > r011n002 slots=4 >>>>>>>>>>> > > r011n003 slots=4 >>>>>>>>>>> > > >>>>>>>>>>> > > cat rankfile.0 >>>>>>>>>>> > > rank 0=r011n002 slot=0 >>>>>>>>>>> > > rank 1=r011n003 slot=1 >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> ################################################################################## >>>>>>>>>>> > > >>>>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname >>>>>>>>>>> ### OK >>>>>>>>>>> > > r011n002 >>>>>>>>>>> > > r011n003 >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> ################################################################################## >>>>>>>>>>> > > but >>>>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : >>>>>>>>>>> -n 1 >>>>>>>>>>> > hostname >>>>>>>>>>> > > ### CRASHED >>>>>>>>>>> > > * >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > rmaps_rank_file.c at line 404 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > base/rmaps_base_map_job.c at line 87 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > base/plm_base_launch_support.c at line 77 >>>>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>>>>>> in file >>>>>>>>>>> > > plm_rsh_module.c at line 985 >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>>>>>>>>>> > attempting to >>>>>>>>>>> > > launch so we are aborting. >>>>>>>>>>> > > >>>>>>>>>>> > > There may be more information reported by the environment >>>>>>>>>>> (see >>>>>>>>>>> > above). >>>>>>>>>>> > > >>>>>>>>>>> > > This may be because the daemon was unable to find all the >>>>>>>>>>> needed >>>>>>>>>>> > shared >>>>>>>>>>> > > libraries on the remote node. You may set your >>>>>>>>>>> LD_LIBRARY_PATH to >>>>>>>>>>> > have the >>>>>>>>>>> > > location of the shared libraries on the remote nodes and this >>>>>>>>>>> will >>>>>>>>>>> > > automatically be forwarded to the remote nodes. >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > orterun noticed that the job aborted, but has no info as to >>>>>>>>>>> the >>>>>>>>>>> > process >>>>>>>>>>> > > that caused that situation. >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> > > orterun: clean termination accomplished >>>>>>>>>>> > > * >>>>>>>>>>> > > It seems that the rankfile option is not propagted to the >>>>>>>>>>> second >>>>>>>>>>> > command >>>>>>>>>>> > > line ; there is no global understanding of the ranking inside >>>>>>>>>>> a >>>>>>>>>>> > mpirun >>>>>>>>>>> > > command. >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> ################################################################################## >>>>>>>>>>> > > >>>>>>>>>>> > > Assuming that , I tried to provide a rankfile to each command >>>>>>>>>>> line: >>>>>>>>>>> > > >>>>>>>>>>> > > cat rankfile.0 >>>>>>>>>>> > > rank 0=r011n002 slot=0 >>>>>>>>>>> > > >>>>>>>>>>> > > cat rankfile.1 >>>>>>>>>>> > > rank 0=r011n003 slot=1 >>>>>>>>>>> > > >>>>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : >>>>>>>>>>> -rf >>>>>>>>>>> > rankfile.1 >>>>>>>>>>> > > -n 1 hostname ### CRASHED >>>>>>>>>>> > > *[r011n002:28778] *** Process received signal *** >>>>>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11) >>>>>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1) >>>>>>>>>>> > > [r011n002:28778] Failing at address: 0x34 >>>>>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600] >>>>>>>>>>> > > [r011n002:28778] [ 1] >>>>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>>>>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d) >>>>>>>>>>> > > [0x5557decd] >>>>>>>>>>> > > [r011n002:28778] [ 2] >>>>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>>>>>>>>>> > 0(orte_plm_base_launch_apps+0x117) >>>>>>>>>>> > > [0x555842a7] >>>>>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >>>>>>>>>>> > mca_plm_rsh.so >>>>>>>>>>> > > [0x556098c0] >>>>>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>>>>> > [0x804aa27] >>>>>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>>>>> > [0x804a022] >>>>>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >>>>>>>>>>> > [0x9f1dec] >>>>>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>>>>> > [0x8049f71] >>>>>>>>>>> > > [r011n002:28778] *** End of error message *** >>>>>>>>>>> > > Segmentation fault (core dumped)* >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > I hope that I've found a bug because it would be very >>>>>>>>>>> important >>>>>>>>>>> > for me to >>>>>>>>>>> > > have this kind of capabiliy . >>>>>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my >>>>>>>>>>> exes >>>>>>>>>>> > and >>>>>>>>>>> > > sockets together. >>>>>>>>>>> > > >>>>>>>>>>> > > Thanks in advance for your help >>>>>>>>>>> > > >>>>>>>>>>> > > Geoffroy >>>>>>>>>>> > _______________________________________________ >>>>>>>>>>> > users mailing list >>>>>>>>>>> > us...@open-mpi.org >>>>>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> -------------- next part -------------- >>>>>>>>>>> HTML attachment scrubbed and removed >>>>>>>>>>> >>>>>>>>>>> ------------------------------ >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> End of users Digest, Vol 1202, Issue 2 >>>>>>>>>>> ************************************** >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> -------------- next part -------------- >>>>>>>>>>> HTML attachment scrubbed and removed >>>>>>>>>>> >>>>>>>>>>> ------------------------------ >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> End of users Digest, Vol 1218, Issue 2 >>>>>>>>>>> ************************************** >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >