Thanks Ralph, after playing with prefixes it worked, I still have a problem running app file with rankfile, by providing full hostlist in mpirun command and not in app file. Is is planned behaviour, or it can be fixed ?
See Working example: $cat rankfile rank 0=+n1 slot=0 rank 1=+n0 slot=0 $cat appfile -np 1 -H witch1,witch2 ./hello_world -np 1 -H witch1,witch2 ./hello_world $mpirun -rf rankfile -app appfile Hello world! I'm 1 of 2 on witch1 Hello world! I'm 0 of 2 on witch2 See NOT working example: $cat appfile -np 1 -H witch1 ./hello_world -np 1 -H witch2 ./hello_world $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile -------------------------------------------------------------------------- Rankfile claimed host +n1 by index that is bigger than number of allocated hosts. -------------------------------------------------------------------------- [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <r...@open-mpi.org> wrote: > Took a deeper look into this, and I think that your first guess was > correct. > When we changed hostfile and -host to be per-app-context options, it became > necessary for you to put that info in the appfile itself. So try adding it > there. What you would need in your appfile is the following: > > -np 1 -H witch1 hostname > -np 1 -H witch2 hostname > > That should get you what you want. > Ralph > > On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote: > > No, it's not working as I expect , unless I expect somthing wrong . > ( sorry for the long PATH, I needed to provide it ) > > $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ > /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun > -np 2 -H witch1,witch2 hostname > witch1 > witch2 > > $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/ > /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun > -np 2 -H witch1,witch2 -app appfile > dellix7 > dellix7 > $cat appfile > -np 1 hostname > -np 1 hostname > > > On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Run it without the appfile, just putting the apps on the cmd line - does >> it work right then? >> >> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote: >> >> additional info >> I am running mpirun on hostA, and providing hostlist with hostB and hostC. >> I expect that each application would run on hostB and hostC, but I get all >> of them running on hostA. >> dellix7$cat appfile >> -np 1 hostname >> -np 1 hostname >> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile >> dellix7 >> dellix7 >> Thanks >> Lenny. >> >> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Strange - let me have a look at it later today. Probably something simple >>> that another pair of eyes might spot. >>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote: >>> >>> Seems like connected problem: >>> I can't use rankfile with app, even after all those fixes ( working with >>> trunk 1.4a1r21657). >>> This is my case : >>> >>> $cat rankfile >>> rank 0=+n1 slot=0 >>> rank 1=+n0 slot=0 >>> $cat appfile >>> -np 1 hostname >>> -np 1 hostname >>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile >>> >>> -------------------------------------------------------------------------- >>> Rankfile claimed host +n1 by index that is bigger than number of >>> allocated hosts. >>> >>> -------------------------------------------------------------------------- >>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422 >>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85 >>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103 >>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001 >>> >>> >>> The problem is, that rankfile mapper tries to find an appropriate host in >>> the partial ( and not full ) hostlist. >>> >>> Any suggestions how to fix it? >>> >>> Thanks >>> Lenny. >>> >>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Okay, I fixed this today too....r21219 >>>> >>>> >>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote: >>>> >>>> Now there is another problem :) >>>>> >>>>> You can try oversubscribe node. At least by 1 task. >>>>> If you hostfile and rank file limit you at N procs, you can ask mpirun >>>>> for N+1 and it wil be not rejected. >>>>> Although in reality there will be N tasks. >>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5" >>>>> both works, but in both cases there are only 4 tasks. It isn't crucial, >>>>> because there is nor real oversubscription, but there is still some bug >>>>> which can affect something in future. >>>>> >>>>> -- >>>>> Anton Starikov. >>>>> >>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote: >>>>> >>>>> This is fixed as of r21208. >>>>>> >>>>>> Thanks for reporting it! >>>>>> Ralph >>>>>> >>>>>> >>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote: >>>>>> >>>>>> Although removing this check solves problem of having more slots in >>>>>>> rankfile than necessary, there is another problem. >>>>>>> >>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example: >>>>>>> >>>>>>> >>>>>>> hostfile: >>>>>>> >>>>>>> node01 >>>>>>> node01 >>>>>>> node02 >>>>>>> node02 >>>>>>> >>>>>>> rankfile: >>>>>>> >>>>>>> rank 0=node01 slot=1 >>>>>>> rank 1=node01 slot=0 >>>>>>> rank 2=node02 slot=1 >>>>>>> rank 3=node02 slot=0 >>>>>>> >>>>>>> mpirun -np 4 ./something >>>>>>> >>>>>>> complains with: >>>>>>> >>>>>>> "There are not enough slots available in the system to satisfy the 4 >>>>>>> slots >>>>>>> that were requested by the application" >>>>>>> >>>>>>> but "mpirun -np 3 ./something" will work though. It works, when you >>>>>>> ask for 1 CPU less. And the same behavior in any case (shared nodes, >>>>>>> non-shared nodes, multi-node) >>>>>>> >>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and all >>>>>>> affinities set as it requested in rankfile, there is no >>>>>>> oversubscription. >>>>>>> >>>>>>> >>>>>>> Anton. >>>>>>> >>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote: >>>>>>> >>>>>>> Ah - thx for catching that, I'll remove that check. It no longer is >>>>>>>> required. >>>>>>>> >>>>>>>> Thx! >>>>>>>> >>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky < >>>>>>>> lenny.verkhov...@gmail.com> wrote: >>>>>>>> According to the code it does cares. >>>>>>>> >>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 >>>>>>>> >>>>>>>> ival = orte_rmaps_rank_file_value.ival; >>>>>>>> if ( ival > (np-1) ) { >>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, >>>>>>>> ival, rankfile); >>>>>>>> rc = ORTE_ERR_BAD_PARAM; >>>>>>>> goto unlock; >>>>>>>> } >>>>>>>> >>>>>>>> If I remember correctly, I used an array to map ranks, and since the >>>>>>>> length of array is NP, maximum index must be less than np, so if you >>>>>>>> have >>>>>>>> the number of rank > NP, you have no place to put it inside array. >>>>>>>> >>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we >>>>>>>> map the additional procs either byslot (default) or bynode (if you >>>>>>>> specify >>>>>>>> that option). So the rankfile doesn't need to contain an entry for >>>>>>>> every >>>>>>>> proc." - Correct point. >>>>>>>> >>>>>>>> >>>>>>>> Lenny. >>>>>>>> >>>>>>>> >>>>>>>> On 5/5/09, Ralph Castain <r...@open-mpi.org> wrote: Sorry Lenny, but >>>>>>>> that isn't correct. The rankfile mapper doesn't care if the rankfile >>>>>>>> contains additional info - it only maps up to the number of processes, >>>>>>>> and >>>>>>>> ignores anything beyond that number. So there is no need to remove the >>>>>>>> additional info. >>>>>>>> >>>>>>>> Likewise, if you have more procs than the rankfile specifies, we map >>>>>>>> the additional procs either byslot (default) or bynode (if you specify >>>>>>>> that >>>>>>>> option). So the rankfile doesn't need to contain an entry for every >>>>>>>> proc. >>>>>>>> >>>>>>>> Just don't want to confuse folks. >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky < >>>>>>>> lenny.verkhov...@gmail.com> wrote: >>>>>>>> Hi, >>>>>>>> maximum rank number must be less then np. >>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is >>>>>>>> invalid. >>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile >>>>>>>> Best regards, >>>>>>>> Lenny. >>>>>>>> >>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot < >>>>>>>> geopig...@gmail.com> wrote: >>>>>>>> Hi , >>>>>>>> >>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my >>>>>>>> command doesn't work >>>>>>>> >>>>>>>> cat rankf: >>>>>>>> rank 0=node1 slot=* >>>>>>>> rank 1=node2 slot=* >>>>>>>> >>>>>>>> cat hostf: >>>>>>>> node1 slots=2 >>>>>>>> node2 slots=2 >>>>>>>> >>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 >>>>>>>> hostname : --host node2 -n 1 hostname >>>>>>>> >>>>>>>> Error, invalid rank (1) in the rankfile (rankf) >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>>>> rmaps_rank_file.c at line 403 >>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>>>> base/rmaps_base_map_job.c at line 86 >>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>>>> base/plm_base_launch_support.c at line 86 >>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >>>>>>>> plm_rsh_module.c at line 1016 >>>>>>>> >>>>>>>> >>>>>>>> Ralph, could you tell me if my command syntax is correct or not ? if >>>>>>>> not, give me the expected one ? >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Geoffroy >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> >>>>>>>> >>>>>>>> Immediately Sir !!! :) >>>>>>>> >>>>>>>> Thanks again Ralph >>>>>>>> >>>>>>>> Geoffroy >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> Message: 2 >>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600 >>>>>>>> From: Ralph Castain <r...@open-mpi.org> >>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>> Message-ID: >>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> >>>>>>>> Content-Type: text/plain; charset="iso-8859-1" >>>>>>>> >>>>>>>> I believe this is fixed now in our development trunk - you can >>>>>>>> download any >>>>>>>> tarball starting from last night and give it a try, if you like. Any >>>>>>>> feedback would be appreciated. >>>>>>>> >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >>>>>>>> >>>>>>>> Ah now, I didn't say it -worked-, did I? :-) >>>>>>>> >>>>>>>> Clearly a bug exists in the program. I'll try to take a look at it >>>>>>>> (if Lenny >>>>>>>> doesn't get to it first), but it won't be until later in the week. >>>>>>>> >>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >>>>>>>> >>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi but >>>>>>>> my >>>>>>>> second example shows that it's not working >>>>>>>> >>>>>>>> cat hostfile.0 >>>>>>>> r011n002 slots=4 >>>>>>>> r011n003 slots=4 >>>>>>>> >>>>>>>> cat rankfile.0 >>>>>>>> rank 0=r011n002 slot=0 >>>>>>>> rank 1=r011n003 slot=1 >>>>>>>> >>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>>>>>>> hostname >>>>>>>> ### CRASHED >>>>>>>> >>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > rmaps_rank_file.c at line 404 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > base/rmaps_base_map_job.c at line 87 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > base/plm_base_launch_support.c at line 77 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > plm_rsh_module.c at line 985 >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>>>>>>> > attempting to >>>>>>>> > > launch so we are aborting. >>>>>>>> > > >>>>>>>> > > There may be more information reported by the environment (see >>>>>>>> > above). >>>>>>>> > > >>>>>>>> > > This may be because the daemon was unable to find all the needed >>>>>>>> > shared >>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH >>>>>>>> to >>>>>>>> > have the >>>>>>>> > > location of the shared libraries on the remote nodes and this >>>>>>>> will >>>>>>>> > > automatically be forwarded to the remote nodes. >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > orterun noticed that the job aborted, but has no info as to the >>>>>>>> > process >>>>>>>> > > that caused that situation. >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > orterun: clean termination accomplished >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Message: 4 >>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600 >>>>>>>> From: Ralph Castain <r...@lanl.gov> >>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>>>>>>> DelSp="yes" >>>>>>>> >>>>>>>> The rankfile cuts across the entire job - it isn't applied on an >>>>>>>> app_context basis. So the ranks in your rankfile must correspond to >>>>>>>> the eventual rank of each process in the cmd line. >>>>>>>> >>>>>>>> Unfortunately, that means you have to count ranks. In your case, you >>>>>>>> only have four, so that makes life easier. Your rankfile would look >>>>>>>> something like this: >>>>>>>> >>>>>>>> rank 0=r001n001 slot=0 >>>>>>>> rank 1=r001n002 slot=1 >>>>>>>> rank 2=r001n001 slot=1 >>>>>>>> rank 3=r001n002 slot=2 >>>>>>>> >>>>>>>> HTH >>>>>>>> Ralph >>>>>>>> >>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >>>>>>>> >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > I agree that my examples are not very clear. What I want to do is >>>>>>>> to >>>>>>>> > launch a multiexes application (masters-slaves) and benefit from >>>>>>>> the >>>>>>>> > processor affinity. >>>>>>>> > Could you show me how to convert this command , using -rf option >>>>>>>> > (whatever the affinity is) >>>>>>>> > >>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >>>>>>>> r001n002 >>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >>>>>>>> > host r001n002 slave.x options4 >>>>>>>> > >>>>>>>> > Thanks for your help >>>>>>>> > >>>>>>>> > Geoffroy >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > Message: 2 >>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300 >>>>>>>> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>>>>>>> > To: Open MPI Users <us...@open-mpi.org> >>>>>>>> > Message-ID: >>>>>>>> > < >>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >>>>>>>> > Content-Type: text/plain; charset="iso-8859-1" >>>>>>>> > >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1 >>>>>>>> > defined, >>>>>>>> > while n=1, which means only rank 0 is present and can be >>>>>>>> allocated. >>>>>>>> > >>>>>>>> > NP must be >= the largest rank in rankfile. >>>>>>>> > >>>>>>>> > What exactly are you trying to do ? >>>>>>>> > >>>>>>>> > I tried to recreate your seqv but all I got was >>>>>>>> > >>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >>>>>>>> > hostfile.0 >>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >>>>>>>> > [witch19:30798] mca: base: component_find: paffinity >>>>>>>> > "mca_paffinity_linux" >>>>>>>> > uses an MCA interface that is not recognized (component MCA v1.0.0 >>>>>>>> != >>>>>>>> > supported MCA v2.0.0) -- ignored >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > It looks like opal_init failed for some reason; your parallel >>>>>>>> > process is >>>>>>>> > likely to abort. There are many reasons that a parallel process >>>>>>>> can >>>>>>>> > fail during opal_init; some of which are due to configuration or >>>>>>>> > environment problems. This failure appears to be an internal >>>>>>>> failure; >>>>>>>> > here's some additional information (which may only be relevant to >>>>>>>> an >>>>>>>> > Open MPI developer): >>>>>>>> > >>>>>>>> > opal_carto_base_select failed >>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >>>>>>>> file >>>>>>>> > ../../orte/runtime/orte_init.c at line 78 >>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >>>>>>>> file >>>>>>>> > ../../orte/orted/orted_main.c at line 344 >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while >>>>>>>> > attempting >>>>>>>> > to launch so we are aborting. >>>>>>>> > >>>>>>>> > There may be more information reported by the environment (see >>>>>>>> above). >>>>>>>> > >>>>>>>> > This may be because the daemon was unable to find all the needed >>>>>>>> > shared >>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>> > have the >>>>>>>> > location of the shared libraries on the remote nodes and this will >>>>>>>> > automatically be forwarded to the remote nodes. >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > mpirun noticed that the job aborted, but has no info as to the >>>>>>>> process >>>>>>>> > that caused that situation. >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > mpirun: clean termination accomplished >>>>>>>> > >>>>>>>> > >>>>>>>> > Lenny. >>>>>>>> > >>>>>>>> > >>>>>>>> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >>>>>>>> > > >>>>>>>> > > Hi , >>>>>>>> > > >>>>>>>> > > I am currently testing the process affinity capabilities of >>>>>>>> > openmpi and I >>>>>>>> > > would like to know if the rankfile behaviour I will describe >>>>>>>> below >>>>>>>> > is normal >>>>>>>> > > or not ? >>>>>>>> > > >>>>>>>> > > cat hostfile.0 >>>>>>>> > > r011n002 slots=4 >>>>>>>> > > r011n003 slots=4 >>>>>>>> > > >>>>>>>> > > cat rankfile.0 >>>>>>>> > > rank 0=r011n002 slot=0 >>>>>>>> > > rank 1=r011n003 slot=1 >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>> ################################################################################## >>>>>>>> > > >>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### >>>>>>>> OK >>>>>>>> > > r011n002 >>>>>>>> > > r011n003 >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>> ################################################################################## >>>>>>>> > > but >>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>>>>>>> > hostname >>>>>>>> > > ### CRASHED >>>>>>>> > > * >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > rmaps_rank_file.c at line 404 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > base/rmaps_base_map_job.c at line 87 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > base/plm_base_launch_support.c at line 77 >>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>>>>>>> file >>>>>>>> > > plm_rsh_module.c at line 985 >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>>>>>>> > attempting to >>>>>>>> > > launch so we are aborting. >>>>>>>> > > >>>>>>>> > > There may be more information reported by the environment (see >>>>>>>> > above). >>>>>>>> > > >>>>>>>> > > This may be because the daemon was unable to find all the needed >>>>>>>> > shared >>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH >>>>>>>> to >>>>>>>> > have the >>>>>>>> > > location of the shared libraries on the remote nodes and this >>>>>>>> will >>>>>>>> > > automatically be forwarded to the remote nodes. >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > orterun noticed that the job aborted, but has no info as to the >>>>>>>> > process >>>>>>>> > > that caused that situation. >>>>>>>> > > >>>>>>>> > >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> > > orterun: clean termination accomplished >>>>>>>> > > * >>>>>>>> > > It seems that the rankfile option is not propagted to the second >>>>>>>> > command >>>>>>>> > > line ; there is no global understanding of the ranking inside a >>>>>>>> > mpirun >>>>>>>> > > command. >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>> ################################################################################## >>>>>>>> > > >>>>>>>> > > Assuming that , I tried to provide a rankfile to each command >>>>>>>> line: >>>>>>>> > > >>>>>>>> > > cat rankfile.0 >>>>>>>> > > rank 0=r011n002 slot=0 >>>>>>>> > > >>>>>>>> > > cat rankfile.1 >>>>>>>> > > rank 0=r011n003 slot=1 >>>>>>>> > > >>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >>>>>>>> > rankfile.1 >>>>>>>> > > -n 1 hostname ### CRASHED >>>>>>>> > > *[r011n002:28778] *** Process received signal *** >>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11) >>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1) >>>>>>>> > > [r011n002:28778] Failing at address: 0x34 >>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600] >>>>>>>> > > [r011n002:28778] [ 1] >>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d) >>>>>>>> > > [0x5557decd] >>>>>>>> > > [r011n002:28778] [ 2] >>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>>>>>>> > 0(orte_plm_base_launch_apps+0x117) >>>>>>>> > > [0x555842a7] >>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >>>>>>>> > mca_plm_rsh.so >>>>>>>> > > [0x556098c0] >>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>> > [0x804aa27] >>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>> > [0x804a022] >>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >>>>>>>> > [0x9f1dec] >>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>>>>>>> > [0x8049f71] >>>>>>>> > > [r011n002:28778] *** End of error message *** >>>>>>>> > > Segmentation fault (core dumped)* >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > I hope that I've found a bug because it would be very important >>>>>>>> > for me to >>>>>>>> > > have this kind of capabiliy . >>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my >>>>>>>> exes >>>>>>>> > and >>>>>>>> > > sockets together. >>>>>>>> > > >>>>>>>> > > Thanks in advance for your help >>>>>>>> > > >>>>>>>> > > Geoffroy >>>>>>>> > _______________________________________________ >>>>>>>> > users mailing list >>>>>>>> > us...@open-mpi.org >>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> -------------- next part -------------- >>>>>>>> HTML attachment scrubbed and removed >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> End of users Digest, Vol 1202, Issue 2 >>>>>>>> ************************************** >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> -------------- next part -------------- >>>>>>>> HTML attachment scrubbed and removed >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> End of users Digest, Vol 1218, Issue 2 >>>>>>>> ************************************** >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >