Hi, maximum rank number must be less then np. if np=1 then there is only rank 0 in the system, so rank 1 is invalid. please remove "rank 1=node2 slot=*" from the rankfile Best regards, Lenny.
On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <geopig...@gmail.com>wrote: > Hi , > > I got the > openmpi-1.4a1r21095.tar.gz<http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21095.tar.gz>tarball, > but unfortunately my command doesn't work > > cat rankf: > rank 0=node1 slot=* > rank 1=node2 slot=* > > cat hostf: > node1 slots=2 > node2 slots=2 > > mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname : > --host node2 -n 1 hostname > > Error, invalid rank (1) in the rankfile (rankf) > > -------------------------------------------------------------------------- > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file > rmaps_rank_file.c at line 403 > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file > base/rmaps_base_map_job.c at line 86 > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file > base/plm_base_launch_support.c at line 86 > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file > plm_rsh_module.c at line 1016 > > > Ralph, could you tell me if my command syntax is correct or not ? if not, > give me the expected one ? > > Regards > > Geoffroy > > > > > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> > >> Immediately Sir !!! :) >> >> Thanks again Ralph >> >> Geoffroy >> >> >> >>> >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Thu, 30 Apr 2009 06:45:39 -0600 >>> From: Ralph Castain <r...@open-mpi.org> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> To: Open MPI Users <us...@open-mpi.org> >>> Message-ID: >>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> >>> Content-Type: text/plain; charset="iso-8859-1" >>> >>> I believe this is fixed now in our development trunk - you can download >>> any >>> tarball starting from last night and give it a try, if you like. Any >>> feedback would be appreciated. >>> >>> Ralph >>> >>> >>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >>> >>> Ah now, I didn't say it -worked-, did I? :-) >>> >>> Clearly a bug exists in the program. I'll try to take a look at it (if >>> Lenny >>> doesn't get to it first), but it won't be until later in the week. >>> >>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >>> >>> I agree with you Ralph , and that 's what I expect from openmpi but my >>> second example shows that it's not working >>> >>> cat hostfile.0 >>> r011n002 slots=4 >>> r011n003 slots=4 >>> >>> cat rankfile.0 >>> rank 0=r011n002 slot=0 >>> rank 1=r011n003 slot=1 >>> >>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname >>> ### CRASHED >>> >>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > rmaps_rank_file.c at line 404 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > base/rmaps_base_map_job.c at line 87 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > base/plm_base_launch_support.c at line 77 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > plm_rsh_module.c at line 985 >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>> > attempting to >>> > > launch so we are aborting. >>> > > >>> > > There may be more information reported by the environment (see >>> > above). >>> > > >>> > > This may be because the daemon was unable to find all the needed >>> > shared >>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >>> > have the >>> > > location of the shared libraries on the remote nodes and this will >>> > > automatically be forwarded to the remote nodes. >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > orterun noticed that the job aborted, but has no info as to the >>> > process >>> > > that caused that situation. >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > orterun: clean termination accomplished >>> >>> >>> >>> Message: 4 >>> Date: Tue, 14 Apr 2009 06:55:58 -0600 >>> From: Ralph Castain <r...@lanl.gov> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> To: Open MPI Users <us...@open-mpi.org> >>> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>> DelSp="yes" >>> >>> The rankfile cuts across the entire job - it isn't applied on an >>> app_context basis. So the ranks in your rankfile must correspond to >>> the eventual rank of each process in the cmd line. >>> >>> Unfortunately, that means you have to count ranks. In your case, you >>> only have four, so that makes life easier. Your rankfile would look >>> something like this: >>> >>> rank 0=r001n001 slot=0 >>> rank 1=r001n002 slot=1 >>> rank 2=r001n001 slot=1 >>> rank 3=r001n002 slot=2 >>> >>> HTH >>> Ralph >>> >>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >>> >>> > Hi, >>> > >>> > I agree that my examples are not very clear. What I want to do is to >>> > launch a multiexes application (masters-slaves) and benefit from the >>> > processor affinity. >>> > Could you show me how to convert this command , using -rf option >>> > (whatever the affinity is) >>> > >>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 >>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >>> > host r001n002 slave.x options4 >>> > >>> > Thanks for your help >>> > >>> > Geoffroy >>> > >>> > >>> > >>> > >>> > >>> > Message: 2 >>> > Date: Sun, 12 Apr 2009 18:26:35 +0300 >>> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> > To: Open MPI Users <us...@open-mpi.org> >>> > Message-ID: >>> > <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >>> > Content-Type: text/plain; charset="iso-8859-1" >>> > >>> > Hi, >>> > >>> > The first "crash" is OK, since your rankfile has ranks 0 and 1 >>> > defined, >>> > while n=1, which means only rank 0 is present and can be allocated. >>> > >>> > NP must be >= the largest rank in rankfile. >>> > >>> > What exactly are you trying to do ? >>> > >>> > I tried to recreate your seqv but all I got was >>> > >>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >>> > hostfile.0 >>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >>> > [witch19:30798] mca: base: component_find: paffinity >>> > "mca_paffinity_linux" >>> > uses an MCA interface that is not recognized (component MCA v1.0.0 != >>> > supported MCA v2.0.0) -- ignored >>> > >>> -------------------------------------------------------------------------- >>> > It looks like opal_init failed for some reason; your parallel >>> > process is >>> > likely to abort. There are many reasons that a parallel process can >>> > fail during opal_init; some of which are due to configuration or >>> > environment problems. This failure appears to be an internal failure; >>> > here's some additional information (which may only be relevant to an >>> > Open MPI developer): >>> > >>> > opal_carto_base_select failed >>> > --> Returned value -13 instead of OPAL_SUCCESS >>> > >>> -------------------------------------------------------------------------- >>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >>> > ../../orte/runtime/orte_init.c at line 78 >>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >>> > ../../orte/orted/orted_main.c at line 344 >>> > >>> -------------------------------------------------------------------------- >>> > A daemon (pid 11629) died unexpectedly with status 243 while >>> > attempting >>> > to launch so we are aborting. >>> > >>> > There may be more information reported by the environment (see above). >>> > >>> > This may be because the daemon was unable to find all the needed >>> > shared >>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to >>> > have the >>> > location of the shared libraries on the remote nodes and this will >>> > automatically be forwarded to the remote nodes. >>> > >>> -------------------------------------------------------------------------- >>> > >>> -------------------------------------------------------------------------- >>> > mpirun noticed that the job aborted, but has no info as to the process >>> > that caused that situation. >>> > >>> -------------------------------------------------------------------------- >>> > mpirun: clean termination accomplished >>> > >>> > >>> > Lenny. >>> > >>> > >>> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >>> > > >>> > > Hi , >>> > > >>> > > I am currently testing the process affinity capabilities of >>> > openmpi and I >>> > > would like to know if the rankfile behaviour I will describe below >>> > is normal >>> > > or not ? >>> > > >>> > > cat hostfile.0 >>> > > r011n002 slots=4 >>> > > r011n003 slots=4 >>> > > >>> > > cat rankfile.0 >>> > > rank 0=r011n002 slot=0 >>> > > rank 1=r011n003 slot=1 >>> > > >>> > > >>> > > >>> > >>> >>> ################################################################################## >>> > > >>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK >>> > > r011n002 >>> > > r011n003 >>> > > >>> > > >>> > > >>> > >>> >>> ################################################################################## >>> > > but >>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>> > hostname >>> > > ### CRASHED >>> > > * >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > Error, invalid rank (1) in the rankfile (rankfile.0) >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > rmaps_rank_file.c at line 404 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > base/rmaps_base_map_job.c at line 87 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > base/plm_base_launch_support.c at line 77 >>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >>> > > plm_rsh_module.c at line 985 >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > A daemon (pid unknown) died unexpectedly on signal 1 while >>> > attempting to >>> > > launch so we are aborting. >>> > > >>> > > There may be more information reported by the environment (see >>> > above). >>> > > >>> > > This may be because the daemon was unable to find all the needed >>> > shared >>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >>> > have the >>> > > location of the shared libraries on the remote nodes and this will >>> > > automatically be forwarded to the remote nodes. >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > orterun noticed that the job aborted, but has no info as to the >>> > process >>> > > that caused that situation. >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > orterun: clean termination accomplished >>> > > * >>> > > It seems that the rankfile option is not propagted to the second >>> > command >>> > > line ; there is no global understanding of the ranking inside a >>> > mpirun >>> > > command. >>> > > >>> > > >>> > > >>> > >>> >>> ################################################################################## >>> > > >>> > > Assuming that , I tried to provide a rankfile to each command line: >>> > > >>> > > cat rankfile.0 >>> > > rank 0=r011n002 slot=0 >>> > > >>> > > cat rankfile.1 >>> > > rank 0=r011n003 slot=1 >>> > > >>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >>> > rankfile.1 >>> > > -n 1 hostname ### CRASHED >>> > > *[r011n002:28778] *** Process received signal *** >>> > > [r011n002:28778] Signal: Segmentation fault (11) >>> > > [r011n002:28778] Signal code: Address not mapped (1) >>> > > [r011n002:28778] Failing at address: 0x34 >>> > > [r011n002:28778] [ 0] [0xffffe600] >>> > > [r011n002:28778] [ 1] >>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>> > 0(orte_odls_base_default_get_add_procs_data+0x55d) >>> > > [0x5557decd] >>> > > [r011n002:28778] [ 2] >>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>> > 0(orte_plm_base_launch_apps+0x117) >>> > > [0x555842a7] >>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >>> > mca_plm_rsh.so >>> > > [0x556098c0] >>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > [0x804aa27] >>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > [0x804a022] >>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >>> > [0x9f1dec] >>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > [0x8049f71] >>> > > [r011n002:28778] *** End of error message *** >>> > > Segmentation fault (core dumped)* >>> > > >>> > > >>> > > >>> > > I hope that I've found a bug because it would be very important >>> > for me to >>> > > have this kind of capabiliy . >>> > > Launch a multiexe mpirun command line and be able to bind my exes >>> > and >>> > > sockets together. >>> > > >>> > > Thanks in advance for your help >>> > > >>> > > Geoffroy >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> -------------- next part -------------- >>> HTML attachment scrubbed and removed >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> End of users Digest, Vol 1202, Issue 2 >>> ************************************** >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> -------------- next part -------------- >>> HTML attachment scrubbed and removed >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> End of users Digest, Vol 1218, Issue 2 >>> ************************************** >>> >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >