Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Ralph Castain Tue, 14 Apr 2009 08:56:06 -0400

The rankfile cuts across the entire job - it isn't applied on anapp_context basis. So the ranks in your rankfile must correspond tothe eventual rank of each process in the cmd line.

Unfortunately, that means you have to count ranks. In your case, youonly have four, so that makes life easier. Your rankfile would looksomething like this:


rank 0=r001n001 slot=0
rank 1=r001n002 slot=1
rank 2=r001n001 slot=1
rank 3=r001n002 slot=2

HTH
Ralph

On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:

Hi,
I agree that my examples are not very clear. What I want to do is tolaunch a multiexes application (masters-slaves) and benefit from theprocessor affinity.Could you show me how to convert this command , using -rf option(whatever the affinity is)
mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -host r001n002 slave.x options4
Thanks for your help

Geoffroy





Message: 2
Date: Sun, 12 Apr 2009 18:26:35 +0300
From: Lenny Verkhovsky <lenny.verkhov...@gmail.com>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID:
       <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi,
The first "crash" is OK, since your rankfile has ranks 0 and 1defined,
while n=1, which means only rank 0 is present and can be allocated.

NP must be >= the largest rank in rankfile.

What exactly are you trying to do ?

I tried to recreate your seqv but all I got was
~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfilehostfile.0
-rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
[witch19:30798] mca: base: component_find: paffinity"mca_paffinity_linux"
uses an MCA interface that is not recognized (component MCA v1.0.0 !=
supported MCA v2.0.0) -- ignored
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallelprocess is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 opal_carto_base_select failed
 --> Returned value -13 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/runtime/orte_init.c at line 78
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/orted/orted_main.c at line 344
--------------------------------------------------------------------------
A daemon (pid 11629) died unexpectedly with status 243 whileattempting
to launch so we are aborting.

There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the neededsharedlibraries on the remote node. You may set your LD_LIBRARY_PATH tohave the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished


Lenny.


On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote:
>
> Hi ,
>
> I am currently testing the process affinity capabilities ofopenmpi and I> would like to know if the rankfile behaviour I will describe belowis normal
> or not ?
>
> cat hostfile.0
> r011n002 slots=4
> r011n003 slots=4
>
> cat rankfile.0
> rank 0=r011n002 slot=0
> rank 1=r011n003 slot=1
>
>
>##################################################################################
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ### OK
> r011n002
> r011n003
>
>
>##################################################################################
> but
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1hostname
> ### CRASHED
> *
>--------------------------------------------------------------------------
> Error, invalid rank (1) in the rankfile (rankfile.0)
>--------------------------------------------------------------------------
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 404
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 87
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/plm_base_launch_support.c at line 77
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> plm_rsh_module.c at line 985
>--------------------------------------------------------------------------> A daemon (pid unknown) died unexpectedly on signal 1 whileattempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (seeabove).
>
> This may be because the daemon was unable to find all the neededshared> libraries on the remote node. You may set your LD_LIBRARY_PATH tohave the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
>-------------------------------------------------------------------------->--------------------------------------------------------------------------> orterun noticed that the job aborted, but has no info as to theprocess
> that caused that situation.
>--------------------------------------------------------------------------
> orterun: clean termination accomplished
> *
> It seems that the rankfile option is not propagted to the secondcommand> line ; there is no global understanding of the ranking inside ampirun
> command.
>
>
>##################################################################################
>
> Assuming that , I tried to provide a rankfile to each command line:
>
> cat rankfile.0
> rank 0=r011n002 slot=0
>
> cat rankfile.1
> rank 0=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rfrankfile.1
> -n 1 hostname ### CRASHED
> *[r011n002:28778] *** Process received signal ***
> [r011n002:28778] Signal: Segmentation fault (11)
> [r011n002:28778] Signal code: Address not mapped (1)
> [r011n002:28778] Failing at address: 0x34
> [r011n002:28778] [ 0] [0xffffe600]
> [r011n002:28778] [ 1]
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x55d)
> [0x5557decd]
> [r011n002:28778] [ 2]
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x117)
> [0x555842a7]
> [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/mca_plm_rsh.so
> [0x556098c0]
> [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun[0x804aa27]> [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun[0x804a022]> [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)[0x9f1dec]> [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun[0x8049f71]
> [r011n002:28778] *** End of error message ***
> Segmentation fault (core dumped)*
>
>
>
> I hope that I've found a bug because it would be very importantfor me to
> have this kind of capabiliy .
> Launch a multiexe mpirun command line and be able to bind my exesand
> sockets together.
>
> Thanks in advance for your help
>
> Geoffroy
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Reply via email to