Seems like connected problem:
I can't use rankfile with app, even after all those fixes
( working with trunk 1.4a1r21657).
This is my case :
$cat rankfile
rank 0=+n1 slot=0
rank 1=+n0 slot=0
$cat appfile
-np 1 hostname
-np 1 hostname
$mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of
allocated hosts.
--------------------------------------------------------------------------
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
file ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c
at line 422
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at
line 85
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
file ../../../../orte/mca/plm/base/plm_base_launch_support.c at
line 103
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
file ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
The problem is, that rankfile mapper tries to find an
appropriate host in the partial ( and not full ) hostlist.
Any suggestions how to fix it?
Thanks
Lenny.
On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <rhc@open-
mpi.org> wrote:
Okay, I fixed this today too....r21219
On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
Now there is another problem :)
You can try oversubscribe node. At least by 1 task.
If you hostfile and rank file limit you at N procs, you can ask
mpirun for N+1 and it wil be not rejected.
Although in reality there will be N tasks.
So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun
-np 5" both works, but in both cases there are only 4 tasks. It
isn't crucial, because there is nor real oversubscription, but
there is still some bug which can affect something in future.
--
Anton Starikov.
On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
This is fixed as of r21208.
Thanks for reporting it!
Ralph
On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
Although removing this check solves problem of having more slots
in rankfile than necessary, there is another problem.
If I set rmaps_base_no_oversubscribe=1 then if, for example:
hostfile:
node01
node01
node02
node02
rankfile:
rank 0=node01 slot=1
rank 1=node01 slot=0
rank 2=node02 slot=1
rank 3=node02 slot=0
mpirun -np 4 ./something
complains with:
"There are not enough slots available in the system to satisfy
the 4 slots
that were requested by the application"
but "mpirun -np 3 ./something" will work though. It works, when
you ask for 1 CPU less. And the same behavior in any case
(shared nodes, non-shared nodes, multi-node)
If you switch off rmaps_base_no_oversubscribe, then it works and
all affinities set as it requested in rankfile, there is no
oversubscription.
Anton.
On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
Ah - thx for catching that, I'll remove that check. It no longer
is required.
Thx!
On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <lenny.verkhov...@gmail.com
> wrote:
According to the code it does cares.
$vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
ival = orte_rmaps_rank_file_value.ival;
if ( ival > (np-1) ) {
orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
ival, rankfile);
rc = ORTE_ERR_BAD_PARAM;
goto unlock;
}
If I remember correctly, I used an array to map ranks, and since
the length of array is NP, maximum index must be less than np,
so if you have the number of rank > NP, you have no place to put
it inside array.
"Likewise, if you have more procs than the rankfile specifies,
we map the additional procs either byslot (default) or bynode
(if you specify that option). So the rankfile doesn't need to
contain an entry for every proc." - Correct point.
Lenny.
On 5/5/09, Ralph Castain <r...@open-mpi.org> wrote: Sorry Lenny,
but that isn't correct. The rankfile mapper doesn't care if the
rankfile contains additional info - it only maps up to the
number of processes, and ignores anything beyond that number. So
there is no need to remove the additional info.
Likewise, if you have more procs than the rankfile specifies, we
map the additional procs either byslot (default) or bynode (if
you specify that option). So the rankfile doesn't need to
contain an entry for every proc.
Just don't want to confuse folks.
Ralph
On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <lenny.verkhov...@gmail.com
> wrote:
Hi,
maximum rank number must be less then np.
if np=1 then there is only rank 0 in the system, so rank 1 is
invalid.
please remove "rank 1=node2 slot=*" from the rankfile
Best regards,
Lenny.
On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <geopig...@gmail.com
> wrote:
Hi ,
I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately
my command doesn't work
cat rankf:
rank 0=node1 slot=*
rank 1=node2 slot=*
cat hostf:
node1 slots=2
node2 slots=2
mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
hostname : --host node2 -n 1 hostname
Error, invalid rank (1) in the rankfile (rankf)
--------------------------------------------------------------------------
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file rmaps_rank_file.c at line 403
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file base/rmaps_base_map_job.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file base/plm_base_launch_support.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file plm_rsh_module.c at line 1016
Ralph, could you tell me if my command syntax is correct or
not ? if not, give me the expected one ?
Regards
Geoffroy
2009/4/30 Geoffroy Pignot <geopig...@gmail.com>
Immediately Sir !!! :)
Thanks again Ralph
Geoffroy
------------------------------
Message: 2
Date: Thu, 30 Apr 2009 06:45:39 -0600
From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID:
<71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
I believe this is fixed now in our development trunk - you can
download any
tarball starting from last night and give it a try, if you like.
Any
feedback would be appreciated.
Ralph
On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
Ah now, I didn't say it -worked-, did I? :-)
Clearly a bug exists in the program. I'll try to take a look at
it (if Lenny
doesn't get to it first), but it won't be until later in the week.
On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
I agree with you Ralph , and that 's what I expect from openmpi
but my
second example shows that it's not working
cat hostfile.0
r011n002 slots=4
r011n003 slots=4
cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
hostname
### CRASHED
> > Error, invalid rank (1) in the rankfile (rankfile.0)
> >
>
--------------------------------------------------------------------------
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > rmaps_rank_file.c at line 404
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > base/rmaps_base_map_job.c at line 87
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > base/plm_base_launch_support.c at line 77
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > plm_rsh_module.c at line 985
> >
>
--------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1 while
> attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the
needed
> shared
> > libraries on the remote node. You may set your
LD_LIBRARY_PATH to
> have the
> > location of the shared libraries on the remote nodes and
this will
> > automatically be forwarded to the remote nodes.
> >
>
--------------------------------------------------------------------------
> >
>
--------------------------------------------------------------------------
> > orterun noticed that the job aborted, but has no info as to
the
> process
> > that caused that situation.
> >
>
--------------------------------------------------------------------------
> > orterun: clean termination accomplished
Message: 4
Date: Tue, 14 Apr 2009 06:55:58 -0600
From: Ralph Castain <r...@lanl.gov>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
DelSp="yes"
The rankfile cuts across the entire job - it isn't applied on an
app_context basis. So the ranks in your rankfile must correspond
to
the eventual rank of each process in the cmd line.
Unfortunately, that means you have to count ranks. In your case,
you
only have four, so that makes life easier. Your rankfile would
look
something like this:
rank 0=r001n001 slot=0
rank 1=r001n002 slot=1
rank 2=r001n001 slot=1
rank 3=r001n002 slot=2
HTH
Ralph
On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> Hi,
>
> I agree that my examples are not very clear. What I want to do
is to
> launch a multiexes application (masters-slaves) and benefit
from the
> processor affinity.
> Could you show me how to convert this command , using -rf option
> (whatever the affinity is)
>
> mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
r001n002
> master.x options2 : -n 1 -host r001n001 slave.x options3 : -n
1 -
> host r001n002 slave.x options4
>
> Thanks for your help
>
> Geoffroy
>
>
>
>
>
> Message: 2
> Date: Sun, 12 Apr 2009 18:26:35 +0300
> From: Lenny Verkhovsky <lenny.verkhov...@gmail.com>
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
> <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com
>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> The first "crash" is OK, since your rankfile has ranks 0 and 1
> defined,
> while n=1, which means only rank 0 is present and can be
allocated.
>
> NP must be >= the largest rank in rankfile.
>
> What exactly are you trying to do ?
>
> I tried to recreate your seqv but all I got was
>
> ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> hostfile.0
> -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> [witch19:30798] mca: base: component_find: paffinity
> "mca_paffinity_linux"
> uses an MCA interface that is not recognized (component MCA
v1.0.0 !=
> supported MCA v2.0.0) -- ignored
>
--------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel
process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal
failure;
> here's some additional information (which may only be relevant
to an
> Open MPI developer):
>
> opal_carto_base_select failed
> --> Returned value -13 instead of OPAL_SUCCESS
>
--------------------------------------------------------------------------
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
in file
> ../../orte/runtime/orte_init.c at line 78
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
in file
> ../../orte/orted/orted_main.c at line 344
>
--------------------------------------------------------------------------
> A daemon (pid 11629) died unexpectedly with status 243 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see
above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH
to
> have the
> location of the shared libraries on the remote nodes and this
will
> automatically be forwarded to the remote nodes.
>
--------------------------------------------------------------------------
>
--------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the
process
> that caused that situation.
>
--------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> Lenny.
>
>
> On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote:
> >
> > Hi ,
> >
> > I am currently testing the process affinity capabilities of
> openmpi and I
> > would like to know if the rankfile behaviour I will describe
below
> is normal
> > or not ?
> >
> > cat hostfile.0
> > r011n002 slots=4
> > r011n003 slots=4
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> >
> >
>
##################################################################################
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname
### OK
> > r011n002
> > r011n003
> >
> >
> >
>
##################################################################################
> > but
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -
n 1
> hostname
> > ### CRASHED
> > *
> >
>
--------------------------------------------------------------------------
> > Error, invalid rank (1) in the rankfile (rankfile.0)
> >
>
--------------------------------------------------------------------------
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > rmaps_rank_file.c at line 404
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > base/rmaps_base_map_job.c at line 87
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > base/plm_base_launch_support.c at line 77
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
in file
> > plm_rsh_module.c at line 985
> >
>
--------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1 while
> attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the
needed
> shared
> > libraries on the remote node. You may set your
LD_LIBRARY_PATH to
> have the
> > location of the shared libraries on the remote nodes and
this will
> > automatically be forwarded to the remote nodes.
> >
>
--------------------------------------------------------------------------
> >
>
--------------------------------------------------------------------------
> > orterun noticed that the job aborted, but has no info as to
the
> process
> > that caused that situation.
> >
>
--------------------------------------------------------------------------
> > orterun: clean termination accomplished
> > *
> > It seems that the rankfile option is not propagted to the
second
> command
> > line ; there is no global understanding of the ranking
inside a
> mpirun
> > command.
> >
> >
> >
>
##################################################################################
> >
> > Assuming that , I tried to provide a rankfile to each
command line:
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> >
> > cat rankfile.1
> > rank 0=r011n003 slot=1
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -
rf
> rankfile.1
> > -n 1 hostname ### CRASHED
> > *[r011n002:28778] *** Process received signal ***
> > [r011n002:28778] Signal: Segmentation fault (11)
> > [r011n002:28778] Signal code: Address not mapped (1)
> > [r011n002:28778] Failing at address: 0x34
> > [r011n002:28778] [ 0] [0xffffe600]
> > [r011n002:28778] [ 1]
> > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> 0(orte_odls_base_default_get_add_procs_data+0x55d)
> > [0x5557decd]
> > [r011n002:28778] [ 2]
> > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> 0(orte_plm_base_launch_apps+0x117)
> > [0x555842a7]
> > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
> mca_plm_rsh.so
> > [0x556098c0]
> > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> [0x804aa27]
> > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> [0x804a022]
> > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
> [0x9f1dec]
> > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> [0x8049f71]
> > [r011n002:28778] *** End of error message ***
> > Segmentation fault (core dumped)*
> >
> >
> >
> > I hope that I've found a bug because it would be very
important
> for me to
> > have this kind of capabiliy .
> > Launch a multiexe mpirun command line and be able to bind my
exes
> and
> > sockets together.
> >
> > Thanks in advance for your help
> >
> > Geoffroy
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1202, Issue 2
**************************************
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1218, Issue 2
**************************************
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users