------------------------------
Message: 2
Date: Mon, 4 May 2009 04:45:57 -0600
From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org>
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
DelSp="yes"
My apologies - I wasn't clear enough. You need a tarball from r21111
or greater...such as:
http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
HTH
Ralph
On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote:
Hi ,
I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
command doesn't work
cat rankf:
rank 0=node1 slot=*
rank 1=node2 slot=*
cat hostf:
node1 slots=2
node2 slots=2
mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
hostname : --host node2 -n 1 hostname
Error, invalid rank (1) in the rankfile (rankf)
--------------------------------------------------------------------------
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file
rmaps_rank_file.c at line 403
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/rmaps_base_map_job.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/plm_base_launch_support.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
file
plm_rsh_module.c at line 1016
Ralph, could you tell me if my command syntax is correct or not ?
if
not, give me the expected one ?
Regards
Geoffroy
2009/4/30 Geoffroy Pignot <geopig...@gmail.com>
Immediately Sir !!! :)
Thanks again Ralph
Geoffroy
------------------------------
Message: 2
Date: Thu, 30 Apr 2009 06:45:39 -0600
From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID:
<71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
I believe this is fixed now in our development trunk - you can
download any
tarball starting from last night and give it a try, if you like.
Any
feedback would be appreciated.
Ralph
On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
Ah now, I didn't say it -worked-, did I? :-)
Clearly a bug exists in the program. I'll try to take a look at it
(if Lenny
doesn't get to it first), but it won't be until later in the week.
On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
I agree with you Ralph , and that 's what I expect from openmpi
but my
second example shows that it's not working
cat hostfile.0
r011n002 slots=4
r011n003 slots=4
cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
hostname
### CRASHED
Error, invalid rank (1) in the rankfile (rankfile.0)
--------------------------------------------------------------------------
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
rmaps_rank_file.c at line 404
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/rmaps_base_map_job.c at line 87
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/plm_base_launch_support.c at line 77
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
plm_rsh_module.c at line 985
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while
attempting to
launch so we are aborting.
There may be more information reported by the environment (see
above).
This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this
will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
orterun noticed that the job aborted, but has no info as to the
process
that caused that situation.
--------------------------------------------------------------------------
orterun: clean termination accomplished
Message: 4
Date: Tue, 14 Apr 2009 06:55:58 -0600
From: Ralph Castain <r...@lanl.gov>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
DelSp="yes"
The rankfile cuts across the entire job - it isn't applied on an
app_context basis. So the ranks in your rankfile must correspond to
the eventual rank of each process in the cmd line.
Unfortunately, that means you have to count ranks. In your case,
you
only have four, so that makes life easier. Your rankfile would look
something like this:
rank 0=r001n001 slot=0
rank 1=r001n002 slot=1
rank 2=r001n001 slot=1
rank 3=r001n002 slot=2
HTH
Ralph
On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
Hi,
I agree that my examples are not very clear. What I want to do
is to
launch a multiexes application (masters-slaves) and benefit from
the
processor affinity.
Could you show me how to convert this command , using -rf option
(whatever the affinity is)
mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
r001n002
master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
host r001n002 slave.x options4
Thanks for your help
Geoffroy
Message: 2
Date: Sun, 12 Apr 2009 18:26:35 +0300
From: Lenny Verkhovsky <lenny.verkhov...@gmail.com>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org>
Message-ID:
<453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
The first "crash" is OK, since your rankfile has ranks 0 and 1
defined,
while n=1, which means only rank 0 is present and can be
allocated.
NP must be >= the largest rank in rankfile.
What exactly are you trying to do ?
I tried to recreate your seqv but all I got was
~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
hostfile.0
-rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
[witch19:30798] mca: base: component_find: paffinity
"mca_paffinity_linux"
uses an MCA interface that is not recognized (component MCA
v1.0.0 !=
supported MCA v2.0.0) -- ignored
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process
can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal
failure;
here's some additional information (which may only be relevant
to an
Open MPI developer):
opal_carto_base_select failed
--> Returned value -13 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file
../../orte/runtime/orte_init.c at line 78
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file
../../orte/orted/orted_main.c at line 344
--------------------------------------------------------------------------
A daemon (pid 11629) died unexpectedly with status 243 while
attempting
to launch so we are aborting.
There may be more information reported by the environment (see
above).
This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
Lenny.
On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote:
Hi ,
I am currently testing the process affinity capabilities of
openmpi and I
would like to know if the rankfile behaviour I will describe
below
is normal
or not ?
cat hostfile.0
r011n002 slots=4
r011n003 slots=4
cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1
##################################################################################
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
r011n002
r011n003
##################################################################################
but
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
hostname
### CRASHED
*
--------------------------------------------------------------------------
Error, invalid rank (1) in the rankfile (rankfile.0)
--------------------------------------------------------------------------
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
rmaps_rank_file.c at line 404
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/rmaps_base_map_job.c at line 87
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
base/plm_base_launch_support.c at line 77
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
file
plm_rsh_module.c at line 985
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while
attempting to
launch so we are aborting.
There may be more information reported by the environment (see
above).
This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this
will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
orterun noticed that the job aborted, but has no info as to the
process
that caused that situation.
--------------------------------------------------------------------------
orterun: clean termination accomplished
*
It seems that the rankfile option is not propagted to the second
command
line ; there is no global understanding of the ranking inside a
mpirun
command.
##################################################################################
Assuming that , I tried to provide a rankfile to each command
line:
cat rankfile.0
rank 0=r011n002 slot=0
cat rankfile.1
rank 0=r011n003 slot=1
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
rankfile.1
-n 1 hostname ### CRASHED
*[r011n002:28778] *** Process received signal ***
[r011n002:28778] Signal: Segmentation fault (11)
[r011n002:28778] Signal code: Address not mapped (1)
[r011n002:28778] Failing at address: 0x34
[r011n002:28778] [ 0] [0xffffe600]
[r011n002:28778] [ 1]
/tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
0(orte_odls_base_default_get_add_procs_data+0x55d)
[0x5557decd]
[r011n002:28778] [ 2]
/tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
0(orte_plm_base_launch_apps+0x117)
[0x555842a7]
[r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
mca_plm_rsh.so
[0x556098c0]
[r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
[0x804aa27]
[r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
[0x804a022]
[r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
[0x9f1dec]
[r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
[0x8049f71]
[r011n002:28778] *** End of error message ***
Segmentation fault (core dumped)*
I hope that I've found a bug because it would be very important
for me to
have this kind of capabiliy .
Launch a multiexe mpirun command line and be able to bind my exes
and
sockets together.
Thanks in advance for your help
Geoffroy
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1202, Issue 2
**************************************
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1218, Issue 2
**************************************
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1221, Issue 3
**************************************