Hi Ralph
Thanks for your extra tests. Before leaving , I just pointed out a
problem coming from running plpa across different rh distribs (<=>
different Linux kernels). Indeed, I configure and compile openmpi
on rhel4 , then I run on rhel5. I think my problem comes from this
approximation. I'll do few more tests tomorrow morning (France) and
keep you inform.
Regards
Geoffroy
Message: 2
Date: Mon, 4 May 2009 13:34:40 -0600
From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org <mailto:users@open-
mpi.org>>
Message-ID:
<71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com
<mailto:71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com
>>
Content-Type: text/plain; charset="iso-8859-1"
Hmmm...I'm afraid I can't replicate the problem. All seems to be
working
just fine on the RHEL systems available to me. The procs indeed
bind
to the
specified processors in every case.
rhc@odin ~/trunk]$ cat rankfile
rank 0=odin001 slot=0
rank 1=odin002 slot=1
[rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2
--leave-session-attached
-mca paffinity_base_verbose 5 ./mpi_spin
[odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:
09297>
<http://odin001.cs.indiana.edu:9297/>]
paffinity slot assignment: slot_list == 0
[odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:
09297>
<http://odin001.cs.indiana.edu:9297/>]
paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[odin002.cs.indiana.edu:13566 <http://odin002.cs.indiana.edu:
13566>]
paffinity slot assignment: slot_list == 1
[odin002.cs.indiana.edu:13566 <http://odin002.cs.indiana.edu:
13566>]
paffinity slot assignment: rank 1 runs on cpu
#1 (#1)
Suspended
[rhc@odin mpi]$ ssh odin001
[rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
S rhc 0 9296 0.0 orted
RLl rhc 0 9297 100 mpi_spin
[rhc@odin mpi]$ ssh odin002
[rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
S rhc 0 13562 0.0 orted
RLl rhc 1 13566 102 mpi_spin
Not sure where to go from here...perhaps someone else can spot the
problem?
Ralph
On Mon, May 4, 2009 at 8:28 AM, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
> Unfortunately, I didn't write any of that code - I was just
fixing the
> mapper so it would properly map the procs. From what I can
tell,
the proper
> things are happening there.
>
> I'll have to dig into the code that specifically deals with
parsing the
> results to bind the processes. Afraid that will take awhile
longer - pretty
> dark in that hole.
>
>
>
> On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot
<geopig...@gmail.com <mailto:geopig...@gmail.com>>wrote:
>
>> Hi,
>>
>> So, there are no more crashes with my "crazy" mpirun command.
But the
>> paffinity feature seems to be broken. Indeed I am not able
to pin my
>> processes.
>>
>> Simple test with a program using your plpa library :
>>
>> r011n006% cat hostf
>> r011n006 slots=4
>>
>> r011n006% cat rankf
>> rank 0=r011n006 slot=0 ----> bind to CPU 0 , exact ?
>>
>> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
--rankfile
>> rankf --wdir /tmp -n 1 a.out
>> >>> PLPA Number of processors online: 4
>> >>> PLPA Number of processor sockets: 2
>> >>> PLPA Socket 0 (ID 0): 2 cores
>> >>> PLPA Socket 1 (ID 3): 2 cores
>>
>> Ctrl+Z
>> r011n006%bg
>>
>> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
>> R+ gpignot 3 9271 97.8 a.out
>>
>> In fact whatever the slot number I put in my rankfile , a.out
always runs
>> on the CPU 3. I was looking for it on CPU 0 accordind to my
cpuinfo file
>> (see below)
>> The result is the same if I try another syntax (rank
0=r011n006
slot=0:0
>> bind to socket 0 - core 0 , exact ? )
>>
>> Thanks in advance
>>
>> Geoffroy
>>
>> PS: I run on rhel5
>>
>> r011n006% uname -a
>> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15
01:46:39 CDT
>> 2008 x86_64 x86_64 x86_64 GNU/Linux
>>
>> My configure is :
>> ./configure --prefix=/tmp/openmpi-1.4a
--libdir='${exec_prefix}/lib64'
>> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
>>
>>
>> r011n006% cat /proc/cpuinfo
>> processor : 0
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @
2.66GHz
>> stepping : 6
>> cpu MHz : 2660.007
>> cache size : 4096 KB
>> physical id : 0
>> siblings : 2
>> core id : 0
>> cpu cores : 2
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 10
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca
>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
syscall nx lm
>> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> bogomips : 5323.68
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor : 1
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @
2.66GHz
>> stepping : 6
>> cpu MHz : 2660.007
>> cache size : 4096 KB
>> physical id : 3
>> siblings : 2
>> core id : 0
>> cpu cores : 2
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 10
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca
>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
syscall nx lm
>> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> bogomips : 5320.03
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor : 2
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @
2.66GHz
>> stepping : 6
>> cpu MHz : 2660.007
>> cache size : 4096 KB
>> physical id : 0
>> siblings : 2
>> core id : 1
>> cpu cores : 2
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 10
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca
>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
syscall nx lm
>> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> bogomips : 5319.39
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 36 bits physical, 48 bits virtual
>> power management:
>>
>> processor : 3
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @
2.66GHz
>> stepping : 6
>> cpu MHz : 2660.007
>> cache size : 4096 KB
>> physical id : 3
>> siblings : 2
>> core id : 1
>> cpu cores : 2
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 10
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca
>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
syscall nx lm
>> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> bogomips : 5320.03
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 36 bits physical, 48 bits virtual
>> power management:
>>
>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Mon, 4 May 2009 04:45:57 -0600
>>> From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org
>>
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users <us...@open-mpi.org <mailto:us...@open-mpi.org
>>
>>> Message-ID: <D01D7B16-4B47-46F3-AD41-D1A90B2E4927@open-
mpi.org
<mailto:d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org>>
>>>
>>> Content-Type: text/plain; charset="us-ascii";
Format="flowed";
>>> DelSp="yes"
>>>
>>> My apologies - I wasn't clear enough. You need a tarball from
r21111
>>> or greater...such as:
>>>
>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
>>>
>>> HTH
>>> Ralph
>>>
>>>
>>> On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote:
>>>
>>> > Hi ,
>>> >
>>> > I got the openmpi-1.4a1r21095.tar.gz tarball, but
unfortunately my
>>> > command doesn't work
>>> >
>>> > cat rankf:
>>> > rank 0=node1 slot=*
>>> > rank 1=node2 slot=*
>>> >
>>> > cat hostf:
>>> > node1 slots=2
>>> > node2 slots=2
>>> >
>>> > mpirun --rankfile rankf --hostfile hostf --host node1 -
n 1
>>> > hostname : --host node2 -n 1 hostname
>>> >
>>> > Error, invalid rank (1) in the rankfile (rankf)
>>> >
>>> >
>>>
--------------------------------------------------------------------------
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad
parameter
in file
>>> > rmaps_rank_file.c at line 403
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad
parameter
in file
>>> > base/rmaps_base_map_job.c at line 86
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad
parameter
in file
>>> > base/plm_base_launch_support.c at line 86
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad
parameter
in file
>>> > plm_rsh_module.c at line 1016
>>> >
>>> >
>>> > Ralph, could you tell me if my command syntax is correct or
not ? if
>>> > not, give me the expected one ?
>>> >
>>> > Regards
>>> >
>>> > Geoffroy
>>> >
>>> >
>>> >
>>> >
>>> > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com
<mailto:geopig...@gmail.com>>
>>> > Immediately Sir !!! :)
>>> >
>>> > Thanks again Ralph
>>> >
>>> > Geoffroy
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > ------------------------------
>>> >
>>> > Message: 2
>>> > Date: Thu, 30 Apr 2009 06:45:39 -0600
>>> > From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org
>>
>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > To: Open MPI Users <us...@open-mpi.org
<mailto:us...@open-mpi.org>>
>>> > Message-ID:
>>> > <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com
<mailto:71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com
>>
>>> > Content-Type: text/plain; charset="iso-8859-1"
>>> >
>>> > I believe this is fixed now in our development trunk -
you can
>>> > download any
>>> > tarball starting from last night and give it a try, if you
like. Any
>>> > feedback would be appreciated.
>>> >
>>> > Ralph
>>> >
>>> >
>>> > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>> >
>>> > Ah now, I didn't say it -worked-, did I? :-)
>>> >
>>> > Clearly a bug exists in the program. I'll try to take a
look
at it
>>> > (if Lenny
>>> > doesn't get to it first), but it won't be until later in
the
week.
>>> >
>>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>> >
>>> > I agree with you Ralph , and that 's what I expect from
openmpi but my
>>> > second example shows that it's not working
>>> >
>>> > cat hostfile.0
>>> > r011n002 slots=4
>>> > r011n003 slots=4
>>> >
>>> > cat rankfile.0
>>> > rank 0=r011n002 slot=0
>>> > rank 1=r011n003 slot=1
>>> >
>>> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1
hostname : -n 1
>>> > hostname
>>> > ### CRASHED
>>> >
>>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > rmaps_rank_file.c at line 404
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > base/rmaps_base_map_job.c at line 87
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > base/plm_base_launch_support.c at line 77
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > plm_rsh_module.c at line 985
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > A daemon (pid unknown) died unexpectedly on signal 1
while
>>> > > attempting to
>>> > > > launch so we are aborting.
>>> > > >
>>> > > > There may be more information reported by the
environment
(see
>>> > > above).
>>> > > >
>>> > > > This may be because the daemon was unable to find all
the
needed
>>> > > shared
>>> > > > libraries on the remote node. You may set your
LD_LIBRARY_PATH to
>>> > > have the
>>> > > > location of the shared libraries on the remote nodes
and
this will
>>> > > > automatically be forwarded to the remote nodes.
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > orterun noticed that the job aborted, but has no info
as
to the
>>> > > process
>>> > > > that caused that situation.
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > orterun: clean termination accomplished
>>> >
>>> >
>>> >
>>> > Message: 4
>>> > Date: Tue, 14 Apr 2009 06:55:58 -0600
>>> > From: Ralph Castain <r...@lanl.gov <mailto:r...@lanl.gov>>
>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > To: Open MPI Users <us...@open-mpi.org
<mailto:us...@open-mpi.org>>
>>> > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov
<mailto:f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>>
>>> > Content-Type: text/plain; charset="us-ascii";
Format="flowed";
>>> > DelSp="yes"
>>> >
>>> > The rankfile cuts across the entire job - it isn't
applied on an
>>> > app_context basis. So the ranks in your rankfile must
correspond to
>>> > the eventual rank of each process in the cmd line.
>>> >
>>> > Unfortunately, that means you have to count ranks. In your
case, you
>>> > only have four, so that makes life easier. Your rankfile
would look
>>> > something like this:
>>> >
>>> > rank 0=r001n001 slot=0
>>> > rank 1=r001n002 slot=1
>>> > rank 2=r001n001 slot=1
>>> > rank 3=r001n002 slot=2
>>> >
>>> > HTH
>>> > Ralph
>>> >
>>> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I agree that my examples are not very clear. What I
want to
do is to
>>> > > launch a multiexes application (masters-slaves) and
benefit
from the
>>> > > processor affinity.
>>> > > Could you show me how to convert this command , using -rf
option
>>> > > (whatever the affinity is)
>>> > >
>>> > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -
host
r001n002
>>> > > master.x options2 : -n 1 -host r001n001 slave.x
options3 :
-n 1 -
>>> > > host r001n002 slave.x options4
>>> > >
>>> > > Thanks for your help
>>> > >
>>> > > Geoffroy
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > Message: 2
>>> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>> > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com
<mailto:lenny.verkhov...@gmail.com>>
>>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > > To: Open MPI Users <us...@open-mpi.org
<mailto:us...@open-mpi.org>>
>>> > > Message-ID:
>>> > >
<453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com
<mailto:453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com
>>
>>> > > Content-Type: text/plain; charset="iso-8859-1"
>>> > >
>>> > > Hi,
>>> > >
>>> > > The first "crash" is OK, since your rankfile has ranks
0 and 1
>>> > > defined,
>>> > > while n=1, which means only rank 0 is present and can be
allocated.
>>> > >
>>> > > NP must be >= the largest rank in rankfile.
>>> > >
>>> > > What exactly are you trying to do ?
>>> > >
>>> > > I tried to recreate your seqv but all I got was
>>> > >
>>> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
--hostfile
>>> > > hostfile.0
>>> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1
hostname
>>> > > [witch19:30798] mca: base: component_find: paffinity
>>> > > "mca_paffinity_linux"
>>> > > uses an MCA interface that is not recognized (component
MCA
>>> > v1.0.0 !=
>>> > > supported MCA v2.0.0) -- ignored
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > It looks like opal_init failed for some reason; your
parallel
>>> > > process is
>>> > > likely to abort. There are many reasons that a parallel
process can
>>> > > fail during opal_init; some of which are due to
configuration or
>>> > > environment problems. This failure appears to be an
internal
>>> > failure;
>>> > > here's some additional information (which may only be
relevant to an
>>> > > Open MPI developer):
>>> > >
>>> > > opal_carto_base_select failed
>>> > > --> Returned value -13 instead of OPAL_SUCCESS
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in
>>> > file
>>> > > ../../orte/runtime/orte_init.c at line 78
>>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in
>>> > file
>>> > > ../../orte/orted/orted_main.c at line 344
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > A daemon (pid 11629) died unexpectedly with status 243
while
>>> > > attempting
>>> > > to launch so we are aborting.
>>> > >
>>> > > There may be more information reported by the
environment (see
>>> > above).
>>> > >
>>> > > This may be because the daemon was unable to find all the
needed
>>> > > shared
>>> > > libraries on the remote node. You may set your
LD_LIBRARY_PATH to
>>> > > have the
>>> > > location of the shared libraries on the remote nodes and
this will
>>> > > automatically be forwarded to the remote nodes.
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > mpirun noticed that the job aborted, but has no info as
to the
>>> > process
>>> > > that caused that situation.
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > mpirun: clean termination accomplished
>>> > >
>>> > >
>>> > > Lenny.
>>> > >
>>> > >
>>> > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com
<mailto:geopig...@gmail.com>> wrote:
>>> > > >
>>> > > > Hi ,
>>> > > >
>>> > > > I am currently testing the process affinity
capabilities of
>>> > > openmpi and I
>>> > > > would like to know if the rankfile behaviour I will
describe below
>>> > > is normal
>>> > > > or not ?
>>> > > >
>>> > > > cat hostfile.0
>>> > > > r011n002 slots=4
>>> > > > r011n003 slots=4
>>> > > >
>>> > > > cat rankfile.0
>>> > > > rank 0=r011n002 slot=0
>>> > > > rank 1=r011n003 slot=1
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
##################################################################################
>>> > > >
>>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2
hostname ### OK
>>> > > > r011n002
>>> > > > r011n003
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
##################################################################################
>>> > > > but
>>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1
hostname
: -n 1
>>> > > hostname
>>> > > > ### CRASHED
>>> > > > *
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > rmaps_rank_file.c at line 404
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > base/rmaps_base_map_job.c at line 87
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > base/plm_base_launch_support.c at line 77
>>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
parameter in
>>> > file
>>> > > > plm_rsh_module.c at line 985
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > A daemon (pid unknown) died unexpectedly on signal 1
while
>>> > > attempting to
>>> > > > launch so we are aborting.
>>> > > >
>>> > > > There may be more information reported by the
environment
(see
>>> > > above).
>>> > > >
>>> > > > This may be because the daemon was unable to find all
the
needed
>>> > > shared
>>> > > > libraries on the remote node. You may set your
LD_LIBRARY_PATH to
>>> > > have the
>>> > > > location of the shared libraries on the remote nodes
and
this will
>>> > > > automatically be forwarded to the remote nodes.
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > orterun noticed that the job aborted, but has no info
as
to the
>>> > > process
>>> > > > that caused that situation.
>>> > > >
>>> > >
>>> >
>>>
--------------------------------------------------------------------------
>>> > > > orterun: clean termination accomplished
>>> > > > *
>>> > > > It seems that the rankfile option is not propagted to
the
second
>>> > > command
>>> > > > line ; there is no global understanding of the ranking
inside a
>>> > > mpirun
>>> > > > command.
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
##################################################################################
>>> > > >
>>> > > > Assuming that , I tried to provide a rankfile to each
command
>>> > line:
>>> > > >
>>> > > > cat rankfile.0
>>> > > > rank 0=r011n002 slot=0
>>> > > >
>>> > > > cat rankfile.1
>>> > > > rank 0=r011n003 slot=1
>>> > > >
>>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1
hostname
: -rf
>>> > > rankfile.1
>>> > > > -n 1 hostname ### CRASHED
>>> > > > *[r011n002:28778] *** Process received signal ***
>>> > > > [r011n002:28778] Signal: Segmentation fault (11)
>>> > > > [r011n002:28778] Signal code: Address not mapped (1)
>>> > > > [r011n002:28778] Failing at address: 0x34
>>> > > > [r011n002:28778] [ 0] [0xffffe600]
>>> > > > [r011n002:28778] [ 1]
>>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>> > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>>> > > > [0x5557decd]
>>> > > > [r011n002:28778] [ 2]
>>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>> > > 0(orte_plm_base_launch_apps+0x117)
>>> > > > [0x555842a7]
>>> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/
openmpi/
>>> > > mca_plm_rsh.so
>>> > > > [0x556098c0]
>>> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/
orterun
>>> > > [0x804aa27]
>>> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/
orterun
>>> > > [0x804a022]
>>> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main
+0xdc)
>>> > > [0x9f1dec]
>>> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/
orterun
>>> > > [0x8049f71]
>>> > > > [r011n002:28778] *** End of error message ***
>>> > > > Segmentation fault (core dumped)*
>>> > > >
>>> > > >
>>> > > >
>>> > > > I hope that I've found a bug because it would be very
important
>>> > > for me to
>>> > > > have this kind of capabiliy .
>>> > > > Launch a multiexe mpirun command line and be able to
bind
my exes
>>> > > and
>>> > > > sockets together.
>>> > > >
>>> > > > Thanks in advance for your help
>>> > > >
>>> > > > Geoffroy
>>> > > _______________________________________________
>>> > > users mailing list
>>> > > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > -------------- next part --------------
>>> > HTML attachment scrubbed and removed
>>> >
>>> > ------------------------------
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > End of users Digest, Vol 1202, Issue 2
>>> > **************************************
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > -------------- next part --------------
>>> > HTML attachment scrubbed and removed
>>> >
>>> > ------------------------------
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > End of users Digest, Vol 1218, Issue 2
>>> > **************************************
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> -------------- next part --------------
>>> HTML attachment scrubbed and removed
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> End of users Digest, Vol 1221, Issue 3
>>> **************************************
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1221, Issue 17
***************************************
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users