Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Geoffroy Pignot Mon, 4 May 2009 16:00:11 -0400

Hi Ralph

Thanks for your extra tests.  Before leaving , I just pointed out a problem
coming from running plpa across different rh distribs (<=> different Linux
kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on
rhel5. I think my problem comes from this approximation. I'll do few more
tests tomorrow morning (France) and keep you inform.


Regards

Geoffroy








>
>
> Message: 2
> Date: Mon, 4 May 2009 13:34:40 -0600
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>        <71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hmmm...I'm afraid I can't replicate the problem. All seems to be working
> just fine on the RHEL systems available to me. The procs indeed bind to the
> specified processors in every case.
>
> rhc@odin ~/trunk]$ cat rankfile
> rank 0=odin001 slot=0
> rank 1=odin002 slot=1
>
> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2 --leave-session-attached
> -mca paffinity_base_verbose 5 ./mpi_spin
> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
> paffinity slot assignment: slot_list == 0
> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
> paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1
> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on
> cpu
> #1 (#1)
>
> Suspended
> [rhc@odin mpi]$ ssh odin001
> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
> S    rhc        0  9296  0.0 orted
> RLl  rhc        0  9297  100 mpi_spin
>
> [rhc@odin mpi]$ ssh odin002
> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
> S    rhc        0 13562  0.0 orted
> RLl  rhc        1 13566  102 mpi_spin
>
>
> Not sure where to go from here...perhaps someone else can spot the problem?
> Ralph
>
>
> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> > Unfortunately, I didn't write any of that code - I was just fixing the
> > mapper so it would properly map the procs. From what I can tell, the
> proper
> > things are happening there.
> >
> > I'll have to dig into the code that specifically deals with parsing the
> > results to bind the processes. Afraid that will take awhile longer -
> pretty
> > dark in that hole.
> >
> >
> >
> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot <geopig...@gmail.com
> >wrote:
> >
> >> Hi,
> >>
> >> So, there are no more crashes with my "crazy" mpirun command. But the
> >> paffinity feature seems to be broken. Indeed I am not able to pin my
> >> processes.
> >>
> >> Simple test with a program using your plpa library :
> >>
> >> r011n006% cat hostf
> >> r011n006 slots=4
> >>
> >> r011n006% cat rankf
> >> rank 0=r011n006 slot=0   ----> bind to CPU 0 , exact ?
> >>
> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
> --rankfile
> >> rankf --wdir /tmp -n 1 a.out
> >>  >>> PLPA Number of processors online: 4
> >>  >>> PLPA Number of processor sockets: 2
> >>  >>> PLPA Socket 0 (ID 0): 2 cores
> >>  >>> PLPA Socket 1 (ID 3): 2 cores
> >>
> >> Ctrl+Z
> >> r011n006%bg
> >>
> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
> >> R+   gpignot    3  9271 97.8 a.out
> >>
> >> In fact whatever the slot number I put in my rankfile , a.out always
> runs
> >> on the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo file
> >> (see below)
> >> The result is the same if I try another syntax (rank 0=r011n006 slot=0:0
> >> bind to socket 0 - core 0  , exact ? )
> >>
> >> Thanks in advance
> >>
> >> Geoffroy
> >>
> >> PS: I run on rhel5
> >>
> >> r011n006% uname -a
> >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT
> >> 2008 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> My configure is :
> >>  ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64'
> >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
> >>
> >>
> >> r011n006% cat /proc/cpuinfo
> >> processor       : 0
> >> vendor_id       : GenuineIntel
> >> cpu family      : 6
> >> model           : 15
> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> >> stepping        : 6
> >> cpu MHz         : 2660.007
> >> cache size      : 4096 KB
> >> physical id     : 0
> >> siblings        : 2
> >> core id         : 0
> >> cpu cores       : 2
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 10
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca
> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> >> bogomips        : 5323.68
> >> clflush size    : 64
> >> cache_alignment : 64
> >> address sizes   : 36 bits physical, 48 bits virtual
> >> power management:
> >>
> >> processor       : 1
> >> vendor_id       : GenuineIntel
> >> cpu family      : 6
> >> model           : 15
> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> >> stepping        : 6
> >> cpu MHz         : 2660.007
> >> cache size      : 4096 KB
> >> physical id     : 3
> >> siblings        : 2
> >> core id         : 0
> >> cpu cores       : 2
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 10
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca
> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> >> bogomips        : 5320.03
> >> clflush size    : 64
> >> cache_alignment : 64
> >> address sizes   : 36 bits physical, 48 bits virtual
> >> power management:
> >>
> >> processor       : 2
> >> vendor_id       : GenuineIntel
> >> cpu family      : 6
> >> model           : 15
> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> >> stepping        : 6
> >> cpu MHz         : 2660.007
> >> cache size      : 4096 KB
> >> physical id     : 0
> >> siblings        : 2
> >> core id         : 1
> >> cpu cores       : 2
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 10
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca
> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> >> bogomips        : 5319.39
> >> clflush size    : 64
> >> cache_alignment : 64
> >> address sizes   : 36 bits physical, 48 bits virtual
> >> power management:
> >>
> >> processor       : 3
> >> vendor_id       : GenuineIntel
> >> cpu family      : 6
> >> model           : 15
> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> >> stepping        : 6
> >> cpu MHz         : 2660.007
> >> cache size      : 4096 KB
> >> physical id     : 3
> >> siblings        : 2
> >> core id         : 1
> >> cpu cores       : 2
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 10
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca
> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> >> bogomips        : 5320.03
> >> clflush size    : 64
> >> cache_alignment : 64
> >> address sizes   : 36 bits physical, 48 bits virtual
> >> power management:
> >>
> >>
> >>> ------------------------------
> >>>
> >>> Message: 2
> >>> Date: Mon, 4 May 2009 04:45:57 -0600
> >>> From: Ralph Castain <r...@open-mpi.org>
> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>> To: Open MPI Users <us...@open-mpi.org>
> >>> Message-ID: <d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org>
> >>>
> >>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >>>        DelSp="yes"
> >>>
> >>> My apologies - I wasn't clear enough. You need a tarball from r21111
> >>> or greater...such as:
> >>>
> >>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
> >>>
> >>> HTH
> >>> Ralph
> >>>
> >>>
> >>> On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote:
> >>>
> >>> > Hi ,
> >>> >
> >>> > I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
> >>> > command doesn't work
> >>> >
> >>> > cat rankf:
> >>> > rank 0=node1 slot=*
> >>> > rank 1=node2 slot=*
> >>> >
> >>> > cat hostf:
> >>> > node1 slots=2
> >>> > node2 slots=2
> >>> >
> >>> > mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1
> >>> > hostname : --host node2 -n 1 hostname
> >>> >
> >>> > Error, invalid rank (1) in the rankfile (rankf)
> >>> >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>> > rmaps_rank_file.c at line 403
> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>> > base/rmaps_base_map_job.c at line 86
> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>> > base/plm_base_launch_support.c at line 86
> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>> > plm_rsh_module.c at line 1016
> >>> >
> >>> >
> >>> > Ralph, could you tell me if my command syntax is correct or not ? if
> >>> > not, give me the expected one ?
> >>> >
> >>> > Regards
> >>> >
> >>> > Geoffroy
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com>
> >>> > Immediately Sir !!! :)
> >>> >
> >>> > Thanks again Ralph
> >>> >
> >>> > Geoffroy
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > ------------------------------
> >>> >
> >>> > Message: 2
> >>> > Date: Thu, 30 Apr 2009 06:45:39 -0600
> >>> > From: Ralph Castain <r...@open-mpi.org>
> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>> > To: Open MPI Users <us...@open-mpi.org>
> >>> > Message-ID:
> >>> >        <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
> >>> > Content-Type: text/plain; charset="iso-8859-1"
> >>> >
> >>> > I believe this is fixed now in our development trunk - you can
> >>> > download any
> >>> > tarball starting from last night and give it a try, if you like. Any
> >>> > feedback would be appreciated.
> >>> >
> >>> > Ralph
> >>> >
> >>> >
> >>> > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
> >>> >
> >>> > Ah now, I didn't say it -worked-, did I? :-)
> >>> >
> >>> > Clearly a bug exists in the program. I'll try to take a look at it
> >>> > (if Lenny
> >>> > doesn't get to it first), but it won't be until later in the week.
> >>> >
> >>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
> >>> >
> >>> > I agree with you Ralph , and that 's what I expect from openmpi but
> my
> >>> > second example shows that it's not working
> >>> >
> >>> > cat hostfile.0
> >>> >   r011n002 slots=4
> >>> >   r011n003 slots=4
> >>> >
> >>> >  cat rankfile.0
> >>> >    rank 0=r011n002 slot=0
> >>> >    rank 1=r011n003 slot=1
> >>> >
> >>> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> >>> > hostname
> >>> > ### CRASHED
> >>> >
> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > rmaps_rank_file.c at line 404
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > base/rmaps_base_map_job.c at line 87
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > base/plm_base_launch_support.c at line 77
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > plm_rsh_module.c at line 985
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> >>> > > attempting to
> >>> > > > launch so we are aborting.
> >>> > > >
> >>> > > > There may be more information reported by the environment (see
> >>> > > above).
> >>> > > >
> >>> > > > This may be because the daemon was unable to find all the needed
> >>> > > shared
> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>> > > have the
> >>> > > > location of the shared libraries on the remote nodes and this
> will
> >>> > > > automatically be forwarded to the remote nodes.
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > orterun noticed that the job aborted, but has no info as to the
> >>> > > process
> >>> > > > that caused that situation.
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > orterun: clean termination accomplished
> >>> >
> >>> >
> >>> >
> >>> > Message: 4
> >>> > Date: Tue, 14 Apr 2009 06:55:58 -0600
> >>> > From: Ralph Castain <r...@lanl.gov>
> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>> > To: Open MPI Users <us...@open-mpi.org>
> >>> > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>
> >>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >>> >       DelSp="yes"
> >>> >
> >>> > The rankfile cuts across the entire job - it isn't applied on an
> >>> > app_context basis. So the ranks in your rankfile must correspond to
> >>> > the eventual rank of each process in the cmd line.
> >>> >
> >>> > Unfortunately, that means you have to count ranks. In your case, you
> >>> > only have four, so that makes life easier. Your rankfile would look
> >>> > something like this:
> >>> >
> >>> > rank 0=r001n001 slot=0
> >>> > rank 1=r001n002 slot=1
> >>> > rank 2=r001n001 slot=1
> >>> > rank 3=r001n002 slot=2
> >>> >
> >>> > HTH
> >>> > Ralph
> >>> >
> >>> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > I agree that my examples are not very clear. What I want to do is
> to
> >>> > > launch a multiexes application (masters-slaves) and benefit from
> the
> >>> > > processor affinity.
> >>> > > Could you show me how to convert this command , using -rf option
> >>> > > (whatever the affinity is)
> >>> > >
> >>> > > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> >>> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> >>> > > host r001n002 slave.x options4
> >>> > >
> >>> > > Thanks for your help
> >>> > >
> >>> > > Geoffroy
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > Message: 2
> >>> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
> >>> > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com>
> >>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>> > > To: Open MPI Users <us...@open-mpi.org>
> >>> > > Message-ID:
> >>> > >        <
> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> >>> > > Content-Type: text/plain; charset="iso-8859-1"
> >>> > >
> >>> > > Hi,
> >>> > >
> >>> > > The first "crash" is OK, since your rankfile has ranks 0 and 1
> >>> > > defined,
> >>> > > while n=1, which means only rank 0 is present and can be allocated.
> >>> > >
> >>> > > NP must be >= the largest rank in rankfile.
> >>> > >
> >>> > > What exactly are you trying to do ?
> >>> > >
> >>> > > I tried to recreate your seqv but all I got was
> >>> > >
> >>> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> >>> > > hostfile.0
> >>> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> >>> > > [witch19:30798] mca: base: component_find: paffinity
> >>> > > "mca_paffinity_linux"
> >>> > > uses an MCA interface that is not recognized (component MCA
> >>> > v1.0.0 !=
> >>> > > supported MCA v2.0.0) -- ignored
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > It looks like opal_init failed for some reason; your parallel
> >>> > > process is
> >>> > > likely to abort. There are many reasons that a parallel process can
> >>> > > fail during opal_init; some of which are due to configuration or
> >>> > > environment problems. This failure appears to be an internal
> >>> > failure;
> >>> > > here's some additional information (which may only be relevant to
> an
> >>> > > Open MPI developer):
> >>> > >
> >>> > >  opal_carto_base_select failed
> >>> > >  --> Returned value -13 instead of OPAL_SUCCESS
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
> >>> > file
> >>> > > ../../orte/runtime/orte_init.c at line 78
> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
> >>> > file
> >>> > > ../../orte/orted/orted_main.c at line 344
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > A daemon (pid 11629) died unexpectedly with status 243 while
> >>> > > attempting
> >>> > > to launch so we are aborting.
> >>> > >
> >>> > > There may be more information reported by the environment (see
> >>> > above).
> >>> > >
> >>> > > This may be because the daemon was unable to find all the needed
> >>> > > shared
> >>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>> > > have the
> >>> > > location of the shared libraries on the remote nodes and this will
> >>> > > automatically be forwarded to the remote nodes.
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > mpirun noticed that the job aborted, but has no info as to the
> >>> > process
> >>> > > that caused that situation.
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > mpirun: clean termination accomplished
> >>> > >
> >>> > >
> >>> > > Lenny.
> >>> > >
> >>> > >
> >>> > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote:
> >>> > > >
> >>> > > > Hi ,
> >>> > > >
> >>> > > > I am currently testing the process affinity capabilities of
> >>> > > openmpi and I
> >>> > > > would like to know if the rankfile behaviour I will describe
> below
> >>> > > is normal
> >>> > > > or not ?
> >>> > > >
> >>> > > > cat hostfile.0
> >>> > > > r011n002 slots=4
> >>> > > > r011n003 slots=4
> >>> > > >
> >>> > > > cat rankfile.0
> >>> > > > rank 0=r011n002 slot=0
> >>> > > > rank 1=r011n003 slot=1
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> ##################################################################################
> >>> > > >
> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ### OK
> >>> > > > r011n002
> >>> > > > r011n003
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> ##################################################################################
> >>> > > > but
> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> >>> > > hostname
> >>> > > > ### CRASHED
> >>> > > > *
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > rmaps_rank_file.c at line 404
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > base/rmaps_base_map_job.c at line 87
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > base/plm_base_launch_support.c at line 77
> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>> > file
> >>> > > > plm_rsh_module.c at line 985
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> >>> > > attempting to
> >>> > > > launch so we are aborting.
> >>> > > >
> >>> > > > There may be more information reported by the environment (see
> >>> > > above).
> >>> > > >
> >>> > > > This may be because the daemon was unable to find all the needed
> >>> > > shared
> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>> > > have the
> >>> > > > location of the shared libraries on the remote nodes and this
> will
> >>> > > > automatically be forwarded to the remote nodes.
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > orterun noticed that the job aborted, but has no info as to the
> >>> > > process
> >>> > > > that caused that situation.
> >>> > > >
> >>> > >
> >>> >
> >>>
> --------------------------------------------------------------------------
> >>> > > > orterun: clean termination accomplished
> >>> > > > *
> >>> > > > It seems that the rankfile option is not propagted to the second
> >>> > > command
> >>> > > > line ; there is no global understanding of the ranking inside a
> >>> > > mpirun
> >>> > > > command.
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> ##################################################################################
> >>> > > >
> >>> > > > Assuming that , I tried to provide a rankfile to each command
> >>> > line:
> >>> > > >
> >>> > > > cat rankfile.0
> >>> > > > rank 0=r011n002 slot=0
> >>> > > >
> >>> > > > cat rankfile.1
> >>> > > > rank 0=r011n003 slot=1
> >>> > > >
> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
> >>> > > rankfile.1
> >>> > > > -n 1 hostname ### CRASHED
> >>> > > > *[r011n002:28778] *** Process received signal ***
> >>> > > > [r011n002:28778] Signal: Segmentation fault (11)
> >>> > > > [r011n002:28778] Signal code: Address not mapped (1)
> >>> > > > [r011n002:28778] Failing at address: 0x34
> >>> > > > [r011n002:28778] [ 0] [0xffffe600]
> >>> > > > [r011n002:28778] [ 1]
> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> >>> > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
> >>> > > > [0x5557decd]
> >>> > > > [r011n002:28778] [ 2]
> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> >>> > > 0(orte_plm_base_launch_apps+0x117)
> >>> > > > [0x555842a7]
> >>> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
> >>> > > mca_plm_rsh.so
> >>> > > > [0x556098c0]
> >>> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>> > > [0x804aa27]
> >>> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>> > > [0x804a022]
> >>> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
> >>> > > [0x9f1dec]
> >>> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>> > > [0x8049f71]
> >>> > > > [r011n002:28778] *** End of error message ***
> >>> > > > Segmentation fault (core dumped)*
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > I hope that I've found a bug because it would be very important
> >>> > > for me to
> >>> > > > have this kind of capabiliy .
> >>> > > > Launch a multiexe mpirun command line and be able to bind my exes
> >>> > > and
> >>> > > > sockets together.
> >>> > > >
> >>> > > > Thanks in advance for your help
> >>> > > >
> >>> > > > Geoffroy
> >>> > > _______________________________________________
> >>> > > users mailing list
> >>> > > us...@open-mpi.org
> >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> >
> >>> > -------------- next part --------------
> >>> > HTML attachment scrubbed and removed
> >>> >
> >>> > ------------------------------
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> >
> >>> > End of users Digest, Vol 1202, Issue 2
> >>> > **************************************
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > -------------- next part --------------
> >>> > HTML attachment scrubbed and removed
> >>> >
> >>> > ------------------------------
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> >
> >>> > End of users Digest, Vol 1218, Issue 2
> >>> > **************************************
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> -------------- next part --------------
> >>> HTML attachment scrubbed and removed
> >>>
> >>> ------------------------------
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> End of users Digest, Vol 1221, Issue 3
> >>> **************************************
> >>>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1221, Issue 17
> ***************************************
>

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Reply via email to