Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Rayson Ho
On Mon, Jan 30, 2012 at 11:33 PM, Tom Bryan  wrote:
> For our use, yes, spawn_multiple makes sense.  We won't be spawning lots and
> lots of jobs in quick succession.  We're using MPI as an robust way to get
> IPC as we spawn multiple child processes while using SGE to help us with
> load balancing our compute nodes.

Note that spawn_multiple is not going to buy you anything as SGE and
Open Grid Scheduler (and most other batch systems) do not handle
dynamic slot allocation. There is no way to change the number of slots
that are used by a job once it's running.

For this reason, I don't recall seeing any users using spawn_multiple
(and also, IIRC, the call was introduced in MPI-2)... and you might
want to make sure that normal MPI jobs work before debuging a
spawn_multiple() job.

Rayson

=
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/


>
>> Anyway:
>> do you see on the master node of the parallel job in:
>
> Yes, I should have included that kind of output.  I'll have to run it again
> with the cols option, but I used pstree to see that I have mpitest --child
> processes as children of orted by way of sge_shepherd and sge_execd.
>
> Thanks,
> ---Tom
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/



Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...

2012-01-31 Thread Götz Waschk
On Mon, Jan 30, 2012 at 5:11 PM, Richard Walsh
 wrote:
> I have not seen this mpirun error with the OpenMPI version I have built
> with Intel 12.1 and the mpicc fix:
> openmpi-1.5.5rc1.tar.bz2

Hi,

I haven't tried that version yet. I was trying to build a
supplementary package to the openmpi 1.5.3 shipped with RHEL6.2, the
same source, just built using the Intel compiler.

> and from the looks of things, I wonder if your problem is related.  The
> solution in the original case was to conditionally dial-down optimization
> when using the 12.1 compiler to prevent the compiler itself from crashing
> during a compile.  What you present is a failure during execution.  Such
> failures might be due to over zealous optimization, but there seems to be
> little reason on the face of it to believe that there is a connection between
> the former and the latter.

Well, the similarity is that it is also a crash in the malloc routine.
I don't know if my optflags are too high, I have derived them from Red
Hat's, replacing the options unkown to icc:
-O2 -g -pipe -Wall -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=pentium4

> Does this failure occur with all attempts to use 'mpirun' whatever the source?
> My 'mpicc' problem did.  If this is true and If you believe it is an 
> optimization
> level issue you could try turning it off in the failing routine and see if 
> that
> produces a remedy.  I would also try things with the very latest release.

Yes, the mpicc crash happened every time, I could reproduce that.

I have only tested the most basic code, the cpi.c example. The funny
thing is, that mpirun -np 8 cpi doesn't always crash, sometimes it
finishes just fine.

Regards, Götz Waschk



Re: [OMPI users] MPI_AllGather null terminator character

2012-01-31 Thread Gabriele Fatigati
Dear Jeff,

I have very interesting news. I recompiled OpenMPI 1.4.4 enabling the
memchecker.

Now the warning on strcmp is disappeared also without buffers
initializations using memset!

So the warning is a false positive? My simple code is safe?

Thanks.

2012/1/28 Jeff Squyres 

> On Jan 28, 2012, at 5:22 AM, Gabriele Fatigati wrote:
>
> > I had the same idea so my simple code I have already done calloc and
> memset ..
> >
> > The same warning still appear using strncmp that should exclude
> uninitialized bytes on hostnam_recv_buf :(
>
> Bummer.
>
> > My apologize for being so insistent, but I would understand if there is
> some bug in MPI_Allgather, strcmp or Valgrind itself.
>
> Understood.
>
> I still think that MPI_Allgather will exactly send the bytes starting at
> the buffer you specify, regardless of whether they include \0 or not.
>
> I was unable to replicate the valgrind warning on my systems.  A few more
> things to try:
>
> 1. Are you using the latest version of Valgrind?
>
> 2. (I should have asked this before - sorry!) Are you using InfiniBand to
> transmit the data across your network?  If so, Valgrind might not have
> visibility on the receive buffers being filled because IB, by its nature,
> uses OS bypass to fill in receive buffers.  Meaning: Valgrind won't "see"
> the receive buffers getting filled, and therefore will think that they are
> uninitialized.  If you re-run your experiment with TCP and/or shared memory
> (like I did), you won't see the Valgrind uninitialized warnings.
>
> To avoid these OS-bypass issues, you might try installing Open MPI with
> --with-valgrind=DIR (DIR = directory where Valgrind is installed -- we need
> valgrind.h, IIRC).  What this does is allow Open MPI to use Valgrind's
> external tools API to say "don't worry Valgrind, the entire contents of
> this buffer are initialized" in cases exactly like this.
>
> There is a performance cost to using Valgrind integration, though.  So
> don't make this your production copy of Open MPI.
>
> 3. Do a for loop accessing each position of the buffer *before* you send
> it.  Not just up to the \0, but traverse the *entire length* of the buffer
> and do some meaningless action with each byte.  See if Valgrind complains.
>  If it doesn't, you know for certain that the entire source buffer is not
> the origin of the warning.
>
> 4. Similarly, do a loop accessing each position of the received buffer.
>  You can have Valgrind attach a debugger when it runs into issues; with
> that, you can see exactly which position Valgrind thinks is uninitialized.
>  Compare the value that was sent to the value that was received and ensure
> that they are the same.
>
> Hope that helps!
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Ing. Gabriele Fatigati

HPC specialist

SuperComputing Applications and Innovation Department

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] [openib] segfault when using openib btl

2012-01-31 Thread Eloi Gaudry

Hi,

I just would like to give you an update on this issue.
Since we are using OpenMPI-1.4.4, we cannot reproduce it anymore.

Regards,
Eloi



On 09/29/2010 06:01 AM, Nysal Jan wrote:

Hi Eloi,
We discussed this issue during the weekly developer meeting & there 
were no further suggestions, apart from checking the driver and 
firmware levels. The consensus was that it would be better if you 
could take this up directly with your IB vendor.


Regards
--Nysal

On Mon, Sep 27, 2010 at 8:14 PM, Eloi Gaudry > wrote:


Terry,

Please find enclosed the requested check outputs (using
-output-filename stdout.tag.null option).
I'm displaying frag->hdr->tag here.

Eloi

On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
> Eloi, sorry can you print out frag->hdr->tag?
>
> Unfortunately from your last email I think it will still all have
> non-zero values.
> If that ends up being the case then there must be something odd
with the
> descriptor pointer to the fragment.
>
> --td
>
> Eloi Gaudry wrote:
> > Terry,
> >
> > Please find enclosed the requested check outputs (using
-output-filename
> > stdout.tag.null option).
> >
> > For information, Nysal In his first message referred to
> > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was
wrnong on
> > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH
(MCA_BTL_TAG_PML

> > + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> >
> >  #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> >
> > #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> >
> >  #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> >
> > #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML 0x40
> >
> > Eloi
> >
> > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> >> I am thinking checking the value of *frag->hdr right before
the return
> >> in the post_send function in
ompi/mca/btl/openib/btl_openib_endpoint.h.
> >> It is line 548 in the trunk
> >>
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
> >> ope nib_endpoint.h#548
> >>
> >> --td
> >>
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>>
> >>> Do you have any patch that I could apply to be able to do so
? I'm
> >>> remotely working on a cluster (with a terminal) and I cannot
use any
> >>> parallel debugger or sequential debugger (with a call to
xterm...). I
> >>> can track frag->hdr->tag value in
> >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
> >>>
> >>> You'll find a stacktrace (receive side) in this thread (10th
or 11th
> >>> message) but it might be pointless.
> >>>
> >>> Regards,
> >>> Eloi
> >>>
> >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
>  So it sounds like coalescing is not your issue and that the
problem
>  has something to do with the queue sizes.  It would be
helpful if we
>  could detect the hdr->tag == 0 issue on the sending side
and get at
>  least a stack trace.  There is something really odd going
on here.
> 
>  --td
> 
>  Eloi Gaudry wrote:
> > Hi Terry,
> >
> > I'm sorry to say that I might have missed a point here.
> >
> > I've lately been relaunching all previously failing
computations with
> > the message coalescing feature being switched off, and I
saw the same
> > hdr->tag=0 error several times, always during a collective
call
> > (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so
far). And as
> > soon as I switched to the peer queue option I was
previously using
> > (--mca btl_openib_receive_queues P,65536,256,192,128
instead of using
> > --mca btl_openib_use_message_coalescing 0), all
computations ran
> > flawlessly.
> >
> > As for the reproducer, I've already tried to write
something but I
> > haven't succeeded so far at reproducing the hdr->tag=0
issue with it.
> >
> > Eloi
> >
> > On 24/09/2010 18:37, Terry Dontje wrote:
> >> Eloi Gaudry wrote:
> >>> Terry,
> >>>
> >>> You were right, the error indeed seems to come from the
message
> >>> coalescing feature. If I turn it off using the "--mca
> >>> btl_ope

Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Reuti
Am 31.01.2012 um 06:33 schrieb Rayson Ho:

> On Mon, Jan 30, 2012 at 11:33 PM, Tom Bryan  wrote:
>> For our use, yes, spawn_multiple makes sense.  We won't be spawning lots and
>> lots of jobs in quick succession.  We're using MPI as an robust way to get
>> IPC as we spawn multiple child processes while using SGE to help us with
>> load balancing our compute nodes.
> 
> Note that spawn_multiple is not going to buy you anything as SGE and
> Open Grid Scheduler (and most other batch systems) do not handle
> dynamic slot allocation. There is no way to change the number of slots
> that are used by a job once it's running.

Agreed, the problem is first to phrase it in a submission command like: I need 
2 cores for 2 hours, 4 cores for one hour and finally 1 core for 8 hours. And 
the application must act accordingly. This all sounds more like a real-time 
queuing system and application, where this can be ensured to happen in time.

-- Reuti


> For this reason, I don't recall seeing any users using spawn_multiple
> (and also, IIRC, the call was introduced in MPI-2)... and you might
> want to make sure that normal MPI jobs work before debuging a
> spawn_multiple() job.
> 
> Rayson
> 
> =
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net/
> 
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
> 
> 
>> 
>>> Anyway:
>>> do you see on the master node of the parallel job in:
>> 
>> Yes, I should have included that kind of output.  I'll have to run it again
>> with the cols option, but I used pstree to see that I have mpitest --child
>> processes as children of orted by way of sge_shepherd and sge_execd.
>> 
>> Thanks,
>> ---Tom
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Rayson
> 
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




[OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Dave Love
This is to help anyone else having this problem, as it doesn't seem to
be mentioned anywhere I can find, rather surprisingly.

Core binding is broken on Interlagos with open-mpi 1.5.4.  I guess it
also bites on Magny-Cours, but all our systems are currently busy and I
can't check.

It does work, at least basically, in 1.5.5rc1, but the release notes for
that don't give any indication.  Perhaps someone could mention
Interlagos in the notes, and any other hardware that might be affected
(presumably Magny-Cours and some Power if it's confusion introduced by
the extra NUMA level).

As an example of the error, with 1.5.4 on 32-core Interlagos invoked
like

  mpirun -np 32 --bind-to-core --bycore  --report-bindings ...

you get

  ...
  [compute002:18153] [[14894,0],0] odls:default:fork binding child 
[[14894,1],15] to cpus 4000
  --
  An invalid physical processor id was returned when attempting to
  set processor affinity - please check to ensure that your system
  supports such functionality. If so, then this is probably something
  that should be reported to the OMPI developers.
  --
  ...

It works up to 16 cores.

We seem to have issues even with 1.5.5rc1, but I'll try to get bug
reports into the tracker.  I hope the heads-up here is useful though.



Re: [OMPI users] MPI_AllGather null terminator character

2012-01-31 Thread Jeff Squyres
On Jan 31, 2012, at 3:59 AM, Gabriele Fatigati wrote:

> I have very interesting news. I recompiled OpenMPI 1.4.4 enabling the 
> memchecker. 
> 
> Now the warning on strcmp is disappeared also without buffers initializations 
> using memset!
> 
> So the warning is a false positive? My simple code is safe?

If you were using IB as the network transport, yes, it's a false positive.

With memchecker enabled, Open MPI will *always* tell Valgrind that the entire 
contents of the buffer are defined, even when the data is coming from an 
OS-bypass transport (such as an OpenFabrics-based device, like IB).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Jeff Squyres
On Jan 31, 2012, at 6:18 AM, Dave Love wrote:

> Core binding is broken on Interlagos with open-mpi 1.5.4.  I guess it
> also bites on Magny-Cours, but all our systems are currently busy and I
> can't check.
> 
> It does work, at least basically, in 1.5.5rc1, but the release notes for
> that don't give any indication.  Perhaps someone could mention
> Interlagos in the notes, and any other hardware that might be affected
> (presumably Magny-Cours and some Power if it's confusion introduced by
> the extra NUMA level).

I think there was some weirdness in how AMD chips were represented to the Linux 
kernel (they present differently than Intel chips).  I believe the issues have 
been worked out by hwloc.  OMPI 1.5.4 uses an older version of hwloc (v1.2); 
1.5.5rc1 was synced to a newer version of hwloc.

Note: a) there's one more hwloc sync that's going to happen before 1.5.5 is 
released, and b) per https://svn.open-mpi.org/trac/ompi/ticket/2990, perhaps 
there's still some weirdness going on in OMPI 1.5.x's affinity code.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Invitation to connect on LinkedIn

2012-01-31 Thread Song Guo via LinkedIn
LinkedIn





Song Guo requested to add you as a connection on LinkedIn:


--

Mohan,

I'd like to add you to my professional network on LinkedIn.

- Song

Accept invitation from Song Guo
http://www.linkedin.com/e/kq0fyp-gy2z9znd-1x/uYFEuWAc-_V_w7MB9hFjx_pd4WRoHI/blk/I187460085_55/1BpC5vrmRLoRZcjkkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYRdlYRe30MdzgTe359bS55izdxkT18bP0SczcPdz4PdPkLrCBxbOYWrSlI/EML_comm_afe/?hs=false&tok=1xCm-tn7ZTKR41

View invitation from Song Guo
http://www.linkedin.com/e/kq0fyp-gy2z9znd-1x/uYFEuWAc-_V_w7MB9hFjx_pd4WRoHI/blk/I187460085_55/3kRnPkUc30Sd3sUckALqnpPbOYWrSlI/svi/?hs=false&tok=2TeaW-9HpTKR41

--

Why might connecting with Song Guo be a good idea?

Song Guo's connections could be useful to you:

After accepting Song Guo's invitation, check Song Guo's connections to see who 
else you may know and who you might want an introduction to. Building these 
connections can create opportunities in the future.

-- 
(c) 2012, LinkedIn Corporation

Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Brice Goglin
Le 31/01/2012 14:24, Jeff Squyres a écrit :
> On Jan 31, 2012, at 6:18 AM, Dave Love wrote:
>
>> Core binding is broken on Interlagos with open-mpi 1.5.4.  I guess it
>> also bites on Magny-Cours, but all our systems are currently busy and I
>> can't check.
>>
>> It does work, at least basically, in 1.5.5rc1, but the release notes for
>> that don't give any indication.  Perhaps someone could mention
>> Interlagos in the notes, and any other hardware that might be affected
>> (presumably Magny-Cours and some Power if it's confusion introduced by
>> the extra NUMA level).
> I think there was some weirdness in how AMD chips were represented to the 
> Linux kernel (they present differently than Intel chips).  I believe the 
> issues have been worked out by hwloc.

Right, AMD "dual-core modules" are reported almost exactly as "a single
hyperthreaded core" by the kernel. We had to tweak hwloc to detect two
different cores. So you get 32 cores and 32 PUs (hwloc >= 1.2.1) instead
of 16 cores and 32 PUs (hwloc <1.2.1).

If you don't have this hwloc change, I guess binding to core breaks
because you have 16 cores for 32 processes. I don't know if there's an
easy way to tell OMPI 1.5.4 to bind to PUs instead of Cores. This should
work as expected.

Unless I am mistaken, OMPI 1.5.4 has hwloc 1.2 while 1.5.5 will have
1.2.2 or even 1.3.1. So don't use core binding on interlagos with
OMPI<=1.5.4.

Note that magny-Cours processors are OK, cores are "normal" there.

FWIW, the Linux kernel (at least up to 3.2) still reports wrong L2 and
L1i cache information on AMD Bulldozer. Kernel bug reported at
https://bugzilla.kernel.org/show_bug.cgi?id=42607

Brice



Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Jeff Squyres
On Jan 31, 2012, at 8:49 AM, Brice Goglin wrote:

> Unless I am mistaken, OMPI 1.5.4 has hwloc 1.2

Correct.

> while 1.5.5 will have
> 1.2.2 or even 1.3.1. So don't use core binding on interlagos with
> OMPI<=1.5.4.

OMPI 1.5.5rc1 has hwloc 1.3.1 + a few SVN commits past it.

Per some off-list discussions, we'll probably do an hwloc 1.3.2 release Real 
Soon Now to clean up/release all post-1.3.1 fixes, and then sync that to Open 
MPI 1.5.5.

> Note that magny-Cours processors are OK, cores are "normal" there.
> 
> FWIW, the Linux kernel (at least up to 3.2) still reports wrong L2 and
> L1i cache information on AMD Bulldozer. Kernel bug reported at
> https://bugzilla.kernel.org/show_bug.cgi?id=42607

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Reuti
Am 31.01.2012 um 05:33 schrieb Tom Bryan:

>> Suppose you want to start 4 additional tasks, you would need 5 in total from
>> SGE.
> 
> OK, thanks.  I'll try other values.

BTW: there is a setting in the PE definition to allow one addititonal task:

$ qconf -sp openmpi
...
job_is_first_task  FALSE

This is useful, in case the master task does only collect the results and 
doesn't put any load on the machine. For conventional MPI applications it's set 
to "true" though.


>>> #$ -cwd
>>> #$ -j yes
>>> export LD_LIBRARY_PATH=/${VXR_STATIC}/openmpi-1.5.4/lib
>>> ./mpitest $*
>>> 
>>> The mpitest program is the one that calls Spawn_multiple.  In this case, it
>>> just tries to run multiple copies of itself.  If I restrict my SGE
>> 
>> I never used spawn_mutiple, but isn't it necessary to start it with mpiexec
>> too and call MPI_Init?
>> 
>> $ mpiexec ./mpitest -np 1
> 
> I don't think so.

In the book "Using MPI-2 by William Gropp at el." they use it in chapter 
7.2.2/page 235 this way, although it's indeed stated in the MPI-2.2 standard on 
page 329 to create a singleton MPI environment if the application could find 
the necessary information (i.e. wasn't started by mpiexec).

Maybe it's a side effect of a tight integration that it would start on the 
correct nodes (but I face an incorrect allocation of slots and an error message 
at the end if started without mpiexec), as in this case it has no command line 
option for the hostfile. How to get the requested nodes if started from the 
command line?

Maybe someone from the Open MPI team can clarify the intended behavior in this 
case.


>  In any case, when I restrict the SGE grid to run all of
> my orte parallel environment jobs on one machine, the application runs fine.
> I only have problems if one or more of the spawned children gets scheduled
> to another node.  
> 
>> to override the detected slots by the tight integration into SGE. Otherwise 
>> it
>> might be running only as a serial one. The additional 4 spawned processes can
>> then be added inside your application.
>> 
>> The line to initialize MPI:
>> 
>> if( MPI::Init( MPI::THREAD_MULTIPLE ) != MPI::THREAD_MULTIPLE )
>> ...
>> 
>> I replaced the complete if... by a plain MPI::Init(); and get a suitable
>> output (see attached, qsub -pe openmpi 4 and changed _nProc to 3) in a tight
>> integration into SGE.
>> 
>> NB: What is MPI::Init( MPI::THREAD_MULTIPLE ) supposed to do, output a 
>> feature

Okay, typo - the _thread is missing.


>> of MPI?
> 
> Well, I'm new to MPI, so I'm not sure.  The program was actually written by
> a co-worker.  I think that it's supposed to set up a bunch of things and
> also verify that our build has the requested level of thread support.

Threads have nothing to do with comm_spawn. Their support is necessary to 
combine MPI with OpenMP or any other thread library. I couldn't use it 
initially as I haven't compiled it with --enable-mpi-threads. A plain 
MPI::Init(); is sufficient here (thread support won't hurt though).


> My co-worker clarified today that he actually had this exact code working
> last year on a test cluster that we set up.  Now we're trying to put
> together a production cluster with the latest version of Open MPI and SGE
> (Son of Grid Engine), but Mpitest is now hanging as described in my first
> e-mail.

For me it's not hanging. Did you try the alternative startup using mpiexec?

Aha - BTW: I use 1.4.4

-- Reuti


>> Is it for an actual application where you need this feature? In the MPI
>> documentation it's discouraged to start it this way for performance reasons.
> 
> For our use, yes, spawn_multiple makes sense.  We won't be spawning lots and
> lots of jobs in quick succession.  We're using MPI as an robust way to get
> IPC as we spawn multiple child processes while using SGE to help us with
> load balancing our compute nodes.
> 
>> Anyway:
>> do you see on the master node of the parallel job in:
> 
> Yes, I should have included that kind of output.  I'll have to run it again
> with the cols option, but I used pstree to see that I have mpitest --child
> processes as children of orted by way of sge_shepherd and sge_execd.
> 
> Thanks,
> ---Tom
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_AllGather null terminator character

2012-01-31 Thread Gabriele Fatigati
Ok Jeff, thanks very much for your support!

Regards,

2012/1/31 Jeff Squyres 

> On Jan 31, 2012, at 3:59 AM, Gabriele Fatigati wrote:
>
> > I have very interesting news. I recompiled OpenMPI 1.4.4 enabling the
> memchecker.
> >
> > Now the warning on strcmp is disappeared also without buffers
> initializations using memset!
> >
> > So the warning is a false positive? My simple code is safe?
>
> If you were using IB as the network transport, yes, it's a false positive.
>
> With memchecker enabled, Open MPI will *always* tell Valgrind that the
> entire contents of the buffer are defined, even when the data is coming
> from an OS-bypass transport (such as an OpenFabrics-based device, like IB).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Ing. Gabriele Fatigati

HPC specialist

SuperComputing Applications and Innovation Department

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


[OMPI users] OpenMPI / SLURM -> Send/Recv blocking

2012-01-31 Thread adrian sabou
Hi All,
 
I'm having this weird problem when running a very simple OpenMPI application. 
The application sends an integer from the rank 0 process to the rank 1 process. 
The sequence of code that I use to accomplish this is the following:
if (rank == 0)
{
printf("Process %d - Sending...\n", rank);
MPI_Send(&sent, 1, MPI_INT, 1, 1, MPI_COMM_WORLD);
printf("Process %d - Sent.\n", rank);
}
if (rank == 1)
{
 printf("Process %d - Receiving...\n", rank);
MPI_Recv(&received, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stat);
printf("Process %d - Received.\n", rank");
}
 
printf("Process %d - Barrier reached.\n", rank);
MPI_Barrier(MPI_COMM_WORLD);
printf("Process %d - Barrier passed.\n", rank");
 
Like I said, a very simple program.
When launching this application with SLURM (using "salloc -N2 mpirun 
./"), it hangs at the barrier. However, it passes the barrier if I 
launch it without SLURM (using "mpirun -np 2 ./"). I first noticed this 
problem when my application hanged if I tried to send two successive messages 
from a process to another. Only the first MPI_Send would work. The second 
MPI_Send would block indefinitely. I was wondering whether any of you have 
encountered a similar problem, or may have an ideea as to what is causing the 
Send/Receive pair to block when using SLURM. The exact output in my console is 
as follows:
 
salloc: Granted job allocation 1138
Process 0 - Sending...
Process 1 - Receiving...
Process 1 - Received.
Process 1 - Barrier reached.
Process 0 - Sent.
Process 0 - Barrier reached.
(it just hangs here)
 
I am new to MPI programming and to OpenMPI and would greatly appreciate any 
help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), my 
SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1), the operating system on the 
cluster on which I tried to run my application is Ubuntu 10.04 LTS Server x64. 
If anyone is willing to help me out, I will happily provide any other info 
requested (as long as the request comes with instructions on how to get that 
info).
 
Your answers will be of great help! Thanks!
 
Adrian

Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...

2012-01-31 Thread Richard Walsh
Gotz,

Sorry, I was in a rush and missed that.

Here is some further information the compiler options used by me
for the 1.5.5 build:

 [richard.walsh@bob linux]$ pwd
/share/apps/openmpi-intel/1.5.5/build/opal/mca/memory/linux

[richard.walsh@bob linux]$ make -n malloc.o
echo "  CC" malloc.o;depbase=`echo malloc.o | sed 
's|[^/]*$|.deps/&|;s|\.o$||'`;\
icc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/hwloc/hwloc122ompi/hwloc/include/private/autogen 
-I../../../../opal/mca/hwloc/hwloc122ompi/hwloc/include/hwloc/autogen  
-DMALLOC_DEBUG=0 -D_GNU_SOURCE=1 -DUSE_TSD_DATA_HACK=1 -DMALLOC_HOOKS=1 
-I./sysdeps/pthread  -I./sysdeps/generic -I../../../..   
-I/share/apps/openmpi-intel/1.5.5/build/opal/mca/hwloc/hwloc122ompi/hwloc/include
   -I/usr/include/infiniband -I/usr/include/infiniband   -DNDEBUG -g -O2 
-finline-functions -fno-strict-aliasing -restrict -pthread 
-I/share/apps/openmpi-intel/1.5.5/build/opal/mca/hwloc/hwloc122ompi/hwloc/include
 -MT malloc.o -MD -MP -MF $depbase.Tpo -c -o malloc.o malloc.c &&\
mv -f $depbase.Tpo $depbase.Po

The entry point your code crashed in:

opal_memory_ptmalloc2_int_malloc

is renamed to:

rename.h:#define _int_malloc opal_memory_ptmalloc2_int_malloc

in the malloc.c routine in 1.5.5.  Perhaps you should lower the optimization
level to zero and see what you get.

Sincerely,

rbw

Richard Walsh
Parallel Applications and Systems Manager
CUNY HPC Center, Staten Island, NY
W: 718-982-3319
M: 612-382-4620

Miracles are delivered to order by great intelligence, or when it is
absent, through the passage of time and a series of mere chance
events. -- Max Headroom


From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of Götz 
Waschk [goetz.was...@gmail.com]
Sent: Tuesday, January 31, 2012 3:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 
20110811) issues ...

On Mon, Jan 30, 2012 at 5:11 PM, Richard Walsh
 wrote:
> I have not seen this mpirun error with the OpenMPI version I have built
> with Intel 12.1 and the mpicc fix:
> openmpi-1.5.5rc1.tar.bz2

Hi,

I haven't tried that version yet. I was trying to build a
supplementary package to the openmpi 1.5.3 shipped with RHEL6.2, the
same source, just built using the Intel compiler.

> and from the looks of things, I wonder if your problem is related.  The
> solution in the original case was to conditionally dial-down optimization
> when using the 12.1 compiler to prevent the compiler itself from crashing
> during a compile.  What you present is a failure during execution.  Such
> failures might be due to over zealous optimization, but there seems to be
> little reason on the face of it to believe that there is a connection between
> the former and the latter.

Well, the similarity is that it is also a crash in the malloc routine.
I don't know if my optflags are too high, I have derived them from Red
Hat's, replacing the options unkown to icc:
-O2 -g -pipe -Wall -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=pentium4

> Does this failure occur with all attempts to use 'mpirun' whatever the source?
> My 'mpicc' problem did.  If this is true and If you believe it is an 
> optimization
> level issue you could try turning it off in the failing routine and see if 
> that
> produces a remedy.  I would also try things with the very latest release.

Yes, the mpicc crash happened every time, I could reproduce that.

I have only tested the most basic code, the cpi.c example. The funny
thing is, that mpirun -np 8 cpi doesn't always crash, sometimes it
finishes just fine.

Regards, Götz Waschk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Change is in the Air - Smoking in Designated Areas Only in 
effect.
Tobacco-Free Campus as of July 1, 2012.



Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Dave Love
Brice Goglin  writes:

> Note that magny-Cours processors are OK, cores are "normal" there.

Apologies for the bad guess about the architecture, and thanks for the
info.

> FWIW, the Linux kernel (at least up to 3.2) still reports wrong L2 and
> L1i cache information on AMD Bulldozer. Kernel bug reported at
> https://bugzilla.kernel.org/show_bug.cgi?id=42607

I assume that isn't relevant for open-mpi, just other things.  Is that
right?

We'll try to get some action out of AMD in the face of a procurement, if
nothing else.



Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Dave Love
Reuti  writes:

> Maybe it's a side effect of a tight integration that it would start on
> the correct nodes (but I face an incorrect allocation of slots and an
> error message at the end if started without mpiexec), as in this case
> it has no command line option for the hostfile. How to get the
> requested nodes if started from the command line?

Yes, I wouldn't expect it to work without mpirun/mpiexec and, of course,
I basically agree with Reuti about the rest.

If there is an actual SGE problem or need for an enhancement, though,
please file it per https://arc.liv.ac.uk/trac/SGE#mail



Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Jeff Squyres
I only noticed after the fact that Tom is also here at Cisco (it's a big 
company, after all :-) ).

I've contacted him using our proprietary super-secret Cisco handshake (i.e., 
the internal phone network); I'll see if I can figure out the issues off-list.


On Jan 31, 2012, at 1:08 PM, Dave Love wrote:

> Reuti  writes:
> 
>> Maybe it's a side effect of a tight integration that it would start on
>> the correct nodes (but I face an incorrect allocation of slots and an
>> error message at the end if started without mpiexec), as in this case
>> it has no command line option for the hostfile. How to get the
>> requested nodes if started from the command line?
> 
> Yes, I wouldn't expect it to work without mpirun/mpiexec and, of course,
> I basically agree with Reuti about the rest.
> 
> If there is an actual SGE problem or need for an enhancement, though,
> please file it per https://arc.liv.ac.uk/trac/SGE#mail
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Reuti
Am 31.01.2012 um 20:12 schrieb Jeff Squyres:

> I only noticed after the fact that Tom is also here at Cisco (it's a big 
> company, after all :-) ).
> 
> I've contacted him using our proprietary super-secret Cisco handshake (i.e., 
> the internal phone network); I'll see if I can figure out the issues off-list.

But I would be interested in a statement about a hostlist for singleton 
startups. Or whether it's honoring the tight integration nodes more by accident 
than by design. And as said: I see a wrong allocation, as the initial ./Mpitest 
doesn't count as process. I get a 3+1 allocation instead of 2+2 (what is 
granted by SGE). If started with "mpiexec -np 1 ./Mpitest" all is fine.

-- Reuti


> On Jan 31, 2012, at 1:08 PM, Dave Love wrote:
> 
>> Reuti  writes:
>> 
>>> Maybe it's a side effect of a tight integration that it would start on
>>> the correct nodes (but I face an incorrect allocation of slots and an
>>> error message at the end if started without mpiexec), as in this case
>>> it has no command line option for the hostfile. How to get the
>>> requested nodes if started from the command line?
>> 
>> Yes, I wouldn't expect it to work without mpirun/mpiexec and, of course,
>> I basically agree with Reuti about the rest.
>> 
>> If there is an actual SGE problem or need for an enhancement, though,
>> please file it per https://arc.liv.ac.uk/trac/SGE#mail
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2

2012-01-31 Thread Daniel Milroy
Hello,

I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC
environment.  We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon
X5660 cpus.  You can find my build options below.  In an effort to
test the OpenMPI build, I compiled "Hello world" with an mpi_init call
in C and Fortran.  Mpirun of both versions on a single node results in
a segfault.  I have attached the pertinent portion of gdb's output of
the "Hello world" core dump.  Submitting a parallel "Hello world" job
to torque results in segfaults across the respective nodes.  However,
if I execute mpirun of C or Fortran "Hello world" following a segfault
the program will exit successfully.  Additionally, if I strace mpirun
on either a single node or on multiple nodes in parallel "Hello world"
runs successfully.  I am unsure how to proceed- any help would be
greatly appreciated.


Thank you in advance,

Dan Milroy


Build options:

source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/iccvars.sh intel64
source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/ifortvars.sh
intel64
export CC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc
export CXX=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc
export F77=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
export F90=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
export FC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
./configure --prefix=/openmpi-1.4.5rc2_intel-12.1
--with-tm=/torque-2.5.8/ --enable-shared --enable-static --without-psm


GDB_hello.c_core_dump
Description: Binary data


Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2

2012-01-31 Thread Jeff Squyres
We have heard reports of failures with the Intel 12.1 compilers.

Can you try with rc4 (that was literally just released) with the 
--without-memory-manager configure option?


On Jan 31, 2012, at 2:19 PM, Daniel Milroy wrote:

> Hello,
> 
> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC
> environment.  We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon
> X5660 cpus.  You can find my build options below.  In an effort to
> test the OpenMPI build, I compiled "Hello world" with an mpi_init call
> in C and Fortran.  Mpirun of both versions on a single node results in
> a segfault.  I have attached the pertinent portion of gdb's output of
> the "Hello world" core dump.  Submitting a parallel "Hello world" job
> to torque results in segfaults across the respective nodes.  However,
> if I execute mpirun of C or Fortran "Hello world" following a segfault
> the program will exit successfully.  Additionally, if I strace mpirun
> on either a single node or on multiple nodes in parallel "Hello world"
> runs successfully.  I am unsure how to proceed- any help would be
> greatly appreciated.
> 
> 
> Thank you in advance,
> 
> Dan Milroy
> 
> 
> Build options:
> 
>source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/iccvars.sh 
> intel64
>source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/ifortvars.sh
> intel64
>export CC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc
>export CXX=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc
>export F77=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>export F90=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>export FC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>./configure --prefix=/openmpi-1.4.5rc2_intel-12.1
> --with-tm=/torque-2.5.8/ --enable-shared --enable-static --without-psm
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Ralph Castain
Not sure I fully grok this thread, but will try to provide an answer.

When you start a singleton, it spawns off a daemon that is the equivalent of 
"mpirun". This daemon is created for the express purpose of allowing the 
singleton to use MPI dynamics like comm_spawn - without it, the singleton would 
be unable to execute those functions.

The first thing the daemon does is read the local allocation, using the same 
methods as used by mpirun. So whatever allocation is present that mpirun would 
have read, the daemon will get. This includes hostfiles and SGE allocations.

The exception to this is when the singleton gets started in an altered 
environment - e.g., if SGE changes the environmental variables when launching 
the singleton process. We see this in some resource managers - you can get an 
allocation of N nodes, but when you launch a job, the envar in that job only 
indicates the number of nodes actually running processes in that job. In such a 
situation, the daemon will see the altered value as its "allocation", 
potentially causing confusion.

For this reason, I generally recommend that you run dynamic applications using 
miprun when operating in RM-managed environments to avoid confusion. Or at 
least use "printenv" to check that the envars are going to be right before 
trying to start from a singleton.

HTH
Ralph

On Jan 31, 2012, at 12:19 PM, Reuti wrote:

> Am 31.01.2012 um 20:12 schrieb Jeff Squyres:
> 
>> I only noticed after the fact that Tom is also here at Cisco (it's a big 
>> company, after all :-) ).
>> 
>> I've contacted him using our proprietary super-secret Cisco handshake (i.e., 
>> the internal phone network); I'll see if I can figure out the issues 
>> off-list.
> 
> But I would be interested in a statement about a hostlist for singleton 
> startups. Or whether it's honoring the tight integration nodes more by 
> accident than by design. And as said: I see a wrong allocation, as the 
> initial ./Mpitest doesn't count as process. I get a 3+1 allocation instead of 
> 2+2 (what is granted by SGE). If started with "mpiexec -np 1 ./Mpitest" all 
> is fine.
> 
> -- Reuti
> 
> 
>> On Jan 31, 2012, at 1:08 PM, Dave Love wrote:
>> 
>>> Reuti  writes:
>>> 
 Maybe it's a side effect of a tight integration that it would start on
 the correct nodes (but I face an incorrect allocation of slots and an
 error message at the end if started without mpiexec), as in this case
 it has no command line option for the hostfile. How to get the
 requested nodes if started from the command line?
>>> 
>>> Yes, I wouldn't expect it to work without mpirun/mpiexec and, of course,
>>> I basically agree with Reuti about the rest.
>>> 
>>> If there is an actual SGE problem or need for an enhancement, though,
>>> please file it per https://arc.liv.ac.uk/trac/SGE#mail
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Reuti

Am 31.01.2012 um 20:38 schrieb Ralph Castain:

> Not sure I fully grok this thread, but will try to provide an answer.
> 
> When you start a singleton, it spawns off a daemon that is the equivalent of 
> "mpirun". This daemon is created for the express purpose of allowing the 
> singleton to use MPI dynamics like comm_spawn - without it, the singleton 
> would be unable to execute those functions.
> 
> The first thing the daemon does is read the local allocation, using the same 
> methods as used by mpirun. So whatever allocation is present that mpirun 
> would have read, the daemon will get. This includes hostfiles and SGE 
> allocations.

So it should honor also the default hostfile of Open MPI if running outside of 
SGE, i.e. from the command line?


> The exception to this is when the singleton gets started in an altered 
> environment - e.g., if SGE changes the environmental variables when launching 
> the singleton process. We see this in some resource managers - you can get an 
> allocation of N nodes, but when you launch a job, the envar in that job only 
> indicates the number of nodes actually running processes in that job. In such 
> a situation, the daemon will see the altered value as its "allocation", 
> potentially causing confusion.

Not sure whether I get it right. When I launch the same application with:

"mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines):

27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
 9504 ?S  0:00  \_ sge_shepherd-3791 -bg
 9506 ?Ss 0:00  \_ /bin/sh 
/var/spool/sge/pc15370/job_scripts/3791
 9507 ?S  0:00  \_ mpiexec -np 1 ./Mpitest
 9508 ?R  0:07  \_ ./Mpitest
 9509 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh -inherit 
-nostdin -V pc15381  orted -mca
 9513 ?S  0:00  \_ /home/reuti/mpitest/Mpitest --child

 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
25434 ?Sl 0:00  \_ sge_shepherd-3791 -bg
25436 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
/var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381
25444 ?S  0:00  \_ orted -mca ess env -mca orte_ess_jobid 
821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
25447 ?S  0:01  \_ /home/reuti/mpitest/Mpitest --child
25448 ?S  0:01  \_ /home/reuti/mpitest/Mpitest --child

This is what I expect (main + 1 child, other node gets 2 children). Now I 
launch the singleton instead (nothing changed besides this, still 2+2 granted):

"./Mpitest" and get:

27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
 9546 ?S  0:00  \_ sge_shepherd-3793 -bg
 9548 ?Ss 0:00  \_ /bin/sh 
/var/spool/sge/pc15370/job_scripts/3793
 9549 ?R  0:00  \_ ./Mpitest
 9550 ?Ss 0:00  \_ orted --hnp --set-sid --report-uri 6 
--singleton-died-pipe 7
 9551 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh 
-inherit -nostdin -V pc15381 orted
 9554 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
--child
 9555 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
--child

 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
25494 ?Sl 0:00  \_ sge_shepherd-3793 -bg
25495 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
/var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381
25502 ?S  0:00  \_ orted -mca ess env -mca orte_ess_jobid 
814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
25503 ?S  0:00  \_ /home/reuti/mpitest/Mpitest --child

Only one child is going to the other node. The environment is the same in both 
cases. Is this the correct behavior?

-- Reuti


> For this reason, I generally recommend that you run dynamic applications 
> using miprun when operating in RM-managed environments to avoid confusion. Or 
> at least use "printenv" to check that the envars are going to be right before 
> trying to start from a singleton.
> 
> HTH
> Ralph
> 
> On Jan 31, 2012, at 12:19 PM, Reuti wrote:
> 
>> Am 31.01.2012 um 20:12 schrieb Jeff Squyres:
>> 
>>> I only noticed after the fact that Tom is also here at Cisco (it's a big 
>>> company, after all :-) ).
>>> 
>>> I've contacted him using our proprietary super-secret Cisco handshake 
>>> (i.e., the internal phone network); I'll see if I can figure out the issues 
>>> off-list.
>> 
>> But I would be interested in a statement about a hostlist for singleton 
>> startups. Or whether it's honoring the tight integration nodes more by 
>> accident than by design. And as said: I see a wrong allocation, as the 
>> initial ./Mpitest doesn't count as process. I get a 3+1 allocation instead 
>> of 2+2 (what is granted by SGE). If started with "mpiexec -np 1 ./Mpitest" 
>> all is fine.
>> 
>> -- Reuti
>> 
>> 
>>> On Jan 31, 2012

Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Brice Goglin
Le 31/01/2012 19:02, Dave Love a écrit :
>> FWIW, the Linux kernel (at least up to 3.2) still reports wrong L2 and
>> L1i cache information on AMD Bulldozer. Kernel bug reported at
>> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> I assume that isn't relevant for open-mpi, just other things.  Is that
> right?

In 1.5.x, cache info doesn't matter as far as I know.

In trunk, the affinity code has been reworked. I think you can bind
process to caches there. Binding to L2 wouldn't work as expected (would
bind to one core instead of 2). hwloc doesn't have instruction cache
support so far, so wrong L1i info doesn't matter.

I don't know if anybody in trunk uses shared cache size yet (for BTL SM
tuning for instance).


> We'll try to get some action out of AMD in the face of a procurement, if
> nothing else.

I just sent a link to the kernel bugreport to my hwloc contact at AMD.

Brice



Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-31 Thread Ralph Castain

On Jan 31, 2012, at 12:58 PM, Reuti wrote:

> 
> Am 31.01.2012 um 20:38 schrieb Ralph Castain:
> 
>> Not sure I fully grok this thread, but will try to provide an answer.
>> 
>> When you start a singleton, it spawns off a daemon that is the equivalent of 
>> "mpirun". This daemon is created for the express purpose of allowing the 
>> singleton to use MPI dynamics like comm_spawn - without it, the singleton 
>> would be unable to execute those functions.
>> 
>> The first thing the daemon does is read the local allocation, using the same 
>> methods as used by mpirun. So whatever allocation is present that mpirun 
>> would have read, the daemon will get. This includes hostfiles and SGE 
>> allocations.
> 
> So it should honor also the default hostfile of Open MPI if running outside 
> of SGE, i.e. from the command line?

Yes

> 
> 
>> The exception to this is when the singleton gets started in an altered 
>> environment - e.g., if SGE changes the environmental variables when 
>> launching the singleton process. We see this in some resource managers - you 
>> can get an allocation of N nodes, but when you launch a job, the envar in 
>> that job only indicates the number of nodes actually running processes in 
>> that job. In such a situation, the daemon will see the altered value as its 
>> "allocation", potentially causing confusion.
> 
> Not sure whether I get it right. When I launch the same application with:
> 
> "mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines):
> 
> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
> 9504 ?S  0:00  \_ sge_shepherd-3791 -bg
> 9506 ?Ss 0:00  \_ /bin/sh 
> /var/spool/sge/pc15370/job_scripts/3791
> 9507 ?S  0:00  \_ mpiexec -np 1 ./Mpitest
> 9508 ?R  0:07  \_ ./Mpitest
> 9509 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh -inherit 
> -nostdin -V pc15381  orted -mca
> 9513 ?S  0:00  \_ /home/reuti/mpitest/Mpitest --child
> 
> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
> 25434 ?Sl 0:00  \_ sge_shepherd-3791 -bg
> 25436 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
> /var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381
> 25444 ?S  0:00  \_ orted -mca ess env -mca orte_ess_jobid 
> 821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
> 25447 ?S  0:01  \_ /home/reuti/mpitest/Mpitest --child
> 25448 ?S  0:01  \_ /home/reuti/mpitest/Mpitest --child
> 
> This is what I expect (main + 1 child, other node gets 2 children). Now I 
> launch the singleton instead (nothing changed besides this, still 2+2 
> granted):
> 
> "./Mpitest" and get:
> 
> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd
> 9546 ?S  0:00  \_ sge_shepherd-3793 -bg
> 9548 ?Ss 0:00  \_ /bin/sh 
> /var/spool/sge/pc15370/job_scripts/3793
> 9549 ?R  0:00  \_ ./Mpitest
> 9550 ?Ss 0:00  \_ orted --hnp --set-sid --report-uri 
> 6 --singleton-died-pipe 7
> 9551 ?Sl 0:00  \_ /usr/sge/bin/lx24-x86/qrsh 
> -inherit -nostdin -V pc15381 orted
> 9554 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
> --child
> 9555 ?S  0:00  \_ /home/reuti/mpitest/Mpitest 
> --child
> 
> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd
> 25494 ?Sl 0:00  \_ sge_shepherd-3793 -bg
> 25495 ?Ss 0:00  \_ /usr/sge/utilbin/lx24-x86/qrsh_starter 
> /var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381
> 25502 ?S  0:00  \_ orted -mca ess env -mca orte_ess_jobid 
> 814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
> 25503 ?S  0:00  \_ /home/reuti/mpitest/Mpitest --child
> 
> Only one child is going to the other node. The environment is the same in 
> both cases. Is this the correct behavior?


We probably aren't correctly marking the original singleton on that node, and 
so the mapper thinks there are still two slots available on the original node.


> 
> -- Reuti
> 
> 
>> For this reason, I generally recommend that you run dynamic applications 
>> using miprun when operating in RM-managed environments to avoid confusion. 
>> Or at least use "printenv" to check that the envars are going to be right 
>> before trying to start from a singleton.
>> 
>> HTH
>> Ralph
>> 
>> On Jan 31, 2012, at 12:19 PM, Reuti wrote:
>> 
>>> Am 31.01.2012 um 20:12 schrieb Jeff Squyres:
>>> 
 I only noticed after the fact that Tom is also here at Cisco (it's a big 
 company, after all :-) ).
 
 I've contacted him using our proprietary super-secret Cisco handshake 
 (i.e., the internal phone network); I'll see if I can figure out the 
 issues off-list.
>>> 
>>> But I would be interested in a statement about a hostlist for singleton

Re: [OMPI users] core binding failure on Interlagos (and possibly Magny-Cours)

2012-01-31 Thread Jeff Squyres
On Jan 31, 2012, at 3:20 PM, Brice Goglin wrote:

> In 1.5.x, cache info doesn't matter as far as I know.
> 
> In trunk, the affinity code has been reworked. I think you can bind
> process to caches there. Binding to L2 wouldn't work as expected (would
> bind to one core instead of 2). hwloc doesn't have instruction cache
> support so far, so wrong L1i info doesn't matter.
> 
> I don't know if anybody in trunk uses shared cache size yet (for BTL SM
> tuning for instance).

That information is all correct.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/