Re: [OMPI users] General question on the implementation of a"scheduler" on client side...

2010-05-21 Thread Olivier Riff
Hello Jeff,

thanks for your detailed answer.

2010/5/20 Jeff Squyres 

> You're basically talking about implementing some kind of
> application-specific protocol.  A few tips that may help in your design:
>
> 1. Look into MPI_Isend / MPI_Irecv for non-blocking sends and receives.
>  These may be particularly useful in the server side, so that it can do
> other stuff while sends and receives are progressing.
>

-> You are definitively right, I have to switch to non blocking sends and
receives.


>
> 2. You probably already noticed that collective operations (broadcasts and
> the link) need to be invoked by all members of the communicator.  So if you
> want to do a broadcast, everyone needs to know.  That being said, you can
> send a short message to everyone alerting them that a longer broadcast is
> coming -- then they can execute MPI_BCAST, etc.  That works best if your
> broadcasts are large messages (i.e., you benefit from scalable
> implementations of broadcast) -- otherwise you're individually sending short
> messages followed by a short broadcast.  There might not be much of a "win"
> there.
>

-> That is what I was thinking about to implement. As you mentioned, and
specifically for my case where I mainly send short messages, there might not
be much win. By the way, are their some benchmarks testing sequential
MPI_ISend versus MPI_BCAST for instance ? The aim is to determine up which
buffer size a MPI_BCast is worth to be used for my case. You can answer that
the test is easy to do and that I can test it by myself :)


>
> 3. FWIW, the MPI Forum has introduced the concept of non-blocking
> collective operations for the upcoming MPI-3 spec.  These may help; google
> for libnbc for a (non-optimized) implementation that may be of help to you.
>  MPI implementations (like Open MPI) will feature non-blocking collectives
> someday in the future.
>
>
-> Interesting to know and to keep in mind. Sometimes the future is really
near.

Thanks again for your answer and info.

Olivier



> On May 20, 2010, at 5:30 AM, Olivier Riff wrote:
>
> > Hello,
> >
> > I have a general question about the best way to implement an openmpi
> application, i.e the design of the application.
> >
> > A machine (I call it the "server") should send to a cluster containing a
> lot of processors (the "clients") regularly task to do (byte buffers from
> very various size).
> > The server should send to each client a different buffer, then wait for
> each client answers (buffer sent by each client after some processing), and
> retrieve the result data.
> >
> > First I made something looking like this.
> > On the server side: Send sequentially to each client buffers using
> MPI_Send.
> > On each client side: loop which waits a buffer using MPI_Recv, then
> process the buffer and sends the result using MPI_Send
> > This is really not efficient because a lot of time is lost due to the
> fact that the server sends and receives sequentially the buffers.
> > It only has the advantage to have on the client size a pretty easy
> scheduler:
> > Wait for buffer (MPI_Recv) -> Analyse it -> Send result (MPI_Send)
> >
> > My wish is to mix MPI_Send/MPI_Recv and other mpi functions like
> MPI_BCast/MPI_Scatter/MPI_Gather... (like I imagine every mpi application
> does).
> > The problem is that I cannot find a easy solution in order that each
> client knows which kind of mpi function is currently called by the server.
> If the server calls MPI_BCast the client should do the same. Sending at each
> time a first message to indicate the function the server will call next does
> not look very nice. Though I do not see an easy/best way to implement an
> "adaptative" scheduler on the client side.
> >
> > Any tip, advice, help would be appreciate.
> >
> >
> > Thanks,
> >
> > Olivier
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype

2010-05-21 Thread Pankatz, Klaus
Hi folks,

openMPI 1.4.1 seems to have another problem with my machine, or something on 
it. 

This little program here (compiled with mpif90) startet with mpiexec -np 4 
a.out produces the following output:
Suriprisingly the same thing written in C-Code (compiled with mpiCC) works 
without a problem.
May it be a interference with other MPI-distributions although I think I have 
deleted all?

Note: The error occurs also with my climate model. The error is nearly the 
same, only with MPI_ERR_TYPE: invalid root.
I've compiled openMPI not as root root, but in my home-directory.

Thanks for your advice, 
Klaus

My machine:
> OpenMPI-version 1.4.1 compiled with Lahey Fortran 95 (lf95).
> OpenMPI was compiled "out of the box" only changing to the Lahey compiler 
> with a setenv $FC lf95
>
> The System: Linux marvin 2.6.27.6-1 #1 SMP Sat Nov 15 20:19:04 CET 2008 
> x86_64 GNU/Linux
>
> Compiler: Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a

***
Output:
[marvin:21997] *** An error occurred in MPI_Bcast
[marvin:21997] *** on communicator MPI_COMM_WORLD
[marvin:21997] *** MPI_ERR_TYPE: invalid datatype
[marvin:21997] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Process 1 : k= 10 before
--
mpiexec has exited due to process rank 1 with PID 21997 on
node marvin exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--
[marvin:21993] 3 more processes have sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[marvin:21993] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
Process 3 : k= 10 before

Program Fortran90:
  include 'mpif.h'

  integer k, rank, size, ierror, tag, p


  call MPI_INIT(ierror)
  call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
  if (rank == 0) then 
 k = 20
  else 
 k = 10
  end if
  do p= 0,size,1

 if (rank == p) then
print*, 'Process', p,': k=', k,  'before'

 end if

  enddo 
  call MPI_Bcast(k, 1, MPI_INT,0,MPI_COMM_WORLD)
  do p =0,size,1
 if (rank == p) then
print*, 'Process', p, ': k=', k, 'after'
  end if   
  enddo
  call MPI_Finalize(ierror)

  end  

Program C-Code:

#include 
#include 
int main (int argc, char *argv[])
{
int k,id,p,size;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(id == 0)
k = 20;
else
k = 10;
for(p=0; p

Re: [OMPI users] GM + OpenMPI bug ...

2010-05-21 Thread José Ignacio Aliaga Estellés


Hi,

We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit  
Ethernet Controller (Copper) (rev 02)

bi00:   Subsystem: Intel Corporation PRO/1000 XT Server Adapter
bi00:   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-  
ParErr- Stepping- SERR+ FastB2B-
bi00:   Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium  
>TAbort- SERR- 
bi00:   Latency: 64 (63750ns min), Cache Line Size: 64 bytes
bi00:   Interrupt: pin A routed to IRQ 185
bi00:   Region 0: Memory at fe9e (64-bit, non-prefetchable)  
[size=128K]
bi00:   Region 2: Memory at fe9d (64-bit, non-prefetchable)  
[size=64K]

bi00:   Region 4: I/O ports at dc80 [size=32]
bi00:   Expansion ROM at fe9c [disabled] [size=64K]
bi00:   Capabilities: [dc] Power Management version 2
bi00: Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0 
+,D1-,D2-,D3hot+,D3cold-)

bi00: Status: D0 PME-Enable- DSel=0 DScale=1 PME-
bi00:   Capabilities: [e4] PCI-X non-bridge device
bi00: Command: DPERE- ERO+ RBC=512 OST=1
bi00: Status: Dev=04:01.0 64bit+ 133MHz+ SCD- USC- DC=simple  
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
bi00:   Capabilities: [f0] Message Signalled Interrupts: 64bit+  
Queue=0/0 Enable-

bi00: Address:   Data: 
bi00: 00: 86 80 08 10 17 01 30 02 02 00 00 02 10 40 00 00
bi00: 10: 04 00 9e fe 00 00 00 00 04 00 9d fe 00 00 00 00
bi00: 20: 81 dc 00 00 00 00 00 00 00 00 00 00 86 80 07 11
bi00: 30: 00 00 9c fe dc 00 00 00 00 00 00 00 05 01 ff 00
bi00: 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 e4 22 48
bi00: e0: 00 20 00 40 07 f0 02 00 08 04 43 04 00 00 00 00
bi00: f0: 05 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00

We don't know how to interpret this information. We suppose that SEER  
and PERR are not activated, if we have understood correctly the  
Status " ... >SERR- Could you confirm that? If this is the case, could you indicate how  
to activate them?


Additionally, if you know any software tool or methodology to check  
the hardware/software, please, could you send us how to do it?


Thanks in advance.

Best regards,

  José I. Aliaga

El 20/05/2010, a las 16:29, Patrick Geoffray escribió:


Hi Jose,

On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
I think that I have found a bug on the implementation of GM  
collectives
routines included in OpenMPI. The version of the GM software is  
2.0.30

for the PCI64 cards.



I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
Could you help me? Thanks.


We have been running the test you provided on 8 nodes for 4 hours  
and haven't seen any errors. The setup used GM 2.0.30 and openmpi  
1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have  
PCI64 NICs anymore, and no machines with a PCI 64/66 slot.


One-bit errors are rarely a software problem, they are usually  
linked to hardware corruption. Old PCI has a simple parity check  
but most machines/BIOS of this era ignored reported errors. You may  
want to check the lspci output on your machines and see if SERR or  
PERR is set. You can also try to reset each NIC in its PCI slot, or  
use a different slot if available.


Hope it helps.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com






[OMPI users] [sge::tight-integration] slot scheduling and resources handling

2010-05-21 Thread Eloi Gaudry
Hi there,

I'm observing something strange on our cluster managed by SGE6.2u4 when 
launching a parallel computation on several nodes, using OpenMPI/SGE tight-
integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are 
not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation 
based on the allocated node hostnames.

Here is what I'm doing:
- launch a parallel computation involving 8 processors, using for each of them 
14GB of memory. I'm using a qsub command where i request memory_free resource 
and use tight integration with openmpi
- 3 servers are available:
. barney with 4 cores (4 slots) and 32GB
. carl with 4 cores (4 slots) and 32GB
. charlie with 8 cores (8 slots) and 64GB

Here is the output of the allocated nodes (OpenMPI output):
==   ALLOCATED NODES   ==

 Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
  Daemon: [[44332,0],0] Daemon launched: True
  Num slots: 4  Slots in use: 0
  Num slots allocated: 4  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0
 Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2
  Daemon: Not defined Daemon launched: False
  Num slots: 2  Slots in use: 0
  Num slots allocated: 2  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0
 Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2
  Daemon: Not defined Daemon launched: False
  Num slots: 2  Slots in use: 0
  Num slots allocated: 2  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0

=

Here is what I see when my computation is running on the cluster:
# rank   pid  hostname
 0 28112  charlie
 1 11417  carl
 2 11808  barney
 3 28113  charlie
 4 11418  carl
 5 11809  barney
 6 28114  charlie
 7 11419  carl

Note that -the parallel environment used under SGE is defined as:
[eg@moe:~]$ qconf -sp round_robin
pe_nameround_robin
slots  32
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE

I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. 
"ALLOCATED NODES" report) but instead allocate each job of the parallel 
computation at a time, using a round-robin method.

Note that I'm using the '--bynode' option in the orterun command line. If the 
behavior I'm observing is simply the consequence of using this option, please 
let me know. This would eventually mean that one need to state that SGE tight-
integration has a lower priority on orterun behavior than the different command 
line options.

Any help would be appreciated,
Thanks,
Eloi


-- 


Eloi Gaudry

Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM

Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626


Re: [OMPI users] Using a rankfile for ompi-restart

2010-05-21 Thread Fernando Lemos
On Tue, May 18, 2010 at 3:53 PM, Josh Hursey  wrote:
>> I've noticed that ompi-restart doesn't support the --rankfile option.
>> It only supports --hostfile/--machinefile. Is there any reason
>> --rankfile isn't supported?
>>
>> Suppose you have a cluster without a shared file system. When one node
>> fails, you transfer its checkpoint to a spare node and invoke
>> ompi-restart. In 1.5, ompi-restart automagically handles this
>> situation (if you supply a hostfile) and is able to restart the
>> process, but I'm afraid it might not always be able to find the
>> checkpoints this way. If you could specify to ompi-restart where the
>> ranks are (and thus where the checkpoints are), then maybe restart
>> would always work as long (as long as you've specified the location of
>> the checkpoints correctly), or maybe ompi-restart would be faster.
>
> We can easily add the --rankfile option to ompi-restart. I filed a ticket to
> add this option, and assess if there are other options that we should pass
> along (e.g., --npernode, --byhost). I should be able to fix this in the next
> week or so, but the ticket is linked below so you can follow the progress.
>  https://svn.open-mpi.org/trac/ompi/ticket/2413

Nice, thanks!

> Most of the ompi-restart parameters are passed directly to the mpirun
> command. ompi-restart is mostly a wrapper around mpirun that is able to
> parse the metadata and create the appcontext file. I wonder if a more
> general parameter like '--mpirun-args ...' might make sense so users don't
> have to wait on me to expose the interface they need.
>
> Donno. What do you think? Should I create a '--mpirun-args' option or
> duplicate all of the mpirun command line parameters, or some combination of
> the two.

Well, I think an --mpirun-args argument would be even more useful, as
it's hard to foresee how ompi-restart is gonna be used. Maybe a
combination of the two would be ideal, since some options are going to
be used very often (i.e. --hostfile, --hosts, etc.).


Regards,



Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling

2010-05-21 Thread Reuti
Hi,

Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:

> Hi there,
> 
> I'm observing something strange on our cluster managed by SGE6.2u4 when 
> launching a parallel computation on several nodes, using OpenMPI/SGE tight-
> integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are 
> not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation 
> based on the allocated node hostnames.

you compiled Open MPI with --with-sge (and recompiled your applications)? You 
are using the correct mpiexec?

-- Reuti


> Here is what I'm doing:
> - launch a parallel computation involving 8 processors, using for each of 
> them 
> 14GB of memory. I'm using a qsub command where i request memory_free resource 
> and use tight integration with openmpi
> - 3 servers are available:
> . barney with 4 cores (4 slots) and 32GB
> . carl with 4 cores (4 slots) and 32GB
> . charlie with 8 cores (8 slots) and 64GB
> 
> Here is the output of the allocated nodes (OpenMPI output):
> ==   ALLOCATED NODES   ==
> 
> Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>  Daemon: [[44332,0],0] Daemon launched: True
>  Num slots: 4  Slots in use: 0
>  Num slots allocated: 4  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2
>  Daemon: Not defined Daemon launched: False
>  Num slots: 2  Slots in use: 0
>  Num slots allocated: 2  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2
>  Daemon: Not defined Daemon launched: False
>  Num slots: 2  Slots in use: 0
>  Num slots allocated: 2  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> 
> =
> 
> Here is what I see when my computation is running on the cluster:
> # rank   pid  hostname
> 0 28112  charlie
> 1 11417  carl
> 2 11808  barney
> 3 28113  charlie
> 4 11418  carl
> 5 11809  barney
> 6 28114  charlie
> 7 11419  carl
> 
> Note that -the parallel environment used under SGE is defined as:
> [eg@moe:~]$ qconf -sp round_robin
> pe_nameround_robin
> slots  32
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary FALSE
> 
> I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. 
> "ALLOCATED NODES" report) but instead allocate each job of the parallel 
> computation at a time, using a round-robin method.
> 
> Note that I'm using the '--bynode' option in the orterun command line. If the 
> behavior I'm observing is simply the consequence of using this option, please 
> let me know. This would eventually mean that one need to state that SGE tight-
> integration has a lower priority on orterun behavior than the different 
> command 
> line options.
> 
> Any help would be appreciated,
> Thanks,
> Eloi
> 
> 
> -- 
> 
> 
> Eloi Gaudry
> 
> Free Field Technologies
> Axis Park Louvain-la-Neuve
> Rue Emile Francqui, 1
> B-1435 Mont-Saint Guibert
> BELGIUM
> 
> Company Phone: +32 10 487 959
> Company Fax:   +32 10 454 626
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype

2010-05-21 Thread Eugene Loh

Pankatz, Klaus wrote:


Hi folks,

openMPI 1.4.1 seems to have another problem with my machine, or something on it. 


This little program here (compiled with mpif90) startet with mpiexec -np 4 
a.out produces the following output:
Suriprisingly the same thing written in C-Code (compiled with mpiCC) works 
without a problem.
 


Not so surprising since it's C code!

For Fortran:

MPI_INT -> MPI_INTEGER
and add an ierror argument to your MPI_Bcast call (the way you have for 
MPI_Comm_rank/size).


Re: [OMPI users] An error occured in MPI_Bcast; MPI_ERR_TYPE: invalid datatype

2010-05-21 Thread Tom Rosmond
Your fortran call to 'mpi_bcast' needs a status parameter at the end of
the argument list.  Also, I don't think 'MPI_INT' is correct for
fortran, it should be 'MPI_INTEGER'.  With these changes the program
works OK.

T. Rosmond

On Fri, 2010-05-21 at 11:40 +0200, Pankatz, Klaus wrote:
> Hi folks,
> 
> openMPI 1.4.1 seems to have another problem with my machine, or something on 
> it. 
> 
> This little program here (compiled with mpif90) startet with mpiexec -np 4 
> a.out produces the following output:
> Suriprisingly the same thing written in C-Code (compiled with mpiCC) works 
> without a problem.
> May it be a interference with other MPI-distributions although I think I have 
> deleted all?
> 
> Note: The error occurs also with my climate model. The error is nearly the 
> same, only with MPI_ERR_TYPE: invalid root.
> I've compiled openMPI not as root root, but in my home-directory.
> 
> Thanks for your advice, 
> Klaus
>  
> My machine:
> > OpenMPI-version 1.4.1 compiled with Lahey Fortran 95 (lf95).
> > OpenMPI was compiled "out of the box" only changing to the Lahey compiler 
> > with a setenv $FC lf95
> >
> > The System: Linux marvin 2.6.27.6-1 #1 SMP Sat Nov 15 20:19:04 CET 2008 
> > x86_64 GNU/Linux
> >
> > Compiler: Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a
> 
> ***
> Output:
> [marvin:21997] *** An error occurred in MPI_Bcast
> [marvin:21997] *** on communicator MPI_COMM_WORLD
> [marvin:21997] *** MPI_ERR_TYPE: invalid datatype
> [marvin:21997] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> Process 1 : k= 10 before
> --
> mpiexec has exited due to process rank 1 with PID 21997 on
> node marvin exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --
> [marvin:21993] 3 more processes have sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [marvin:21993] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> Process 3 : k= 10 before
> 
> Program Fortran90:
>   include 'mpif.h'
> 
>   integer k, rank, size, ierror, tag, p
> 
> 
>   call MPI_INIT(ierror)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>   if (rank == 0) then 
>  k = 20
>   else 
>  k = 10
>   end if
>   do p= 0,size,1
>  
>  if (rank == p) then
> print*, 'Process', p,': k=', k,  'before'
>  
>  end if
>  
>   enddo 
>   call MPI_Bcast(k, 1, MPI_INT,0,MPI_COMM_WORLD)
>   do p =0,size,1
>  if (rank == p) then
> print*, 'Process', p, ': k=', k, 'after'
>   end if   
>   enddo
>   call MPI_Finalize(ierror)
>
>   end  
> 
> Program C-Code:
> 
> #include 
> #include 
> int main (int argc, char *argv[])
> {
> int k,id,p,size;
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &id);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> if(id == 0)
> k = 20;
> else
> k = 10;
> for(p=0; p if(id == p)
> printf("Process %d: k= %d before\n",id,k);
> }
> //note MPI_Bcast must be put where all other processes
> //can see it.
> MPI_Bcast(&k,1,MPI_INT,0,MPI_COMM_WORLD);
> for(p=0; p if(id == p)
> printf("Process %d: k= %d after\n",id,k);
> }
> MPI_Finalize();
>   return 0 ;
>   }
> ***
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling

2010-05-21 Thread Eloi Gaudry
Hi Reuti,

Yes, the openmpi binaries used were build after having used the --with-sge 
during configure, and we only use those binaries on our cluster.

[eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info
 Package: Open MPI root@moe Distribution
Open MPI: 1.3.3
   Open MPI SVN revision: r21666
   Open MPI release date: Jul 14, 2009
Open RTE: 1.3.3
   Open RTE SVN revision: r21666
   Open RTE release date: Jul 14, 2009
OPAL: 1.3.3
   OPAL SVN revision: r21666
   OPAL release date: Jul 14, 2009
Ident string: 1.3.3
  Prefix: /opt/openmpi-1.3.3
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: moe
   Configured by: root
   Configured on: Tue Nov 10 11:19:34 CET 2009
  Configure host: moe
Built by: root
Built on: Tue Nov 10 11:28:14 CET 2009
  Built host: moe
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: yes
  Thread support: posix (mpi: no, progress: no)
   Sparse Groups: no
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no  (checkpoint thread: no)
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
 MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
 MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
  MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
 MCA btl: gm (MCA v2.0, API v2.0, Component v1.3.3)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
 MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
 MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
 MCA ras: gri

Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling

2010-05-21 Thread Reuti
Hi,

Am 21.05.2010 um 17:19 schrieb Eloi Gaudry:

> Hi Reuti,
> 
> Yes, the openmpi binaries used were build after having used the --with-sge 
> during configure, and we only use those binaries on our cluster.
> 
> [eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info

> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)

ok. As you have a Tight Integration as goal and set in your PE "control_slaves 
TRUE", SGE wouldn't allow `qrsh -inherit ...` to nodes which are not in the 
list of granted nodes. So it looks, like your job is running outside of this 
Tight Integration with its own `rsh`or `ssh`.

Do you reset $JOB_ID or other environment variables in your jobscript, which 
could trigger Open MPI to assume that it's not running inside SGE?

-- Reuti


> 
> 
> On Friday 21 May 2010 16:01:54 Reuti wrote:
>> Hi,
>> 
>> Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:
>>> Hi there,
>>> 
>>> I'm observing something strange on our cluster managed by SGE6.2u4 when
>>> launching a parallel computation on several nodes, using OpenMPI/SGE
>>> tight- integration mode (OpenMPI-1.3.3). It seems that the SGE allocated
>>> slots are not used by OpenMPI, as if OpenMPI was doing is own
>>> round-robin allocation based on the allocated node hostnames.
>> 
>> you compiled Open MPI with --with-sge (and recompiled your applications)?
>> You are using the correct mpiexec?
>> 
>> -- Reuti
>> 
>>> Here is what I'm doing:
>>> - launch a parallel computation involving 8 processors, using for each of
>>> them 14GB of memory. I'm using a qsub command where i request
>>> memory_free resource and use tight integration with openmpi
>>> - 3 servers are available:
>>> . barney with 4 cores (4 slots) and 32GB
>>> . carl with 4 cores (4 slots) and 32GB
>>> . charlie with 8 cores (8 slots) and 64GB
>>> 
>>> Here is the output of the allocated nodes (OpenMPI output):
>>> ==   ALLOCATED NODES   ==
>>> 
>>> Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>>> 
>>> Daemon: [[44332,0],0] Daemon launched: True
>>> Num slots: 4  Slots in use: 0
>>> Num slots allocated: 4  Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0  Next node_rank: 0
>>> 
>>> Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2
>>> 
>>> Daemon: Not defined Daemon launched: False
>>> Num slots: 2  Slots in use: 0
>>> Num slots allocated: 2  Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0  Next node_rank: 0
>>> 
>>> Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2
>>> 
>>> Daemon: Not defined Daemon launched: False
>>> Num slots: 2  Slots in use: 0
>>> Num slots allocated: 2  Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0  Next node_rank: 0
>>> 
>>> =
>>> 
>>> Here is what I see when my computation is running on the cluster:
>>> # rank   pid  hostname
>>> 
>>>0 28112  charlie
>>>1 11417  carl
>>>2 11808  barney
>>>3 28113  charlie
>>>4 11418  carl
>>>5 11809  barney
>>>6 28114  charlie
>>>7 11419  carl
>>> 
>>> Note that -the parallel environment used under SGE is defined as:
>>> [eg@moe:~]$ qconf -sp round_robin
>>> pe_nameround_robin
>>> slots  32
>>> user_lists NONE
>>> xuser_listsNONE
>>> start_proc_args/bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule$round_robin
>>> control_slaves TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots  min
>>> accounting_summary FALSE
>>> 
>>> I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE
>>> (cf. "ALLOCATED NODES" report) but instead allocate each job of the
>>> parallel computation at a time, using a round-robin method.
>>> 
>>> Note that I'm using the '--bynode' option in the orterun command line. If
>>> the behavior I'm observing is simply the consequence of using this
>>> option, please let me know. This would eventually mean that one need to
>>> state that SGE tight- integration has a lower priority on orterun
>>> behavior than the different command line options.
>>> 
>>> Any help would be appreciated,
>>> Thanks,
>>> Eloi
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> 
> 
> Eloi Gaudry
> 
> Free Field Technologies
> Axis Park Louvain-la-Neuve
> Rue Emile Francqui, 1
> B-1435 Mont-Saint Guibert
> BELGIUM
> 
> Company Phone: +32 10 487 959
> Company Fax:   +32 10 454 626




Re: [OMPI users] GM + OpenMPI bug ...

2010-05-21 Thread Patrick Geoffray

Hi Jose,

On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:

We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)


This is the output for the Intel GigE NIC, you should look at the one 
for the Myricom NIC and the PCI bridge above it (lspci -t to see the tree).



bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
SERR- 

PERR- status means no parity detected when receiving data. Looking at 
the PERR status of the PCI bridge on the other side will show if there 
was in corruption on that bus.


As a first step, you can see if you can reproduce errors with a simple 
test involving a single node at a time. You can run "gm_allsize 
--verify" on each machine: it will send packets to itself (loopback in 
the switch) and check for corruption. If you don't see errors after a 
while, that node is probably clean. If you see errors, you can look 
deeper at lspci output to see if it's a PCI problem. If you are using a 
riser card, you can try without.


I am not sure if openMPI has an option to enable debug checksum, but it 
would also be useful to see if it detects anything.



Additionally, if you know any software tool or methodology to check the
hardware/software, please, could you send us how to do it?


You may want to look at the FAQ on GM troubleshooting:
http://www.myri.com/cgi-bin/fom.pl?file=425

Additionally, you can send email to h...@myri.com to open a ticket.

Patrick


Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t

2010-05-21 Thread Michael E. Thomadakis

Hello,

I am resending this because I am not sure if it was sent out to the OMPI 
list.


Any help would be greatly appreciated.

best 

Michael

On 05/19/10 13:19, Michael E. Thomadakis wrote:

Hello,
I would like to build OMPI V1.4.2 and make it available to our users at the
Supercomputing Center at Texas A&M Univ. Our system is a 2-socket, 4-core 
Nehalem
@2.8GHz, 24GiB DRAM / node, 324 nodes connected to 4xQDR Voltaire fabric,
CentOS/RHEL 5.4.



I have been trying to find the following information :

1) high-resolution timers: how do I specify the HRT linux timers in the
--with-timer=TYPE
  line of ./configure ?

2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the
full installation it complains it cannot find it. Note that I build BLCR with
GCC but I am building OMPI with Intel compilers (V11.1)


3) Does OMPI by default use SHM for intra-node message IPC but reverts to IB for
inter-node ?

4) How could I select the high-speed transport, say DAPL or OFED IB verbs ? Is
there any preference as to the specific high-speed transport over 
Mellanox/Voltaire QDR IB?

5) When we launch MPI jobs via PBS/TORQUE do we have control on the task and
thread placement on nodes/cores ?

6) Can we suspend/restart cleanly OMPI jobs with the above scheduler ? Any
caveats on suspension / resumption of OMPI jobs ?

7) Do you have any performance data comparing OMPI vs say MVAPICHv2 and
IntelMPI ? This is not a political issue since I am groing to be providing all
these MPI stacks to our users (IntelMPI V4.0 already installed).




Thank you so much for the great s/w ...

best
Michael



%  \
% Michael E. Thomadakis, Ph.D.  Senior Lead Supercomputer Engineer/Res \
% E-mail: miket AT tamu DOT edu   Texas A&M University \
% web:http://alphamike.tamu.edu   Supercomputing Center \
% Voice:  979-862-3931Teague Research Center, 104B \
% FAX:979-847-8643  College Station, TX 77843, USA \
%  \
   


Re: [OMPI users] General question on the implementation ofa"scheduler" on client side...

2010-05-21 Thread Jeff Squyres
On May 21, 2010, at 3:13 AM, Olivier Riff wrote:

> -> That is what I was thinking about to implement. As you mentioned, and 
> specifically for my case where I mainly send short messages, there might not 
> be much win. By the way, are their some benchmarks testing sequential 
> MPI_ISend versus MPI_BCAST for instance ? The aim is to determine up which 
> buffer size a MPI_BCast is worth to be used for my case. You can answer that 
> the test is easy to do and that I can test it by myself :)

"It depends".  :-)

You're probably best writing a benchmark yourself that mirrors what your 
application is going to do.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/