Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Rahul Nabar
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager  wrote:
> Most of that bandwidth is in marketing...  Sorry, but it's not a high
> performance switch.

Well, how does one figure out what exactly is a "hih performance
switch"? I've found this an exceedingly hard task. Like the OP posted
the Dell 6248 is rated to give more than a fully subscribed backbone
capacity. Nor I do not know any good third party test lab nor do I
know any switch load testing benchmarks that'd take a switch through
its paces.

So, how does one go about selecting a good switch? "The most expensive
the better" is somewhat a unsatisfying option!

-- 
Rahul



Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh

Jonathan Dursi wrote:


Continuing the conversation with myself:

Google pointed me to Trac ticket #1944, which spoke of deadlocks in 
looped collective operations; there is no collective operation 
anywhere in this sample code, but trying one of the suggested 
workarounds/clues:   that is, setting btl_sm_num_fifos to at least 
(np-1) seems to make things work quite reliably, for both OpenMPI 
1.3.2 and 1.3.3; that is, while this


mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with 
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again 
seemingly randomly) with 1.3.3,


mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much 
faster...


The btl_sm_num_fifos thing doesn't on the surface make much sense to 
me.  That presumably controls the number of receive FIFOs per process.  
The default became 1, which could threaten to change behavior if 
multiple senders all send to the same FIFO.  But your sample program has 
just one-to-one connections.  Each receiver has only one sender.  So, 
the number of FIFOs shouldn't matter.  Bumping the number up only means 
you allocate some FIFOs that are never used.


Hmm.  Continuing the conversation with myself, maybe that's not entirely 
true.  Whatever fragments are sent by a process must be received back 
from the receiver.  So, a process receives not only messages from its 
left but also return fragments from its right.  Still, why would np-1 
FIFOs be needed?  Why not just 2?


And, as Jeff points out, everyone should be staying in pretty good sync 
with the Sendrecv pattern.  So, how could there be a problem at all?


Like Jeff, my attempts so far to reproduce the problem (with 
hardware/software conveniently accessible to me) have come up empty.


Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Joe Landman

Rahul Nabar wrote:

On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager  wrote:

Most of that bandwidth is in marketing...  Sorry, but it's not a high
performance switch.


Well, how does one figure out what exactly is a "hih performance
switch"? I've found this an exceedingly hard task. Like the OP posted
the Dell 6248 is rated to give more than a fully subscribed backbone
capacity. Nor I do not know any good third party test lab nor do I
know any switch load testing benchmarks that'd take a switch through
its paces.

So, how does one go about selecting a good switch? "The most expensive
the better" is somewhat a unsatisfying option!


There are several options.

1) research the switches, get the numbers, and then find/interview the 
people who use it.  See if it is as advertised.


2) hire a company to do the same for you, or more to the point, generate 
a reasonable recommendation given your needs.


3) design a benchmark test, and try to run it against the switch.  The 
OSU tests from D. Panda could be used for switch testing as well as for 
HBA testing, with some simple adjustments (Panda's focus is mostly upon 
latency and bandwidth as a function of message size, you could change 
message size, and measure bandwidth/throughput as a function of number 
of workers).


3 is likely the easiest for you to do.  2 is likely what you should do 
if you are designing a cluster and need expert (non-biased) opinion.


Unfortunately, as Gerry indicates, there are a great deal of what I call 
marketing numbers out there.  There isn't enough real data.  Marketing 
numbers seem good on the surface.  Its when you use the product, you 
discover the reality isn't as rosy.


We have found several good gigabit switches for HPC/MPI codes.  A number 
of our customers have started out with the least expensive switch 
possible, and ran into backplane problems in the 20's of nodes, never 
mind the hundred plus they needed to run on.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Jonathan Dursi

Hi, Eugene:

If it continues to be a problem for people to reproduce this, I'll see  
what can be done about having an account made here for someone to poke  
around.  Alternately, any suggestions for tests that I can do to help  
diagnose/verify the problem, or figure out whats different about this  
setup would be greatly appreciated.


As re the btl_sm_num_fifos thing, it could be a bit of a red herring,  
it's just something I started to use following one of the previous bug  
reports.   However, it changes the behaviour pretty markedly -  with  
the sample program I submitted (eg, the send recvs looping around),  
and with OpenMPI 1.3.2 (the version where I see the most extreme  
problems, eg things fail every run), this always works


mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

and other larger numbers for num_fifos also seems to reliably work,  
but 4 or less


mpirun -np 6 -mca btl_sm_num_fifos 4 -mca btl sm,self ./diffusion-mpi

always hangs, as before - after some number of iterations, sometimes  
fewer, sometimes more, always somewhere in the MPI_Sendrecv:

(gdb) where
#0  0x2b9b0a661e80 in opal_progress@plt () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#1  0x2b9b0a67e345 in ompi_request_default_wait () from /scinet/ 
gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2  0x2b9b0a6a42c0 in PMPI_Sendrecv () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#3  0x2b9b0a43c540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0

#4  0x00400eab in MAIN__ ()
#5  0x00400fda in main (argc=1, argv=0x7fffb92cc078)  
at ../../../gcc-4.4.0/libgfortran/fmain.c:21


On the other hand, if I set the leftmost and rightmost neighbours to  
MPI_PROC_NULL as Jeff requested, the behaviour changes; any number  
greater than two works


mpirun -np 6 -mca btl_sm_num_fifos 3 -mca btl sm,self ./diffusion-mpi

But the btl_sm_num_fifos 2  always hangs, either in the Sendrecv or in  
the Finalize


mpirun -np 6 -mca btl_sm_num_fifos 2 -mca btl sm,self ./diffusion-mpi

And the default always hangs, usually in the Finalize but sometimes in  
the Sendrecv.


mpirun -np 6 -mca btl sm,self ./diffusion-mpi
(gdb) where
#0  0x2ad54846d51f in poll () from /lib64/libc.so.6
#1  0x2ad54717a7c1 in poll_dispatch () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2  0x2ad547179659 in opal_event_base_loop () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3  0x2ad54716e189 in opal_progress () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4  0x2ad54931ef15 in barrier () from /scinet/gpc/mpi/openmpi/ 
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5  0x2ad546ca358b in ompi_mpi_finalize () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6  0x2ad546a5d529 in pmpi_finalize__ () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0

#7  0x00400f99 in MAIN__ ()


So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)  
Sendrecv()s:

 Default always hangs in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos hangs in Sendrecv after random number of  
iterations or Finalize


Test problem with non-periodic (left neighbour of proc 0 is  
MPI_PROC_NULL) Sendrecv()s:
 Default always hangs, in Sendrecv after random number of iterations  
or at Finalize

 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
 Using 2 fifos hangs in Finalize  or Sendrecv after random number of  
iterations


OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)  
Sendrecv()s:
 Default sometimes (~20% of time) hangs in Sendrecv after random  
number of iterations

 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
 Using 2 fifos sometimes (~20% of time) hangs in Finalize  or  
Sendrecv after random number of iterations but sometimes completes


Test problem with non-periodic (left neighbour of proc 0 is  
MPI_PROC_NULL) Sendrecv()s:
 Default usually (~75% of time) hangs, in Finalize or in  Sendrecv  
after random number of iterations

 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
 Using 2 fifos usually (~75% of time) hangs in Finalize  or Sendrecv  
after random number of iterations but sometimes completes


Ope

Re: [OMPI users] Changing location where checkpoints are saved

2009-09-23 Thread Josh Hursey

This is described in the C/R User's Guide attached to the webpage below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the  
past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account.  
By default,
it seems that checkpoints are saved in $HOME. However, I would  
prefer them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves  
checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] fault tolerance in open mpi

2009-09-23 Thread Josh Hursey
Unfortunately I cannot provide a precise time frame for availability  
at this point, but we are targeting the v1.5 release series. There is  
a handful of core developers working on this issue at the moment.  
Pieces of this  work have already made it into the Open MPI  
development trunk. If you want to play around with what is available  
try turning on the resilient mapper:

  -mca rmaps resilient

We will be sure to email the list once this work becomes more stable  
and available.


-- Josh

On Sep 18, 2009, at 2:56 AM, vipin kumar wrote:


Hi Josh,

It is good to hear from you that work is in progress towards  
resiliency of Open-MPI. I was and I am waiting for this capability  
in Open-MPI. I have almost finished my development work and waiting  
for this to happen so that I can test my programs. It will be good  
if you can tell how long it will take to make Open-MPI a resilient  
impementation. Here by resiliency I mean abnormal termination or  
intentionally killing a process should not cause any(parent or  
sibling) process to be terminated, given that processes are connected.


thanks.

Regards,

On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey   
wrote:
Task-farm or manager/worker recovery models typically depend on  
intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI  
implementation. William Gropp and Ewing Lusk have a paper entitled  
"Fault Tolerance in MPI Programs" that outlines how an application  
might take advantage of these features in order to recover from  
process failure.


However, these techniques strongly depend upon resilient MPI  
implementations, and behaviors that, some may argue, are non- 
standard. Unfortunately there are not many MPI implementations that  
are sufficiently resilient in the face of process failure to support  
failure in task-farm scenarios. Though Open MPI supports the current  
MPI 2.1 standard, it is not as resilient to process failure as it  
could be.


There are a number of people working on improving the resiliency of  
Open MPI in the face of network and process failure (including  
myself). We have started to move some of the resiliency work into  
the Open MPI trunk. Resiliency in Open MPI has been improving over  
the past few months, but I would not assess it as ready quite yet.  
Most of the work has focused on the runtime level (ORTE), and there  
are still some MPI level (OMPI) issues that need to be worked out.


With all of that being said, I would try some of the techniques  
presented in the Gropp/Lusk paper in your application. Then test it  
with Open MPI and let us know how it goes.


Best,
Josh


On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:

Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting for that failure on the fly is better than taking
checkpoints and restarting the whole process excluding the failed
blade.

Durga

On Mon, Aug 3, 2009 at 9:21 AM, jody wrote:
Hi

I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)

Jody


On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar  
wrote:

Hi all,

Thanks Durga for your reply.

Jeff, once you wrote code for Mandelbrot set to demonstrate fault  
tolerance

in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. I  
attempted,
but no luck. Can you please tell how to write such programs in Open  
MPI.


Thanks in advance.

Regards,
On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury   
wrote:


Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct me:

MPI standards specify communication mechanism, not fault tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost multipath' routes (which means two equally
capable NIC cards connecting to the same destination and modifying the
kernel routing table to use both cards; the kernel will dynamically
load balance.). At the MAC level, you can achieve the same effect by
trunking multiple network cards.

You can achieve process level fault tolerance by a checkpointing
scheme such as BLCR, which has been tested to work with OpenMPI (and
other processes as well)

Durga

On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar  
wrote:


Hi all,

I want to know whether open mpi supports Network and process fault
tolerance
or not? If there is any example demonstrating these features that will
be
best.

Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India

___
users mailing list
us...@open-

Re: [OMPI users] error in ompi-checkpoint

2009-09-23 Thread Josh Hursey

How did you configure Open MPI? Is your application using SIGUSR1?

This error message indicates that Open MPI's daemons could not  
communicate with the application processes. The daemons send SIGUSR1  
to the process to initiate the handshake (you can change this signal  
with -mca opal_cr_signal). If your application does not respond to the  
daemon within a time bound (default 20 sec, though you can change it  
with -mca snapc_full_max_wait_time) then this error is printed, and  
the checkpoint is aborted.


-- Josh


On Sep 22, 2009, at 1:43 AM, Mallikarjuna Shastry wrote:





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] switch and NIC performance (was: very bad parallel scaling of vasp using openmpi)

2009-09-23 Thread Dave Love
Rahul Nabar  writes:

> So, how does one go about selecting a good switch? "The most expensive
> the better" is somewhat a unsatisfying option!

Also it's apparently not always right, if I recall correctly, according
to the figures on MPI switch performance in the reports somewhere
under http://www.cse.dl.ac.uk/disco.¹  The benchmark database there may
also be relevant.  I think they include OMPI figures, to bring this
vaguely on topic.

If you're concerned about latency -- which you may be for VASP? -- then
the NICs and their settings are more important than the switch.
(Obviously use Open-MX, not TCP, too.)  See
 for figures on the NICs I could test.
I haven't seen better GigE ping-pong results than those, but would be
interested to.

---
¹ Not whatever the currently-politically-correct URL is, but it still
  works (except if you turn off Javascript, sigh).



Re: [OMPI users] switch and NIC performance (was: very bad parallel scaling of vasp using openmpi)

2009-09-23 Thread Jeff Squyres

On Sep 23, 2009, at 10:15 AM, Dave Love wrote:

So, how does one go about selecting a good switch? "The most  
expensive

the better" is somewhat a unsatisfying option!


Also it's apparently not always right


+1 on Dave's and Joe's comments.

For example, not all of Cisco's switches are suitable for "ultra" HPC  
clusters.  Cisco has some very expensive switches whose goals are very  
definitely not the same as what ultra HPC clusters typically need.   
They're great switches (ok, I'm a bit biased ;-) ), but they're not  
what you would need for an ultra HPC cluster.  Buying one of these  
would be kind of like buying an F-350 truck instead of an F1 formula  
race car; both are excellent at their respective tasks, but they're  
very different tasks.


My point: a network switch != a network switch != a network switch.   
Make sure you understand what workloads and tasks the network switch  
was designed for; don't just rely on published spec numbers -- they  
don't tell the full story.  Both an F1 and an F-350 can go 100 mph --  
but they get there in very different ways.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Peter Kjellstrom
On Wednesday 23 September 2009, Rahul Nabar wrote:
> On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager  
wrote:
> > Most of that bandwidth is in marketing...  Sorry, but it's not a high
> > performance switch.
>
> Well, how does one figure out what exactly is a "hih performance
> switch"?

IMHO 1G Ethernet won't be enough ("high performance" or not). Get yourself 
some cheap IB HCAs and a switch. The only chance you have with Ethernet is to 
run some sort of bypass proto (OpenMX etc.) and tune your NICs.

/Peter

> I've found this an exceedingly hard task. Like the OP posted 
> the Dell 6248 is rated to give more than a fully subscribed backbone
> capacity. Nor I do not know any good third party test lab nor do I
> know any switch load testing benchmarks that'd take a switch through
> its paces.
>
> So, how does one go about selecting a good switch? "The most expensive
> the better" is somewhat a unsatisfying option!


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh

Jonathan Dursi wrote:


Continuing the conversation with myself:


Sorry to interrupt...  :^)

Okay, I managed to reproduce the hang.  I'll try to look at this.



Google pointed me to Trac ticket #1944, which spoke of deadlocks in 
looped collective operations; there is no collective operation 
anywhere in this sample code, but trying one of the suggested 
workarounds/clues:   that is, setting btl_sm_num_fifos to at least 
(np-1) seems to make things work quite reliably, for both OpenMPI 
1.3.2 and 1.3.3; that is, while this


mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with 
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again 
seemingly randomly) with 1.3.3,


mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much 
faster...


Jonathan





Re: [OMPI users] switch and NIC performance (was: very bad parallelscaling of vasp using openmpi)

2009-09-23 Thread Jeff Squyres
Wow; I should point out an amazing coincidence here.  Doug Eadline  
used [almost] exactly the same analogy that I did (truck vs. F1) in a  
column that was published today in Linux Magazine:


http://www.linux-mag.com/id/7534

I swear I didn't read his column before I posted my answer this morning!

:-)


On Sep 23, 2009, at 10:38 AM, Jeff Squyres (jsquyres) wrote:


On Sep 23, 2009, at 10:15 AM, Dave Love wrote:

>> So, how does one go about selecting a good switch? "The most
>> expensive
>> the better" is somewhat a unsatisfying option!
>
> Also it's apparently not always right

+1 on Dave's and Joe's comments.

For example, not all of Cisco's switches are suitable for "ultra" HPC
clusters.  Cisco has some very expensive switches whose goals are very
definitely not the same as what ultra HPC clusters typically need.
They're great switches (ok, I'm a bit biased ;-) ), but they're not
what you would need for an ultra HPC cluster.  Buying one of these
would be kind of like buying an F-350 truck instead of an F1 formula
race car; both are excellent at their respective tasks, but they're
very different tasks.

My point: a network switch != a network switch != a network switch.
Make sure you understand what workloads and tasks the network switch
was designed for; don't just rely on published spec numbers -- they
don't tell the full story.  Both an F1 and an F-350 can go 100 mph --
but they get there in very different ways.

--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com