from:"Rahul Nabar"

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-09-01 Thread Rahul Nabar

On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > It would simplify testing if you could get all the eth0's to be of one type > and on the same subnet, and the same for eth1. > > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, somethin

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-26 Thread Rahul Nabar

On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, something like this: > > mpirun --mca btl_tcp_if_include eth0 ... Thanks Jeff! Just tried the exact test that you suggested. [rpnabar

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar

On Wed, Aug 25, 2010 at 6:41 AM, John Hearns wrote: > You could sort that out with udev rules on each machine. Sure. I'd always wanted consistent names for the eth interfaces when I set up the cluster but I couldn't get udev to co-operate. Maybe this time! Let me try. > Look in the directory /et

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar

On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar wrote: > -- > gather: > NP256 hangs > NP128 hangs > NP64 hangs > NP32 OK > > Note: "gather" always hangs at the followin

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar

On Tue, Aug 24, 2010 at 4:58 PM, Jeff Squyres wrote: > Are all the eth0's on one subnet and all the eth2's on a different subnet? > > Or are all eth0's and eth2's all on the same subnet? Thanks Jeff! Different subnets. All 10GigE's are on 192.168.x.x and all 1GigE's are on 10.0.x.x e.g.

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar

On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it > seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2 > MB messa

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar

On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it My MPI version is 1.4.1. This isn't the latest but still fairly recent. So I assume th

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar

On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen wrote: > > I have had a similar load related problem with Bcast. Thanks Randolph! That's interesting to know! What was the hardware you were using? Does your bcast fail at the exact same point too? > > I don't know what caused it though. With thi

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar

On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann wrote: > It is hard to imagine how a total data load of 41,943,040 bytes could be a > problem. That is really not much data. By the time the BCAST is done, each > task (except root) will have received a single half meg message form one > sender. Th

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-23 Thread Rahul Nabar

On Sun, Aug 22, 2010 at 9:57 PM, Randolph Pullen < randolph_pul...@yahoo.com.au> wrote: > Its a long shot but could it be related to the total data volume ? > ie 524288 * 80 = 41943040 bytes active in the cluster > > Can you exceed this 41943040 data volume with a smaller message repeated > more

[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-19 Thread Rahul Nabar

My Intel IMB-MPI tests stall, but only in very specific cases:larger packet sizes + large core counts. Only happens for bcast, gather and exchange tests. Only for the larger core counts (~256 cores). Other tests like pingpong and sendrecev run fine even with larger core counts. e.g. This bcast tes

[OMPI users] MPI broadcast test fails only when I run within a torque job

2010-07-28 Thread Rahul Nabar

I'm not sure if this is a torque issue or an MPI issue. If I log in to a compute-node and run the standard mpi broadcast test it returns no error but if I run it through PBS/Torque I get an error (see below) The nodes that return the error are fairly random. Even the same set of nodes will run a t

[OMPI users] subnet specification for MPI when multiple networks are present

2010-06-22 Thread Rahul Nabar

I have compute-nodes with twin eth interfaces 1GigE and 10GigE. In the OpenMPI docs I found an instruction: " It is therefore very important that if active ports on the same host are on physically separate fabrics, they must have different subnet IDs." Is this the same "subnet" that is set via an

Re: [OMPI users] MPI daemon error

2010-05-29 Thread Rahul Nabar

On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote: > > >From your other note, it sounds like #3 might be the problem here. Do you > >have some nodes that are configured with "eth0" pointing to your 10.x > >network, and other nodes with "eth0" pointing to your 192.x network? I have > >found

[OMPI users] which eth interface does mpi use by default when torque supplies it with a hostfile?

2010-05-28 Thread Rahul Nabar

Each of our servers has twin eth cards: 1GigE and 10GigE. How does openmpi decide which card to use while sending messages on? One of the cards is on a 10.0. IP address subnet whereas the other cards are on a 192.168 adress subnet. Can I select one or the other by specifying the --host option with

Re: [OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar

On Fri, May 28, 2010 at 3:53 PM, Ralph Castain wrote: > What environment are you running on the cluster, and what version of OMPI? > Not sure that error message is coming from us. openmpi-1.4.1 The cluster runs PBS-Torque. So I guess, that could be the other error source. -- Rahul

[OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar

Often when I try and run larger jobs on our cluster I get the error of the sort from some of the compute-servers: eu260 - daemon did not report back when launched It does not happen every time; but pretty often. Any ideas what could be wrong? The node seems pingable and I could log in suc

[OMPI users] Disabling irqbalance service for better performance of MPI jobs

2009-12-14 Thread Rahul Nabar

I have already been using the processor and memory affinity options to bind the processes to specific cores. Does the presence of the irqbalance daemon matter? I saw some recommendation to disable this for a performance boost. Or is this irrelevant? I am running HPC jobs with no over- nor under-su

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-10-02 Thread Rahul Nabar

On Wed, Sep 30, 2009 at 1:34 AM, Anthony Chan wrote: > ./configure CC=icc F77=ifort > MPI_CC=/usr/local/ompi-ifort/bin/mpicc > MPI_F77=/usr/local/ompi-ifort/bin/mpif77 > --prefix=.. > > Let me know how it goes. > > A.Chan Thanks! Your command like worked perfect! :) -- Rahul

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-10-02 Thread Rahul Nabar

On Wed, Sep 30, 2009 at 3:16 PM, Peter Kjellstrom wrote: > Not MPI aware, but, you could watch network traffic with a tool such as > collectl in real-time. collectl is a great idea. I am going to try that now. -- Rahul

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar

On Tue, Sep 29, 2009 at 1:33 PM, Anthony Chan wrote: > > Rahul, > > > What errors did you see when compiling MPE for OpenMPI ? > Can you send me the configure and make outputs as seen on > your terminal ? ALso, what version of MPE are you using > with OpenMPI ? Version: mpe2-1.0.6p1 ./configur

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar

On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh wrote: > to know. It sounds like you want to be able to watch some % utilization of > a hardware interface as the program is running. I *think* these tools (the > ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of that > class. You ar

[OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar

I have a code that seems to run about 40% faster when I bond together twin eth interfaces. The question, of course, arises: is it really producing so much traffic to keep twin 1 Gig eth interfaces busy? I don't really believe this but need a way to check. What are good tools to monitior the MPI pe

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Rahul Nabar

On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote: > Most of that bandwidth is in marketing... Sorry, but it's not a high > performance switch. Well, how does one figure out what exactly is a "hih performance switch"? I've found this an exceedingly hard task. Like the OP posted the Dell 6248

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar

On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain wrote: > So I gather that by "direct" you mean that you don't get an allocation from > Maui before running the job, but for the other you do? Otherwise, OMPI > should detect the that it is running under Torque and automatically use the > Torque launche

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar

2009/3/31 Ralph Castain : > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? Are they segfaulting - if so, can > you use gdb to find out where? I have to admit I'm a newbiee with gdb. I am trying to recompile my code as "ifort

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar

2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We > regularly run OMPI jobs on Torque without issue. Another small thing that I noticed. Not sure if it is relevant. When the job starts running there is an orte process. The args to this process are sli

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar

2009/3/31 Ralph Castain : > > Information would be most helpful - the information we really need is > specified here: http://www.open-mpi.org/community/help/ Output of "ompi_info --all" is attached in a file. echo $LD_LIBRARY_PATH /usr/local/ompi-ifort/lib:/opt/intel/fce/10.1.018/lib:/opt/intel

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar

2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll try my best to fill you guys in on as much debug info as I can. > regularly run OMPI jobs on Torque without issue. So do we. In fac

[OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar

I've a strange OpenMPI/Torque problem while trying to run a job on our Opteron-SC-1435 based cluster: Each node has 8 cpus. If I got to a node and run like so then the job works: mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS} Same job if I submit through PBS/Torque then it starts running but

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

[OMPI users] MPI broadcast test fails only when I run within a torque job

[OMPI users] subnet specification for MPI when multiple networks are present

Re: [OMPI users] MPI daemon error

[OMPI users] which eth interface does mpi use by default when torque supplies it with a hostfile?

Re: [OMPI users] MPI daemon error

[OMPI users] MPI daemon error

[OMPI users] Disabling irqbalance service for better performance of MPI jobs

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

[OMPI users] profile the performance of a MPI code: how much traffic is being generated?

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

[OMPI users] job runs with mpirun on a node but not if submitted via Torque.

30 matches

Site Navigation

Mail list logo

Footer information