On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote:
> It would simplify testing if you could get all the eth0's to be of one type
> and on the same subnet, and the same for eth1.
>
> Once you do that, try using just one of the networks by telling OMPI to use
> only one of the devices, somethin
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote:
> Once you do that, try using just one of the networks by telling OMPI to use
> only one of the devices, something like this:
>
> mpirun --mca btl_tcp_if_include eth0 ...
Thanks Jeff! Just tried the exact test that you suggested.
[rpnabar
On Wed, Aug 25, 2010 at 6:41 AM, John Hearns wrote:
> You could sort that out with udev rules on each machine.
Sure. I'd always wanted consistent names for the eth interfaces when I
set up the cluster but I couldn't get udev to co-operate. Maybe this
time! Let me try.
> Look in the directory /et
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar wrote:
> --
> gather:
> NP256 hangs
> NP128 hangs
> NP64 hangs
> NP32 OK
>
> Note: "gather" always hangs at the followin
On Tue, Aug 24, 2010 at 4:58 PM, Jeff Squyres wrote:
> Are all the eth0's on one subnet and all the eth2's on a different subnet?
>
> Or are all eth0's and eth2's all on the same subnet?
Thanks Jeff! Different subnets. All 10GigE's are on 192.168.x.x and
all 1GigE's are on 10.0.x.x
e.g.
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB messa
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
My MPI version is 1.4.1. This isn't the latest but still fairly
recent. So I assume th
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
wrote:
>
> I have had a similar load related problem with Bcast.
Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?
>
> I don't know what caused it though. With thi
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann wrote:
> It is hard to imagine how a total data load of 41,943,040 bytes could be a
> problem. That is really not much data. By the time the BCAST is done, each
> task (except root) will have received a single half meg message form one
> sender. Th
On Sun, Aug 22, 2010 at 9:57 PM, Randolph Pullen <
randolph_pul...@yahoo.com.au> wrote:
> Its a long shot but could it be related to the total data volume ?
> ie 524288 * 80 = 41943040 bytes active in the cluster
>
> Can you exceed this 41943040 data volume with a smaller message repeated
> more
My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast, gather and
exchange tests. Only for the larger core counts (~256 cores). Other
tests like pingpong and sendrecev run fine even with larger core
counts.
e.g. This bcast tes
I'm not sure if this is a torque issue or an MPI issue. If I log in to
a compute-node and run the standard mpi broadcast test it returns no
error but if I run it through PBS/Torque I get an error (see below)
The nodes that return the error are fairly random. Even the same set
of nodes will run a t
I have compute-nodes with twin eth interfaces 1GigE and 10GigE. In the
OpenMPI docs I found an instruction:
" It is therefore very important that if active ports on the same host
are on physically separate fabrics, they must have different subnet
IDs."
Is this the same "subnet" that is set via an
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote:
>
> >From your other note, it sounds like #3 might be the problem here. Do you
> >have some nodes that are configured with "eth0" pointing to your 10.x
> >network, and other nodes with "eth0" pointing to your 192.x network? I have
> >found
Each of our servers has twin eth cards: 1GigE and 10GigE. How does
openmpi decide which card to use while sending messages on? One of the
cards is on a 10.0. IP address subnet whereas the other cards are on a
192.168 adress subnet. Can I select one or the other by specifying the
--host option with
On Fri, May 28, 2010 at 3:53 PM, Ralph Castain wrote:
> What environment are you running on the cluster, and what version of OMPI?
> Not sure that error message is coming from us.
openmpi-1.4.1
The cluster runs PBS-Torque. So I guess, that could be the other error source.
--
Rahul
Often when I try and run larger jobs on our cluster I get the error of
the sort from some of the compute-servers:
eu260 - daemon did not report back when launched
It does not happen every time; but pretty often. Any ideas what could
be wrong? The node seems pingable and I could log in suc
I have already been using the processor and memory affinity options to
bind the processes to specific cores. Does the presence of the
irqbalance daemon matter? I saw some recommendation to disable this
for a performance boost. Or is this irrelevant?
I am running HPC jobs with no over- nor under-su
On Wed, Sep 30, 2009 at 1:34 AM, Anthony Chan wrote:
> ./configure CC=icc F77=ifort
> MPI_CC=/usr/local/ompi-ifort/bin/mpicc
> MPI_F77=/usr/local/ompi-ifort/bin/mpif77
> --prefix=..
>
> Let me know how it goes.
>
> A.Chan
Thanks! Your command like worked perfect! :)
--
Rahul
On Wed, Sep 30, 2009 at 3:16 PM, Peter Kjellstrom wrote:
> Not MPI aware, but, you could watch network traffic with a tool such as
> collectl in real-time.
collectl is a great idea. I am going to try that now.
--
Rahul
On Tue, Sep 29, 2009 at 1:33 PM, Anthony Chan wrote:
>
> Rahul,
>
>
> What errors did you see when compiling MPE for OpenMPI ?
> Can you send me the configure and make outputs as seen on
> your terminal ? ALso, what version of MPE are you using
> with OpenMPI ?
Version: mpe2-1.0.6p1
./configur
On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh wrote:
> to know. It sounds like you want to be able to watch some % utilization of
> a hardware interface as the program is running. I *think* these tools (the
> ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of that
> class.
You ar
I have a code that seems to run about 40% faster when I bond together
twin eth interfaces. The question, of course, arises: is it really
producing so much traffic to keep twin 1 Gig eth interfaces busy? I
don't really believe this but need a way to check.
What are good tools to monitior the MPI pe
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote:
> Most of that bandwidth is in marketing... Sorry, but it's not a high
> performance switch.
Well, how does one figure out what exactly is a "hih performance
switch"? I've found this an exceedingly hard task. Like the OP posted
the Dell 6248
On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain wrote:
> So I gather that by "direct" you mean that you don't get an allocation from
> Maui before running the job, but for the other you do? Otherwise, OMPI
> should detect the that it is running under Torque and automatically use the
> Torque launche
2009/3/31 Ralph Castain :
> I have no idea why your processes are crashing when run via Torque - are you
> sure that the processes themselves crash? Are they segfaulting - if so, can
> you use gdb to find out where?
I have to admit I'm a newbiee with gdb. I am trying to recompile my
code as "ifort
2009/3/31 Ralph Castain :
> It is very hard to debug the problem with so little information. We
> regularly run OMPI jobs on Torque without issue.
Another small thing that I noticed. Not sure if it is relevant.
When the job starts running there is an orte process. The args to this
process are sli
2009/3/31 Ralph Castain :
>
> Information would be most helpful - the information we really need is
> specified here: http://www.open-mpi.org/community/help/
Output of "ompi_info --all" is attached in a file.
echo $LD_LIBRARY_PATH
/usr/local/ompi-ifort/lib:/opt/intel/fce/10.1.018/lib:/opt/intel
2009/3/31 Ralph Castain :
> It is very hard to debug the problem with so little information. We
Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll
try my best to fill you guys in on as much debug info as I can.
> regularly run OMPI jobs on Torque without issue.
So do we. In fac
I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:
Each node has 8 cpus.
If I got to a node and run like so then the job works:
mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
Same job if I submit through PBS/Torque then it starts running but
30 matches
Mail list logo