-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
Correa
Sent: Thursday, March 06, 2014 13:16
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
Hi Beichuan
So, it looks like that now the program runs, even though with specific settings
depending on whether you're using OMPI 1.6.5 or 1.7.4, right?
It looks like the problem now is performance, right?
System load affects performance, but unless the network is overwhelmed, or
perhaps the Lustre file system is hanging or too slow, I would think that a
walltime increase from 1min to 10min is not related to system load, but
something else.
Do you remember the setup that gave you 1min walltime?
Was it the same that you sent below?
Do you happen to know which nodes?
Are you sharing nodes with other jobs, or are you running alone on the nodes?
Sharing with other processes may slow down your job.
If you request all cores in the node, PBS should give you a full node (unless
they tricked PBS to think the nodes have more cores than they actually do).
How do you request the nodes in your #PBS directives?
Do you request nodes and ppn, or do you request procs?
I suggest that you do:
cat $PBS_NODEFILE
in your PBS script, just to document which nodes are actually given to you.
Also helpful to document/troubleshoot is to add -v and -tag-output to your
mpiexec command line.
The difference in walltime could be due to some malfunction of IB HCAs on the
nodes, for instance.
Since you are allowing (if I remember right) the use of TCP, OpenMPI will try
to use any interfaces that you did not rule out.
If your mpiexec command line doesn't make any restriction, it will use anything
available, if I remember right.
(Jeff will correct me in the next second.) If your mpiexec command line has mca
btl_tcp_if_include 10.148.0.0/16 it will use the 10.148.0.0/16 subnet in with
TCP transport, I think.
(Jeff will cut my list subscription after that one, for spreading
misinformation.)
In either case my impression is that you may have left a door open to the use
of non-IB (and non-IB-verbs) transport.
Is 10.148.0.0/16 the an Infiniband subnet or an Ethernet subnet?
Did you remeber Jeff's suggestion from a while ago to avoid TCP (over Ethernet
or over IB), and stick to IB verbs?
Is 10.148.0.0/16 the IB or the Ethernet subnet?
On 03/02/2014 02:38 PM, Jeff Squyres (jsquyres) wrote:
Both 1.6.x and 1.7.x/1.8.x will need verbs.h to use the native verbs
network stack.
You can use emulated TCP over IB (e.g., using the OMPI TCP BTL), but
it's nowhere near as fast/efficient the native verbs network stack.
You could force the use of IB verbs with
-mca btl ^tcp
or with
-mca btl sm,openib,self
on the mpiexec command line.
In this case, if any of the IB HCAs on the nodes is bad, the job will
abort with an error message, instead of running too slow (if it is
using other networks).
There are also ways to tell OMPI to do a more verbose output, that
may perhaps help diagnose the problem.
ompi_info | grep verbose
may give some hints (I confess I don't remember them).
Believe me, this did happen to me, i.e., to run MPI programs in a
cluster that had all sorts of non-homogeneous nodes, some with faulty
IB HCAs, some with incomplete OFED installation, some that were not
mounting shared file systems properly, etc.
[I didn't administer that one!]
Hopefully that is not the problem you are facing, but verbose output
may help anyways.
I hope this helps,
Gus Correa
On 03/06/2014 01:49 PM, Beichuan Yan wrote:
1. For $TMPDIR and $TCP, there are four combinations by commenting on/off (note
the system's default TMPDIR=/work3/yanb):
export TMPDIR=/work1/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
2. I tested the 4 combinations for OpenMPI 1.6.5 and OpenMPI 1.7.4 respectively
for the pure-MPI mode (no OPENMP threads; 8 nodes, each node runs 16
processes). The results are weird: of all 8 cases, only TWO of them can run,
but run so slow:
OpenMPI 1.6.5:
export TMPDIR=/work1/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
Warning: shared-memory, /work1/home/yanb/tmp/ Run, take 10 minutes,
slow
OpenMPI 1.7.4:
#export TMPDIR=/work1/home/yanb/tmp
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
Warning: shared-memory /work3/yanb/605832.SPIRIT/ Run, take 10
minutess, slow
So you see, a) openmpi 1.6.5 and 1.7.4 need different settings to
run;
b) whether specifying TMPDIR, I got the shared memory warning.
3. But a few days ago, OpenMPI 1.6.5 worked great and took only 1
minute
(now it takes 10 minutes). I am so confused by the results.
Does the system loading level or fluctuation or PBS pro affect
OpenMPI performance?
Thanks,
Beichuan
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
Correa
Sent: Tuesday, March 04, 2014 08:48
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
Hi Beichuan
So, from "df" it looks like /home is /work1, right?
Also, "mount" shows only /work[1-4], not the other
7 CWFS panfs (Panasas?), which apparently are not available in the compute
nodes/blades.
I presume you have access and are using only some of the /work[1-4]
(lustre) file systems for all your MPI and other software installation, right?
Not the panfs, right?
Awkward that it doesn't work, because lustre is supposed to be a parallel file
system, highly available to all nodes (assuming it is mounted on all nodes).
It also shows a small /tmp with a tmpfs file system, which is volatile, in
memory:
http://en.wikipedia.org/wiki/Tmpfs
I would guess they don't let you write there, so TMPDIR=/tmp may not be a
possible option, but this is just a wild guess.
Or maybe OMPI requires an actual non-volatile file system to write its shared
memory auxiliary files and other stuff that normally goes on /tmp? [Jeff,
Ralph, help!!] I kind of remember some old discussion on this list about this,
but maybe it was in another list.
[You could ask the sys admin about this, and perhaps what he
recommends to use to replace /tmp.]
Just in case they may have some file system mount point mixup, you could try
perhaps TMPDIR=/work1/yanb/tmp (rather than /home) You could also try
TMPDIR=/work3/yanb/tmp, as if I remember right this is another file system you
have access to (not sure anymore, it may have been in the previous emails).
Either way, you may need to create the tmp directory beforehand.
**
Any chances that this is an environment mixup?
Say, that you may be inadvertently using the SGI-MPI mpiexec Using a
/full/path/to/mpiexec in your job may clarify this.
"which mpiexec" will tell, but since the environment on the compute nodes may
not be exactly the same as in the login node, it may not be reliable information.
Or perhaps you may not be pointing to the OMPI libraries?
Are you exporting PATH and LD_LIBRARY_PATH on .bashrc/.tcshrc, with the OMPI
items (bin and lib) *PREPENDED* (not appended), so as to take precedence over
other possible/SGI/pre-existent MPI items?
Those are pretty (ugly) common problems.
**
I hope this helps,
Gus Correa
On 03/03/2014 10:13 PM, Beichuan Yan wrote:
1. info from a compute node
-bash-4.1$ hostname
r32i1n1
-bash-4.1$ df -h /home
Filesystem Size Used Avail Use% Mounted on
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
1.2P 136T 1.1P 12% /work1 -bash-4.1$
mount devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on
/tmp type tmpfs (rw,size=150m) none on /proc/sys/fs/binfmt_misc
type binfmt_misc
(rw) cpuset on /dev/cpuset type cpuset (rw)
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre
(rw,flock)
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre
(rw,flock)
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre
(rw,flock)
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre
(rw,flock)
2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see
mpi-related temporary files there when the job gets started.
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
Correa
Sent: Monday, March 03, 2014 18:23
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
Hi Beichuan
OK, it says "unclassified.html", so I presume it is not a problem.
The web site says the computer is an SGI ICE X.
I am not familiar to it, so what follows are guesses.
The SGI site brochure suggests that the nodes/blades have local disks:
https://www.sgi.com/pdfs/4330.pdf
The file systems prefixed with IP addresses (work[1-4]) and with panfs (cwfs
and CWFS[1-6]) and a colon (:) are shared exports (not local), but not
necessarily NFS (panfs may be Panasas?).
From this output it is hard to tell where /home is, but I would guess it
is also shared (not local).
Maybe "df -h /home" will tell. Or perhaps "mount".
You may be logged in to a login/service node, so although it does have a /tmp
(your ls / shows tmp), this doesn't guarantee that the compute nodes/blades
also do.
Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp
doesn't exist on the nodes/blades, or is not writable.
Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"?
This should tell if /tmp exists on the nodes, if it is writable.
A stupid question:
When you tried your job with this:
export TMPDIR=/home/yanb/tmp
Did you create the directory /home/yanb/tmp beforehand?
Anyway, you may need to ask the help of a system administrator of this machine.
Gus Correa
On 03/03/2014 07:43 PM, Beichuan Yan wrote:
Gus,
I am using this system: http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't
know exactly configurations of the file system. Here is the output of "df -h":
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 919G 16G 857G 2% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sda5 139M 33M 100M 25% /boot
adfs3v-s:/adfs3/hafs14
6.5T 678G 5.5T 11% /scratch
adfs3v-s:/adfs3/hafs16
6.5T 678G 5.5T 11% /var/spool/mail
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
1.2P 136T 1.1P 12% /work1
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
1.2P 793T 368T 69% /work4
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
1.2P 509T 652T 44% /work3
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
1.2P 521T 640T 45% /work2
panfs://172.16.0.10/CWFS
728T 286T 443T 40% /p/cwfs
panfs://172.16.1.61/CWFS1
728T 286T 443T 40% /p/CWFS1
panfs://172.16.0.210/CWFS2
728T 286T 443T 40% /p/CWFS2
panfs://172.16.1.125/CWFS3
728T 286T 443T 40% /p/CWFS3
panfs://172.16.1.224/CWFS4
728T 286T 443T 40% /p/CWFS4
panfs://172.16.1.224/CWFS5
728T 286T 443T 40% /p/CWFS5
panfs://172.16.1.224/CWFS6
728T 286T 443T 40% /p/CWFS6
panfs://172.16.1.224/CWFS7
728T 286T 443T 40% /p/CWFS7
1. My home directory is /home/yanb.
My simulation files are located at /work3/yanb.
The default TMPDIR set by system is just /work3/yanb
2. I did try not to set TMPDIR and let it default, which is just case 1 and
case 2.
Case1: #export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
It gives no apparent reason.
Case2: #export TMPDIR=/home/yanb/tmp
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
It gives warning of shared memory file on network file system.
3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.
4. FYI, "ls /" gives:
ELT apps cgroup hafs1 hafs12 hafs2 hafs5 hafs8 home
lost+found mnt p root selinux tftpboot var work3
admin bin dev hafs10 hafs13 hafs3 hafs6 hafs9 lib media
net panfs sbin srv tmp work1 work4
app boot etc hafs11 hafs15 hafs4 hafs7 hafs_x86_64 lib64 misc
opt proc scratch sys usr work2 workspace
Beichuan
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
Correa
Sent: Monday, March 03, 2014 17:24
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
Hi Beichuan
If you are using the university cluster, chances are that /home is not local,
but on an NFS share, or perhaps Lustre (which you may have mentioned before, I
don't remember).
Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems with the server name, but I don't
know about Lustre.
Did you try just not to set TMPDIR and let it default?
If the default TMPDIR is on Lustre (did you say this?, anyway I
don't
remember) you could perhaps try to force it to /tmp:
export TMPDIR=/tmp,
If the cluster nodes are diskfull /tmp is likely to exist and be local to the
cluster nodes.
[But the cluster nodes may be diskless ... :( ]
I hope this helps,
Gus Correa
On 03/03/2014 07:10 PM, Beichuan Yan wrote:
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem?
I don't know how to tell a directory is local file system or network file
system.
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
Squyres (jsquyres)
Sent: Monday, March 03, 2014 16:57
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
How about setting TMPDIR to a local filesystem?
On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu> wrote:
I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent reason; 2
job complains shared-memory file on network file system, which can be resolved by "
export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local directory. The default
TMPDIR points to a Lustre directory.
There is no any other output. I checked my job with "qstat -n" and found that processes
were actually not started on compute nodes even though PBS Pro has "started" my job.
Beichuan
3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node runs 16 processes
(clearly shared-memory of MPI is used). Four combinations of "TMPDIR" and "TCP"
are tested:
case 1:
#export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE
./paraEllip3d input.txt
output:
Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue
v2.5 Mon Mar 3 15:47:16 EST 2014
-bash: line 1: 448597 Terminated
/var/spool/PBS/mom_priv/jobs/602244.service12.SC
Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics
cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,w
all
t
i
m
e
=00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014
It looks like you have two general cases:
1. The job fails for no apparent reason (like above), or 2. The
job complains that your TMPDIR is on a shared filesystem
Right?
I think the real issue, then, is to figure out why your jobs are failing with
no output.
Is there anything in the stderr output?
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users