Hi Beichuan

So, from "df" it looks like /home is /work1, right?

Also, "mount" shows only /work[1-4], not the other
7 CWFS panfs (Panasas?), which apparently are not available in the compute nodes/blades.

I presume you have access and are using only some of the /work[1-4]
(lustre) file systems for all your MPI and other software
installation, right? Not the panfs, right?

Awkward that it doesn't work, because lustre is supposed to be a
parallel file system, highly available to all nodes (assuming it is mounted on all nodes).

It also shows a small /tmp with a tmpfs file system,
which is volatile, in memory:

http://en.wikipedia.org/wiki/Tmpfs

I would guess they don't let you write there, so TMPDIR=/tmp may not
be a possible option, but this is just a wild guess.
Or maybe OMPI requires an actual non-volatile file system to write its
shared memory auxiliary files and other stuff that normally goes on /tmp? [Jeff, Ralph, help!!]
I kind of remember some old discussion on this list about this,
but maybe it was in another list.

[You could ask the sys admin about this, and perhaps what he recommends
to use to replace /tmp.]

Just in case they may have some file system mount point mixup,
you could try perhaps TMPDIR=/work1/yanb/tmp (rather than /home)
You could also try TMPDIR=/work3/yanb/tmp, as if I remember right
this is another file system you have access to (not sure anymore,
it may have been in the previous emails).
Either way, you may need to create the tmp directory beforehand.

**

Any chances that this is an environment mixup?

Say, that you may be inadvertently using the SGI-MPI mpiexec
Using a /full/path/to/mpiexec in your job may clarify this.

"which mpiexec" will tell, but since the environment on the compute
nodes may not be exactly the same as in the login node, it may not be
reliable information.

Or perhaps you may not be pointing to the OMPI libraries?
Are you exporting PATH and LD_LIBRARY_PATH on .bashrc/.tcshrc,
with the OMPI items (bin and lib) *PREPENDED* (not appended),
so as to take precedence over other possible/SGI/pre-existent
MPI items?

Those are pretty (ugly) common problems.

**

I hope this helps,
Gus Correa

On 03/03/2014 10:13 PM, Beichuan Yan wrote:
1. info from a compute node
-bash-4.1$ hostname
r32i1n1
-bash-4.1$ df -h /home
Filesystem            Size  Used Avail Use% Mounted on
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
                       1.2P  136T  1.1P  12% /work1
-bash-4.1$ mount
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /tmp type tmpfs (rw,size=150m)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
cpuset on /dev/cpuset type cpuset (rw)
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre (rw,flock)
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre (rw,flock)
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre (rw,flock)
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre (rw,flock)


2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see 
mpi-related temporary files there when the job gets started.

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Monday, March 03, 2014 18:23
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

OK, it says "unclassified.html", so I presume it is not a problem.

The web site says the computer is an SGI ICE X.
I am not familiar to it, so what follows are guesses.

The SGI site brochure suggests that the nodes/blades have local disks:
https://www.sgi.com/pdfs/4330.pdf

The file systems prefixed with IP addresses (work[1-4]) and with panfs (cwfs 
and CWFS[1-6]) and a colon (:) are shared exports (not local), but not 
necessarily NFS (panfs may be Panasas?).
  From this output it is hard to tell where /home is, but I would guess it is 
also shared (not local).
Maybe "df -h /home" will tell.  Or perhaps "mount".

You may be logged in to a login/service node, so although it does have a /tmp 
(your ls / shows tmp), this doesn't guarantee that the compute nodes/blades 
also do.

Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp 
doesn't exist on the nodes/blades, or is not writable.

Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"?
This should tell if /tmp exists on the nodes, if it is writable.

A stupid question:
When you tried your job with this:

export TMPDIR=/home/yanb/tmp

Did you create the directory /home/yanb/tmp beforehand?

Anyway, you may need to ask the help of a system administrator of this machine.

Gus Correa

On 03/03/2014 07:43 PM, Beichuan Yan wrote:
Gus,

I am using this system: http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't 
know exactly configurations of the file system. Here is the output of "df -h":
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda6             919G   16G  857G   2% /
tmpfs                  32G     0   32G   0% /dev/shm
/dev/sda5             139M   33M  100M  25% /boot
adfs3v-s:/adfs3/hafs14
                        6.5T  678G  5.5T  11% /scratch
adfs3v-s:/adfs3/hafs16
                        6.5T  678G  5.5T  11% /var/spool/mail
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
                        1.2P  136T  1.1P  12% /work1
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
                        1.2P  793T  368T  69% /work4
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
                        1.2P  509T  652T  44% /work3
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
                        1.2P  521T  640T  45% /work2
panfs://172.16.0.10/CWFS
                        728T  286T  443T  40% /p/cwfs
panfs://172.16.1.61/CWFS1
                        728T  286T  443T  40% /p/CWFS1
panfs://172.16.0.210/CWFS2
                        728T  286T  443T  40% /p/CWFS2
panfs://172.16.1.125/CWFS3
                        728T  286T  443T  40% /p/CWFS3
panfs://172.16.1.224/CWFS4
                        728T  286T  443T  40% /p/CWFS4
panfs://172.16.1.224/CWFS5
                        728T  286T  443T  40% /p/CWFS5
panfs://172.16.1.224/CWFS6
                        728T  286T  443T  40% /p/CWFS6
panfs://172.16.1.224/CWFS7
                        728T  286T  443T  40% /p/CWFS7

1. My home directory is /home/yanb.
My simulation files are located at /work3/yanb.
The default TMPDIR set by system is just /work3/yanb

2. I did try not to set TMPDIR and let it default, which is just case 1 and 
case 2.
    Case1: #export TMPDIR=/home/yanb/tmp
              TCP="--mca btl_tcp_if_include 10.148.0.0/16"
          It gives no apparent reason.
    Case2: #export TMPDIR=/home/yanb/tmp
              #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
          It gives warning of shared memory file on network file system.

3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.

4. FYI, "ls /" gives:
ELT    apps  cgroup  hafs1   hafs12  hafs2  hafs5  hafs8        home   
lost+found  mnt  p      root     selinux  tftpboot  var    work3
admin  bin   dev     hafs10  hafs13  hafs3  hafs6  hafs9        lib    media    
   net  panfs  sbin     srv      tmp       work1  work4
app    boot  etc     hafs11  hafs15  hafs4  hafs7  hafs_x86_64  lib64  misc     
   opt  proc   scratch  sys      usr       work2  workspace

Beichuan

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
Correa
Sent: Monday, March 03, 2014 17:24
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

If you are using the university cluster, chances are that /home is not local, 
but on an NFS share, or perhaps Lustre (which you may have mentioned before, I 
don't remember).

Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems with the server name, but I don't 
know about Lustre.

Did you try just not to set TMPDIR and let it default?
If the default TMPDIR is on Lustre (did you say this?, anyway I don't
remember) you could perhaps try to force it to /tmp:
export TMPDIR=/tmp,
If the cluster nodes are diskfull /tmp is likely to exist and be local to the 
cluster nodes.
[But the cluster nodes may be diskless ... :( ]

I hope this helps,
Gus Correa

On 03/03/2014 07:10 PM, Beichuan Yan wrote:
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem? 
I don't know how to tell a directory is local file system or network file 
system.

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
Squyres (jsquyres)
Sent: Monday, March 03, 2014 16:57
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

How about setting TMPDIR to a local filesystem?


On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu>    wrote:

I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent reason;  2 
job complains shared-memory file on network file system, which can be resolved by " 
export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local directory. The default 
TMPDIR points to a Lustre directory.

There is no any other output. I checked my job with "qstat -n" and found that processes 
were actually not started on compute nodes even though PBS Pro has "started" my job.

Beichuan

3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node runs 16 processes 
(clearly shared-memory of MPI is used). Four combinations of "TMPDIR" and "TCP" 
are tested:
case 1:
#export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE
./paraEllip3d input.txt
output:
Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014 End Prologue v2.5
Mon Mar  3 15:47:16 EST 2014
-bash: line 1: 448597 Terminated              
/var/spool/PBS/mom_priv/jobs/602244.service12.SC
Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014 Statistics
cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wallt
i
m
e
=00:03:24 End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014

It looks like you have two general cases:

1. The job fails for no apparent reason (like above), or 2. The job
complains that your TMPDIR is on a shared filesystem

Right?

I think the real issue, then, is to figure out why your jobs are failing with 
no output.

Is there anything in the stderr output?

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to