1. info from a compute node
-bash-4.1$ hostname
r32i1n1
-bash-4.1$ df -h /home
Filesystem            Size  Used Avail Use% Mounted on
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
                      1.2P  136T  1.1P  12% /work1
-bash-4.1$ mount
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /tmp type tmpfs (rw,size=150m)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
cpuset on /dev/cpuset type cpuset (rw)
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre (rw,flock)
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre (rw,flock)
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre (rw,flock)
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre (rw,flock)


2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see 
mpi-related temporary files there when the job gets started.

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Monday, March 03, 2014 18:23
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

OK, it says "unclassified.html", so I presume it is not a problem.

The web site says the computer is an SGI ICE X.
I am not familiar to it, so what follows are guesses.

The SGI site brochure suggests that the nodes/blades have local disks:
https://www.sgi.com/pdfs/4330.pdf

The file systems prefixed with IP addresses (work[1-4]) and with panfs (cwfs 
and CWFS[1-6]) and a colon (:) are shared exports (not local), but not 
necessarily NFS (panfs may be Panasas?).
 From this output it is hard to tell where /home is, but I would guess it is 
also shared (not local).
Maybe "df -h /home" will tell.  Or perhaps "mount".

You may be logged in to a login/service node, so although it does have a /tmp 
(your ls / shows tmp), this doesn't guarantee that the compute nodes/blades 
also do.

Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp 
doesn't exist on the nodes/blades, or is not writable.

Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"?
This should tell if /tmp exists on the nodes, if it is writable.

A stupid question:
When you tried your job with this:

export TMPDIR=/home/yanb/tmp

Did you create the directory /home/yanb/tmp beforehand?

Anyway, you may need to ask the help of a system administrator of this machine.

Gus Correa

On 03/03/2014 07:43 PM, Beichuan Yan wrote:
> Gus,
>
> I am using this system: 
> http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly 
> configurations of the file system. Here is the output of "df -h":
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda6             919G   16G  857G   2% /
> tmpfs                  32G     0   32G   0% /dev/shm
> /dev/sda5             139M   33M  100M  25% /boot
> adfs3v-s:/adfs3/hafs14
>                        6.5T  678G  5.5T  11% /scratch
> adfs3v-s:/adfs3/hafs16
>                        6.5T  678G  5.5T  11% /var/spool/mail
> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
>                        1.2P  136T  1.1P  12% /work1
> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
>                        1.2P  793T  368T  69% /work4
> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
>                        1.2P  509T  652T  44% /work3
> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
>                        1.2P  521T  640T  45% /work2 
> panfs://172.16.0.10/CWFS
>                        728T  286T  443T  40% /p/cwfs
> panfs://172.16.1.61/CWFS1
>                        728T  286T  443T  40% /p/CWFS1
> panfs://172.16.0.210/CWFS2
>                        728T  286T  443T  40% /p/CWFS2
> panfs://172.16.1.125/CWFS3
>                        728T  286T  443T  40% /p/CWFS3
> panfs://172.16.1.224/CWFS4
>                        728T  286T  443T  40% /p/CWFS4
> panfs://172.16.1.224/CWFS5
>                        728T  286T  443T  40% /p/CWFS5
> panfs://172.16.1.224/CWFS6
>                        728T  286T  443T  40% /p/CWFS6
> panfs://172.16.1.224/CWFS7
>                        728T  286T  443T  40% /p/CWFS7
>
> 1. My home directory is /home/yanb.
> My simulation files are located at /work3/yanb.
> The default TMPDIR set by system is just /work3/yanb
>
> 2. I did try not to set TMPDIR and let it default, which is just case 1 and 
> case 2.
>    Case1: #export TMPDIR=/home/yanb/tmp
>              TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>         It gives no apparent reason.
>    Case2: #export TMPDIR=/home/yanb/tmp
>              #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>         It gives warning of shared memory file on network file system.
>
> 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.
>
> 4. FYI, "ls /" gives:
> ELT    apps  cgroup  hafs1   hafs12  hafs2  hafs5  hafs8        home   
> lost+found  mnt  p      root     selinux  tftpboot  var    work3
> admin  bin   dev     hafs10  hafs13  hafs3  hafs6  hafs9        lib    media  
>      net  panfs  sbin     srv      tmp       work1  work4
> app    boot  etc     hafs11  hafs15  hafs4  hafs7  hafs_x86_64  lib64  misc   
>      opt  proc   scratch  sys      usr       work2  workspace
>
> Beichuan
>
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus 
> Correa
> Sent: Monday, March 03, 2014 17:24
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> Hi Beichuan
>
> If you are using the university cluster, chances are that /home is not local, 
> but on an NFS share, or perhaps Lustre (which you may have mentioned before, 
> I don't remember).
>
> Maybe "df -h" will show what is local what is not.
> It works for NFS, it prefixes file systems with the server name, but I don't 
> know about Lustre.
>
> Did you try just not to set TMPDIR and let it default?
> If the default TMPDIR is on Lustre (did you say this?, anyway I don't
> remember) you could perhaps try to force it to /tmp:
> export TMPDIR=/tmp,
> If the cluster nodes are diskfull /tmp is likely to exist and be local to the 
> cluster nodes.
> [But the cluster nodes may be diskless ... :( ]
>
> I hope this helps,
> Gus Correa
>
> On 03/03/2014 07:10 PM, Beichuan Yan wrote:
>> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local 
>> filesystem? I don't know how to tell a directory is local file system or 
>> network file system.
>>
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff 
>> Squyres (jsquyres)
>> Sent: Monday, March 03, 2014 16:57
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> How about setting TMPDIR to a local filesystem?
>>
>>
>> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu>   wrote:
>>
>>> I agree there are two cases for pure-MPI mode: 1. Job fails with no 
>>> apparent reason;  2 job complains shared-memory file on network file 
>>> system, which can be resolved by " export TMPDIR=/home/yanb/tmp", 
>>> /home/yanb/tmp is my local directory. The default TMPDIR points to a Lustre 
>>> directory.
>>>
>>> There is no any other output. I checked my job with "qstat -n" and found 
>>> that processes were actually not started on compute nodes even though PBS 
>>> Pro has "started" my job.
>>>
>>> Beichuan
>>>
>>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node 
>>>> runs 16 processes (clearly shared-memory of MPI is used). Four 
>>>> combinations of "TMPDIR" and "TCP" are tested:
>>>> case 1:
>>>> #export TMPDIR=/home/yanb/tmp
>>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE 
>>>> ./paraEllip3d input.txt
>>>> output:
>>>> Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014 End Prologue v2.5 
>>>> Mon Mar  3 15:47:16 EST 2014
>>>> -bash: line 1: 448597 Terminated              
>>>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC
>>>> Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014 Statistics 
>>>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wallt
>>>> i
>>>> m
>>>> e
>>>> =00:03:24 End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014
>>>
>>> It looks like you have two general cases:
>>>
>>> 1. The job fails for no apparent reason (like above), or 2. The job 
>>> complains that your TMPDIR is on a shared filesystem
>>>
>>> Right?
>>>
>>> I think the real issue, then, is to figure out why your jobs are failing 
>>> with no output.
>>>
>>> Is there anything in the stderr output?
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to