1. info from a compute node -bash-4.1$ hostname r32i1n1 -bash-4.1$ df -h /home Filesystem Size Used Avail Use% Mounted on 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 1.2P 136T 1.1P 12% /work1 -bash-4.1$ mount devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /tmp type tmpfs (rw,size=150m) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) cpuset on /dev/cpuset type cpuset (rw) 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre (rw,flock) 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre (rw,flock) 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre (rw,flock) 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre (rw,flock)
2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see mpi-related temporary files there when the job gets started. -----Original Message----- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Monday, March 03, 2014 18:23 To: Open MPI Users Subject: Re: [OMPI users] OpenMPI job initializing problem Hi Beichuan OK, it says "unclassified.html", so I presume it is not a problem. The web site says the computer is an SGI ICE X. I am not familiar to it, so what follows are guesses. The SGI site brochure suggests that the nodes/blades have local disks: https://www.sgi.com/pdfs/4330.pdf The file systems prefixed with IP addresses (work[1-4]) and with panfs (cwfs and CWFS[1-6]) and a colon (:) are shared exports (not local), but not necessarily NFS (panfs may be Panasas?). From this output it is hard to tell where /home is, but I would guess it is also shared (not local). Maybe "df -h /home" will tell. Or perhaps "mount". You may be logged in to a login/service node, so although it does have a /tmp (your ls / shows tmp), this doesn't guarantee that the compute nodes/blades also do. Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp doesn't exist on the nodes/blades, or is not writable. Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"? This should tell if /tmp exists on the nodes, if it is writable. A stupid question: When you tried your job with this: export TMPDIR=/home/yanb/tmp Did you create the directory /home/yanb/tmp beforehand? Anyway, you may need to ask the help of a system administrator of this machine. Gus Correa On 03/03/2014 07:43 PM, Beichuan Yan wrote: > Gus, > > I am using this system: > http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly > configurations of the file system. Here is the output of "df -h": > Filesystem Size Used Avail Use% Mounted on > /dev/sda6 919G 16G 857G 2% / > tmpfs 32G 0 32G 0% /dev/shm > /dev/sda5 139M 33M 100M 25% /boot > adfs3v-s:/adfs3/hafs14 > 6.5T 678G 5.5T 11% /scratch > adfs3v-s:/adfs3/hafs16 > 6.5T 678G 5.5T 11% /var/spool/mail > 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 > 1.2P 136T 1.1P 12% /work1 > 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 > 1.2P 793T 368T 69% /work4 > 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 > 1.2P 509T 652T 44% /work3 > 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 > 1.2P 521T 640T 45% /work2 > panfs://172.16.0.10/CWFS > 728T 286T 443T 40% /p/cwfs > panfs://172.16.1.61/CWFS1 > 728T 286T 443T 40% /p/CWFS1 > panfs://172.16.0.210/CWFS2 > 728T 286T 443T 40% /p/CWFS2 > panfs://172.16.1.125/CWFS3 > 728T 286T 443T 40% /p/CWFS3 > panfs://172.16.1.224/CWFS4 > 728T 286T 443T 40% /p/CWFS4 > panfs://172.16.1.224/CWFS5 > 728T 286T 443T 40% /p/CWFS5 > panfs://172.16.1.224/CWFS6 > 728T 286T 443T 40% /p/CWFS6 > panfs://172.16.1.224/CWFS7 > 728T 286T 443T 40% /p/CWFS7 > > 1. My home directory is /home/yanb. > My simulation files are located at /work3/yanb. > The default TMPDIR set by system is just /work3/yanb > > 2. I did try not to set TMPDIR and let it default, which is just case 1 and > case 2. > Case1: #export TMPDIR=/home/yanb/tmp > TCP="--mca btl_tcp_if_include 10.148.0.0/16" > It gives no apparent reason. > Case2: #export TMPDIR=/home/yanb/tmp > #TCP="--mca btl_tcp_if_include 10.148.0.0/16" > It gives warning of shared memory file on network file system. > > 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason. > > 4. FYI, "ls /" gives: > ELT apps cgroup hafs1 hafs12 hafs2 hafs5 hafs8 home > lost+found mnt p root selinux tftpboot var work3 > admin bin dev hafs10 hafs13 hafs3 hafs6 hafs9 lib media > net panfs sbin srv tmp work1 work4 > app boot etc hafs11 hafs15 hafs4 hafs7 hafs_x86_64 lib64 misc > opt proc scratch sys usr work2 workspace > > Beichuan > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus > Correa > Sent: Monday, March 03, 2014 17:24 > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI job initializing problem > > Hi Beichuan > > If you are using the university cluster, chances are that /home is not local, > but on an NFS share, or perhaps Lustre (which you may have mentioned before, > I don't remember). > > Maybe "df -h" will show what is local what is not. > It works for NFS, it prefixes file systems with the server name, but I don't > know about Lustre. > > Did you try just not to set TMPDIR and let it default? > If the default TMPDIR is on Lustre (did you say this?, anyway I don't > remember) you could perhaps try to force it to /tmp: > export TMPDIR=/tmp, > If the cluster nodes are diskfull /tmp is likely to exist and be local to the > cluster nodes. > [But the cluster nodes may be diskless ... :( ] > > I hope this helps, > Gus Correa > > On 03/03/2014 07:10 PM, Beichuan Yan wrote: >> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local >> filesystem? I don't know how to tell a directory is local file system or >> network file system. >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >> Squyres (jsquyres) >> Sent: Monday, March 03, 2014 16:57 >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI job initializing problem >> >> How about setting TMPDIR to a local filesystem? >> >> >> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu> wrote: >> >>> I agree there are two cases for pure-MPI mode: 1. Job fails with no >>> apparent reason; 2 job complains shared-memory file on network file >>> system, which can be resolved by " export TMPDIR=/home/yanb/tmp", >>> /home/yanb/tmp is my local directory. The default TMPDIR points to a Lustre >>> directory. >>> >>> There is no any other output. I checked my job with "qstat -n" and found >>> that processes were actually not started on compute nodes even though PBS >>> Pro has "started" my job. >>> >>> Beichuan >>> >>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node >>>> runs 16 processes (clearly shared-memory of MPI is used). Four >>>> combinations of "TMPDIR" and "TCP" are tested: >>>> case 1: >>>> #export TMPDIR=/home/yanb/tmp >>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >>>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE >>>> ./paraEllip3d input.txt >>>> output: >>>> Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue v2.5 >>>> Mon Mar 3 15:47:16 EST 2014 >>>> -bash: line 1: 448597 Terminated >>>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC >>>> Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics >>>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wallt >>>> i >>>> m >>>> e >>>> =00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014 >>> >>> It looks like you have two general cases: >>> >>> 1. The job fails for no apparent reason (like above), or 2. The job >>> complains that your TMPDIR is on a shared filesystem >>> >>> Right? >>> >>> I think the real issue, then, is to figure out why your jobs are failing >>> with no output. >>> >>> Is there anything in the stderr output? >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users