No, I did all these and none worked. I just found, with exact the same code, data and job settings, a job can really run one day while cannot the other day. It is NOT repeatable. I don't know what the problem is: hardware? OpenMPI? PBS Pro?
Anyway, I may have to give up using OpenMPI on that system and switch to IntelMPI which always work. Thanks, Beichuan -----Original Message----- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Thursday, March 06, 2014 13:51 To: Open MPI Users Subject: Re: [OMPI users] OpenMPI job initializing problem On 03/06/2014 03:35 PM, Beichuan Yan wrote: > Gus, > > Yes, 10.148.0.0/16 is the IB subnet. > > I did try others but none worked: > #export > TCP="--mca btl sm,openib" > No run, no output If I remember right, and unless this changed in recent OMPI vervsions, you also need "self": -mca btl sm,openib,self Alternatively, you could rule out tcp: -mca btl ^tcp > > #export > TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16" > No run, no output > > Beichuan Likewise, "self" is missing here. Also, I don't know if you can ask for openib and also add --mca btl_tcp_if_include 10.148.0.0/16. Note that one turns off tcp (I think), whereas the other requests a tcp interface (or that the IB interface with IPoIB functionality). That combination sounds weird to me. The OMPI developers may clarify if this is valid syntax/syntax combination. I would try simply -mca btl sm,openib,self, which is likely to give you the IB transport with verbs, plus shared memory intra-node, plus the (mandatory?) self (loopback interface?). In my experience, this will also help identify any malfunctioning IB HCA in the nodes (with a failure/error message). I hope it helps, Gus Correa > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa > Sent: Thursday, March 06, 2014 13:16 > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI job initializing problem > > Hi Beichuan > > So, it looks like that now the program runs, even though with specific > settings depending on whether you're using OMPI 1.6.5 or 1.7.4, right? > > It looks like the problem now is performance, right? > > System load affects performance, but unless the network is overwhelmed, or > perhaps the Lustre file system is hanging or too slow, I would think that a > walltime increase from 1min to 10min is not related to system load, but > something else. > > Do you remember the setup that gave you 1min walltime? > Was it the same that you sent below? > Do you happen to know which nodes? > Are you sharing nodes with other jobs, or are you running alone on the nodes? > Sharing with other processes may slow down your job. > If you request all cores in the node, PBS should give you a full node (unless > they tricked PBS to think the nodes have more cores than they actually do). > How do you request the nodes in your #PBS directives? > Do you request nodes and ppn, or do you request procs? > > I suggest that you do: > cat $PBS_NODEFILE > in your PBS script, just to document which nodes are actually given to you. > > Also helpful to document/troubleshoot is to add -v and -tag-output to your > mpiexec command line. > > > The difference in walltime could be due to some malfunction of IB HCAs on the > nodes, for instance. > Since you are allowing (if I remember right) the use of TCP, OpenMPI will try > to use any interfaces that you did not rule out. > If your mpiexec command line doesn't make any restriction, it will use > anything available, if I remember right. > (Jeff will correct me in the next second.) If your mpiexec command line has > mca btl_tcp_if_include 10.148.0.0/16 it will use the 10.148.0.0/16 subnet in > with TCP transport, I think. > (Jeff will cut my list subscription after that one, for spreading > misinformation.) > > In either case my impression is that you may have left a door open to the use > of non-IB (and non-IB-verbs) transport. > > Is 10.148.0.0/16 the an Infiniband subnet or an Ethernet subnet? > > Did you remeber Jeff's suggestion from a while ago to avoid TCP (over > Ethernet or over IB), and stick to IB verbs? > > > Is 10.148.0.0/16 the IB or the Ethernet subnet? > > On 03/02/2014 02:38 PM, Jeff Squyres (jsquyres) wrote: > > Both 1.6.x and 1.7.x/1.8.x will need verbs.h to use the native verbs > > network stack. > > > > You can use emulated TCP over IB (e.g., using the OMPI TCP BTL), but > > it's nowhere near as fast/efficient the native verbs network stack. > > > > > You could force the use of IB verbs with > > -mca btl ^tcp > > or with > > -mca btl sm,openib,self > > on the mpiexec command line. > > In this case, if any of the IB HCAs on the nodes is bad, > the job will abort with an error message, instead of running too slow > (if it is using other networks). > > There are also ways to tell OMPI to do a more verbose output, > that may perhaps help diagnose the problem. > ompi_info | grep verbose > may give some hints (I confess I don't remember them). > > > Believe me, this did happen to me, i.e., to run MPI programs in a > cluster that had all sorts of non-homogeneous nodes, some with > faulty IB HCAs, some with incomplete OFED installation, some that > were not mounting shared file systems properly, etc. > [I didn't administer that one!] > Hopefully that is not the problem you are facing, but verbose output > may help anyways. > > I hope this helps, > Gus Correa > > > > On 03/06/2014 01:49 PM, Beichuan Yan wrote: >> 1. For $TMPDIR and $TCP, there are four combinations by commenting on/off >> (note the system's default TMPDIR=/work3/yanb): >> export TMPDIR=/work1/home/yanb/tmp >> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >> >> 2. I tested the 4 combinations for OpenMPI 1.6.5 and OpenMPI 1.7.4 >> respectively for the pure-MPI mode (no OPENMP threads; 8 nodes, each node >> runs 16 processes). The results are weird: of all 8 cases, only TWO of them >> can run, but run so slow: >> >> OpenMPI 1.6.5: >> export TMPDIR=/work1/home/yanb/tmp >> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >> Warning: shared-memory, /work1/home/yanb/tmp/ >> Run, take 10 minutes, slow >> >> OpenMPI 1.7.4: >> #export TMPDIR=/work1/home/yanb/tmp >> #TCP="--mca btl_tcp_if_include 10.148.0.0/16" >> Warning: shared-memory /work3/yanb/605832.SPIRIT/ >> Run, take 10 minutess, slow >> >> So you see, a) openmpi 1.6.5 and 1.7.4 need different settings to run; > b) whether specifying TMPDIR, I got the shared memory warning. >> >> 3. But a few days ago, OpenMPI 1.6.5 worked great and took only 1 minute > (now it takes 10 minutes). I am so confused by the results. > Does the system loading level or fluctuation or PBS pro affect OpenMPI > performance? >> >> Thanks, >> Beichuan >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa >> Sent: Tuesday, March 04, 2014 08:48 >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI job initializing problem >> >> Hi Beichuan >> >> So, from "df" it looks like /home is /work1, right? >> >> Also, "mount" shows only /work[1-4], not the other >> 7 CWFS panfs (Panasas?), which apparently are not available in the compute >> nodes/blades. >> >> I presume you have access and are using only some of the /work[1-4] >> (lustre) file systems for all your MPI and other software installation, >> right? Not the panfs, right? >> >> Awkward that it doesn't work, because lustre is supposed to be a parallel >> file system, highly available to all nodes (assuming it is mounted on all >> nodes). >> >> It also shows a small /tmp with a tmpfs file system, which is volatile, in >> memory: >> >> http://en.wikipedia.org/wiki/Tmpfs >> >> I would guess they don't let you write there, so TMPDIR=/tmp may not be a >> possible option, but this is just a wild guess. >> Or maybe OMPI requires an actual non-volatile file system to write its >> shared memory auxiliary files and other stuff that normally goes on /tmp? >> [Jeff, Ralph, help!!] I kind of remember some old discussion on this list >> about this, but maybe it was in another list. >> >> [You could ask the sys admin about this, and perhaps what he recommends to >> use to replace /tmp.] >> >> Just in case they may have some file system mount point mixup, you could try >> perhaps TMPDIR=/work1/yanb/tmp (rather than /home) You could also try >> TMPDIR=/work3/yanb/tmp, as if I remember right this is another file system >> you have access to (not sure anymore, it may have been in the previous >> emails). >> Either way, you may need to create the tmp directory beforehand. >> >> ** >> >> Any chances that this is an environment mixup? >> >> Say, that you may be inadvertently using the SGI-MPI mpiexec Using a >> /full/path/to/mpiexec in your job may clarify this. >> >> "which mpiexec" will tell, but since the environment on the compute nodes >> may not be exactly the same as in the login node, it may not be reliable >> information. >> >> Or perhaps you may not be pointing to the OMPI libraries? >> Are you exporting PATH and LD_LIBRARY_PATH on .bashrc/.tcshrc, with the OMPI >> items (bin and lib) *PREPENDED* (not appended), so as to take precedence >> over other possible/SGI/pre-existent MPI items? >> >> Those are pretty (ugly) common problems. >> >> ** >> >> I hope this helps, >> Gus Correa >> >> On 03/03/2014 10:13 PM, Beichuan Yan wrote: >>> 1. info from a compute node >>> -bash-4.1$ hostname >>> r32i1n1 >>> -bash-4.1$ df -h /home >>> Filesystem Size Used Avail Use% Mounted on >>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 >>> 1.2P 136T 1.1P 12% /work1 -bash-4.1$ mount >>> devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /tmp type >>> tmpfs (rw,size=150m) none on /proc/sys/fs/binfmt_misc type binfmt_misc >>> (rw) cpuset on /dev/cpuset type cpuset (rw) >>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre >>> (rw,flock) >>> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre >>> (rw,flock) >>> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre >>> (rw,flock) >>> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre >>> (rw,flock) >>> >>> >>> 2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did >>> see mpi-related temporary files there when the job gets started. >>> >>> -----Original Message----- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus >>> Correa >>> Sent: Monday, March 03, 2014 18:23 >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI job initializing problem >>> >>> Hi Beichuan >>> >>> OK, it says "unclassified.html", so I presume it is not a problem. >>> >>> The web site says the computer is an SGI ICE X. >>> I am not familiar to it, so what follows are guesses. >>> >>> The SGI site brochure suggests that the nodes/blades have local disks: >>> https://www.sgi.com/pdfs/4330.pdf >>> >>> The file systems prefixed with IP addresses (work[1-4]) and with panfs >>> (cwfs and CWFS[1-6]) and a colon (:) are shared exports (not local), but >>> not necessarily NFS (panfs may be Panasas?). >>> From this output it is hard to tell where /home is, but I would guess >>> it is also shared (not local). >>> Maybe "df -h /home" will tell. Or perhaps "mount". >>> >>> You may be logged in to a login/service node, so although it does have a >>> /tmp (your ls / shows tmp), this doesn't guarantee that the compute >>> nodes/blades also do. >>> >>> Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp >>> doesn't exist on the nodes/blades, or is not writable. >>> >>> Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"? >>> This should tell if /tmp exists on the nodes, if it is writable. >>> >>> A stupid question: >>> When you tried your job with this: >>> >>> export TMPDIR=/home/yanb/tmp >>> >>> Did you create the directory /home/yanb/tmp beforehand? >>> >>> Anyway, you may need to ask the help of a system administrator of this >>> machine. >>> >>> Gus Correa >>> >>> On 03/03/2014 07:43 PM, Beichuan Yan wrote: >>>> Gus, >>>> >>>> I am using this system: >>>> http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know >>>> exactly configurations of the file system. Here is the output of "df -h": >>>> Filesystem Size Used Avail Use% Mounted on >>>> /dev/sda6 919G 16G 857G 2% / >>>> tmpfs 32G 0 32G 0% /dev/shm >>>> /dev/sda5 139M 33M 100M 25% /boot >>>> adfs3v-s:/adfs3/hafs14 >>>> 6.5T 678G 5.5T 11% /scratch >>>> adfs3v-s:/adfs3/hafs16 >>>> 6.5T 678G 5.5T 11% /var/spool/mail >>>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 >>>> 1.2P 136T 1.1P 12% /work1 >>>> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 >>>> 1.2P 793T 368T 69% /work4 >>>> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 >>>> 1.2P 509T 652T 44% /work3 >>>> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 >>>> 1.2P 521T 640T 45% /work2 >>>> panfs://172.16.0.10/CWFS >>>> 728T 286T 443T 40% /p/cwfs >>>> panfs://172.16.1.61/CWFS1 >>>> 728T 286T 443T 40% /p/CWFS1 >>>> panfs://172.16.0.210/CWFS2 >>>> 728T 286T 443T 40% /p/CWFS2 >>>> panfs://172.16.1.125/CWFS3 >>>> 728T 286T 443T 40% /p/CWFS3 >>>> panfs://172.16.1.224/CWFS4 >>>> 728T 286T 443T 40% /p/CWFS4 >>>> panfs://172.16.1.224/CWFS5 >>>> 728T 286T 443T 40% /p/CWFS5 >>>> panfs://172.16.1.224/CWFS6 >>>> 728T 286T 443T 40% /p/CWFS6 >>>> panfs://172.16.1.224/CWFS7 >>>> 728T 286T 443T 40% /p/CWFS7 >>>> >>>> 1. My home directory is /home/yanb. >>>> My simulation files are located at /work3/yanb. >>>> The default TMPDIR set by system is just /work3/yanb >>>> >>>> 2. I did try not to set TMPDIR and let it default, which is just case 1 >>>> and case 2. >>>> Case1: #export TMPDIR=/home/yanb/tmp >>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >>>> It gives no apparent reason. >>>> Case2: #export TMPDIR=/home/yanb/tmp >>>> #TCP="--mca btl_tcp_if_include 10.148.0.0/16" >>>> It gives warning of shared memory file on network file system. >>>> >>>> 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason. >>>> >>>> 4. FYI, "ls /" gives: >>>> ELT apps cgroup hafs1 hafs12 hafs2 hafs5 hafs8 home >>>> lost+found mnt p root selinux tftpboot var work3 >>>> admin bin dev hafs10 hafs13 hafs3 hafs6 hafs9 lib >>>> media net panfs sbin srv tmp work1 work4 >>>> app boot etc hafs11 hafs15 hafs4 hafs7 hafs_x86_64 lib64 >>>> misc opt proc scratch sys usr work2 workspace >>>> >>>> Beichuan >>>> >>>> -----Original Message----- >>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus >>>> Correa >>>> Sent: Monday, March 03, 2014 17:24 >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] OpenMPI job initializing problem >>>> >>>> Hi Beichuan >>>> >>>> If you are using the university cluster, chances are that /home is not >>>> local, but on an NFS share, or perhaps Lustre (which you may have >>>> mentioned before, I don't remember). >>>> >>>> Maybe "df -h" will show what is local what is not. >>>> It works for NFS, it prefixes file systems with the server name, but I >>>> don't know about Lustre. >>>> >>>> Did you try just not to set TMPDIR and let it default? >>>> If the default TMPDIR is on Lustre (did you say this?, anyway I don't >>>> remember) you could perhaps try to force it to /tmp: >>>> export TMPDIR=/tmp, >>>> If the cluster nodes are diskfull /tmp is likely to exist and be local to >>>> the cluster nodes. >>>> [But the cluster nodes may be diskless ... :( ] >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> On 03/03/2014 07:10 PM, Beichuan Yan wrote: >>>>> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local >>>>> filesystem? I don't know how to tell a directory is local file system or >>>>> network file system. >>>>> >>>>> -----Original Message----- >>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>>>> Squyres (jsquyres) >>>>> Sent: Monday, March 03, 2014 16:57 >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] OpenMPI job initializing problem >>>>> >>>>> How about setting TMPDIR to a local filesystem? >>>>> >>>>> >>>>> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu> >>>>> wrote: >>>>> >>>>>> I agree there are two cases for pure-MPI mode: 1. Job fails with no >>>>>> apparent reason; 2 job complains shared-memory file on network file >>>>>> system, which can be resolved by " export TMPDIR=/home/yanb/tmp", >>>>>> /home/yanb/tmp is my local directory. The default TMPDIR points to a >>>>>> Lustre directory. >>>>>> >>>>>> There is no any other output. I checked my job with "qstat -n" and found >>>>>> that processes were actually not started on compute nodes even though >>>>>> PBS Pro has "started" my job. >>>>>> >>>>>> Beichuan >>>>>> >>>>>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute >>>>>>> node runs 16 processes (clearly shared-memory of MPI is used). Four >>>>>>> combinations of "TMPDIR" and "TCP" are tested: >>>>>>> case 1: >>>>>>> #export TMPDIR=/home/yanb/tmp >>>>>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >>>>>>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE >>>>>>> ./paraEllip3d input.txt >>>>>>> output: >>>>>>> Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue v2.5 >>>>>>> Mon Mar 3 15:47:16 EST 2014 >>>>>>> -bash: line 1: 448597 Terminated >>>>>>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC >>>>>>> Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics >>>>>>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wall >>>>>>> t >>>>>>> i >>>>>>> m >>>>>>> e >>>>>>> =00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014 >>>>>> >>>>>> It looks like you have two general cases: >>>>>> >>>>>> 1. The job fails for no apparent reason (like above), or 2. The job >>>>>> complains that your TMPDIR is on a shared filesystem >>>>>> >>>>>> Right? >>>>>> >>>>>> I think the real issue, then, is to figure out why your jobs are failing >>>>>> with no output. >>>>>> >>>>>> Is there anything in the stderr output? >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users