Re: [OMPI users] OpenMPI job initializing problem

Beichuan Yan Thu, 6 Mar 2014 16:52:20 -0500 (EST)

No, I did all these and none worked.

I just found, with exact the same code, data and job settings, a job can really 
run one day while cannot the other day. It is NOT repeatable. I don't know what 
the problem is: hardware? OpenMPI? PBS Pro?


Anyway, I may have to give up using OpenMPI on that system and switch to 
IntelMPI which always work.

Thanks,
Beichuan

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 06, 2014 13:51
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On 03/06/2014 03:35 PM, Beichuan Yan wrote:
> Gus,
>
> Yes, 10.148.0.0/16 is the IB subnet.
>
> I did try others but none worked:
> #export
> TCP="--mca btl sm,openib"
> No run, no output

If I remember right, and unless this changed in recent OMPI vervsions, you also 
need "self":

-mca btl sm,openib,self

Alternatively, you could rule out tcp:

-mca btl ^tcp

>
> #export
> TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16"
> No run, no output
>
 > Beichuan

Likewise, "self" is missing here.

Also, I don't know if you can ask for openib and also add --mca 
btl_tcp_if_include 10.148.0.0/16.
Note that one turns off tcp (I think),
whereas the other requests a tcp interface (or that the IB interface with IPoIB 
functionality).
That combination sounds weird to me.
The OMPI developers may clarify if this is valid syntax/syntax combination.

I would try simply -mca btl sm,openib,self, which is likely to give you the IB 
transport with verbs, plus shared memory intra-node, plus the
(mandatory?) self (loopback interface?).
In my experience, this will also help identify any malfunctioning IB HCA in the 
nodes (with a failure/error message).


I hope it helps,
Gus Correa


>
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
> Sent: Thursday, March 06, 2014 13:16
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> Hi Beichuan
>
> So, it looks like that now the program runs, even though with specific 
> settings depending on whether you're using OMPI 1.6.5 or 1.7.4, right?
>
> It looks like the problem now is performance, right?
>
> System load affects performance, but unless the network is overwhelmed, or 
> perhaps the Lustre file system is hanging or too slow, I would think that a 
> walltime increase from 1min to 10min is not related to system load, but 
> something else.
>
> Do you remember the setup that gave you 1min walltime?
> Was it the same that you sent below?
> Do you happen to know which nodes?
> Are you sharing nodes with other jobs, or are you running alone on the nodes?
> Sharing with other processes may slow down your job.
> If you request all cores in the node, PBS should give you a full node (unless 
> they tricked PBS to think the nodes have more cores than they actually do).
> How do you request the nodes in your #PBS directives?
> Do you request nodes and ppn, or do you request procs?
>
> I suggest that you do:
> cat $PBS_NODEFILE
> in your PBS script, just to document which nodes are actually given to you.
>
> Also helpful to document/troubleshoot is to add -v and -tag-output to your 
> mpiexec command line.
>
>
> The difference in walltime could be due to some malfunction of IB HCAs on the 
> nodes, for instance.
> Since you are allowing (if I remember right) the use of TCP, OpenMPI will try 
> to use any interfaces that you did not rule out.
> If your mpiexec command line doesn't make any restriction, it will use 
> anything available, if I remember right.
> (Jeff will correct me in the next second.) If your mpiexec command line has 
> mca btl_tcp_if_include 10.148.0.0/16 it will use the 10.148.0.0/16 subnet in 
> with TCP transport, I think.
> (Jeff will cut my list subscription after that one, for spreading 
> misinformation.)
>
> In either case my impression is that you may have left a door open to the use 
> of non-IB (and non-IB-verbs) transport.
>
> Is 10.148.0.0/16 the an Infiniband subnet or an Ethernet subnet?
>
> Did you remeber Jeff's suggestion from a while ago to avoid TCP (over 
> Ethernet or over IB), and stick to IB verbs?
>
>
> Is 10.148.0.0/16 the IB or the Ethernet subnet?
>
> On 03/02/2014 02:38 PM, Jeff Squyres (jsquyres) wrote:
>   >  Both 1.6.x and 1.7.x/1.8.x will need verbs.h to use the native verbs
>   >  network stack.
>   >
>   >  You can use emulated TCP over IB (e.g., using the OMPI TCP BTL), but
>   >  it's nowhere near as fast/efficient the native verbs network stack.
>   >
>
>
> You could force the use of IB verbs with
>
> -mca btl ^tcp
>
> or with
>
> -mca btl sm,openib,self
>
> on the mpiexec command line.
>
> In this case, if any of the IB HCAs on the nodes is bad,
> the job will abort with an error message, instead of running too slow
> (if it is using other networks).
>
> There are also ways to tell OMPI to do a more verbose output,
> that may perhaps help diagnose the problem.
> ompi_info | grep verbose
> may give some hints (I confess I don't remember them).
>
>
> Believe me, this did happen to me, i.e., to run MPI programs in a
> cluster that had all sorts of non-homogeneous nodes, some with
> faulty IB HCAs, some with incomplete OFED installation, some that
> were not mounting shared file systems properly, etc.
> [I didn't administer that one!]
> Hopefully that is not the problem you are facing, but verbose output
> may help anyways.
>
> I hope this helps,
> Gus Correa
>
>
>
> On 03/06/2014 01:49 PM, Beichuan Yan wrote:
>> 1. For $TMPDIR and $TCP, there are four combinations by commenting on/off 
>> (note the system's default TMPDIR=/work3/yanb):
>> export TMPDIR=/work1/home/yanb/tmp
>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>
>> 2. I tested the 4 combinations for OpenMPI 1.6.5 and OpenMPI 1.7.4 
>> respectively for the pure-MPI mode (no OPENMP threads; 8 nodes, each node 
>> runs 16 processes). The results are weird: of all 8 cases, only TWO of them 
>> can run, but run so slow:
>>
>> OpenMPI 1.6.5:
>> export TMPDIR=/work1/home/yanb/tmp
>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>> Warning: shared-memory, /work1/home/yanb/tmp/
>> Run, take 10 minutes, slow
>>
>> OpenMPI 1.7.4:
>> #export TMPDIR=/work1/home/yanb/tmp
>> #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>> Warning: shared-memory /work3/yanb/605832.SPIRIT/
>> Run, take 10 minutess, slow
>>
>> So you see, a) openmpi 1.6.5 and 1.7.4 need different settings to run;
> b) whether specifying TMPDIR, I got the shared memory warning.
>>
>> 3. But a few days ago, OpenMPI 1.6.5 worked great and took only 1 minute
> (now it takes 10 minutes). I am so confused by the results.
> Does the system loading level or fluctuation or PBS pro affect OpenMPI
> performance?
>>
>> Thanks,
>> Beichuan
>>
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
>> Sent: Tuesday, March 04, 2014 08:48
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> Hi Beichuan
>>
>> So, from "df" it looks like /home is /work1, right?
>>
>> Also, "mount" shows only /work[1-4], not the other
>> 7 CWFS panfs (Panasas?), which apparently are not available in the compute 
>> nodes/blades.
>>
>> I presume you have access and are using only some of the /work[1-4]
>> (lustre) file systems for all your MPI and other software installation, 
>> right? Not the panfs, right?
>>
>> Awkward that it doesn't work, because lustre is supposed to be a parallel 
>> file system, highly available to all nodes (assuming it is mounted on all 
>> nodes).
>>
>> It also shows a small /tmp with a tmpfs file system, which is volatile, in 
>> memory:
>>
>> http://en.wikipedia.org/wiki/Tmpfs
>>
>> I would guess they don't let you write there, so TMPDIR=/tmp may not be a 
>> possible option, but this is just a wild guess.
>> Or maybe OMPI requires an actual non-volatile file system to write its 
>> shared memory auxiliary files and other stuff that normally goes on /tmp?  
>> [Jeff, Ralph, help!!] I kind of remember some old discussion on this list 
>> about this, but maybe it was in another list.
>>
>> [You could ask the sys admin about this, and perhaps what he recommends to 
>> use to replace /tmp.]
>>
>> Just in case they may have some file system mount point mixup, you could try 
>> perhaps TMPDIR=/work1/yanb/tmp (rather than /home) You could also try 
>> TMPDIR=/work3/yanb/tmp, as if I remember right this is another file system 
>> you have access to (not sure anymore, it may have been in the previous 
>> emails).
>> Either way, you may need to create the tmp directory beforehand.
>>
>> **
>>
>> Any chances that this is an environment mixup?
>>
>> Say, that you may be inadvertently using the SGI-MPI mpiexec Using a 
>> /full/path/to/mpiexec in your job may clarify this.
>>
>> "which mpiexec" will tell, but since the environment on the compute nodes 
>> may not be exactly the same as in the login node, it may not be reliable 
>> information.
>>
>> Or perhaps you may not be pointing to the OMPI libraries?
>> Are you exporting PATH and LD_LIBRARY_PATH on .bashrc/.tcshrc, with the OMPI 
>> items (bin and lib) *PREPENDED* (not appended), so as to take precedence 
>> over other possible/SGI/pre-existent MPI items?
>>
>> Those are pretty (ugly) common problems.
>>
>> **
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 03/03/2014 10:13 PM, Beichuan Yan wrote:
>>> 1. info from a compute node
>>> -bash-4.1$ hostname
>>> r32i1n1
>>> -bash-4.1$ df -h /home
>>> Filesystem            Size  Used Avail Use% Mounted on
>>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
>>>                          1.2P  136T  1.1P  12% /work1 -bash-4.1$ mount
>>> devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /tmp type
>>> tmpfs (rw,size=150m) none on /proc/sys/fs/binfmt_misc type binfmt_misc
>>> (rw) cpuset on /dev/cpuset type cpuset (rw)
>>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre
>>> (rw,flock)
>>> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre
>>> (rw,flock)
>>> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre
>>> (rw,flock)
>>> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre
>>> (rw,flock)
>>>
>>>
>>> 2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did 
>>> see mpi-related temporary files there when the job gets started.
>>>
>>> -----Original Message-----
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>>> Correa
>>> Sent: Monday, March 03, 2014 18:23
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>>
>>> Hi Beichuan
>>>
>>> OK, it says "unclassified.html", so I presume it is not a problem.
>>>
>>> The web site says the computer is an SGI ICE X.
>>> I am not familiar to it, so what follows are guesses.
>>>
>>> The SGI site brochure suggests that the nodes/blades have local disks:
>>> https://www.sgi.com/pdfs/4330.pdf
>>>
>>> The file systems prefixed with IP addresses (work[1-4]) and with panfs 
>>> (cwfs and CWFS[1-6]) and a colon (:) are shared exports (not local), but 
>>> not necessarily NFS (panfs may be Panasas?).
>>>     From this output it is hard to tell where /home is, but I would guess 
>>> it is also shared (not local).
>>> Maybe "df -h /home" will tell.  Or perhaps "mount".
>>>
>>> You may be logged in to a login/service node, so although it does have a 
>>> /tmp (your ls / shows tmp), this doesn't guarantee that the compute 
>>> nodes/blades also do.
>>>
>>> Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp 
>>> doesn't exist on the nodes/blades, or is not writable.
>>>
>>> Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"?
>>> This should tell if /tmp exists on the nodes, if it is writable.
>>>
>>> A stupid question:
>>> When you tried your job with this:
>>>
>>> export TMPDIR=/home/yanb/tmp
>>>
>>> Did you create the directory /home/yanb/tmp beforehand?
>>>
>>> Anyway, you may need to ask the help of a system administrator of this 
>>> machine.
>>>
>>> Gus Correa
>>>
>>> On 03/03/2014 07:43 PM, Beichuan Yan wrote:
>>>> Gus,
>>>>
>>>> I am using this system: 
>>>> http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know 
>>>> exactly configurations of the file system. Here is the output of "df -h":
>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>> /dev/sda6             919G   16G  857G   2% /
>>>> tmpfs                  32G     0   32G   0% /dev/shm
>>>> /dev/sda5             139M   33M  100M  25% /boot
>>>> adfs3v-s:/adfs3/hafs14
>>>>                           6.5T  678G  5.5T  11% /scratch
>>>> adfs3v-s:/adfs3/hafs16
>>>>                           6.5T  678G  5.5T  11% /var/spool/mail
>>>> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
>>>>                           1.2P  136T  1.1P  12% /work1
>>>> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
>>>>                           1.2P  793T  368T  69% /work4
>>>> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
>>>>                           1.2P  509T  652T  44% /work3
>>>> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
>>>>                           1.2P  521T  640T  45% /work2
>>>> panfs://172.16.0.10/CWFS
>>>>                           728T  286T  443T  40% /p/cwfs
>>>> panfs://172.16.1.61/CWFS1
>>>>                           728T  286T  443T  40% /p/CWFS1
>>>> panfs://172.16.0.210/CWFS2
>>>>                           728T  286T  443T  40% /p/CWFS2
>>>> panfs://172.16.1.125/CWFS3
>>>>                           728T  286T  443T  40% /p/CWFS3
>>>> panfs://172.16.1.224/CWFS4
>>>>                           728T  286T  443T  40% /p/CWFS4
>>>> panfs://172.16.1.224/CWFS5
>>>>                           728T  286T  443T  40% /p/CWFS5
>>>> panfs://172.16.1.224/CWFS6
>>>>                           728T  286T  443T  40% /p/CWFS6
>>>> panfs://172.16.1.224/CWFS7
>>>>                           728T  286T  443T  40% /p/CWFS7
>>>>
>>>> 1. My home directory is /home/yanb.
>>>> My simulation files are located at /work3/yanb.
>>>> The default TMPDIR set by system is just /work3/yanb
>>>>
>>>> 2. I did try not to set TMPDIR and let it default, which is just case 1 
>>>> and case 2.
>>>>       Case1: #export TMPDIR=/home/yanb/tmp
>>>>                 TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>>>          It gives no apparent reason.
>>>>       Case2: #export TMPDIR=/home/yanb/tmp
>>>>                 #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>>>          It gives warning of shared memory file on network file system.
>>>>
>>>> 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.
>>>>
>>>> 4. FYI, "ls /" gives:
>>>> ELT    apps  cgroup  hafs1   hafs12  hafs2  hafs5  hafs8        home   
>>>> lost+found  mnt  p      root     selinux  tftpboot  var    work3
>>>> admin  bin   dev     hafs10  hafs13  hafs3  hafs6  hafs9        lib    
>>>> media       net  panfs  sbin     srv      tmp       work1  work4
>>>> app    boot  etc     hafs11  hafs15  hafs4  hafs7  hafs_x86_64  lib64  
>>>> misc        opt  proc   scratch  sys      usr       work2  workspace
>>>>
>>>> Beichuan
>>>>
>>>> -----Original Message-----
>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>>>> Correa
>>>> Sent: Monday, March 03, 2014 17:24
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>>>
>>>> Hi Beichuan
>>>>
>>>> If you are using the university cluster, chances are that /home is not 
>>>> local, but on an NFS share, or perhaps Lustre (which you may have 
>>>> mentioned before, I don't remember).
>>>>
>>>> Maybe "df -h" will show what is local what is not.
>>>> It works for NFS, it prefixes file systems with the server name, but I 
>>>> don't know about Lustre.
>>>>
>>>> Did you try just not to set TMPDIR and let it default?
>>>> If the default TMPDIR is on Lustre (did you say this?, anyway I don't
>>>> remember) you could perhaps try to force it to /tmp:
>>>> export TMPDIR=/tmp,
>>>> If the cluster nodes are diskfull /tmp is likely to exist and be local to 
>>>> the cluster nodes.
>>>> [But the cluster nodes may be diskless ... :( ]
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> On 03/03/2014 07:10 PM, Beichuan Yan wrote:
>>>>> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local 
>>>>> filesystem? I don't know how to tell a directory is local file system or 
>>>>> network file system.
>>>>>
>>>>> -----Original Message-----
>>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>>>> Squyres (jsquyres)
>>>>> Sent: Monday, March 03, 2014 16:57
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>>>>
>>>>> How about setting TMPDIR to a local filesystem?
>>>>>
>>>>>
>>>>> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu>      
>>>>> wrote:
>>>>>
>>>>>> I agree there are two cases for pure-MPI mode: 1. Job fails with no 
>>>>>> apparent reason;  2 job complains shared-memory file on network file 
>>>>>> system, which can be resolved by " export TMPDIR=/home/yanb/tmp", 
>>>>>> /home/yanb/tmp is my local directory. The default TMPDIR points to a 
>>>>>> Lustre directory.
>>>>>>
>>>>>> There is no any other output. I checked my job with "qstat -n" and found 
>>>>>> that processes were actually not started on compute nodes even though 
>>>>>> PBS Pro has "started" my job.
>>>>>>
>>>>>> Beichuan
>>>>>>
>>>>>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute 
>>>>>>> node runs 16 processes (clearly shared-memory of MPI is used). Four 
>>>>>>> combinations of "TMPDIR" and "TCP" are tested:
>>>>>>> case 1:
>>>>>>> #export TMPDIR=/home/yanb/tmp
>>>>>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>>>>>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE
>>>>>>> ./paraEllip3d input.txt
>>>>>>> output:
>>>>>>> Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014 End Prologue v2.5
>>>>>>> Mon Mar  3 15:47:16 EST 2014
>>>>>>> -bash: line 1: 448597 Terminated              
>>>>>>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC
>>>>>>> Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014 Statistics
>>>>>>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wall
>>>>>>> t
>>>>>>> i
>>>>>>> m
>>>>>>> e
>>>>>>> =00:03:24 End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014
>>>>>>
>>>>>> It looks like you have two general cases:
>>>>>>
>>>>>> 1. The job fails for no apparent reason (like above), or 2. The job
>>>>>> complains that your TMPDIR is on a shared filesystem
>>>>>>
>>>>>> Right?
>>>>>>
>>>>>> I think the real issue, then, is to figure out why your jobs are failing 
>>>>>> with no output.
>>>>>>
>>>>>> Is there anything in the stderr output?
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquy...@cisco.com
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI job initializing problem

Reply via email to