Please try it again with the attached patch. The --disable-vt is fine.

Thanks
Ralph

Attachment: user2.diff
Description: Binary data


On Mar 20, 2013, at 7:47 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I have completed rebuild of openmpi1.7rc8.
> To save time, I added --disable-vt. ( Is it OK? )
> 
> Well, what shall I do ?
> 
> ./configure \
> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
> --with-tm \
> --with-verbs \
> --disable-ipv6 \
> --disable-vt \
> --enable-debug \
> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> 
> Note:
> I tried patch user.diff after rebuiding openmpi1.7rc8.
> But, I got an error and could not go foward.
> 
> $ patch -p0 < user.diff  # this is OK
> $ make                   # I got an error
> 
>  CC       util/hostfile/hostfile.lo
> PGC-S-0037-Syntax error: Recovery attempted by deleting <string>
> (util/hostfile/hostfile.c: 728)
> PGC/x86-64 Linux 12.9-0: compilation completed with severe errors
> 
> Regards,
> Tetsuya Mishima
> 
>> Could you please apply the attached patch and try it again? If you
> haven't had time to configure with --enable-debug, that is fine - this will
> output regardless.
>> 
>> Thanks
>> Ralph
>> 
>> - user.diff
>> 
>> 
>> On Mar 20, 2013, at 4:59 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> You obviously have some MCA params set somewhere:
>>> 
>>>> 
> --------------------------------------------------------------------------
>>>> A deprecated MCA parameter value was specified in an MCA parameter
>>>> file.  Deprecated MCA parameters should be avoided; they may disappear
>>>> in future releases.
>>>> 
>>>> Deprecated parameter: orte_rsh_agent
>>>> 
> --------------------------------------------------------------------------
>>> 
>>> Check your environment for anything with OMPI_MCA_xxx, and your default
> MCA parameter file to see what has been specified.
>>> 
>>> The allocation looks okay - I'll have to look for other debug flags you
> can set. Meantime, can you please add --enable-debug to your configure cmd
> line and rebuild?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> 
>>>> Hi Ralph,
>>>> 
>>>> Here is a result of rerun with --display-allocation.
>>>> I set OMP_NUM_THREADS=1 to make the problem clear.
>>>> 
>>>> Regards,
>>>> Tetsuya Mishima
>>>> 
>>>> P.S. As far as I checked, these 2 cases are OK(no problem).
>>>> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
>>>> ~/Ducom/testbed/mPre m02-ld
>>>> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
> ~/Ducom/testbed/mPre
>>>> m02-ld
>>>> 
>>>> Script File:
>>>> 
>>>> #!/bin/sh
>>>> #PBS -A tmishima
>>>> #PBS -N Ducom-run
>>>> #PBS -j oe
>>>> #PBS -l nodes=2:ppn=4
>>>> export OMP_NUM_THREADS=1
>>>> cd $PBS_O_WORKDIR
>>>> cp $PBS_NODEFILE pbs_hosts
>>>> NPROCS=`wc -l < pbs_hosts`
>>>> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
>>>> --display-allocation ~/Ducom/testbed/mPre m02-ld
>>>> 
>>>> Output:
>>>> 
> --------------------------------------------------------------------------
>>>> A deprecated MCA parameter value was specified in an MCA parameter
>>>> file.  Deprecated MCA parameters should be avoided; they may disappear
>>>> in future releases.
>>>> 
>>>> Deprecated parameter: orte_rsh_agent
>>>> 
> --------------------------------------------------------------------------
>>>> 
>>>> ======================   ALLOCATED NODES   ======================
>>>> 
>>>> Data for node: node06  Num slots: 4    Max slots: 0
>>>> Data for node: node05  Num slots: 4    Max slots: 0
>>>> 
>>>> =================================================================
>>>> 
> --------------------------------------------------------------------------
>>>> A hostfile was provided that contains at least one node not
>>>> present in the allocation:
>>>> 
>>>> hostfile:  pbs_hosts
>>>> node:      node06
>>>> 
>>>> If you are operating in a resource-managed environment, then only
>>>> nodes that are in the allocation can be used in the hostfile. You
>>>> may find relative node syntax to be a useful alternative to
>>>> specifying absolute node names see the orte_hosts man page for
>>>> further information.
>>>> 
> --------------------------------------------------------------------------
>>>> 
>>>> 
>>>>> I've submitted a patch to fix the Torque launch issue - just some
>>>> leftover garbage that existed at the time of the 1.7.0 branch and
> didn't
>>>> get removed.
>>>>> 
>>>>> For the hostfile issue, I'm stumped as I can't see how the problem
> would
>>>> come about. Could you please rerun your original test and add
>>>> "--display-allocation" to your cmd line? Let's see if it is
>>>>> correctly finding the original allocation.
>>>>> 
>>>>> Thanks
>>>>> Ralph
>>>>> 
>>>>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi Gus,
>>>>>> 
>>>>>> Thank you for your comments. I understand your advice.
>>>>>> Our script used to be --npernode type as well.
>>>>>> 
>>>>>> As I told before, our cluster consists of nodes having 4, 8,
>>>>>> and 32 cores, although it used to be homogeneous at the
>>>>>> starting time. Furthermore, since performance of each core
>>>>>> is almost same, a mixed use of nodes with different number
>>>>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
>>>>>> 
>>>>>> --npernode type is not applicable to such a mixed use.
>>>>>> That's why I'd like to continue to use modified hostfile.
>>>>>> 
>>>>>> By the way, the problem I reported to Jeff yesterday
>>>>>> was that openmpi-1.7 with torque is something wrong,
>>>>>> because it caused error against such a simple case as
>>>>>> shown below, which surprised me. Now, the problem is not
>>>>>> limited to modified hostfile, I guess.
>>>>>> 
>>>>>> #PBS -l nodes=4:ppn=8
>>>>>> mpirun -np 8 ./my_program
>>>>>> (OMP_NUM_THREADS=4)
>>>>>> 
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>> 
>>>>>>> Hi Tetsuya
>>>>>>> 
>>>>>>> Your script that edits $PBS_NODEFILE into a separate hostfile
>>>>>>> is very similar to some that I used here for
>>>>>>> hybrid OpenMP+MPI programs on older versions of OMPI.
>>>>>>> I haven't tried this in 1.6.X,
>>>>>>> but it looks like you did and it works also.
>>>>>>> I haven't tried 1.7 either.
>>>>>>> Since we run production machines,
>>>>>>> I try to stick to the stable versions of OMPI (even numbered:
>>>>>>> 1.6.X, 1.4.X, 1.2.X).
>>>>>>> 
>>>>>>> I believe you can get the same effect even if you
>>>>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is.
>>>>>>> Say, if you choose carefully the values in your
>>>>>>> #PBS -l nodes=?:ppn=?
>>>>>>> of your
>>>>>>> $OMP_NUM_THREADS
>>>>>>> and use an mpiexec with --npernode or --cpus-per-proc.
>>>>>>> 
>>>>>>> For instance, for twelve MPI processes, with two threads each,
>>>>>>> on nodes with eight cores each, I would try
>>>>>>> (but I haven't tried!):
>>>>>>> 
>>>>>>> #PBS -l nodes=3:ppn=8
>>>>>>> 
>>>>>>> export $OMP_NUM_THREADS=2
>>>>>>> 
>>>>>>> mpiexec -np 12 -npernode 4
>>>>>>> 
>>>>>>> or perhaps more tightly:
>>>>>>> 
>>>>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
>>>>>>> 
>>>>>>> I hope this helps,
>>>>>>> Gus Correa
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Reuti and Gus,
>>>>>>>> 
>>>>>>>> Thank you for your comments.
>>>>>>>> 
>>>>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4,
> 8,
>>>>>> 32
>>>>>>>> cores.
>>>>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for
> "-l
>>>>>>>> nodes=2:ppn=4".
>>>>>>>> (strictly speaking, Torque picked up proper nodes.)
>>>>>>>> 
>>>>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no
>>>> troble
>>>>>>>> against that kind
>>>>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7
> to
>>>>>> check
>>>>>>>> when we could
>>>>>>>> move on to it, although we have no positive reason to do that at
> this
>>>>>>>> moment.
>>>>>>>> 
>>>>>>>> As Gus pointed out, I use a script file as shown below for a
>>>> practical
>>>>>> use
>>>>>>>> of openmpi-1.6.x.
>>>>>>>> 
>>>>>>>> #PBS -l nodes=2:ppn=32  # even "-l nodes=1:ppn=32+4:ppn=8" works
> fine
>>>>>>>> export OMP_NUM_THREADS=4
>>>>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16
> lines
>>>>>> here
>>>>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4
> -report-bindings \
>>>>>>>> -x OMP_NUM_THREADS ./my_program  # 32-core node has 8 numanodes,
>>>> 8-core
>>>>>>>> node has 2 numanodes
>>>>>>>> 
>>>>>>>> It works well under the combination of openmpi-1.6.x and Torque.
> The
>>>>>>>> problem is just
>>>>>>>> openmpi-1.7's behavior.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Tetsuya Mishima
>>>>>>>> 
>>>>>>>>> Hi Tetsuya Mishima
>>>>>>>>> 
>>>>>>>>> Mpiexec offers you a number of possibilities that you could try:
>>>>>>>>> --bynode,
>>>>>>>>> --pernode,
>>>>>>>>> --npernode,
>>>>>>>>> --bysocket,
>>>>>>>>> --bycore,
>>>>>>>>> --cpus-per-proc,
>>>>>>>>> --cpus-per-rank,
>>>>>>>>> --rankfile
>>>>>>>>> and more.
>>>>>>>>> 
>>>>>>>>> Most likely one or more of them will fit your needs.
>>>>>>>>> 
>>>>>>>>> There are also associated flags to bind processes to cores,
>>>>>>>>> to sockets, etc, to report the bindings, and so on.
>>>>>>>>> 
>>>>>>>>> Check the mpiexec man page for details.
>>>>>>>>> 
>>>>>>>>> Nevertheless, I am surprised that modifying the
>>>>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7.
>>>>>>>>> I have done this many times in older versions of OMPI.
>>>>>>>>> 
>>>>>>>>> Would it work for you to go back to the stable OMPI 1.6.X,
>>>>>>>>> or does it lack any special feature that you need?
>>>>>>>>> 
>>>>>>>>> I hope this helps,
>>>>>>>>> Gus Correa
>>>>>>>>> 
>>>>>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Jeff,
>>>>>>>>>> 
>>>>>>>>>> I didn't have much time to test this morning. So, I checked it
>>>> again
>>>>>>>>>> now. Then, the trouble seems to depend on the number of nodes to
>>>> use.
>>>>>>>>>> 
>>>>>>>>>> This works(nodes<   4):
>>>>>>>>>> mpiexec -bynode -np 4 ./my_program&&     #PBS -l nodes=2:ppn=8
>>>>>>>>>> (OMP_NUM_THREADS=4)
>>>>>>>>>> 
>>>>>>>>>> This causes error(nodes>= 4):
>>>>>>>>>> mpiexec -bynode -np 8 ./my_program&&     #PBS -l nodes=4:ppn=8
>>>>>>>>>> (OMP_NUM_THREADS=4)
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>> 
>>>>>>>>>>> Oy; that's weird.
>>>>>>>>>>> 
>>>>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why
>>>> that
>>>>>> is
>>>>>>>>>> happening -- sorry!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp>
> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Correa and Jeff,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you for your comments. I quickly checked your
> suggestion.
>>>>>>>>>>>> 
>>>>>>>>>>>> As a result, my simple example case worked well.
>>>>>>>>>>>> export OMP_NUM_THREADS=4
>>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program&&     #PBS -l nodes=2:ppn=4
>>>>>>>>>>>> 
>>>>>>>>>>>> But, practical case that more than 1 process was allocated to
> a
>>>>>> node
>>>>>>>>>> like
>>>>>>>>>>>> below did not work.
>>>>>>>>>>>> export OMP_NUM_THREADS=4
>>>>>>>>>>>> mpiexec -bynode -np 4 ./my_program&&     #PBS -l nodes=2:ppn=8
>>>>>>>>>>>> 
>>>>>>>>>>>> The error message is as follows:
>>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
> is
>>>>>>>>>>>> attempting to be sent to a process whose contact infor
>>>>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316
>>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address
> for
>>>>>>>>>>>> [[30666,0],1]
>>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
> is
>>>>>>>>>>>> attempting to be sent to a process whose contact infor
>>>>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line
> 123
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is our openmpi configuration:
>>>>>>>>>>>> ./configure \
>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
>>>>>>>>>>>> --with-tm \
>>>>>>>>>>>> --with-verbs \
>>>>>>>>>>>> --disable-ipv6 \
>>>>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
>>>>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
>>>>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo
>>>>>> Correa<g...@ldeo.columbia.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In your example, have you tried not to modify the node file,
>>>>>>>>>>>>>> launch two mpi processes with mpiexec, and request a
> "-bynode"
>>>>>>>>>>>> distribution of processes:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This should work in 1.7, too (I use these kinds of options
> with
>>>>>>>> SLURM
>>>>>>>>>> all
>>>>>>>>>>>> the time).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, we should probably verify that the hostfile
>>>> functionality
>>>>>>>> in
>>>>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty
>>>> sure
>>>>>>>> that
>>>>>>>>>>>> what you described should work.  However, Ralph, our
>>>>>>>>>>>>> run-time guy, is on vacation this week.  There might be a
> delay
>>>> in
>>>>>>>>>>>> checking into this.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to