Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread Ralph Castain
I've submitted a patch to fix the Torque launch issue - just some leftover 
garbage that existed at the time of the 1.7.0 branch and didn't get removed.

For the hostfile issue, I'm stumped as I can't see how the problem would come 
about. Could you please rerun your original test and add "--display-allocation" 
to your cmd line? Let's see if it is correctly finding the original allocation.

Thanks
Ralph

On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Gus,
> 
> Thank you for your comments. I understand your advice.
> Our script used to be --npernode type as well.
> 
> As I told before, our cluster consists of nodes having 4, 8,
> and 32 cores, although it used to be homogeneous at the
> starting time. Furthermore, since performance of each core
> is almost same, a mixed use of nodes with different number
> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> 
> --npernode type is not applicable to such a mixed use.
> That's why I'd like to continue to use modified hostfile.
> 
> By the way, the problem I reported to Jeff yesterday
> was that openmpi-1.7 with torque is something wrong,
> because it caused error against such a simple case as
> shown below, which surprised me. Now, the problem is not
> limited to modified hostfile, I guess.
> 
> #PBS -l nodes=4:ppn=8
> mpirun -np 8 ./my_program
> (OMP_NUM_THREADS=4)
> 
> Regards,
> Tetsuya Mishima
> 
>> Hi Tetsuya
>> 
>> Your script that edits $PBS_NODEFILE into a separate hostfile
>> is very similar to some that I used here for
>> hybrid OpenMP+MPI programs on older versions of OMPI.
>> I haven't tried this in 1.6.X,
>> but it looks like you did and it works also.
>> I haven't tried 1.7 either.
>> Since we run production machines,
>> I try to stick to the stable versions of OMPI (even numbered:
>> 1.6.X, 1.4.X, 1.2.X).
>> 
>> I believe you can get the same effect even if you
>> don't edit your $PBS_NODEFILE and let OMPI use it as is.
>> Say, if you choose carefully the values in your
>> #PBS -l nodes=?:ppn=?
>> of your
>> $OMP_NUM_THREADS
>> and use an mpiexec with --npernode or --cpus-per-proc.
>> 
>> For instance, for twelve MPI processes, with two threads each,
>> on nodes with eight cores each, I would try
>> (but I haven't tried!):
>> 
>> #PBS -l nodes=3:ppn=8
>> 
>> export $OMP_NUM_THREADS=2
>> 
>> mpiexec -np 12 -npernode 4
>> 
>> or perhaps more tightly:
>> 
>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
>> 
>> I hope this helps,
>> Gus Correa
>> 
>> 
>> 
>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>> 
>>> Hi Reuti and Gus,
>>> 
>>> Thank you for your comments.
>>> 
>>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8,
> 32
>>> cores.
>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l
>>> nodes=2:ppn=4".
>>> (strictly speaking, Torque picked up proper nodes.)
>>> 
>>> As I mentioned before, I usually use openmpi-1.6.x, which has no troble
>>> against that kind
>>> of use. I encountered the issue when I was evaluating openmpi-1.7 to
> check
>>> when we could
>>> move on to it, although we have no positive reason to do that at this
>>> moment.
>>> 
>>> As Gus pointed out, I use a script file as shown below for a practical
> use
>>> of openmpi-1.6.x.
>>> 
>>> #PBS -l nodes=2:ppn=32  # even "-l nodes=1:ppn=32+4:ppn=8" works fine
>>> export OMP_NUM_THREADS=4
>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> here
>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \
>>> -x OMP_NUM_THREADS ./my_program  # 32-core node has 8 numanodes, 8-core
>>> node has 2 numanodes
>>> 
>>> It works well under the combination of openmpi-1.6.x and Torque. The
>>> problem is just
>>> openmpi-1.7's behavior.
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
 Hi Tetsuya Mishima
 
 Mpiexec offers you a number of possibilities that you could try:
 --bynode,
 --pernode,
 --npernode,
 --bysocket,
 --bycore,
 --cpus-per-proc,
 --cpus-per-rank,
 --rankfile
 and more.
 
 Most likely one or more of them will fit your needs.
 
 There are also associated flags to bind processes to cores,
 to sockets, etc, to report the bindings, and so on.
 
 Check the mpiexec man page for details.
 
 Nevertheless, I am surprised that modifying the
 $PBS_NODEFILE doesn't work for you in OMPI 1.7.
 I have done this many times in older versions of OMPI.
 
 Would it work for you to go back to the stable OMPI 1.6.X,
 or does it lack any special feature that you need?
 
 I hope this helps,
 Gus Correa
 
 On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote:
> 
> 
> Hi Jeff,
> 
> I didn't have much time to test this morning. So, I checked it again
> now. Then, the trouble seems to depend on the number of nodes to use.
> 
> This works(nodes<   4):
> mpiexec -bynode -np 4 ./my_program&&   

Re: [OMPI users] Minor bug: invalid values for opal_signal MCA parameter cause internal error

2013-03-20 Thread Ralph Castain
Simple to do - I added a clearer error message to the trunk and marked it for 
inclusion in the eventual v1.7.1 release. I'll have to let someone else do the 
docs as I don't fully grok the rationale behind it.

Thanks

On Mar 18, 2013, at 12:56 PM, Jeremiah Willcock  wrote:

> If a user gives an invalid value for the opal_signal MCA parameter, such as 
> in the command:
> 
> mpirun -mca opal_signal x /bin/ls
> 
> the error produced by Open MPI 1.6.3 is:
> 
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  opal_util_register_stackhandlers failed
>  --> Returned value -5 instead of OPAL_SUCCESS
> --
> 
> which claims to be an internal error, not an invalid argument given by a 
> user.  That parameter also appears to be poorly documented in general 
> (mentioned in ompi_info -a and on the mailing lists), and seems like it would 
> be an incredibly useful debugging tool when running a crashing application 
> under a debugger.
> 
> -- Jeremiah Willcock
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] mpirun error output

2013-03-20 Thread Bruno Cramer
Hi,

1) Openmpi in PC1

I installed openmpi-1.4.3 using the OpenSuse 32b v. 12.1 repository

as well as openmpi devel

All mpi executables are present so are the libraries in lib directory.

I set the environment as ( .bashrc)


PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin

PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib

export PATH

 When I run any of the test examples (eg. mpirun hello_c.c or any program
that has mpi interface included I get the message

-

mpirun was unable to launch the specified application as it could not find
an executable:

Executable: hello_c.c

Node: linux-curie

while attempting to start process rank 0.

---

typing echo $LD_LIBRARY_PATH I should get something like
/usr/lib/mpi/gcc/openmpi/lib. The only output I get is
/usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel
compilers library is not shown.



1) Openmpi installation in PC2

In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the
openmpi site.

No error occured during ./configure, make, make install process.

The environment settings change a little but are very similar to those
mentioned under PC1.

The same message as above is occuring.

in this case typing echo $LD_LIBRARY_PATH I get the correct output from the
mpi library as /usr/local/lib64 and the executables are in /usr/local/bin.


Any help is wellcome


Regards

Bruno


Re: [OMPI users] mpirun error output

2013-03-20 Thread Ralph Castain
Well, a couple of things come to mind - see below

On Mar 20, 2013, at 9:41 AM, Bruno Cramer  wrote:

> Hi,
> 1) Openmpi in PC1
> I installed openmpi-1.4.3 using the  OpenSuse 32b v. 12.1  repository
> as well as openmpi devel
> All mpi executables are present so are the libraries in lib directory.
> I set the environment as ( .bashrc)
> 
> 
> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin
> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib
> export PATH

You should reverse the ordering here - always put the OMPI path element first, 
then the existing one, to ensure that you are getting the intended version. Lot 
of operating systems come with an older version pre-installed in a standard 
location.

>  
> When I run any of the test examples (eg. mpirun hello_c.c or any program that 
> has mpi interface included I get the message
> -
> mpirun was unable to launch the specified application as it could not find an 
> executable:
> Executable: hello_c.c
> Node: linux-curie
> while attempting to start process rank 0.

Look at the executable - apparently, you tried to run the ".c" source code 
instead of the compiled executable :-)

> ---
> typing echo $LD_LIBRARY_PATH I should get something like 
> /usr/lib/mpi/gcc/openmpi/lib. The only output I get is   
> /usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel 
> compilers library is not shown. 

I suspect that your original LD_LIBRARY_PATH was empty, so now the path starts 
with a ":" and makes bash unhappy. Try reversing the order as above and it 
might work.

> 
> 
> 
> 1) Openmpi installation in PC2
> In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the 
> openmpi site.
> No error occured during ./configure, make,  make install process.
> The environment settings change a little but are very similar to those 
> mentioned under PC1.
> The same message as above is occuring.
> 
> in this case typing echo $LD_LIBRARY_PATH I get the correct output from the 
> mpi library as /usr/local/lib64 and the executables are in /usr/local/bin.
> 
> 
> Any help is wellcome
> 
> 
> Regards
> Bruno
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpirun error output

2013-03-20 Thread Reuti
Am 20.03.2013 um 18:58 schrieb Ralph Castain:

> Well, a couple of things come to mind - see below
> 
> On Mar 20, 2013, at 9:41 AM, Bruno Cramer  wrote:
> 
>> Hi,
>> 1) Openmpi in PC1
>> I installed openmpi-1.4.3 using the  OpenSuse 32b v. 12.1  repository
>> as well as openmpi devel
>> All mpi executables are present so are the libraries in lib directory.
>> I set the environment as ( .bashrc)
>> 
>> 
>> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin
>> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib
>> export PATH
> 
> You should reverse the ordering here - always put the OMPI path element 
> first, then the existing one, to ensure that you are getting the intended 
> version. Lot of operating systems come with an older version pre-installed in 
> a standard location.
> 
>>  
>> When I run any of the test examples (eg. mpirun hello_c.c or any program 
>> that has mpi interface included I get the message
>> -
>> mpirun was unable to launch the specified application as it could not find 
>> an executable:
>> Executable: hello_c.c
>> Node: linux-curie
>> while attempting to start process rank 0.
> 
> Look at the executable - apparently, you tried to run the ".c" source code 
> instead of the compiled executable :-)
> 
>> ---
>> typing echo $LD_LIBRARY_PATH I should get something like 
>> /usr/lib/mpi/gcc/openmpi/lib. The only output I get is   
>> /usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel 
>> compilers library is not shown. 
> 
> I suspect that your original LD_LIBRARY_PATH was empty, so now the path 
> starts with a ":" and makes bash unhappy. Try reversing the order as above 
> and it might work.

AFAIK additional colons don't matter, but nevertheless I prefer indeed for 
cosmetic reasons:

$ export 
LD_LIBRARY_PATH=/usr/lib/mpi/gcc/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

to avoid a superfluous colon too.

-- Reuti


>> 
>> 
>> 
>> 1) Openmpi installation in PC2
>> In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the 
>> openmpi site.
>> No error occured during ./configure, make,  make install process.
>> The environment settings change a little but are very similar to those 
>> mentioned under PC1.
>> The same message as above is occuring.
>> 
>> in this case typing echo $LD_LIBRARY_PATH I get the correct output from the 
>> mpi library as /usr/local/lib64 and the executables are in /usr/local/bin.
>> 
>> 
>> 
>> Any help is wellcome
>> 
>> 
>> Regards
>> Bruno
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] "Error setting file view" NPB BTIO

2013-03-20 Thread kmehta
Hello ,

I am running NAS parallel benchmark's BTIO benchmark (NPB v 3.3) for class
D and 1 process.

`make bt CLASS=D SUBTYPE=full NPROCS=1`

I have provided gcc's `mcmodel=medium` flag alongwith -O3 during
compilation. This is on an x86_64 machine.

I have tested with openmpi 1.4.3, 1.7, but I get "Error setting file view"
when I run the benchmark. It works fine for 4,16 processes. Can someone
point out what is going wrong? Thanks in advance.


NAS Parallel Benchmarks 3.3 -- BT Benchmark

 No input file inputbt.data. Using compiled defaults
 Size:  408x 408x 408
 Iterations:  250dt:   0.200
 Number of active processes: 1

 BTIO -- FULL MPI-IO write interval:   5

 Error setting file view
--
mpirun has exited due to process rank 0 with PID 6663 on
node crill-003 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--





Regards,
Kshitij Mehta
PhD student
University of Houston




Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread tmishima


Hi Ralph,

Here is a result of rerun with --display-allocation.
I set OMP_NUM_THREADS=1 to make the problem clear.

Regards,
Tetsuya Mishima

P.S. As far as I checked, these 2 cases are OK(no problem).
(1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre m02-ld
(2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre
m02-ld

Script File:

#!/bin/sh
#PBS -A tmishima
#PBS -N Ducom-run
#PBS -j oe
#PBS -l nodes=2:ppn=4
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
cp $PBS_NODEFILE pbs_hosts
NPROCS=`wc -l < pbs_hosts`
mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
--display-allocation ~/Ducom/testbed/mPre m02-ld

Output:
--
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: orte_rsh_agent
--

==   ALLOCATED NODES   ==

 Data for node: node06  Num slots: 4Max slots: 0
 Data for node: node05  Num slots: 4Max slots: 0

=
--
A hostfile was provided that contains at least one node not
present in the allocation:

  hostfile:  pbs_hosts
  node:  node06

If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--


> I've submitted a patch to fix the Torque launch issue - just some
leftover garbage that existed at the time of the 1.7.0 branch and didn't
get removed.
>
> For the hostfile issue, I'm stumped as I can't see how the problem would
come about. Could you please rerun your original test and add
"--display-allocation" to your cmd line? Let's see if it is
> correctly finding the original allocation.
>
> Thanks
> Ralph
>
> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Gus,
> >
> > Thank you for your comments. I understand your advice.
> > Our script used to be --npernode type as well.
> >
> > As I told before, our cluster consists of nodes having 4, 8,
> > and 32 cores, although it used to be homogeneous at the
> > starting time. Furthermore, since performance of each core
> > is almost same, a mixed use of nodes with different number
> > of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> >
> > --npernode type is not applicable to such a mixed use.
> > That's why I'd like to continue to use modified hostfile.
> >
> > By the way, the problem I reported to Jeff yesterday
> > was that openmpi-1.7 with torque is something wrong,
> > because it caused error against such a simple case as
> > shown below, which surprised me. Now, the problem is not
> > limited to modified hostfile, I guess.
> >
> > #PBS -l nodes=4:ppn=8
> > mpirun -np 8 ./my_program
> > (OMP_NUM_THREADS=4)
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> Hi Tetsuya
> >>
> >> Your script that edits $PBS_NODEFILE into a separate hostfile
> >> is very similar to some that I used here for
> >> hybrid OpenMP+MPI programs on older versions of OMPI.
> >> I haven't tried this in 1.6.X,
> >> but it looks like you did and it works also.
> >> I haven't tried 1.7 either.
> >> Since we run production machines,
> >> I try to stick to the stable versions of OMPI (even numbered:
> >> 1.6.X, 1.4.X, 1.2.X).
> >>
> >> I believe you can get the same effect even if you
> >> don't edit your $PBS_NODEFILE and let OMPI use it as is.
> >> Say, if you choose carefully the values in your
> >> #PBS -l nodes=?:ppn=?
> >> of your
> >> $OMP_NUM_THREADS
> >> and use an mpiexec with --npernode or --cpus-per-proc.
> >>
> >> For instance, for twelve MPI processes, with two threads each,
> >> on nodes with eight cores each, I would try
> >> (but I haven't tried!):
> >>
> >> #PBS -l nodes=3:ppn=8
> >>
> >> export $OMP_NUM_THREADS=2
> >>
> >> mpiexec -np 12 -npernode 4
> >>
> >> or perhaps more tightly:
> >>
> >> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
> >>
> >> I hope this helps,
> >> Gus Correa
> >>
> >>
> >>
> >> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>
> >>>
> >>> Hi Reuti and Gus,
> >>>
> >>> Thank you for your comments.
> >>>
> >>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8,
> > 32
> >>> cores.
> >>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l
> >>> nodes=2:ppn=4".
> >>> (strictly speaking, Torque picked up proper nodes.)
> >>>
> >>> As I mentioned before, I usually use openmpi-1.6.x, which has no
troble
> >>> against that kind
> >>> of use. 

Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread Ralph Castain
You obviously have some MCA params set somewhere:

> --
> A deprecated MCA parameter value was specified in an MCA parameter
> file.  Deprecated MCA parameters should be avoided; they may disappear
> in future releases.
> 
>  Deprecated parameter: orte_rsh_agent
> --

Check your environment for anything with OMPI_MCA_xxx, and your default MCA 
parameter file to see what has been specified.

The allocation looks okay - I'll have to look for other debug flags you can 
set. Meantime, can you please add --enable-debug to your configure cmd line and 
rebuild?

Thanks
Ralph


On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Here is a result of rerun with --display-allocation.
> I set OMP_NUM_THREADS=1 to make the problem clear.
> 
> Regards,
> Tetsuya Mishima
> 
> P.S. As far as I checked, these 2 cases are OK(no problem).
> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> ~/Ducom/testbed/mPre m02-ld
> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre
> m02-ld
> 
> Script File:
> 
> #!/bin/sh
> #PBS -A tmishima
> #PBS -N Ducom-run
> #PBS -j oe
> #PBS -l nodes=2:ppn=4
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> --display-allocation ~/Ducom/testbed/mPre m02-ld
> 
> Output:
> --
> A deprecated MCA parameter value was specified in an MCA parameter
> file.  Deprecated MCA parameters should be avoided; they may disappear
> in future releases.
> 
>  Deprecated parameter: orte_rsh_agent
> --
> 
> ==   ALLOCATED NODES   ==
> 
> Data for node: node06  Num slots: 4Max slots: 0
> Data for node: node05  Num slots: 4Max slots: 0
> 
> =
> --
> A hostfile was provided that contains at least one node not
> present in the allocation:
> 
>  hostfile:  pbs_hosts
>  node:  node06
> 
> If you are operating in a resource-managed environment, then only
> nodes that are in the allocation can be used in the hostfile. You
> may find relative node syntax to be a useful alternative to
> specifying absolute node names see the orte_hosts man page for
> further information.
> --
> 
> 
>> I've submitted a patch to fix the Torque launch issue - just some
> leftover garbage that existed at the time of the 1.7.0 branch and didn't
> get removed.
>> 
>> For the hostfile issue, I'm stumped as I can't see how the problem would
> come about. Could you please rerun your original test and add
> "--display-allocation" to your cmd line? Let's see if it is
>> correctly finding the original allocation.
>> 
>> Thanks
>> Ralph
>> 
>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Gus,
>>> 
>>> Thank you for your comments. I understand your advice.
>>> Our script used to be --npernode type as well.
>>> 
>>> As I told before, our cluster consists of nodes having 4, 8,
>>> and 32 cores, although it used to be homogeneous at the
>>> starting time. Furthermore, since performance of each core
>>> is almost same, a mixed use of nodes with different number
>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
>>> 
>>> --npernode type is not applicable to such a mixed use.
>>> That's why I'd like to continue to use modified hostfile.
>>> 
>>> By the way, the problem I reported to Jeff yesterday
>>> was that openmpi-1.7 with torque is something wrong,
>>> because it caused error against such a simple case as
>>> shown below, which surprised me. Now, the problem is not
>>> limited to modified hostfile, I guess.
>>> 
>>> #PBS -l nodes=4:ppn=8
>>> mpirun -np 8 ./my_program
>>> (OMP_NUM_THREADS=4)
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
 Hi Tetsuya
 
 Your script that edits $PBS_NODEFILE into a separate hostfile
 is very similar to some that I used here for
 hybrid OpenMP+MPI programs on older versions of OMPI.
 I haven't tried this in 1.6.X,
 but it looks like you did and it works also.
 I haven't tried 1.7 either.
 Since we run production machines,
 I try to stick to the stable versions of OMPI (even numbered:
 1.6.X, 1.4.X, 1.2.X).
 
 I believe you can get the same effect even if you
 don't edit your $PBS_NODEFILE and let OMPI use it as is.
 Say, if you choose carefully the values in your
 #PBS -l nodes=?:ppn=?
 of your
 $OMP_NUM_THREADS
 and use an mpiexec with --npernode or --cpus-per-proc.
 
 F

Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread Ralph Castain
Could you please apply the attached patch and try it again? If you haven't had 
time to configure with --enable-debug, that is fine - this will output 
regardless.

Thanks
Ralph



user.diff
Description: Binary data


On Mar 20, 2013, at 4:59 PM, Ralph Castain  wrote:

> You obviously have some MCA params set somewhere:
> 
>> --
>> A deprecated MCA parameter value was specified in an MCA parameter
>> file.  Deprecated MCA parameters should be avoided; they may disappear
>> in future releases.
>> 
>> Deprecated parameter: orte_rsh_agent
>> --
> 
> Check your environment for anything with OMPI_MCA_xxx, and your default MCA 
> parameter file to see what has been specified.
> 
> The allocation looks okay - I'll have to look for other debug flags you can 
> set. Meantime, can you please add --enable-debug to your configure cmd line 
> and rebuild?
> 
> Thanks
> Ralph
> 
> 
> On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote:
> 
>> 
>> 
>> Hi Ralph,
>> 
>> Here is a result of rerun with --display-allocation.
>> I set OMP_NUM_THREADS=1 to make the problem clear.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> P.S. As far as I checked, these 2 cases are OK(no problem).
>> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
>> ~/Ducom/testbed/mPre m02-ld
>> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre
>> m02-ld
>> 
>> Script File:
>> 
>> #!/bin/sh
>> #PBS -A tmishima
>> #PBS -N Ducom-run
>> #PBS -j oe
>> #PBS -l nodes=2:ppn=4
>> export OMP_NUM_THREADS=1
>> cd $PBS_O_WORKDIR
>> cp $PBS_NODEFILE pbs_hosts
>> NPROCS=`wc -l < pbs_hosts`
>> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
>> --display-allocation ~/Ducom/testbed/mPre m02-ld
>> 
>> Output:
>> --
>> A deprecated MCA parameter value was specified in an MCA parameter
>> file.  Deprecated MCA parameters should be avoided; they may disappear
>> in future releases.
>> 
>> Deprecated parameter: orte_rsh_agent
>> --
>> 
>> ==   ALLOCATED NODES   ==
>> 
>> Data for node: node06  Num slots: 4Max slots: 0
>> Data for node: node05  Num slots: 4Max slots: 0
>> 
>> =
>> --
>> A hostfile was provided that contains at least one node not
>> present in the allocation:
>> 
>> hostfile:  pbs_hosts
>> node:  node06
>> 
>> If you are operating in a resource-managed environment, then only
>> nodes that are in the allocation can be used in the hostfile. You
>> may find relative node syntax to be a useful alternative to
>> specifying absolute node names see the orte_hosts man page for
>> further information.
>> --
>> 
>> 
>>> I've submitted a patch to fix the Torque launch issue - just some
>> leftover garbage that existed at the time of the 1.7.0 branch and didn't
>> get removed.
>>> 
>>> For the hostfile issue, I'm stumped as I can't see how the problem would
>> come about. Could you please rerun your original test and add
>> "--display-allocation" to your cmd line? Let's see if it is
>>> correctly finding the original allocation.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
 
 
 Hi Gus,
 
 Thank you for your comments. I understand your advice.
 Our script used to be --npernode type as well.
 
 As I told before, our cluster consists of nodes having 4, 8,
 and 32 cores, although it used to be homogeneous at the
 starting time. Furthermore, since performance of each core
 is almost same, a mixed use of nodes with different number
 of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
 
 --npernode type is not applicable to such a mixed use.
 That's why I'd like to continue to use modified hostfile.
 
 By the way, the problem I reported to Jeff yesterday
 was that openmpi-1.7 with torque is something wrong,
 because it caused error against such a simple case as
 shown below, which surprised me. Now, the problem is not
 limited to modified hostfile, I guess.
 
 #PBS -l nodes=4:ppn=8
 mpirun -np 8 ./my_program
 (OMP_NUM_THREADS=4)
 
 Regards,
 Tetsuya Mishima
 
> Hi Tetsuya
> 
> Your script that edits $PBS_NODEFILE into a separate hostfile
> is very similar to some that I used here for
> hybrid OpenMP+MPI programs on older versions of OMPI.
> I haven't tried this in 1.6.X,
> but it looks like you did and it works also.
> I haven't tried 1.7 either.
> Since we run producti

Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread tmishima


Hi Ralph,

I have a line below in ~/.openmpi/mca-params.conf to use rsh.
orte_rsh_agent = /usr/bin/rsh

I changed this line to:
plm_rsh_agent = /usr/bin/rsh  # for openmpi-1.7

Then, the error message disappeared. Thanks.

Retruning to the subject, I can rebuild with --enable-debug.
Just wait until it will complete.

Regards,
Tetsuya Mishima

> You obviously have some MCA params set somewhere:
>
> >
--
> > A deprecated MCA parameter value was specified in an MCA parameter
> > file.  Deprecated MCA parameters should be avoided; they may disappear
> > in future releases.
> >
> >  Deprecated parameter: orte_rsh_agent
> >
--
>
> Check your environment for anything with OMPI_MCA_xxx, and your default
MCA parameter file to see what has been specified.
>
> The allocation looks okay - I'll have to look for other debug flags you
can set. Meantime, can you please add --enable-debug to your configure cmd
line and rebuild?
>
> Thanks
> Ralph
>
>
> On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Here is a result of rerun with --display-allocation.
> > I set OMP_NUM_THREADS=1 to make the problem clear.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > P.S. As far as I checked, these 2 cases are OK(no problem).
> > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> > ~/Ducom/testbed/mPre m02-ld
> > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre
> > m02-ld
> >
> > Script File:
> >
> > #!/bin/sh
> > #PBS -A tmishima
> > #PBS -N Ducom-run
> > #PBS -j oe
> > #PBS -l nodes=2:ppn=4
> > export OMP_NUM_THREADS=1
> > cd $PBS_O_WORKDIR
> > cp $PBS_NODEFILE pbs_hosts
> > NPROCS=`wc -l < pbs_hosts`
> > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> > --display-allocation ~/Ducom/testbed/mPre m02-ld
> >
> > Output:
> >
--
> > A deprecated MCA parameter value was specified in an MCA parameter
> > file.  Deprecated MCA parameters should be avoided; they may disappear
> > in future releases.
> >
> >  Deprecated parameter: orte_rsh_agent
> >
--
> >
> > ==   ALLOCATED NODES   ==
> >
> > Data for node: node06  Num slots: 4Max slots: 0
> > Data for node: node05  Num slots: 4Max slots: 0
> >
> > =
> >
--
> > A hostfile was provided that contains at least one node not
> > present in the allocation:
> >
> >  hostfile:  pbs_hosts
> >  node:  node06
> >
> > If you are operating in a resource-managed environment, then only
> > nodes that are in the allocation can be used in the hostfile. You
> > may find relative node syntax to be a useful alternative to
> > specifying absolute node names see the orte_hosts man page for
> > further information.
> >
--
> >
> >
> >> I've submitted a patch to fix the Torque launch issue - just some
> > leftover garbage that existed at the time of the 1.7.0 branch and
didn't
> > get removed.
> >>
> >> For the hostfile issue, I'm stumped as I can't see how the problem
would
> > come about. Could you please rerun your original test and add
> > "--display-allocation" to your cmd line? Let's see if it is
> >> correctly finding the original allocation.
> >>
> >> Thanks
> >> Ralph
> >>
> >> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Hi Gus,
> >>>
> >>> Thank you for your comments. I understand your advice.
> >>> Our script used to be --npernode type as well.
> >>>
> >>> As I told before, our cluster consists of nodes having 4, 8,
> >>> and 32 cores, although it used to be homogeneous at the
> >>> starting time. Furthermore, since performance of each core
> >>> is almost same, a mixed use of nodes with different number
> >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> >>>
> >>> --npernode type is not applicable to such a mixed use.
> >>> That's why I'd like to continue to use modified hostfile.
> >>>
> >>> By the way, the problem I reported to Jeff yesterday
> >>> was that openmpi-1.7 with torque is something wrong,
> >>> because it caused error against such a simple case as
> >>> shown below, which surprised me. Now, the problem is not
> >>> limited to modified hostfile, I guess.
> >>>
> >>> #PBS -l nodes=4:ppn=8
> >>> mpirun -np 8 ./my_program
> >>> (OMP_NUM_THREADS=4)
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
>  Hi Tetsuya
> 
>  Your script that edits $PBS_NODEFILE into a separate hostfile
>  is very similar to some that I used here for
>  hybrid OpenMP+MPI programs on older versions of OMPI.

Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread tmishima


Hi Ralph,

I have completed rebuild of openmpi1.7rc8.
To save time, I added --disable-vt. ( Is it OK? )

Well, what shall I do ?

./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
--with-tm \
--with-verbs \
--disable-ipv6 \
--disable-vt \
--enable-debug \
CC=pgcc CFLAGS="-fast -tp k8-64e" \
CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
F77=pgfortran FFLAGS="-fast -tp k8-64e" \
FC=pgfortran FCFLAGS="-fast -tp k8-64e"

Note:
I tried patch user.diff after rebuiding openmpi1.7rc8.
But, I got an error and could not go foward.

$ patch -p0 < user.diff  # this is OK
$ make   # I got an error

  CC   util/hostfile/hostfile.lo
PGC-S-0037-Syntax error: Recovery attempted by deleting 
(util/hostfile/hostfile.c: 728)
PGC/x86-64 Linux 12.9-0: compilation completed with severe errors

Regards,
Tetsuya Mishima

> Could you please apply the attached patch and try it again? If you
haven't had time to configure with --enable-debug, that is fine - this will
output regardless.
>
> Thanks
> Ralph
>
>  - user.diff
>
>
> On Mar 20, 2013, at 4:59 PM, Ralph Castain  wrote:
>
> > You obviously have some MCA params set somewhere:
> >
> >>
--
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file.  Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--
> >
> > Check your environment for anything with OMPI_MCA_xxx, and your default
MCA parameter file to see what has been specified.
> >
> > The allocation looks okay - I'll have to look for other debug flags you
can set. Meantime, can you please add --enable-debug to your configure cmd
line and rebuild?
> >
> > Thanks
> > Ralph
> >
> >
> > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote:
> >
> >>
> >>
> >> Hi Ralph,
> >>
> >> Here is a result of rerun with --display-allocation.
> >> I set OMP_NUM_THREADS=1 to make the problem clear.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >> P.S. As far as I checked, these 2 cases are OK(no problem).
> >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> >> ~/Ducom/testbed/mPre m02-ld
> >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre
> >> m02-ld
> >>
> >> Script File:
> >>
> >> #!/bin/sh
> >> #PBS -A tmishima
> >> #PBS -N Ducom-run
> >> #PBS -j oe
> >> #PBS -l nodes=2:ppn=4
> >> export OMP_NUM_THREADS=1
> >> cd $PBS_O_WORKDIR
> >> cp $PBS_NODEFILE pbs_hosts
> >> NPROCS=`wc -l < pbs_hosts`
> >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> >> --display-allocation ~/Ducom/testbed/mPre m02-ld
> >>
> >> Output:
> >>
--
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file.  Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--
> >>
> >> ==   ALLOCATED NODES   ==
> >>
> >> Data for node: node06  Num slots: 4Max slots: 0
> >> Data for node: node05  Num slots: 4Max slots: 0
> >>
> >> =
> >>
--
> >> A hostfile was provided that contains at least one node not
> >> present in the allocation:
> >>
> >> hostfile:  pbs_hosts
> >> node:  node06
> >>
> >> If you are operating in a resource-managed environment, then only
> >> nodes that are in the allocation can be used in the hostfile. You
> >> may find relative node syntax to be a useful alternative to
> >> specifying absolute node names see the orte_hosts man page for
> >> further information.
> >>
--
> >>
> >>
> >>> I've submitted a patch to fix the Torque launch issue - just some
> >> leftover garbage that existed at the time of the 1.7.0 branch and
didn't
> >> get removed.
> >>>
> >>> For the hostfile issue, I'm stumped as I can't see how the problem
would
> >> come about. Could you please rerun your original test and add
> >> "--display-allocation" to your cmd line? Let's see if it is
> >>> correctly finding the original allocation.
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>
> 
> 
>  Hi Gus,
> 
>  Thank you for your comments. I understand your advice.
>  Our script used to be --npernode type as well.
> 
>  As I told before, our cluster consists of nodes having 4, 8,
>  and 32 cores, although it used to be homogeneous at the
>  starting time. Furthermore, since performance of each core
>  is almost same, a mixed use of 

Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8

2013-03-20 Thread tmishima


Hi Ralph,

Here is an output on openmpi-1.6.4, just for your information.
Small difference is obserbed. I hope this helps you.

Regards,
Tetusya Mishima

openmpi-1.6.4:
==   ALLOCATED NODES   ==

 Data for node: node06.cluster  Num slots: 4Max slots: 0
 Data for node: node05  Num slots: 4Max slots: 0

=


openmpi-1.7rc8 with --enable-debug:
==   ALLOCATED NODES   ==

 Data for node: node06  Num slots: 4Max slots: 0
 Data for node: node05  Num slots: 4Max slots: 0

=
--
A hostfile was provided that contains at least one node not
present in the allocation:

  hostfile:  pbs_hosts
  node:  node06

If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--