Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
I've submitted a patch to fix the Torque launch issue - just some leftover garbage that existed at the time of the 1.7.0 branch and didn't get removed. For the hostfile issue, I'm stumped as I can't see how the problem would come about. Could you please rerun your original test and add "--display-allocation" to your cmd line? Let's see if it is correctly finding the original allocation. Thanks Ralph On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Gus, > > Thank you for your comments. I understand your advice. > Our script used to be --npernode type as well. > > As I told before, our cluster consists of nodes having 4, 8, > and 32 cores, although it used to be homogeneous at the > starting time. Furthermore, since performance of each core > is almost same, a mixed use of nodes with different number > of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > > --npernode type is not applicable to such a mixed use. > That's why I'd like to continue to use modified hostfile. > > By the way, the problem I reported to Jeff yesterday > was that openmpi-1.7 with torque is something wrong, > because it caused error against such a simple case as > shown below, which surprised me. Now, the problem is not > limited to modified hostfile, I guess. > > #PBS -l nodes=4:ppn=8 > mpirun -np 8 ./my_program > (OMP_NUM_THREADS=4) > > Regards, > Tetsuya Mishima > >> Hi Tetsuya >> >> Your script that edits $PBS_NODEFILE into a separate hostfile >> is very similar to some that I used here for >> hybrid OpenMP+MPI programs on older versions of OMPI. >> I haven't tried this in 1.6.X, >> but it looks like you did and it works also. >> I haven't tried 1.7 either. >> Since we run production machines, >> I try to stick to the stable versions of OMPI (even numbered: >> 1.6.X, 1.4.X, 1.2.X). >> >> I believe you can get the same effect even if you >> don't edit your $PBS_NODEFILE and let OMPI use it as is. >> Say, if you choose carefully the values in your >> #PBS -l nodes=?:ppn=? >> of your >> $OMP_NUM_THREADS >> and use an mpiexec with --npernode or --cpus-per-proc. >> >> For instance, for twelve MPI processes, with two threads each, >> on nodes with eight cores each, I would try >> (but I haven't tried!): >> >> #PBS -l nodes=3:ppn=8 >> >> export $OMP_NUM_THREADS=2 >> >> mpiexec -np 12 -npernode 4 >> >> or perhaps more tightly: >> >> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 >> >> I hope this helps, >> Gus Correa >> >> >> >> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>> >>> Hi Reuti and Gus, >>> >>> Thank you for your comments. >>> >>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, > 32 >>> cores. >>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l >>> nodes=2:ppn=4". >>> (strictly speaking, Torque picked up proper nodes.) >>> >>> As I mentioned before, I usually use openmpi-1.6.x, which has no troble >>> against that kind >>> of use. I encountered the issue when I was evaluating openmpi-1.7 to > check >>> when we could >>> move on to it, although we have no positive reason to do that at this >>> moment. >>> >>> As Gus pointed out, I use a script file as shown below for a practical > use >>> of openmpi-1.6.x. >>> >>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine >>> export OMP_NUM_THREADS=4 >>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > here >>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ >>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, 8-core >>> node has 2 numanodes >>> >>> It works well under the combination of openmpi-1.6.x and Torque. The >>> problem is just >>> openmpi-1.7's behavior. >>> >>> Regards, >>> Tetsuya Mishima >>> Hi Tetsuya Mishima Mpiexec offers you a number of possibilities that you could try: --bynode, --pernode, --npernode, --bysocket, --bycore, --cpus-per-proc, --cpus-per-rank, --rankfile and more. Most likely one or more of them will fit your needs. There are also associated flags to bind processes to cores, to sockets, etc, to report the bindings, and so on. Check the mpiexec man page for details. Nevertheless, I am surprised that modifying the $PBS_NODEFILE doesn't work for you in OMPI 1.7. I have done this many times in older versions of OMPI. Would it work for you to go back to the stable OMPI 1.6.X, or does it lack any special feature that you need? I hope this helps, Gus Correa On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Jeff, > > I didn't have much time to test this morning. So, I checked it again > now. Then, the trouble seems to depend on the number of nodes to use. > > This works(nodes< 4): > mpiexec -bynode -np 4 ./my_program&&
Re: [OMPI users] Minor bug: invalid values for opal_signal MCA parameter cause internal error
Simple to do - I added a clearer error message to the trunk and marked it for inclusion in the eventual v1.7.1 release. I'll have to let someone else do the docs as I don't fully grok the rationale behind it. Thanks On Mar 18, 2013, at 12:56 PM, Jeremiah Willcock wrote: > If a user gives an invalid value for the opal_signal MCA parameter, such as > in the command: > > mpirun -mca opal_signal x /bin/ls > > the error produced by Open MPI 1.6.3 is: > > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_util_register_stackhandlers failed > --> Returned value -5 instead of OPAL_SUCCESS > -- > > which claims to be an internal error, not an invalid argument given by a > user. That parameter also appears to be poorly documented in general > (mentioned in ompi_info -a and on the mailing lists), and seems like it would > be an incredibly useful debugging tool when running a crashing application > under a debugger. > > -- Jeremiah Willcock > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] mpirun error output
Hi, 1) Openmpi in PC1 I installed openmpi-1.4.3 using the OpenSuse 32b v. 12.1 repository as well as openmpi devel All mpi executables are present so are the libraries in lib directory. I set the environment as ( .bashrc) PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib export PATH When I run any of the test examples (eg. mpirun hello_c.c or any program that has mpi interface included I get the message - mpirun was unable to launch the specified application as it could not find an executable: Executable: hello_c.c Node: linux-curie while attempting to start process rank 0. --- typing echo $LD_LIBRARY_PATH I should get something like /usr/lib/mpi/gcc/openmpi/lib. The only output I get is /usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel compilers library is not shown. 1) Openmpi installation in PC2 In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the openmpi site. No error occured during ./configure, make, make install process. The environment settings change a little but are very similar to those mentioned under PC1. The same message as above is occuring. in this case typing echo $LD_LIBRARY_PATH I get the correct output from the mpi library as /usr/local/lib64 and the executables are in /usr/local/bin. Any help is wellcome Regards Bruno
Re: [OMPI users] mpirun error output
Well, a couple of things come to mind - see below On Mar 20, 2013, at 9:41 AM, Bruno Cramer wrote: > Hi, > 1) Openmpi in PC1 > I installed openmpi-1.4.3 using the OpenSuse 32b v. 12.1 repository > as well as openmpi devel > All mpi executables are present so are the libraries in lib directory. > I set the environment as ( .bashrc) > > > PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin > PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib > export PATH You should reverse the ordering here - always put the OMPI path element first, then the existing one, to ensure that you are getting the intended version. Lot of operating systems come with an older version pre-installed in a standard location. > > When I run any of the test examples (eg. mpirun hello_c.c or any program that > has mpi interface included I get the message > - > mpirun was unable to launch the specified application as it could not find an > executable: > Executable: hello_c.c > Node: linux-curie > while attempting to start process rank 0. Look at the executable - apparently, you tried to run the ".c" source code instead of the compiled executable :-) > --- > typing echo $LD_LIBRARY_PATH I should get something like > /usr/lib/mpi/gcc/openmpi/lib. The only output I get is > /usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel > compilers library is not shown. I suspect that your original LD_LIBRARY_PATH was empty, so now the path starts with a ":" and makes bash unhappy. Try reversing the order as above and it might work. > > > > 1) Openmpi installation in PC2 > In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the > openmpi site. > No error occured during ./configure, make, make install process. > The environment settings change a little but are very similar to those > mentioned under PC1. > The same message as above is occuring. > > in this case typing echo $LD_LIBRARY_PATH I get the correct output from the > mpi library as /usr/local/lib64 and the executables are in /usr/local/bin. > > > Any help is wellcome > > > Regards > Bruno > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun error output
Am 20.03.2013 um 18:58 schrieb Ralph Castain: > Well, a couple of things come to mind - see below > > On Mar 20, 2013, at 9:41 AM, Bruno Cramer wrote: > >> Hi, >> 1) Openmpi in PC1 >> I installed openmpi-1.4.3 using the OpenSuse 32b v. 12.1 repository >> as well as openmpi devel >> All mpi executables are present so are the libraries in lib directory. >> I set the environment as ( .bashrc) >> >> >> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/bin >> PATH=$PATH:/usr/lib/mpi/gcc/openmpi/lib >> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/mpi/gcc/openmpi/lib >> export PATH > > You should reverse the ordering here - always put the OMPI path element > first, then the existing one, to ensure that you are getting the intended > version. Lot of operating systems come with an older version pre-installed in > a standard location. > >> >> When I run any of the test examples (eg. mpirun hello_c.c or any program >> that has mpi interface included I get the message >> - >> mpirun was unable to launch the specified application as it could not find >> an executable: >> Executable: hello_c.c >> Node: linux-curie >> while attempting to start process rank 0. > > Look at the executable - apparently, you tried to run the ".c" source code > instead of the compiled executable :-) > >> --- >> typing echo $LD_LIBRARY_PATH I should get something like >> /usr/lib/mpi/gcc/openmpi/lib. The only output I get is >> /usr/local/atlas3.10/lib (which is the blas/lapack library). Also Intel >> compilers library is not shown. > > I suspect that your original LD_LIBRARY_PATH was empty, so now the path > starts with a ":" and makes bash unhappy. Try reversing the order as above > and it might work. AFAIK additional colons don't matter, but nevertheless I prefer indeed for cosmetic reasons: $ export LD_LIBRARY_PATH=/usr/lib/mpi/gcc/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} to avoid a superfluous colon too. -- Reuti >> >> >> >> 1) Openmpi installation in PC2 >> In OpenSuse v 12.1 64b I installed openmpi-1.4.3 downloading it from the >> openmpi site. >> No error occured during ./configure, make, make install process. >> The environment settings change a little but are very similar to those >> mentioned under PC1. >> The same message as above is occuring. >> >> in this case typing echo $LD_LIBRARY_PATH I get the correct output from the >> mpi library as /usr/local/lib64 and the executables are in /usr/local/bin. >> >> >> >> Any help is wellcome >> >> >> Regards >> Bruno >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] "Error setting file view" NPB BTIO
Hello , I am running NAS parallel benchmark's BTIO benchmark (NPB v 3.3) for class D and 1 process. `make bt CLASS=D SUBTYPE=full NPROCS=1` I have provided gcc's `mcmodel=medium` flag alongwith -O3 during compilation. This is on an x86_64 machine. I have tested with openmpi 1.4.3, 1.7, but I get "Error setting file view" when I run the benchmark. It works fine for 4,16 processes. Can someone point out what is going wrong? Thanks in advance. NAS Parallel Benchmarks 3.3 -- BT Benchmark No input file inputbt.data. Using compiled defaults Size: 408x 408x 408 Iterations: 250dt: 0.200 Number of active processes: 1 BTIO -- FULL MPI-IO write interval: 5 Error setting file view -- mpirun has exited due to process rank 0 with PID 6663 on node crill-003 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- Regards, Kshitij Mehta PhD student University of Houston
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
Hi Ralph, Here is a result of rerun with --display-allocation. I set OMP_NUM_THREADS=1 to make the problem clear. Regards, Tetsuya Mishima P.S. As far as I checked, these 2 cases are OK(no problem). (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre m02-ld (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre m02-ld Script File: #!/bin/sh #PBS -A tmishima #PBS -N Ducom-run #PBS -j oe #PBS -l nodes=2:ppn=4 export OMP_NUM_THREADS=1 cd $PBS_O_WORKDIR cp $PBS_NODEFILE pbs_hosts NPROCS=`wc -l < pbs_hosts` mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre m02-ld Output: -- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: orte_rsh_agent -- == ALLOCATED NODES == Data for node: node06 Num slots: 4Max slots: 0 Data for node: node05 Num slots: 4Max slots: 0 = -- A hostfile was provided that contains at least one node not present in the allocation: hostfile: pbs_hosts node: node06 If you are operating in a resource-managed environment, then only nodes that are in the allocation can be used in the hostfile. You may find relative node syntax to be a useful alternative to specifying absolute node names see the orte_hosts man page for further information. -- > I've submitted a patch to fix the Torque launch issue - just some leftover garbage that existed at the time of the 1.7.0 branch and didn't get removed. > > For the hostfile issue, I'm stumped as I can't see how the problem would come about. Could you please rerun your original test and add "--display-allocation" to your cmd line? Let's see if it is > correctly finding the original allocation. > > Thanks > Ralph > > On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Gus, > > > > Thank you for your comments. I understand your advice. > > Our script used to be --npernode type as well. > > > > As I told before, our cluster consists of nodes having 4, 8, > > and 32 cores, although it used to be homogeneous at the > > starting time. Furthermore, since performance of each core > > is almost same, a mixed use of nodes with different number > > of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > > > > --npernode type is not applicable to such a mixed use. > > That's why I'd like to continue to use modified hostfile. > > > > By the way, the problem I reported to Jeff yesterday > > was that openmpi-1.7 with torque is something wrong, > > because it caused error against such a simple case as > > shown below, which surprised me. Now, the problem is not > > limited to modified hostfile, I guess. > > > > #PBS -l nodes=4:ppn=8 > > mpirun -np 8 ./my_program > > (OMP_NUM_THREADS=4) > > > > Regards, > > Tetsuya Mishima > > > >> Hi Tetsuya > >> > >> Your script that edits $PBS_NODEFILE into a separate hostfile > >> is very similar to some that I used here for > >> hybrid OpenMP+MPI programs on older versions of OMPI. > >> I haven't tried this in 1.6.X, > >> but it looks like you did and it works also. > >> I haven't tried 1.7 either. > >> Since we run production machines, > >> I try to stick to the stable versions of OMPI (even numbered: > >> 1.6.X, 1.4.X, 1.2.X). > >> > >> I believe you can get the same effect even if you > >> don't edit your $PBS_NODEFILE and let OMPI use it as is. > >> Say, if you choose carefully the values in your > >> #PBS -l nodes=?:ppn=? > >> of your > >> $OMP_NUM_THREADS > >> and use an mpiexec with --npernode or --cpus-per-proc. > >> > >> For instance, for twelve MPI processes, with two threads each, > >> on nodes with eight cores each, I would try > >> (but I haven't tried!): > >> > >> #PBS -l nodes=3:ppn=8 > >> > >> export $OMP_NUM_THREADS=2 > >> > >> mpiexec -np 12 -npernode 4 > >> > >> or perhaps more tightly: > >> > >> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 > >> > >> I hope this helps, > >> Gus Correa > >> > >> > >> > >> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > >>> > >>> Hi Reuti and Gus, > >>> > >>> Thank you for your comments. > >>> > >>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, > > 32 > >>> cores. > >>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l > >>> nodes=2:ppn=4". > >>> (strictly speaking, Torque picked up proper nodes.) > >>> > >>> As I mentioned before, I usually use openmpi-1.6.x, which has no troble > >>> against that kind > >>> of use.
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
You obviously have some MCA params set somewhere: > -- > A deprecated MCA parameter value was specified in an MCA parameter > file. Deprecated MCA parameters should be avoided; they may disappear > in future releases. > > Deprecated parameter: orte_rsh_agent > -- Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? Thanks Ralph On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Here is a result of rerun with --display-allocation. > I set OMP_NUM_THREADS=1 to make the problem clear. > > Regards, > Tetsuya Mishima > > P.S. As far as I checked, these 2 cases are OK(no problem). > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > ~/Ducom/testbed/mPre m02-ld > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > m02-ld > > Script File: > > #!/bin/sh > #PBS -A tmishima > #PBS -N Ducom-run > #PBS -j oe > #PBS -l nodes=2:ppn=4 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > --display-allocation ~/Ducom/testbed/mPre m02-ld > > Output: > -- > A deprecated MCA parameter value was specified in an MCA parameter > file. Deprecated MCA parameters should be avoided; they may disappear > in future releases. > > Deprecated parameter: orte_rsh_agent > -- > > == ALLOCATED NODES == > > Data for node: node06 Num slots: 4Max slots: 0 > Data for node: node05 Num slots: 4Max slots: 0 > > = > -- > A hostfile was provided that contains at least one node not > present in the allocation: > > hostfile: pbs_hosts > node: node06 > > If you are operating in a resource-managed environment, then only > nodes that are in the allocation can be used in the hostfile. You > may find relative node syntax to be a useful alternative to > specifying absolute node names see the orte_hosts man page for > further information. > -- > > >> I've submitted a patch to fix the Torque launch issue - just some > leftover garbage that existed at the time of the 1.7.0 branch and didn't > get removed. >> >> For the hostfile issue, I'm stumped as I can't see how the problem would > come about. Could you please rerun your original test and add > "--display-allocation" to your cmd line? Let's see if it is >> correctly finding the original allocation. >> >> Thanks >> Ralph >> >> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Gus, >>> >>> Thank you for your comments. I understand your advice. >>> Our script used to be --npernode type as well. >>> >>> As I told before, our cluster consists of nodes having 4, 8, >>> and 32 cores, although it used to be homogeneous at the >>> starting time. Furthermore, since performance of each core >>> is almost same, a mixed use of nodes with different number >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. >>> >>> --npernode type is not applicable to such a mixed use. >>> That's why I'd like to continue to use modified hostfile. >>> >>> By the way, the problem I reported to Jeff yesterday >>> was that openmpi-1.7 with torque is something wrong, >>> because it caused error against such a simple case as >>> shown below, which surprised me. Now, the problem is not >>> limited to modified hostfile, I guess. >>> >>> #PBS -l nodes=4:ppn=8 >>> mpirun -np 8 ./my_program >>> (OMP_NUM_THREADS=4) >>> >>> Regards, >>> Tetsuya Mishima >>> Hi Tetsuya Your script that edits $PBS_NODEFILE into a separate hostfile is very similar to some that I used here for hybrid OpenMP+MPI programs on older versions of OMPI. I haven't tried this in 1.6.X, but it looks like you did and it works also. I haven't tried 1.7 either. Since we run production machines, I try to stick to the stable versions of OMPI (even numbered: 1.6.X, 1.4.X, 1.2.X). I believe you can get the same effect even if you don't edit your $PBS_NODEFILE and let OMPI use it as is. Say, if you choose carefully the values in your #PBS -l nodes=?:ppn=? of your $OMP_NUM_THREADS and use an mpiexec with --npernode or --cpus-per-proc. F
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
Could you please apply the attached patch and try it again? If you haven't had time to configure with --enable-debug, that is fine - this will output regardless. Thanks Ralph user.diff Description: Binary data On Mar 20, 2013, at 4:59 PM, Ralph Castain wrote: > You obviously have some MCA params set somewhere: > >> -- >> A deprecated MCA parameter value was specified in an MCA parameter >> file. Deprecated MCA parameters should be avoided; they may disappear >> in future releases. >> >> Deprecated parameter: orte_rsh_agent >> -- > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA > parameter file to see what has been specified. > > The allocation looks okay - I'll have to look for other debug flags you can > set. Meantime, can you please add --enable-debug to your configure cmd line > and rebuild? > > Thanks > Ralph > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > >> >> >> Hi Ralph, >> >> Here is a result of rerun with --display-allocation. >> I set OMP_NUM_THREADS=1 to make the problem clear. >> >> Regards, >> Tetsuya Mishima >> >> P.S. As far as I checked, these 2 cases are OK(no problem). >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation >> ~/Ducom/testbed/mPre m02-ld >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre >> m02-ld >> >> Script File: >> >> #!/bin/sh >> #PBS -A tmishima >> #PBS -N Ducom-run >> #PBS -j oe >> #PBS -l nodes=2:ppn=4 >> export OMP_NUM_THREADS=1 >> cd $PBS_O_WORKDIR >> cp $PBS_NODEFILE pbs_hosts >> NPROCS=`wc -l < pbs_hosts` >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS >> --display-allocation ~/Ducom/testbed/mPre m02-ld >> >> Output: >> -- >> A deprecated MCA parameter value was specified in an MCA parameter >> file. Deprecated MCA parameters should be avoided; they may disappear >> in future releases. >> >> Deprecated parameter: orte_rsh_agent >> -- >> >> == ALLOCATED NODES == >> >> Data for node: node06 Num slots: 4Max slots: 0 >> Data for node: node05 Num slots: 4Max slots: 0 >> >> = >> -- >> A hostfile was provided that contains at least one node not >> present in the allocation: >> >> hostfile: pbs_hosts >> node: node06 >> >> If you are operating in a resource-managed environment, then only >> nodes that are in the allocation can be used in the hostfile. You >> may find relative node syntax to be a useful alternative to >> specifying absolute node names see the orte_hosts man page for >> further information. >> -- >> >> >>> I've submitted a patch to fix the Torque launch issue - just some >> leftover garbage that existed at the time of the 1.7.0 branch and didn't >> get removed. >>> >>> For the hostfile issue, I'm stumped as I can't see how the problem would >> come about. Could you please rerun your original test and add >> "--display-allocation" to your cmd line? Let's see if it is >>> correctly finding the original allocation. >>> >>> Thanks >>> Ralph >>> >>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: >>> Hi Gus, Thank you for your comments. I understand your advice. Our script used to be --npernode type as well. As I told before, our cluster consists of nodes having 4, 8, and 32 cores, although it used to be homogeneous at the starting time. Furthermore, since performance of each core is almost same, a mixed use of nodes with different number of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. --npernode type is not applicable to such a mixed use. That's why I'd like to continue to use modified hostfile. By the way, the problem I reported to Jeff yesterday was that openmpi-1.7 with torque is something wrong, because it caused error against such a simple case as shown below, which surprised me. Now, the problem is not limited to modified hostfile, I guess. #PBS -l nodes=4:ppn=8 mpirun -np 8 ./my_program (OMP_NUM_THREADS=4) Regards, Tetsuya Mishima > Hi Tetsuya > > Your script that edits $PBS_NODEFILE into a separate hostfile > is very similar to some that I used here for > hybrid OpenMP+MPI programs on older versions of OMPI. > I haven't tried this in 1.6.X, > but it looks like you did and it works also. > I haven't tried 1.7 either. > Since we run producti
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
Hi Ralph, I have a line below in ~/.openmpi/mca-params.conf to use rsh. orte_rsh_agent = /usr/bin/rsh I changed this line to: plm_rsh_agent = /usr/bin/rsh # for openmpi-1.7 Then, the error message disappeared. Thanks. Retruning to the subject, I can rebuild with --enable-debug. Just wait until it will complete. Regards, Tetsuya Mishima > You obviously have some MCA params set somewhere: > > > -- > > A deprecated MCA parameter value was specified in an MCA parameter > > file. Deprecated MCA parameters should be avoided; they may disappear > > in future releases. > > > > Deprecated parameter: orte_rsh_agent > > -- > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. > > The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? > > Thanks > Ralph > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > Here is a result of rerun with --display-allocation. > > I set OMP_NUM_THREADS=1 to make the problem clear. > > > > Regards, > > Tetsuya Mishima > > > > P.S. As far as I checked, these 2 cases are OK(no problem). > > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > > ~/Ducom/testbed/mPre m02-ld > > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > > m02-ld > > > > Script File: > > > > #!/bin/sh > > #PBS -A tmishima > > #PBS -N Ducom-run > > #PBS -j oe > > #PBS -l nodes=2:ppn=4 > > export OMP_NUM_THREADS=1 > > cd $PBS_O_WORKDIR > > cp $PBS_NODEFILE pbs_hosts > > NPROCS=`wc -l < pbs_hosts` > > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > > --display-allocation ~/Ducom/testbed/mPre m02-ld > > > > Output: > > -- > > A deprecated MCA parameter value was specified in an MCA parameter > > file. Deprecated MCA parameters should be avoided; they may disappear > > in future releases. > > > > Deprecated parameter: orte_rsh_agent > > -- > > > > == ALLOCATED NODES == > > > > Data for node: node06 Num slots: 4Max slots: 0 > > Data for node: node05 Num slots: 4Max slots: 0 > > > > = > > -- > > A hostfile was provided that contains at least one node not > > present in the allocation: > > > > hostfile: pbs_hosts > > node: node06 > > > > If you are operating in a resource-managed environment, then only > > nodes that are in the allocation can be used in the hostfile. You > > may find relative node syntax to be a useful alternative to > > specifying absolute node names see the orte_hosts man page for > > further information. > > -- > > > > > >> I've submitted a patch to fix the Torque launch issue - just some > > leftover garbage that existed at the time of the 1.7.0 branch and didn't > > get removed. > >> > >> For the hostfile issue, I'm stumped as I can't see how the problem would > > come about. Could you please rerun your original test and add > > "--display-allocation" to your cmd line? Let's see if it is > >> correctly finding the original allocation. > >> > >> Thanks > >> Ralph > >> > >> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Hi Gus, > >>> > >>> Thank you for your comments. I understand your advice. > >>> Our script used to be --npernode type as well. > >>> > >>> As I told before, our cluster consists of nodes having 4, 8, > >>> and 32 cores, although it used to be homogeneous at the > >>> starting time. Furthermore, since performance of each core > >>> is almost same, a mixed use of nodes with different number > >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > >>> > >>> --npernode type is not applicable to such a mixed use. > >>> That's why I'd like to continue to use modified hostfile. > >>> > >>> By the way, the problem I reported to Jeff yesterday > >>> was that openmpi-1.7 with torque is something wrong, > >>> because it caused error against such a simple case as > >>> shown below, which surprised me. Now, the problem is not > >>> limited to modified hostfile, I guess. > >>> > >>> #PBS -l nodes=4:ppn=8 > >>> mpirun -np 8 ./my_program > >>> (OMP_NUM_THREADS=4) > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > Hi Tetsuya > > Your script that edits $PBS_NODEFILE into a separate hostfile > is very similar to some that I used here for > hybrid OpenMP+MPI programs on older versions of OMPI.
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
Hi Ralph, I have completed rebuild of openmpi1.7rc8. To save time, I added --disable-vt. ( Is it OK? ) Well, what shall I do ? ./configure \ --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ --with-tm \ --with-verbs \ --disable-ipv6 \ --disable-vt \ --enable-debug \ CC=pgcc CFLAGS="-fast -tp k8-64e" \ CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ F77=pgfortran FFLAGS="-fast -tp k8-64e" \ FC=pgfortran FCFLAGS="-fast -tp k8-64e" Note: I tried patch user.diff after rebuiding openmpi1.7rc8. But, I got an error and could not go foward. $ patch -p0 < user.diff # this is OK $ make # I got an error CC util/hostfile/hostfile.lo PGC-S-0037-Syntax error: Recovery attempted by deleting (util/hostfile/hostfile.c: 728) PGC/x86-64 Linux 12.9-0: compilation completed with severe errors Regards, Tetsuya Mishima > Could you please apply the attached patch and try it again? If you haven't had time to configure with --enable-debug, that is fine - this will output regardless. > > Thanks > Ralph > > - user.diff > > > On Mar 20, 2013, at 4:59 PM, Ralph Castain wrote: > > > You obviously have some MCA params set somewhere: > > > >> -- > >> A deprecated MCA parameter value was specified in an MCA parameter > >> file. Deprecated MCA parameters should be avoided; they may disappear > >> in future releases. > >> > >> Deprecated parameter: orte_rsh_agent > >> -- > > > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. > > > > The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? > > > > Thanks > > Ralph > > > > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > >> > >> > >> Hi Ralph, > >> > >> Here is a result of rerun with --display-allocation. > >> I set OMP_NUM_THREADS=1 to make the problem clear. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> P.S. As far as I checked, these 2 cases are OK(no problem). > >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > >> ~/Ducom/testbed/mPre m02-ld > >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > >> m02-ld > >> > >> Script File: > >> > >> #!/bin/sh > >> #PBS -A tmishima > >> #PBS -N Ducom-run > >> #PBS -j oe > >> #PBS -l nodes=2:ppn=4 > >> export OMP_NUM_THREADS=1 > >> cd $PBS_O_WORKDIR > >> cp $PBS_NODEFILE pbs_hosts > >> NPROCS=`wc -l < pbs_hosts` > >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > >> --display-allocation ~/Ducom/testbed/mPre m02-ld > >> > >> Output: > >> -- > >> A deprecated MCA parameter value was specified in an MCA parameter > >> file. Deprecated MCA parameters should be avoided; they may disappear > >> in future releases. > >> > >> Deprecated parameter: orte_rsh_agent > >> -- > >> > >> == ALLOCATED NODES == > >> > >> Data for node: node06 Num slots: 4Max slots: 0 > >> Data for node: node05 Num slots: 4Max slots: 0 > >> > >> = > >> -- > >> A hostfile was provided that contains at least one node not > >> present in the allocation: > >> > >> hostfile: pbs_hosts > >> node: node06 > >> > >> If you are operating in a resource-managed environment, then only > >> nodes that are in the allocation can be used in the hostfile. You > >> may find relative node syntax to be a useful alternative to > >> specifying absolute node names see the orte_hosts man page for > >> further information. > >> -- > >> > >> > >>> I've submitted a patch to fix the Torque launch issue - just some > >> leftover garbage that existed at the time of the 1.7.0 branch and didn't > >> get removed. > >>> > >>> For the hostfile issue, I'm stumped as I can't see how the problem would > >> come about. Could you please rerun your original test and add > >> "--display-allocation" to your cmd line? Let's see if it is > >>> correctly finding the original allocation. > >>> > >>> Thanks > >>> Ralph > >>> > >>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > > > Hi Gus, > > Thank you for your comments. I understand your advice. > Our script used to be --npernode type as well. > > As I told before, our cluster consists of nodes having 4, 8, > and 32 cores, although it used to be homogeneous at the > starting time. Furthermore, since performance of each core > is almost same, a mixed use of
Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
Hi Ralph, Here is an output on openmpi-1.6.4, just for your information. Small difference is obserbed. I hope this helps you. Regards, Tetusya Mishima openmpi-1.6.4: == ALLOCATED NODES == Data for node: node06.cluster Num slots: 4Max slots: 0 Data for node: node05 Num slots: 4Max slots: 0 = openmpi-1.7rc8 with --enable-debug: == ALLOCATED NODES == Data for node: node06 Num slots: 4Max slots: 0 Data for node: node05 Num slots: 4Max slots: 0 = -- A hostfile was provided that contains at least one node not present in the allocation: hostfile: pbs_hosts node: node06 If you are operating in a resource-managed environment, then only nodes that are in the allocation can be used in the hostfile. You may find relative node syntax to be a useful alternative to specifying absolute node names see the orte_hosts man page for further information. --