[OMPI users] mpirun fails on remote applications
hi all, First of all,i'm new to openmpi. So i don't know much about mpi setting. That's why i'm following manual and FAQ suggestions from the beginning. Everything went well untile i try to run a pllication on a remote node by using 'mpirun -np' command. It just hanging there without doing anything, no error messanges, no complaining or whatsoever. What confused me is that i can run application over ssh with no problem, while it comes to mpirun, just stuck in there does nothing. I'm pretty sure i got everyting setup in the right way manner, including no password signin over ssh, environment variables for bot interactive and non-interactive logons. A sample list of commands been used list as following: [fch6699@anfield05 test]$ mpicc -o hello hello.f [fch6699@anfield05 test]$ ssh anfield04 ./hello 0 of 1: Hello world! [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello 0 of 4: Hello world! 2 of 4: Hello world! 3 of 4: Hello world! 1 of 4: Hello world! [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello just hanging there for years!!! need help to fix this !! if u try it in another way [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell still nothing happened, no warnnings, no complains, no error messages.. !! All other files related to this issue can be found in my_files.tar.gz in attachment. .cshrc The output of the "ompi_info --all" command. my_hostfile hello.c output of iptables The only thing i've noticed is that the port of our ssh has been changed from 22 to other number for security issues. Don't know will that have anything to with it or not. Any help will be highly appreciated!! thanks in advance! Kevin my_files.tar.gz Description: application/gzip-compressed
Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
Now there is another problem :) You can try oversubscribe node. At least by 1 task. If you hostfile and rank file limit you at N procs, you can ask mpirun for N+1 and it wil be not rejected. Although in reality there will be N tasks. So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5" both works, but in both cases there are only 4 tasks. It isn't crucial, because there is nor real oversubscription, but there is still some bug which can affect something in future. -- Anton Starikov. On May 12, 2009, at 1:45 AM, Ralph Castain wrote: This is fixed as of r21208. Thanks for reporting it! Ralph On May 11, 2009, at 12:51 PM, Anton Starikov wrote: Although removing this check solves problem of having more slots in rankfile than necessary, there is another problem. If I set rmaps_base_no_oversubscribe=1 then if, for example: hostfile: node01 node01 node02 node02 rankfile: rank 0=node01 slot=1 rank 1=node01 slot=0 rank 2=node02 slot=1 rank 3=node02 slot=0 mpirun -np 4 ./something complains with: "There are not enough slots available in the system to satisfy the 4 slots that were requested by the application" but "mpirun -np 3 ./something" will work though. It works, when you ask for 1 CPU less. And the same behavior in any case (shared nodes, non-shared nodes, multi-node) If you switch off rmaps_base_no_oversubscribe, then it works and all affinities set as it requested in rankfile, there is no oversubscription. Anton. On May 5, 2009, at 3:08 PM, Ralph Castain wrote: Ah - thx for catching that, I'll remove that check. It no longer is required. Thx! On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky > wrote: According to the code it does cares. $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 ival = orte_rmaps_rank_file_value.ival; if ( ival > (np-1) ) { orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival, rankfile); rc = ORTE_ERR_BAD_PARAM; goto unlock; } If I remember correctly, I used an array to map ranks, and since the length of array is NP, maximum index must be less than np, so if you have the number of rank > NP, you have no place to put it inside array. "Likewise, if you have more procs than the rankfile specifies, we map the additional procs either byslot (default) or bynode (if you specify that option). So the rankfile doesn't need to contain an entry for every proc." - Correct point. Lenny. On 5/5/09, Ralph Castain wrote: Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if the rankfile contains additional info - it only maps up to the number of processes, and ignores anything beyond that number. So there is no need to remove the additional info. Likewise, if you have more procs than the rankfile specifies, we map the additional procs either byslot (default) or bynode (if you specify that option). So the rankfile doesn't need to contain an entry for every proc. Just don't want to confuse folks. Ralph On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky > wrote: Hi, maximum rank number must be less then np. if np=1 then there is only rank 0 in the system, so rank 1 is invalid. please remove "rank 1=node2 slot=*" from the rankfile Best regards, Lenny. On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot > wrote: Hi , I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my command doesn't work cat rankf: rank 0=node1 slot=* rank 1=node2 slot=* cat hostf: node1 slots=2 node2 slots=2 mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname : --host node2 -n 1 hostname Error, invalid rank (1) in the rankfile (rankf) -- [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 403 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 1016 Ralph, could you tell me if my command syntax is correct or not ? if not, give me the expected one ? Regards Geoffroy 2009/4/30 Geoffroy Pignot Immediately Sir !!! :) Thanks again Ralph Geoffroy -- Message: 2 Date: Thu, 30 Apr 2009 06:45:39 -0600 From: Ralph Castain Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? To: Open MPI Users Message-ID: <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" I believe this is fixed now in our development trunk - you can download any tarball starting from last night and give it a try, if you like. Any feedback would be appreciated. Ralph On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: Ah now, I didn't say it -worked-, did I? :-) Clearly a bug exist
[OMPI users] Torque 2.2.1 problem with OpenMPI 1.2.5
Hi, I am using OFED 1.3 version inw hich OPENMPI 1.2.5 is included , I have compiled with Intel and gcc . Problem is that during qsub i am not able to run the jobs but same time when i use mpiexec command it is working fine without any issue. here is my script file ; Please help me to diagnose this issue. #!/bin/sh #PBS -N Ad2.0 #PBS -l nodes=3:ppn=8 #PBS -l walltime=100:00:00 date cd ${PBS_O_WORKDIR} nprocs=`wc -l < ${PBS_NODEFILE}` echo "--- PBS_NODEFILE CONTENT ---" cat $PBS_NODEFILE echo "--- PBS_NODEFILE CONTENT ---" echo "Submit host: $(hostname)" mpiexec -machinefile ${PBS_NODEFILE} -n ${nprocs} ./prewet.out date Thanks and Regards ANSUL SRIVASTAVA MOBILE -- 9900180278 Sr. CSE TSG Group(Infrastructure Availability Services ) Wipro Infotech | 146-147, Metagalli Industrial Area Metagalli, Mysore 570016 Direct Number : 0821-2419074/3088125 Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
[OMPI users] New warning messages in 1.3.2 (not present in 1.2.8)
Hi, I've managed to use 1.3.2 (still not with LSF and InfiniPath, I start one step after another), but I have additional warnings that didn't show up in 1.2.8: [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_dash_host: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_gridengine: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_localhost: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored) Can this be fixed in some way? Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher
Re: [OMPI users] mpirun fails on remote applications
sounds like firewall problems to or from anfield04. Lenny, On Tue, May 12, 2009 at 8:18 AM, feng chen wrote: > hi all, > > First of all,i'm new to openmpi. So i don't know much about mpi setting. > That's why i'm following manual and FAQ suggestions from the beginning. > Everything went well untile i try to run a pllication on a remote node by > using 'mpirun -np' command. It just hanging there without doing anything, no > error messanges, no > complaining or whatsoever. What confused me is that i can run application > over ssh with no problem, while it comes to mpirun, just stuck in there does > nothing. > I'm pretty sure i got everyting setup in the right way manner, including no > password signin over ssh, environment variables for bot interactive and > non-interactive logons. > A sample list of commands been used list as following: > > > > > [fch6699@anfield05 test]$ mpicc -o hello hello.f > [fch6699@anfield05 test]$ ssh anfield04 ./hello > 0 of 1: Hello world! > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello > 0 of 4: Hello world! > 2 of 4: Hello world! > 3 of 4: Hello world! > 1 of 4: Hello world! > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello > just hanging there for years!!! > need help to fix this !! > if u try it in another way > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell > still nothing happened, no warnnings, no complains, no error messages.. !! > > All other files related to this issue can be found in my_files.tar.gz in > attachment. > > .cshrc > The output of the "ompi_info --all" command. > my_hostfile > hello.c > output of iptables > > The only thing i've noticed is that the port of our ssh has been changed > from 22 to other number for security issues. > Don't know will that have anything to with it or not. > > > Any help will be highly appreciated!! > > thanks in advance! > > Kevin > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] mpirun fails on remote applications
On Tue, 12 May 2009 11:54:57 +0300 Lenny Verkhovsky wrote: > sounds like firewall problems to or from anfield04. > Lenny, > > On Tue, May 12, 2009 at 8:18 AM, feng chen wrote: > I'm having a similar problem, not sure if it's related (gave up for the moment on 1.3+ openmpi, 1.2.8 works fine nothing above that). 1. Try taking down the firewall and see if it works 2. Make sure that passwordless ssh is working (not sure if it's needed for all things but still ...) 3. can you test it maybe with openmpi 1.2.8? 4. also, does posting the job in the other direction work? (4 -> 5 instead of 5 -> 4) [fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello >From what it seems on my cluster for my specific problem is that machines have different addresses based on which machine you are connecting from (they are connected directly to each other, not through a switch with a central name server), and name lookup seems to happen on the master instead of the client node so it is getting the wrong address. > > hi all, > > > > First of all,i'm new to openmpi. So i don't know much about mpi setting. > > That's why i'm following manual and FAQ suggestions from the beginning. > > Everything went well untile i try to run a pllication on a remote node by > > using 'mpirun -np' command. It just hanging there without doing anything, no > > error messanges, no > > complaining or whatsoever. What confused me is that i can run application > > over ssh with no problem, while it comes to mpirun, just stuck in there does > > nothing. > > I'm pretty sure i got everyting setup in the right way manner, including no > > password signin over ssh, environment variables for bot interactive and > > non-interactive logons. > > A sample list of commands been used list as following: > > > > > > > > > > [fch6699@anfield05 test]$ mpicc -o hello hello.f > > [fch6699@anfield05 test]$ ssh anfield04 ./hello > > 0 of 1: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello > > 0 of 4: Hello world! > > 2 of 4: Hello world! > > 3 of 4: Hello world! > > 1 of 4: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello > > just hanging there for years!!! > > need help to fix this !! > > if u try it in another way > > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell > > still nothing happened, no warnnings, no complains, no error messages.. !! > > > > All other files related to this issue can be found in my_files.tar.gz in > > attachment. > > > > .cshrc > > The output of the "ompi_info --all" command. > > my_hostfile > > hello.c > > output of iptables > > > > The only thing i've noticed is that the port of our ssh has been changed > > from 22 to other number for security issues. > > Don't know will that have anything to with it or not. > > > > > > Any help will be highly appreciated!! > > > > thanks in advance! > > > > Kevin > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] mpirun fails on remote applications
thanks a lot. firewall it is.. It works with firewall's off, while that brings another questions from me. Is there anyway we can run mpirun while firwall 's on? If yes, how do we setup firewall or iptables? thank you From: Micha Feigin To: us...@open-mpi.org Sent: Tuesday, May 12, 2009 4:30:30 AM Subject: Re: [OMPI users] mpirun fails on remote applications On Tue, 12 May 2009 11:54:57 +0300 Lenny Verkhovsky wrote: > sounds like firewall problems to or from anfield04. > Lenny, > > On Tue, May 12, 2009 at 8:18 AM, feng chen wrote: > I'm having a similar problem, not sure if it's related (gave up for the moment on 1.3+ openmpi, 1.2.8 works fine nothing above that). 1. Try taking down the firewall and see if it works 2. Make sure that passwordless ssh is working (not sure if it's needed for all things but still ...) 3. can you test it maybe with openmpi 1.2.8? 4. also, does posting the job in the other direction work? (4 -> 5 instead of 5 -> 4) [fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello >From what it seems on my cluster for my specific problem is that machines have different addresses based on which machine you are connecting from (they are connected directly to each other, not through a switch with a central name server), and name lookup seems to happen on the master instead of the client node so it is getting the wrong address. > > hi all, > > > > First of all,i'm new to openmpi. So i don't know much about mpi setting. > > That's why i'm following manual and FAQ suggestions from the beginning. > > Everything went well untile i try to run a pllication on a remote node by > > using 'mpirun -np' command. It just hanging there without doing anything, no > > error messanges, no > > complaining or whatsoever. What confused me is that i can run application > > over ssh with no problem, while it comes to mpirun, just stuck in there does > > nothing. > > I'm pretty sure i got everyting setup in the right way manner, including no > > password signin over ssh, environment variables for bot interactive and > > non-interactive logons. > > A sample list of commands been used list as following: > > > > > > > > > > [fch6699@anfield05 test]$ mpicc -o hello hello.f > > [fch6699@anfield05 test]$ ssh anfield04 ./hello > > 0 of 1: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello > > 0 of 4: Hello world! > > 2 of 4: Hello world! > > 3 of 4: Hello world! > > 1 of 4: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello > > just hanging there for years!!! > > need help to fix this !! > > if u try it in another way > > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell > > still nothing happened, no warnnings, no complains, no error messages.. !! > > > > All other files related to this issue can be found in my_files.tar.gz in > > attachment. > > > > .cshrc > > The output of the "ompi_info --all" command. > > my_hostfile > > hello.c > > output of iptables > > > > The only thing i've noticed is that the port of our ssh has been changed > > from 22 to other number for security issues. > > Don't know will that have anything to with it or not. > > > > > > Any help will be highly appreciated!! > > > > thanks in advance! > > > > Kevin > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] strange bug
Can you send all the information listed here: http://www.open-mpi.org/community/help/ On May 11, 2009, at 10:03 PM, Anton Starikov wrote: By the way, this if fortran code, which uses F77 bindings. -- Anton Starikov. On May 12, 2009, at 3:06 AM, Anton Starikov wrote: > Due to rankfile fixes I switched to SVN r21208, now my code dies > with error > > [node037:20519] *** An error occurred in MPI_Comm_dup > [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT FROM 4 > [node037:20519] *** MPI_ERR_INTERN: internal error > [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > > -- > Anton Starikov. > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Torque 2.2.1 problem with OpenMPI 1.2.5
The 1.2.x series has a bug in it when used with Torque. Simply do not include -machinefile on your mpiexec cmd line and it should work fine. It will automatically pickup the PBS_NODEFILE contents. On May 12, 2009, at 1:17 AM, wrote: Hi, I am using OFED 1.3 version inw hich OPENMPI 1.2.5 is included , I have compiled with Intel and gcc . Problem is that during qsub i am not able to run the jobs but same time when i use mpiexec command it is working fine without any issue. here is my script file ; Please help me to diagnose this issue. #!/bin/sh #PBS -N Ad2.0 #PBS -l nodes=3:ppn=8 #PBS -l walltime=100:00:00 date cd ${PBS_O_WORKDIR} nprocs=`wc -l < ${PBS_NODEFILE}` echo "--- PBS_NODEFILE CONTENT ---" cat $PBS_NODEFILE echo "--- PBS_NODEFILE CONTENT ---" echo "Submit host: $(hostname)" mpiexec -machinefile ${PBS_NODEFILE} -n ${nprocs} ./prewet.out date Thanks and Regards ANSUL SRIVASTAVA MOBILE -- 9900180278 Sr. CSE TSG Group(Infrastructure Availability Services ) Wipro Infotech | 146-147, Metagalli Industrial Area Metagalli, Mysore 570016 Direct Number : 0821-2419074/3088125 Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] New warning messages in 1.3.2 (not present in 1.2.8)
Looking at this output, I would say that the problem is you didn't recompile your code against 1.3.2. These are warnings about attempts to open components that were present in 1.2.8, but no longer exist in the 1.3.x series. On May 12, 2009, at 2:30 AM, Matthieu Brucher wrote: Hi, I've managed to use 1.3.2 (still not with LSF and InfiniPath, I start one step after another), but I have additional warnings that didn't show up in 1.2.8: [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_dash_host: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_gridengine: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_ras_localhost: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored) [host-b:09180] mca: base: component_find: unable to open /home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored) Can this be fixed in some way? Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)
Or it could be that you installed 1.3.2 over 1.2.8 -- some of the 1.2.8 components that no longer exist in the 1.3 series are still in the installation tree, but failed to open properly (unfortunately, libltdl gives an incorrect "file not found" error message if it is unable to load a plugin for any reason, such as if a symbol is unable to be resolved from that plugin). The best thing to do is to install 1.3 in a clean, fresh tree, or uninstall your 1.2.8 before installing 1.3. On May 12, 2009, at 7:35 AM, Ralph Castain wrote: Looking at this output, I would say that the problem is you didn't recompile your code against 1.3.2. These are warnings about attempts to open components that were present in 1.2.8, but no longer exist in the 1.3.x series. On May 12, 2009, at 2:30 AM, Matthieu Brucher wrote: > Hi, > > I've managed to use 1.3.2 (still not with LSF and InfiniPath, I start > one step after another), but I have additional warnings that didn't > show up in 1.2.8: > > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_ras_dash_host: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_ras_gridengine: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_ras_localhost: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored) > [host-b:09180] mca: base: component_find: unable to open > /home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored) > > Can this be fixed in some way? > > Matthieu > -- > Information System Engineer, Ph.D. > Website: http://matthieu-brucher.developpez.com/ > Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 > LinkedIn: http://www.linkedin.com/in/matthieubrucher > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)
2009/5/12 Jeff Squyres : > Or it could be that you installed 1.3.2 over 1.2.8 -- some of the 1.2.8 > components that no longer exist in the 1.3 series are still in the > installation tree, but failed to open properly (unfortunately, libltdl gives > an incorrect "file not found" error message if it is unable to load a plugin > for any reason, such as if a symbol is unable to be resolved from that > plugin). > > The best thing to do is to install 1.3 in a clean, fresh tree, or uninstall > your 1.2.8 before installing 1.3. OK, this is indeed the case. I'll try to clean the tree (I have several other package and deleted the original 1.2.8 package) and test again. Thanks for the answers! Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher
Re: [OMPI users] strange bug
hostfile from torque PBS_NODEFILE (OMPI is compilled with torque support) It happens with or without rankfile. Started with mpirun -np 16 ./somecode mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1 rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it) I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same. It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code. I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems. I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations. All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI. config.log.gz Description: GNU Zip compressed data ompi-info.txt.gz Description: GNU Zip compressed data -- Anton Starikov. Computational Material Science, Faculty of Science and Technology, University of Twente. Phone: +31 (0)53 489 2986 Fax: +31 (0)53 489 2910 On May 12, 2009, at 12:35 PM, Jeff Squyres wrote: Can you send all the information listed here: http://www.open-mpi.org/community/help/ On May 11, 2009, at 10:03 PM, Anton Starikov wrote: By the way, this if fortran code, which uses F77 bindings. -- Anton Starikov. On May 12, 2009, at 3:06 AM, Anton Starikov wrote: > Due to rankfile fixes I switched to SVN r21208, now my code dies > with error > > [node037:20519] *** An error occurred in MPI_Comm_dup > [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT FROM 4 > [node037:20519] *** MPI_ERR_INTERN: internal error > [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > > -- > Anton Starikov. > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)
On May 12, 2009, at 8:17 AM, Matthieu Brucher wrote: OK, this is indeed the case. I'll try to clean the tree (I have several other package and deleted the original 1.2.8 package) and test again. This misleading libltdl error message continues to bite us over and over again (users and developers alike), so I just put in a workaround with some heuristics to try to print a better error message in cases like yours. Now you'll see something like this: [foo.example.com:24273] mca: base: component_find: unable to open / home/jsquyres/bogus/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) Changeset here (Shiqing tells me it'll work on Windows, too -- MTT will tell us for sure :-) ): https://svn.open-mpi.org/trac/ompi/changeset/21214 And I filed to move it to v1.3.3 here: https://svn.open-mpi.org/trac/ompi/ticket/1917 -- Jeff Squyres Cisco Systems
Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)
Thank you a lot for this. I've just checked everything again, recompiled my code as well (I'm using SCons so it detects that the headers and the libraries changed) and it works without a warning. Matthieu 2009/5/12 Jeff Squyres : > On May 12, 2009, at 8:17 AM, Matthieu Brucher wrote: > >> OK, this is indeed the case. I'll try to clean the tree (I have >> several other package and deleted the original 1.2.8 package) and test >> again. >> > > > This misleading libltdl error message continues to bite us over and over > again (users and developers alike), so I just put in a workaround with some > heuristics to try to print a better error message in cases like yours. Now > you'll see something like this: > > [foo.example.com:24273] mca: base: component_find: unable to open > /home/jsquyres/bogus/lib/openmpi/mca_btl_openib: perhaps a missing symbol, > or compiled for a different version of Open MPI? (ignored) > > Changeset here (Shiqing tells me it'll work on Windows, too -- MTT will tell > us for sure :-) ): > > https://svn.open-mpi.org/trac/ompi/changeset/21214 > > And I filed to move it to v1.3.3 here: > > https://svn.open-mpi.org/trac/ompi/ticket/1917 > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher
Re: [OMPI users] Bug in return status of MPI_WAIT()?
Greetings Jacob; sorry for the slow reply. This is pretty subtle, but I think that your test is incorrect (I remember arguing about this a long time ago and eventually having another OMPI developer prove me wrong! :-) ). 1. You're setting MPI_ERRORS_RETURN, which, if you're using the C++ bindings, means you won't be able to see if an error occurs because they don't return the int error codes. 2. The MPI_ERROR field in the status is specifically *not* set for MPI_TEST and MPI_WAIT. It *is* set for the multi-test/wait functions (e.g., MPI_TESTANY, MPI_WAITALL). MPI-2.1 p52:44-48 says: "Error codes belonging to the error class MPI_ERR_IN_STATUS should be returned only by the MPI completion functions that take arrays of MPI_STATUS. For the functions MPI_TEST, MPI_TESTANY, MPI_WAIT, and MPI_WAITANY, which return a single MPI_STATUS value, the normal MPI error return process should be used (not the MPI_ERROR field in the MPI_STATUS argument)." So I think you need to use MPI::ERRORS_THROW_EXCEPTIONS to catch the error in this case, or look at the return value from the C binding for MPI_WAIT. On May 10, 2009, at 5:51 AM, Katz, Jacob wrote: Hi, While trying error-related functionality of OMPI, I came across a situation where when I use MPI_ERRORS_RETURN error handler, the errors do not come out correctly from WAIT calls. The program below correctly terminates with a fatal “message truncated” error, but when the line setting the error handler to MPI_ERRORS_RETURN is uncommented, it silently completes. I expected the print out that checks the status after WAIT call to be executed, but it wasn’t. The issue didn’t happen when using blocking recv. A bug or my incorrect usage? Thanks! // mpic++ -o test test.cpp // mpirun -np2 ./test #include "mpi.h" #include using namespace std; int main (int argc, char *argv[]) { int rank; char buf[100] = "h"; MPI::Status stat; MPI::Init(argc, argv); rank = MPI::COMM_WORLD.Get_rank(); //MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_RETURN); if (rank == 0) { MPI::Request r = MPI::COMM_WORLD.Irecv(buf, 1, MPI_CHAR, MPI::ANY_SOURCE, MPI::ANY_TAG); r.Wait(stat); if (stat.Get_error() != MPI::SUCCESS) { cout << "0: Error during recv" << endl; } } else { MPI::COMM_WORLD.Send(buf, 2, MPI_CHAR, 0, 0); } MPI::Finalize(); return (0); } Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: (8)-465-5726 - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Bug in return status of MPI_WAIT()?
On May 12, 2009, at 9:37 AM, Jeff Squyres wrote: 2. The MPI_ERROR field in the status is specifically *not* set for MPI_TEST and MPI_WAIT. It *is* set for the multi-test/wait functions (e.g., MPI_TESTANY, MPI_WAITALL). Oops! Typo -- I should have said "(e.g., MPI_TESTALL, MPI_WAITALL)". Just like MPI_TEST, the MPI_ERROR field is not set for MPI_TESTANY because it's a single-completion function. -- Jeff Squyres Cisco Systems
Re: [OMPI users] strange bug
Hey Edgar -- Could this have anything to do with your recent fixes? On May 12, 2009, at 8:30 AM, Anton Starikov wrote: hostfile from torque PBS_NODEFILE (OMPI is compilled with torque support) It happens with or without rankfile. Started with mpirun -np 16 ./somecode mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1 rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it) I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same. It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code. I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems. I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations. All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI. -- Jeff Squyres Cisco Systems
Re: [OMPI users] mpirun fails on remote applications
Open MPI requires that each MPI process be able to connect to any other MPI process in the same job with random TCP ports. It is usually easiest to leave the firewall off, or setup trust relationships between your cluster nodes. On May 12, 2009, at 6:04 AM, feng chen wrote: thanks a lot. firewall it is.. It works with firewall's off, while that brings another questions from me. Is there anyway we can run mpirun while firwall 's on? If yes, how do we setup firewall or iptables? thank you From: Micha Feigin To: us...@open-mpi.org Sent: Tuesday, May 12, 2009 4:30:30 AM Subject: Re: [OMPI users] mpirun fails on remote applications On Tue, 12 May 2009 11:54:57 +0300 Lenny Verkhovsky wrote: > sounds like firewall problems to or from anfield04. > Lenny, > > On Tue, May 12, 2009 at 8:18 AM, feng chen wrote: > I'm having a similar problem, not sure if it's related (gave up for the moment on 1.3+ openmpi, 1.2.8 works fine nothing above that). 1. Try taking down the firewall and see if it works 2. Make sure that passwordless ssh is working (not sure if it's needed for all things but still ...) 3. can you test it maybe with openmpi 1.2.8? 4. also, does posting the job in the other direction work? (4 -> 5 instead of 5 -> 4) [fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello >From what it seems on my cluster for my specific problem is that machines have different addresses based on which machine you are connecting from (they are connected directly to each other, not through a switch with a central name server), and name lookup seems to happen on the master instead of the client node so it is getting the wrong address. > > hi all, > > > > First of all,i'm new to openmpi. So i don't know much about mpi setting. > > That's why i'm following manual and FAQ suggestions from the beginning. > > Everything went well untile i try to run a pllication on a remote node by > > using 'mpirun -np' command. It just hanging there without doing anything, no > > error messanges, no > > complaining or whatsoever. What confused me is that i can run application > > over ssh with no problem, while it comes to mpirun, just stuck in there does > > nothing. > > I'm pretty sure i got everyting setup in the right way manner, including no > > password signin over ssh, environment variables for bot interactive and > > non-interactive logons. > > A sample list of commands been used list as following: > > > > > > > > > > [fch6699@anfield05 test]$ mpicc -o hello hello.f > > [fch6699@anfield05 test]$ ssh anfield04 ./hello > > 0 of 1: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello > > 0 of 4: Hello world! > > 2 of 4: Hello world! > > 3 of 4: Hello world! > > 1 of 4: Hello world! > > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello > > just hanging there for years!!! > > need help to fix this !! > > if u try it in another way > > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./ hell > > still nothing happened, no warnnings, no complains, no error messages.. !! > > > > All other files related to this issue can be found in my_files.tar.gz in > > attachment. > > > > .cshrc > > The output of the "ompi_info --all" command. > > my_hostfile > > hello.c > > output of iptables > > > > The only thing i've noticed is that the port of our ssh has been changed > > from 22 to other number for security issues. > > Don't know will that have anything to with it or not. > > > > > > Any help will be highly appreciated!! > > > > thanks in advance! > > > > Kevin > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] strange bug
I would say the probability is large that it is due to the recent 'fix'. I will try to create a testcase similar to what you suggested. Could you give us maybe some hints on which functionality of MUMPS you are using, or even share the code/ a code fragment? Thanks Edgar Jeff Squyres wrote: Hey Edgar -- Could this have anything to do with your recent fixes? On May 12, 2009, at 8:30 AM, Anton Starikov wrote: hostfile from torque PBS_NODEFILE (OMPI is compilled with torque support) It happens with or without rankfile. Started with mpirun -np 16 ./somecode mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1 rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it) I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same. It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code. I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems. I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations. All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI. -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
Re: [OMPI users] Bug in return status of MPI_WAIT()?
Ah... Thanks, Jeff. If the standard would explicitly mention that MPI::ERRORS_RETURN is useless with C++ binding, life would be a little easier... Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: (8)-465-5726 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Tuesday, May 12, 2009 16:37 To: Open MPI Users Subject: Re: [OMPI users] Bug in return status of MPI_WAIT()? Greetings Jacob; sorry for the slow reply. This is pretty subtle, but I think that your test is incorrect (I remember arguing about this a long time ago and eventually having another OMPI developer prove me wrong! :-) ). 1. You're setting MPI_ERRORS_RETURN, which, if you're using the C++ bindings, means you won't be able to see if an error occurs because they don't return the int error codes. 2. The MPI_ERROR field in the status is specifically *not* set for MPI_TEST and MPI_WAIT. It *is* set for the multi-test/wait functions (e.g., MPI_TESTANY, MPI_WAITALL). MPI-2.1 p52:44-48 says: "Error codes belonging to the error class MPI_ERR_IN_STATUS should be returned only by the MPI completion functions that take arrays of MPI_STATUS. For the functions MPI_TEST, MPI_TESTANY, MPI_WAIT, and MPI_WAITANY, which return a single MPI_STATUS value, the normal MPI error return process should be used (not the MPI_ERROR field in the MPI_STATUS argument)." So I think you need to use MPI::ERRORS_THROW_EXCEPTIONS to catch the error in this case, or look at the return value from the C binding for MPI_WAIT. On May 10, 2009, at 5:51 AM, Katz, Jacob wrote: > Hi, > While trying error-related functionality of OMPI, I came across a > situation where when I use MPI_ERRORS_RETURN error handler, the > errors do not come out correctly from WAIT calls. > The program below correctly terminates with a fatal "message > truncated" error, but when the line setting the error handler to > MPI_ERRORS_RETURN is uncommented, it silently completes. I expected > the print out that checks the status after WAIT call to be executed, > but it wasn't. > The issue didn't happen when using blocking recv. > > A bug or my incorrect usage? > > Thanks! > > // mpic++ -o test test.cpp > // mpirun -np2 ./test > #include "mpi.h" > #include > using namespace std; > > int main (int argc, char *argv[]) > { > int rank; > char buf[100] = "h"; > MPI::Status stat; > > MPI::Init(argc, argv); > rank = MPI::COMM_WORLD.Get_rank(); > > //MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_RETURN); > > if (rank == 0) > { > MPI::Request r = MPI::COMM_WORLD.Irecv(buf, 1, MPI_CHAR, > MPI::ANY_SOURCE, MPI::ANY_TAG); > r.Wait(stat); > if (stat.Get_error() != MPI::SUCCESS) > { > cout << "0: Error during recv" << endl; > } > } > else > { > MPI::COMM_WORLD.Send(buf, 2, MPI_CHAR, 0, 0); > } > > MPI::Finalize(); > return (0); > } > > > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > - > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
[OMPI users] Problem installing Dalton with OpenMPI over PelicanHPC
Dear all, I am trying to install Dalton quantum chemistry program with OpenMPI over PelicanHPC, but it ends with an error. PelicanHPC comes with both LAM and OpenMPI preinstalled. The version of OpenMPI is "OMPI_VERSION "1.2.7rc2"" (from version.h). The wrappers that I use are mpif77.openmpi and mpicc.openmpicc. Bellow, you can see the "link" and "include" of the wrappers: ++ pelican:/# mpicc.openmpi -show gcc -I/usr/lib/openmpi/include/openmpi -I/usr/lib/openmpi/include -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl pelican:/# mpif77.openmpi -show gfortran -I/usr/lib/openmpi/include -pthread -L/usr/lib/openmpi/lib -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl pelican:/# mpif77.openmpi -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-cld --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.2 (Debian 4.3.2-1.1) + The Makefile.conf of Dalton is: ++ ARCH = linux # # CPPFLAGS = -DVAR_G77 -DSYS_LINUX -DVAR_MFDS -DVAR_SPLITFILES -D'INSTALL_WRKMEM=6000' -D'INSTALL_BASDIR="/root/Fig/dalton-2.0/basis/"' -DVAR_MPI -DIMPLICIT_NONE F77 = mpif77.openmpi CC = mpicc.openmpi RM = rm -f FFLAGS = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations -funroll-loops -fno-range-check -fsecond-underscore SAFEFFLAGS = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations -funroll-loops -fno-range-check -fsecond-underscore CFLAGS = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations -funroll-loops -std=c99 -DRESTRICT=restrict INCLUDES = -I../include LIBS = -L/usr/lib -llapack -lblas INSTALLDIR = /root/Fig/dalton-2.0/bin PDPACK_EXTRAS = linpack.o eispack.o GP_EXTRAS = AR = ar ARFLAGS = rvs # flags for ftnchek on Dalton /hjaaj CHEKFLAGS = -nopure -nopretty -nocommon -nousage -noarray -notruncation -quiet -noargumants -arguments=number -usage=var-unitialized # -usage=var-unitialized:arg-const-modified:arg-alias # -usage=var-unitialized:var-set-unused:arg-unused:arg-const-modified:arg-alias # default : linuxparallel.x # # Parallel initialization # MPI_INCLUDE_DIR = -I/usr/lib/openmpi/include MPI_LIB_PATH = -L/usr/lib/openmpi/lib MPI_LIB = -lmpi # # # Suffix rules # hjaaj Oct 04: .g is a "cheat" suffix, for debugging. # 'make x.g' will create x.o from x.F or x.c with -g debug flag set. # ..SUFFIXES : .F .o .c .i .g ..F.o: $(F77) $(INCLUDES) $(CPPFLAGS) $(FFLAGS) -c $*.F ..F.g: $(F77) $(INCLUDES) $(CPPFLAGS) $(FFLAGS) -g -c $*.F ..c.o: $(CC) $(INCLUDES) $(CPPFLAGS) $(CFLAGS) -c $*.c ..c.g: $(CC) $(INCLUDES) $(CPPFLAGS) $(CFLAGS) -g -c $*.c ..F.i: $(F77) $(INCLUDES) $(CPPFLAGS) -E $*.F > $*.i "make" command gives me the error: +++ ---> Linking sequential dalton.x ... mpif77.openmpi -march=x86-64 -O3 -ffast-math -fexpensive-optimizations -funroll-loops -fno-range-check -fsecond-underscore \ -o /root/Fig/dalton-2.0/bin/dalton.x abacus/dalton.o cc/crayio.o abacus/linux_mem_allo.o \ abacus/herpar.o eri/eri2par.o amfi/amfi.o amfi/symtra..o gp/mpi_dummy.o -Labacus -labacus -Lrsp -lrsp -Lsirius -lsirius -labacus -Leri -leri -Ldensfit -ldensfit -Lcc -lcc -Ldft -ldft -Lgp -lgp -Lpdpack -lpdpack -L/usr/lib -llapack -lblas dft/libdft.a(general.o): In function `mpi_sync_data': general.c:(.text+0x78): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0xc3): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0xdc): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0xff): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0x122): undefined reference to `ompi_mpi_comm_world' dft/libdft.a(general.o):general.c:(.text+0x136): more undefined references to `ompi_mpi_comm_world' follow dft/libdft.a(general.o): In function `dft_cslave__': general.c:(.text+0x44e): undefined reference to `ompi_mpi_int' dft/libdft.a(general.o): In function `dft_wake_slaves': general.c:(.text+0x485): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0x4e7): undefined reference to `ompi_mpi_comm_world' general.c:(.text+0x4ee): undefined reference to `ompi_mpi_int' general.c:(.te
Re: [OMPI users] mpirun fails on remote applications
It is usually best to separate the cluster (mpi) interfaces from the internet interface. Usually on a dedicated cluster it is best to have a master node that is connected to the internet and client nodes that are connected to the master node (and if needed tunnel the connection through it to the internet), or via a gateway machine. That way the cluster machines don't need a firewall. I case all machines are connected directly to the internet it is better to have one (usually cheap) connection to the internet that can be firewalled, and a (highend) connection inside the cluster that doesn't need a firewall. On Tue, 12 May 2009 10:22:28 -0400 Jeff Squyres wrote: > Open MPI requires that each MPI process be able to connect to any > other MPI process in the same job with random TCP ports. It is > usually easiest to leave the firewall off, or setup trust > relationships between your cluster nodes. > > > On May 12, 2009, at 6:04 AM, feng chen wrote: > > > thanks a lot. firewall it is.. It works with firewall's off, while > > that brings another questions from me. Is there anyway we can run > > mpirun while firwall 's on? If yes, how do we setup firewall or > > iptables? > > > > thank you > > > > From: Micha Feigin > > To: us...@open-mpi.org > > Sent: Tuesday, May 12, 2009 4:30:30 AM > > Subject: Re: [OMPI users] mpirun fails on remote applications > > > > On Tue, 12 May 2009 11:54:57 +0300 > > Lenny Verkhovsky wrote: > > > > > sounds like firewall problems to or from anfield04. > > > Lenny, > > > > > > On Tue, May 12, 2009 at 8:18 AM, feng chen > > wrote: > > > > > > > I'm having a similar problem, not sure if it's related (gave up for > > the moment > > on 1.3+ openmpi, 1.2.8 works fine nothing above that). > > > > 1. Try taking down the firewall and see if it works > > 2. Make sure that passwordless ssh is working (not sure if it's > > needed for all > > things but still ...) > > 3. can you test it maybe with openmpi 1.2.8? > > 4. also, does posting the job in the other direction work? (4 -> 5 > > instead of 5 -> 4) > > [fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello > > > > >From what it seems on my cluster for my specific problem is that > > machines have > > different addresses based on which machine you are connecting from > > (they are > > connected directly to each other, not through a switch with a > > central name > > server), and name lookup seems to happen on the master instead of > > the client > > node so it is getting the wrong address. > > > > > > hi all, > > > > > > > > First of all,i'm new to openmpi. So i don't know much about mpi > > setting. > > > > That's why i'm following manual and FAQ suggestions from the > > beginning. > > > > Everything went well untile i try to run a pllication on a > > remote node by > > > > using 'mpirun -np' command. It just hanging there without doing > > anything, no > > > > error messanges, no > > > > complaining or whatsoever. What confused me is that i can run > > application > > > > over ssh with no problem, while it comes to mpirun, just stuck > > in there does > > > > nothing. > > > > I'm pretty sure i got everyting setup in the right way manner, > > including no > > > > password signin over ssh, environment variables for bot > > interactive and > > > > non-interactive logons. > > > > A sample list of commands been used list as following: > > > > > > > > > > > > > > > > > > > > [fch6699@anfield05 test]$ mpicc -o hello hello.f > > > > [fch6699@anfield05 test]$ ssh anfield04 ./hello > > > > 0 of 1: Hello world! > > > > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello > > > > 0 of 4: Hello world! > > > > 2 of 4: Hello world! > > > > 3 of 4: Hello world! > > > > 1 of 4: Hello world! > > > > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello > > > > just hanging there for years!!! > > > > need help to fix this !! > > > > if u try it in another way > > > > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./ > > hell > > > > still nothing happened, no warnnings, no complains, no error > > messages.. !! > > > > > > > > All other files related to this issue can be found in > > my_files.tar.gz in > > > > attachment. > > > > > > > > .cshrc > > > > The output of the "ompi_info --all" command. > > > > my_hostfile > > > > hello.c > > > > output of iptables > > > > > > > > The only thing i've noticed is that the port of our ssh has been > > changed > > > > from 22 to other number for security issues. > > > > Don't know will that have anything to with it or not. > > > > > > > > > > > > Any help will be highly appreciated!! > > > > > > > > thanks in advance! > > > > > > > > Kevin > > > > > > > > > > > > > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > ___ > > users mailing
Re: [OMPI users] strange bug
hm, so I am out of ideas. I created multiple variants of test-programs which did what you basically described, and they all passed and did not generate problems. I compiled the MUMPS library and ran the tests that they have in the examples directory, and they all worked. Additionally, I checked in the source code of Open MPI. In comm_dup there is only a single location where we raise the error MPI_ERR_INTERN (which was reported in your email). I am fairly positive, that this can not occur, else we would segfault prior to that (it is a stupid check, don't ask). Furthermore, the code segment that has been modified does not raise anywhere MPI_ERR_INTERN. Of course, it could be a secondary effect and be created somewhere else (PML_ADD or collective module selection) and comm_dup just passes the error code up. One way or the other, I need more hints on what the code does. Any chance of getting a smaller code fragment which replicates the problem? It could use the MUMPS library, I am fine with that since I just compiled and installed it with the current ompi trunk... Thanks Edgar Edgar Gabriel wrote: I would say the probability is large that it is due to the recent 'fix'. I will try to create a testcase similar to what you suggested. Could you give us maybe some hints on which functionality of MUMPS you are using, or even share the code/ a code fragment? Thanks Edgar Jeff Squyres wrote: Hey Edgar -- Could this have anything to do with your recent fixes? On May 12, 2009, at 8:30 AM, Anton Starikov wrote: hostfile from torque PBS_NODEFILE (OMPI is compilled with torque support) It happens with or without rankfile. Started with mpirun -np 16 ./somecode mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1 rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it) I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same. It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code. I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems. I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations. All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI. -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
Re: [OMPI users] strange bug
I will try to prepare test-case. -- Anton Starikov. On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote: hm, so I am out of ideas. I created multiple variants of test- programs which did what you basically described, and they all passed and did not generate problems. I compiled the MUMPS library and ran the tests that they have in the examples directory, and they all worked. Additionally, I checked in the source code of Open MPI. In comm_dup there is only a single location where we raise the error MPI_ERR_INTERN (which was reported in your email). I am fairly positive, that this can not occur, else we would segfault prior to that (it is a stupid check, don't ask). Furthermore, the code segment that has been modified does not raise anywhere MPI_ERR_INTERN. Of course, it could be a secondary effect and be created somewhere else (PML_ADD or collective module selection) and comm_dup just passes the error code up. One way or the other, I need more hints on what the code does. Any chance of getting a smaller code fragment which replicates the problem? It could use the MUMPS library, I am fine with that since I just compiled and installed it with the current ompi trunk... Thanks Edgar Edgar Gabriel wrote: I would say the probability is large that it is due to the recent 'fix'. I will try to create a testcase similar to what you suggested. Could you give us maybe some hints on which functionality of MUMPS you are using, or even share the code/ a code fragment? Thanks Edgar Jeff Squyres wrote: Hey Edgar -- Could this have anything to do with your recent fixes? On May 12, 2009, at 8:30 AM, Anton Starikov wrote: hostfile from torque PBS_NODEFILE (OMPI is compilled with torque support) It happens with or without rankfile. Started with mpirun -np 16 ./somecode mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1 rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it) I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same. It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code. I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems. I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations. All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI. -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
Okay, I fixed this today toor21219 On May 11, 2009, at 11:27 PM, Anton Starikov wrote: Now there is another problem :) You can try oversubscribe node. At least by 1 task. If you hostfile and rank file limit you at N procs, you can ask mpirun for N+1 and it wil be not rejected. Although in reality there will be N tasks. So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5" both works, but in both cases there are only 4 tasks. It isn't crucial, because there is nor real oversubscription, but there is still some bug which can affect something in future. -- Anton Starikov. On May 12, 2009, at 1:45 AM, Ralph Castain wrote: This is fixed as of r21208. Thanks for reporting it! Ralph On May 11, 2009, at 12:51 PM, Anton Starikov wrote: Although removing this check solves problem of having more slots in rankfile than necessary, there is another problem. If I set rmaps_base_no_oversubscribe=1 then if, for example: hostfile: node01 node01 node02 node02 rankfile: rank 0=node01 slot=1 rank 1=node01 slot=0 rank 2=node02 slot=1 rank 3=node02 slot=0 mpirun -np 4 ./something complains with: "There are not enough slots available in the system to satisfy the 4 slots that were requested by the application" but "mpirun -np 3 ./something" will work though. It works, when you ask for 1 CPU less. And the same behavior in any case (shared nodes, non-shared nodes, multi-node) If you switch off rmaps_base_no_oversubscribe, then it works and all affinities set as it requested in rankfile, there is no oversubscription. Anton. On May 5, 2009, at 3:08 PM, Ralph Castain wrote: Ah - thx for catching that, I'll remove that check. It no longer is required. Thx! On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky > wrote: According to the code it does cares. $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572 ival = orte_rmaps_rank_file_value.ival; if ( ival > (np-1) ) { orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival, rankfile); rc = ORTE_ERR_BAD_PARAM; goto unlock; } If I remember correctly, I used an array to map ranks, and since the length of array is NP, maximum index must be less than np, so if you have the number of rank > NP, you have no place to put it inside array. "Likewise, if you have more procs than the rankfile specifies, we map the additional procs either byslot (default) or bynode (if you specify that option). So the rankfile doesn't need to contain an entry for every proc." - Correct point. Lenny. On 5/5/09, Ralph Castain wrote: Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if the rankfile contains additional info - it only maps up to the number of processes, and ignores anything beyond that number. So there is no need to remove the additional info. Likewise, if you have more procs than the rankfile specifies, we map the additional procs either byslot (default) or bynode (if you specify that option). So the rankfile doesn't need to contain an entry for every proc. Just don't want to confuse folks. Ralph On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky > wrote: Hi, maximum rank number must be less then np. if np=1 then there is only rank 0 in the system, so rank 1 is invalid. please remove "rank 1=node2 slot=*" from the rankfile Best regards, Lenny. On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot > wrote: Hi , I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my command doesn't work cat rankf: rank 0=node1 slot=* rank 1=node2 slot=* cat hostf: node1 slots=2 node2 slots=2 mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname : --host node2 -n 1 hostname Error, invalid rank (1) in the rankfile (rankf) -- [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 403 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 1016 Ralph, could you tell me if my command syntax is correct or not ? if not, give me the expected one ? Regards Geoffroy 2009/4/30 Geoffroy Pignot Immediately Sir !!! :) Thanks again Ralph Geoffroy -- Message: 2 Date: Thu, 30 Apr 2009 06:45:39 -0600 From: Ralph Castain Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? To: Open MPI Users Message-ID: <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" I believe this is fixed now in our development trunk - you can download any tarball starting from last night and give it a try, if you like. Any feedback would be appreciated. Ralph On Apr 14, 2009, at
[OMPI users] ****---How to configure NIS and MPI on spread NICs?----****
Hello all, I want to configure NIS and MPI with different network. For example, NIS uses eth0 and MPI uses eth1 some thing like that. How can I do that? Axida