Dear Gus, I would like to thank you and all the guys for your interest and for the wise advices. Unfortunately, the cluster is currently off-line (I don't know why, I'm just returned to the university), but I can ensure that it not the same node that go south.
In addition, last friday I was able to run my code for at least 6 hours, simply by putting only a process on each node. Finally, now I know that the RAM hardware was the orginal one of 8 years ago. Last but least, sometimes a kernel segfault appears (I don't have the access to the log files, therefore I didn't know this when I posted the thread). Tomorrow I'll start to do the diagnostic suggested. Many thanks again! 2012/9/8 <users-requ...@open-mpi.org>: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: problem with rankfile (Ralph Castain) > 2. Re: some mpi processes "disappear" on a cluster of servers > (Gus Correa) > 3. Re: some mpi processes "disappear" on a cluster of servers > (Gus Correa) > 4. Re: MPI_Allreduce fail (minGW gfortran + OpenMPI 1.6.1) > (Jeff Squyres) > 5. Setting RPATH for Open MPI libraries (Jed Brown) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 7 Sep 2012 10:33:55 -0700 > From: Ralph Castain <r...@open-mpi.org> > Subject: Re: [OMPI users] problem with rankfile > To: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> > Cc: us...@open-mpi.org > Message-ID: <8c53f47d-b593-4994-931e-f746ac27b...@open-mpi.org> > Content-Type: text/plain; charset=us-ascii > > > On Sep 7, 2012, at 5:41 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > >> Hi, >> >> are the following outputs helpful to find the error with >> a rankfile on Solaris? > > If you can't bind on the new Solaris machine, then the rankfile won't do you > any good. It looks like we are getting the incorrect number of cores on that > machine - is it possible that it has hardware threads, and doesn't report > "cores"? Can you download and run a copy of lstopo to check the output? You > get that from the hwloc folks: > > http://www.open-mpi.org/software/hwloc/v1.5/ > > >> I wrapped long lines so that they >> are easier to read. Have you had time to look at the >> segmentation fault with a rankfile which I reported in my >> last email (see below)? > > I'm afraid not - been too busy lately. I'd suggest first focusing on getting > binding to work. > >> >> "tyr" is a two processor single core machine. >> >> tyr fd1026 116 mpiexec -report-bindings -np 4 \ >> -bind-to-socket -bycore rank_size >> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >> fork binding child [[27298,1],0] to socket 0 cpus 0001 >> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >> fork binding child [[27298,1],1] to socket 1 cpus 0002 >> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >> fork binding child [[27298,1],2] to socket 0 cpus 0001 >> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >> fork binding child [[27298,1],3] to socket 1 cpus 0002 >> I'm process 0 of 4 ... >> >> >> tyr fd1026 121 mpiexec -report-bindings -np 4 \ >> -bind-to-socket -bysocket rank_size >> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >> fork binding child [[27380,1],0] to socket 0 cpus 0001 >> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >> fork binding child [[27380,1],1] to socket 1 cpus 0002 >> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >> fork binding child [[27380,1],2] to socket 0 cpus 0001 >> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >> fork binding child [[27380,1],3] to socket 1 cpus 0002 >> I'm process 0 of 4 ... >> >> >> tyr fd1026 117 mpiexec -report-bindings -np 4 \ >> -bind-to-core -bycore rank_size >> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >> fork binding child [[27307,1],2] to cpus 0004 >> ------------------------------------------------------------------ >> An attempt to set processor affinity has failed - please check to >> ensure that your system supports such functionality. If so, then >> this is probably something that should be reported to the OMPI >> developers. >> ------------------------------------------------------------------ >> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >> fork binding child [[27307,1],0] to cpus 0001 >> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >> fork binding child [[27307,1],1] to cpus 0002 >> ------------------------------------------------------------------ >> mpiexec was unable to start the specified application >> as it encountered an error >> on node tyr.informatik.hs-fulda.de. More information may be >> available above. >> ------------------------------------------------------------------ >> 4 total processes failed to start >> >> >> >> tyr fd1026 118 mpiexec -report-bindings -np 4 \ >> -bind-to-core -bysocket rank_size >> ------------------------------------------------------------------ >> An invalid physical processor ID was returned when attempting to >> bind >> an MPI process to a unique processor. >> >> This usually means that you requested binding to more processors >> than >> >> exist (e.g., trying to bind N MPI processes to M processors, >> where N > >> M). Double check that you have enough unique processors for >> all the >> MPI processes that you are launching on this host. >> >> You job will now abort. >> ------------------------------------------------------------------ >> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: >> fork binding child [[27347,1],0] to socket 0 cpus 0001 >> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: >> fork binding child [[27347,1],1] to socket 1 cpus 0002 >> ------------------------------------------------------------------ >> mpiexec was unable to start the specified application as it >> encountered an error >> on node tyr.informatik.hs-fulda.de. More information may be >> available above. >> ------------------------------------------------------------------ >> 4 total processes failed to start >> tyr fd1026 119 >> >> >> >> "linpc3" and "linpc4" are two processor dual core machines. >> >> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ >> -np 4 -bind-to-core -bycore rank_size >> [linpc4:16842] [[40914,0],0] odls:default: >> fork binding child [[40914,1],1] to cpus 0001 >> [linpc4:16842] [[40914,0],0] odls:default: >> fork binding child [[40914,1],3] to cpus 0002 >> [linpc3:31384] [[40914,0],1] odls:default: >> fork binding child [[40914,1],0] to cpus 0001 >> [linpc3:31384] [[40914,0],1] odls:default: >> fork binding child [[40914,1],2] to cpus 0002 >> I'm process 1 of 4 ... >> >> >> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ >> -np 4 -bind-to-core -bysocket rank_size >> [linpc4:16846] [[40918,0],0] odls:default: >> fork binding child [[40918,1],1] to socket 0 cpus 0001 >> [linpc4:16846] [[40918,0],0] odls:default: >> fork binding child [[40918,1],3] to socket 0 cpus 0002 >> [linpc3:31435] [[40918,0],1] odls:default: >> fork binding child [[40918,1],0] to socket 0 cpus 0001 >> [linpc3:31435] [[40918,0],1] odls:default: >> fork binding child [[40918,1],2] to socket 0 cpus 0002 >> I'm process 1 of 4 ... >> >> >> >> >> linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \ >> -np 4 -bind-to-socket -bycore rank_size >> ------------------------------------------------------------------ >> Unable to bind to socket 0 on node linpc3. >> ------------------------------------------------------------------ >> ------------------------------------------------------------------ >> Unable to bind to socket 0 on node linpc4. >> ------------------------------------------------------------------ >> ------------------------------------------------------------------ >> mpiexec was unable to start the specified application as it >> encountered an error: >> >> Error name: Fatal >> Node: linpc4 >> >> when attempting to start process rank 1. >> ------------------------------------------------------------------ >> 4 total processes failed to start >> linpc4 fd1026 105 >> >> >> linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \ >> -np 4 -bind-to-socket -bysocket rank_size >> ------------------------------------------------------------------ >> Unable to bind to socket 0 on node linpc4. >> ------------------------------------------------------------------ >> ------------------------------------------------------------------ >> Unable to bind to socket 0 on node linpc3. >> ------------------------------------------------------------------ >> ------------------------------------------------------------------ >> mpiexec was unable to start the specified application as it >> encountered an error: >> >> Error name: Fatal >> Node: linpc4 >> >> when attempting to start process rank 1. >> -------------------------------------------------------------------------- >> 4 total processes failed to start >> >> >> It's interesting that commands that work on Solaris fail on Linux >> and vice versa. >> >> >> Kind regards >> >> Siegmar >> >>>> I couldn't really say for certain - I don't see anything obviously >>>> wrong with your syntax, and the code appears to be working or else >>>> it would fail on the other nodes as well. The fact that it fails >>>> solely on that machine seems suspect. >>>> >>>> Set aside the rankfile for the moment and try to just bind to cores >>>> on that machine, something like: >>>> >>>> mpiexec --report-bindings -bind-to-core >>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size >>>> >>>> If that doesn't work, then the problem isn't with rankfile >>> >>> It doesn't work but I found out something else as you can see below. >>> I get a segmentation fault for some rankfiles. >>> >>> >>> tyr small_prog 110 mpiexec --report-bindings -bind-to-core >>> -host rs0.informatik.hs-fulda.de -n 2 rank_size >>> -------------------------------------------------------------------------- >>> An attempt to set processor affinity has failed - please check to >>> ensure that your system supports such functionality. If so, then >>> this is probably something that should be reported to the OMPI developers. >>> -------------------------------------------------------------------------- >>> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default: >>> fork binding child [[30561,1],0] to cpus 0001 >>> -------------------------------------------------------------------------- >>> mpiexec was unable to start the specified application as it >>> encountered an error: >>> >>> Error name: Resource temporarily unavailable >>> Node: rs0.informatik.hs-fulda.de >>> >>> when attempting to start process rank 0. >>> -------------------------------------------------------------------------- >>> 2 total processes failed to start >>> tyr small_prog 111 >>> >>> >>> >>> >>> Perhaps I have a hint for the error on Solaris Sparc. I use the >>> following rankfile to keep everything simple. >>> >>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0 >>> rank 2=linpc1.informatik.hs-fulda.de slot=0:0 >>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0 >>> rank 4=linpc3.informatik.hs-fulda.de slot=0:0 >>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0 >>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0 >>> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0 >>> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0 >>> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0 >>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0 >>> >>> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size" >>> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine. >>> >>> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size >>> [linpc4:08018] [[49482,0],0] odls:default:fork binding child >>> [[49482,1],5] to slot_list 0:0 >>> [linpc3:22030] [[49482,0],4] odls:default:fork binding child >>> [[49482,1],4] to slot_list 0:0 >>> [linpc0:12887] [[49482,0],2] odls:default:fork binding child >>> [[49482,1],1] to slot_list 0:0 >>> [linpc1:08323] [[49482,0],3] odls:default:fork binding child >>> [[49482,1],2] to slot_list 0:0 >>> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child >>> [[49482,1],7] to slot_list 0:0 >>> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork >>> binding child [[49482,1],9] to slot_list 0:0 >>> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork >>> binding child [[49482,1],6] to slot_list 0:0 >>> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork >>> binding child [[49482,1],0] to slot_list 0:0 >>> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork >>> binding child [[49482,1],8] to slot_list 0:0 >>> ... >>> >>> >>> >>> I get a segmentation fault when I run it on my local machine >>> (Solaris Sparc). >>> >>> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size >>> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG: >>> Data unpack would read past end of buffer in file >>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>> at line 927 >>> [tyr:21421] *** Process received signal *** >>> [tyr:21421] Signal: Segmentation Fault (11) >>> [tyr:21421] Signal code: Address not mapped (1) >>> [tyr:21421] Failing at address: 5ba >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec >>> /lib/libc.so.1:0xcad04 >>> /lib/libc.so.1:0xbf3b4 >>> /lib/libc.so.1:0xbf59c >>> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)] >>> /lib/libc.so.1:free+0x24 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> orte_odls_base_default_construct_child_list+0x1234 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/ >>> mca_odls_default.so:0x90b8 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> orte_daemon_cmd_processor+0x328 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> opal_event_base_loop+0x228 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> opal_progress+0xec >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> orte_plm_base_report_launched+0x1c4 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>> orte_plm_base_launch_apps+0x318 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so: >>> orte_plm_rsh_launch+0xac4 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24 >>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8 >>> [tyr:21421] *** End of error message *** >>> Segmentation fault >>> tyr small_prog 142 >>> >>> >>> The funny thing is that I get a segmentation fault on the Linux >>> machine as well if I change my rankfile in the following way. >>> >>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0 >>> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0 >>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0 >>> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0 >>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0 >>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0 >>> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0 >>> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0 >>> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0 >>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0 >>> >>> >>> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size >>> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would >>> read past end of buffer in file >>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>> at line 927 >>> [linpc4:08402] *** Process received signal *** >>> [linpc4:08402] Signal: Segmentation fault (11) >>> [linpc4:08402] Signal code: Address not mapped (1) >>> [linpc4:08402] Failing at address: 0x5f32fffc >>> [linpc4:08402] [ 0] [0xffffe410] >>> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/ >>> mca_odls_default.so(+0x4023) [0xf73ec023] >>> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(+0x42b91) [0xf7667b91] >>> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3] >>> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(+0x8f366) [0xf76b4366] >>> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc] >>> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526] >>> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(opal_progress+0xba) [0xf769303a] >>> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f] >>> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/ >>> libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27] >>> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/ >>> mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd] >>> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf] >>> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273] >>> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003] >>> [linpc4:08402] *** End of error message *** >>> Segmentation fault >>> linpc4 small_prog 107 >>> >>> >>> Hopefully this information helps to fix the problem. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> >>> >>> >>>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm new to rankfiles so that I played a little bit with different >>>>> options. I thought that the following entry would be similar to an >>>>> entry in an appfile and that MPI could place the process with rank 0 >>>>> on any core of any processor. >>>>> >>>>> rank 0=tyr.informatik.hs-fulda.de >>>>> >>>>> Unfortunately it's not allowed and I got an error. Can somebody add >>>>> the missing help to the file? >>>>> >>>>> >>>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size >>>>> -------------------------------------------------------------------------- >>>>> Sorry! You were supposed to get help about: >>>>> no-slot-list >>>>> from the file: >>>>> help-rmaps_rank_file.txt >>>>> But I couldn't find that topic in the file. Sorry! >>>>> -------------------------------------------------------------------------- >>>>> >>>>> >>>>> As you can see below I could use a rankfile on my old local machine >>>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I >>>>> logged into the machine via ssh and tried the same command once more >>>>> as a local user without success. It's more or less the same error as >>>>> before when I tried to bind the process to a remote machine. >>>>> >>>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size >>>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork >>>>> binding child [[19734,1],0] to slot_list 0:0 >>>>> -------------------------------------------------------------------------- >>>>> We were unable to successfully process/set the requested processor >>>>> affinity settings: >>>>> >>>>> Specified slot list: 0:0 >>>>> Error: Cross-device link >>>>> >>>>> This could mean that a non-existent processor was specified, or >>>>> that the specification had improper syntax. >>>>> -------------------------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> mpiexec was unable to start the specified application as it encountered an >> error: >>>>> >>>>> Error name: No such file or directory >>>>> Node: rs0.informatik.hs-fulda.de >>>>> >>>>> when attempting to start process rank 0. >>>>> -------------------------------------------------------------------------- >>>>> rs0 small_prog 119 >>>>> >>>>> >>>>> The application is available. >>>>> >>>>> rs0 small_prog 119 which rank_size >>>>> /home/fd1026/SunOS/sparc/bin/rank_size >>>>> >>>>> >>>>> Is it a problem in the Open MPI implementation or in my rankfile? >>>>> How can I request which sockets and cores per socket are >>>>> available so that I can use correct values in my rankfile? >>>>> In lam-mpi I had a command "lamnodes" which I could use to get >>>>> such information. Thank you very much for any help in advance. >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> Siegmar >>>>> >>>>> >>>>> >>>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)? >>>>>> >>>>>> Yes, both machines are Sparc. I tried first in a homogeneous >>>>>> environment. >>>>>> >>>>>> tyr fd1026 106 psrinfo -v >>>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14 >>>>>> on-line since 08/31/2012 15:44:42. >>>>>> The sparcv9 processor operates at 1600 MHz, >>>>>> and has a sparcv9 floating point processor. >>>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14 >>>>>> on-line since 08/31/2012 15:44:39. >>>>>> The sparcv9 processor operates at 1600 MHz, >>>>>> and has a sparcv9 floating point processor. >>>>>> tyr fd1026 107 >>>>>> >>>>>> My local machine (tyr) is a dual processor machine and the >>>>>> other one is equipped with two quad-core processors each >>>>>> capable of running two hardware threads. >>>>>> >>>>>> >>>>>> Kind regards >>>>>> >>>>>> Siegmar >>>>>> >>>>>> >>>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross >>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> the man page for "mpiexec" shows the following: >>>>>>>> >>>>>>>> cat myrankfile >>>>>>>> rank 0=aa slot=1:0-2 >>>>>>>> rank 1=bb slot=0:0,1 >>>>>>>> rank 2=cc slot=1-2 >>>>>>>> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that >>>>>>>> >>>>>>>> Rank 0 runs on node aa, bound to socket 1, cores 0-2. >>>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. >>>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2. >>>>>>>> >>>>>>>> Does it mean that the process with rank 0 should be bound to >>>>>>>> core 0, 1, or 2 of socket 1? >>>>>>>> >>>>>>>> I tried to use a rankfile and have a problem. My rankfile contains >>>>>>>> the following lines. >>>>>>>> >>>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0 >>>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0 >>>>>>>> >>>>>>>> >>>>>>>> Everything is fine if I use the file with just my local machine >>>>>>>> (the first two lines). >>>>>>>> >>>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size >>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] >>>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0 >>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] >>>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0 >>>>>>>> I'm process 0 of 2 available processes running on >>>>>> tyr.informatik.hs-fulda.de. >>>>>>>> MPI standard 2.1 is supported. >>>>>>>> I'm process 1 of 2 available processes running on >>>>>> tyr.informatik.hs-fulda.de. >>>>>>>> MPI standard 2.1 is supported. >>>>>>>> tyr small_prog 116 >>>>>>>> >>>>>>>> >>>>>>>> I can also change the socket number and the processes will be attached >>>>>>>> to the correct cores. Unfortunately it doesn't work if I add one >>>>>>>> other machine (third line). >>>>>>>> >>>>>>>> >>>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size >>>>>>>> >> -------------------------------------------------------------------------- >>>>>>>> We were unable to successfully process/set the requested processor >>>>>>>> affinity settings: >>>>>>>> >>>>>>>> Specified slot list: 0:0 >>>>>>>> Error: Cross-device link >>>>>>>> >>>>>>>> This could mean that a non-existent processor was specified, or >>>>>>>> that the specification had improper syntax. >>>>>>>> >> -------------------------------------------------------------------------- >>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0 >>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0 >>>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] >>>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0 >>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process >>>>>>>> whose contact information is unknown in file >>>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 >>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send >>>>>>>> to [[10212,1],0]: tag 20 >>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: >>>>>>>> A message is attempting to be sent to a process whose contact >>>>>>>> information is unknown in file >>>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>>>>>>> at line 2501 >>>>>>>> >> -------------------------------------------------------------------------- >>>>>>>> mpiexec was unable to start the specified application as it >>>>>>>> encountered an error: >>>>>>>> >>>>>>>> Error name: Error 0 >>>>>>>> Node: rs0.informatik.hs-fulda.de >>>>>>>> >>>>>>>> when attempting to start process rank 2. >>>>>>>> >> -------------------------------------------------------------------------- >>>>>>>> tyr small_prog 113 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The other machine has two 8 core processors. >>>>>>>> >>>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v >>>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15 >>>>>>>> on-line since 07/26/2012 15:03:14. >>>>>>>> The sparcv9 processor operates at 2400 MHz, >>>>>>>> and has a sparcv9 floating point processor. >>>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15 >>>>>>>> ... >>>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15 >>>>>>>> on-line since 07/26/2012 15:03:16. >>>>>>>> The sparcv9 processor operates at 2400 MHz, >>>>>>>> and has a sparcv9 floating point processor. >>>>>>>> tyr small_prog 122 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Is it necessary to specify another option on the command line or >>>>>>>> is my rankfile faulty? Thank you very much for any suggestions in >>>>>>>> advance. >>>>>>>> >>>>>>>> >>>>>>>> Kind regards >>>>>>>> >>>>>>>> Siegmar >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >> > > > > > ------------------------------ > > Message: 2 > Date: Fri, 07 Sep 2012 18:01:46 -0400 > From: Gus Correa <g...@ldeo.columbia.edu> > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <504a6eca.4040...@ldeo.columbia.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On 09/03/2012 04:39 PM, Andrea Negri wrote: >> max locked memory (kbytes, -l) 32 >>> max memory size (kbytes, -m) unlimited >>> open files (-n) 1024 >>> pipe size (512 bytes, -p) 8 >>> POSIX message queues (bytes, -q) 819200 >>> stack size (kbytes, -s) 10240 >>> > > Hi Andrea > This is besides the possibilities of > running out of physical memory, > or even defective memory chips, which Jeff, Ralph, > John, George have addressed, I still think that the > system limits above may play a role. > In a 8-year old cluster, hardware failures are not unexpected. > > > 1) System limits > > For what it is worth, virtually none of the programs we run here, > mostly atmosphere/ocean/climate codes, > would run with these limits. > On our compute nodes we set > max locked memory and stack size to > unlimited, to avoid problems with symptoms very > similar to those you describe. > Typically there are lots of automatic arrays in subroutines, > etc, which require a large stack. > Your sys admin could add these lines to the bottom > of /etc/security/limits.conf [the last one sets the > max number of open files]: > > * - memlock -1 > * - stack -1 > * - nofile 4096 > > 2) Defective network interface/cable/switch port > > Yet another possibility, following Ralph's suggestion, > is that you may have a failing network interface, or a bad > Ethernet cable or connector, on the node that goes south, > or on the switch port that serves that node. > [I am assuming your network is Ethernet, probably GigE.] > > Again, in a 8-year old cluster, hardware failures are not unexpected. > > We had this sort of problems with old clusters, old nodes. > > 3) Quarantine the bad node > > Is it always the same node that fails, or does it vary? > [Please answer, it helps us understand what's going on.] > > If it is always the same node, have you tried to quarantine it, > either temporarily removing it from your job submission system > or just turning it off, and run the job on the remaining > nodes? > > I hope this helps, > Gus Correa > > > ------------------------------ > > Message: 3 > Date: Fri, 07 Sep 2012 18:12:20 -0400 > From: Gus Correa <g...@ldeo.columbia.edu> > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <504a7144.1000...@ldeo.columbia.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On 09/07/2012 08:02 AM, Jeff Squyres wrote: >> On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: >> >>> Also look for hardware errors. Perhaps you have some bad RAM somewhere. >>> Is it always the same node that crashes? And so on. >> >> >> Another thought on hardware errors... I actually have seen bad RAM cause >> spontaneous reboots with no Linux warnings. >> >> Do you have any hardware diagnostics from your server >> vendor that you can run? >> > > If you don't have a vendor provided diagnostic tool, > you or your sys admin could try Advanced Clustering "breakin": > > http://www.advancedclustering.com/our-software/view-category.html > > Download the ISO version, burn a CD, put in the node CD drive, > assuming it has one, reboot, chose breakin in the menu options. > If there is no CD drive, there is an alternative with network boot, > although more involved. > > I hope it helps, > Gus Correa > >> A simple way to test your RAM (it's not completely comprehensive, but it >> does check for a surprisingly wide array of memory issues) is to do >> something like this (pseudocode): >> >> ----- >> size_t i, size, increment; >> increment = 1GB; >> size = 1GB; >> int *ptr; >> >> // Find the biggest amount of memory that you can malloc >> while (increment>= 1024) { >> ptr = malloc(size); >> if (NULL != ptr) { >> free(ptr); >> size += increment; >> } else { >> size -= increment; >> increment /= 2; >> } >> } >> printf("I can malloc %lu bytes\n", size); >> >> // Malloc that huge chunk of memory >> ptr = malloc(size); >> for (i = 0; i< size / sizeof(int); ++i, ++ptr) { >> *ptr = 37; >> if (*ptr != 37) { >> printf("Readback error!\n"); >> } >> } >> >> printf("All done\n"); >> ----- >> >> Depending on how much memory you have, > that might take a little while to run > (all the memory has to be paged in, etc.). > You might want to add a status output to show progress, > and/or write/read a page at a time for better efficiency, etc. > But you get the idea. >> >> Hope that helps. >> > > > > ------------------------------ > > Message: 4 > Date: Sat, 8 Sep 2012 06:25:22 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] MPI_Allreduce fail (minGW gfortran + OpenMPI > 1.6.1) > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1b815295-1d69-46db-8dcb-b308c228b...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > I am unable to replicate your problem, but admittedly I only have access to > gfortran on Linux. And I am definitely *not* a Fortran expert. :-\ > > The code seems to run fine for me -- can you send another test program that > actually tests the results of the all reduce? Fortran allocatable stuff > always confuses me; I wonder if perhaps we're not getting the passed pointer > properly. Checking the results of the all reduce would be a good way to > check this theory. > > > > On Sep 6, 2012, at 12:05 PM, Yonghui wrote: > >> Dear mpi users and developers, >> >> I am having some trouble with MPI_Allreduce. I am using MinGW (gcc 4.6.2) >> with OpenMPI 1.6.1. The MPI_Allreduce in c version works fine, but the >> fortran version failed with error. Here is the simple fortran code to >> reproduce the error: >> >> program main >> implicit none >> include 'mpif.h' >> character * (MPI_MAX_PROCESSOR_NAME) >> processor_name >> integer myid, numprocs, namelen, rc, ierr >> integer, allocatable :: mat1(:, :, :) >> >> call MPI_INIT( ierr ) >> call MPI_COMM_RANK( MPI_COMM_WORLD, myid, >> ierr ) >> call MPI_COMM_SIZE( MPI_COMM_WORLD, >> numprocs, ierr ) >> allocate(mat1(-36:36, -36:36, -36:36)) >> mat1(:,:,:) = 111 >> print *, "Going to call MPI_Allreduce." >> call MPI_Allreduce(MPI_IN_PLACE, mat1(-36, >> -36, -36), 389017, MPI_INTEGER, MPI_BOR, MPI_COMM_WORLD, ierr) >> print *, "MPI_Allreduce done!!!" >> call MPI_FINALIZE(rc) >> endprogram >> >> The command that I used to compile: >> gfortran Allreduce.f90 -IC:\OpenMPI-win32\include >> C:\OpenMPI-win32\lib\libmpi_f77.lib >> >> The MPI_Allreduce fail. [xxxxxxx:02112] [[17193,0],0]-[[17193,1],0] >> mca_oob_tcp_msg_recv: readv failed: Unknown error (108). >> I am not sure why this happens. But I think it is the windows build MPI >> problem. Since the simple code works on a Linux system with gfortran. >> >> Any ideas? I appreciate any response! >> >> Yonghui >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 5 > Date: Sat, 8 Sep 2012 07:46:16 -0500 > From: Jed Brown <j...@59a2.org> > Subject: [OMPI users] Setting RPATH for Open MPI libraries > To: Open MPI Users <us...@open-mpi.org> > Message-ID: > <CAM9tzS=qFkOZW=C4=dFg65W+QcX1J=h4w2-dzxwbpnenowj...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Is there a way to configure Open MPI to use RPATH without needing to > manually specify --with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib (and > similar for non-GNU-compatible compilers)? > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 2347, Issue 1 > **************************************