Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Andrea Negri Sun, 9 Sep 2012 18:17:32 -0400

Dear Gus,

I would like to thank you and all the guys for your interest and for
the wise advices.
Unfortunately, the cluster is currently off-line (I don't know why,
I'm just returned to the university), but I can ensure that it not the
same node that go south.


In addition, last friday I was able to run my code for at least 6
hours, simply by putting only a process on each node.

Finally, now I know that the RAM hardware was the orginal one of 8
years ago. Last but least, sometimes a kernel segfault appears (I
don't have the access to the log files, therefore I didn't know this
when I posted the thread).

Tomorrow I'll start to do the diagnostic suggested.
Many thanks again!

2012/9/8  <users-requ...@open-mpi.org>:
> Send users mailing list submissions to
>         us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>         users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: problem with rankfile (Ralph Castain)
>    2. Re: some mpi processes "disappear" on a cluster of        servers
>       (Gus Correa)
>    3. Re: some mpi processes "disappear" on a cluster of        servers
>       (Gus Correa)
>    4. Re: MPI_Allreduce fail (minGW gfortran + OpenMPI 1.6.1)
>       (Jeff Squyres)
>    5. Setting RPATH for Open MPI libraries (Jed Brown)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 7 Sep 2012 10:33:55 -0700
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] problem with rankfile
> To: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>
> Cc: us...@open-mpi.org
> Message-ID: <8c53f47d-b593-4994-931e-f746ac27b...@open-mpi.org>
> Content-Type: text/plain; charset=us-ascii
>
>
> On Sep 7, 2012, at 5:41 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>
>> Hi,
>>
>> are the following outputs helpful to find the error with
>> a rankfile on Solaris?
>
> If you can't bind on the new Solaris machine, then the rankfile won't do you 
> any good. It looks like we are getting the incorrect number of cores on that 
> machine - is it possible that it has hardware threads, and doesn't report 
> "cores"? Can you download and run a copy of lstopo to check the output? You 
> get that from the hwloc folks:
>
> http://www.open-mpi.org/software/hwloc/v1.5/
>
>
>> I wrapped long lines so that they
>> are easier to read. Have you had time to look at the
>> segmentation fault with a rankfile which I reported in my
>> last email (see below)?
>
> I'm afraid not - been too busy lately. I'd suggest first focusing on getting 
> binding to work.
>
>>
>> "tyr" is a two processor single core machine.
>>
>> tyr fd1026 116 mpiexec -report-bindings -np 4 \
>>  -bind-to-socket -bycore rank_size
>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>  fork binding child [[27298,1],0] to socket 0 cpus 0001
>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>  fork binding child [[27298,1],1] to socket 1 cpus 0002
>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>  fork binding child [[27298,1],2] to socket 0 cpus 0001
>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>  fork binding child [[27298,1],3] to socket 1 cpus 0002
>> I'm process 0 of 4 ...
>>
>>
>> tyr fd1026 121 mpiexec -report-bindings -np 4 \
>> -bind-to-socket -bysocket rank_size
>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>  fork binding child [[27380,1],0] to socket 0 cpus 0001
>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>  fork binding child [[27380,1],1] to socket 1 cpus 0002
>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>  fork binding child [[27380,1],2] to socket 0 cpus 0001
>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>  fork binding child [[27380,1],3] to socket 1 cpus 0002
>> I'm process 0 of 4 ...
>>
>>
>> tyr fd1026 117 mpiexec -report-bindings -np 4 \
>>  -bind-to-core -bycore rank_size
>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>  fork binding child [[27307,1],2] to cpus 0004
>> ------------------------------------------------------------------
>> An attempt to set processor affinity has failed - please check to
>> ensure that your system supports such functionality. If so, then
>> this is probably something that should be reported to the OMPI
>>  developers.
>> ------------------------------------------------------------------
>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>  fork binding child [[27307,1],0] to cpus 0001
>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>  fork binding child [[27307,1],1] to cpus 0002
>> ------------------------------------------------------------------
>> mpiexec was unable to start the specified application
>>  as it encountered an error
>> on node tyr.informatik.hs-fulda.de. More information may be
>>  available above.
>> ------------------------------------------------------------------
>> 4 total processes failed to start
>>
>>
>>
>> tyr fd1026 118 mpiexec -report-bindings -np 4 \
>>  -bind-to-core -bysocket rank_size
>> ------------------------------------------------------------------
>> An invalid physical processor ID was returned when attempting to
>>  bind
>> an MPI process to a unique processor.
>>
>> This usually means that you requested binding to more processors
>>  than
>>
>> exist (e.g., trying to bind N MPI processes to M processors,
>>  where N >
>> M).  Double check that you have enough unique processors for
>>  all the
>> MPI processes that you are launching on this host.
>>
>> You job will now abort.
>> ------------------------------------------------------------------
>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>  fork binding child [[27347,1],0] to socket 0 cpus 0001
>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>  fork binding child [[27347,1],1] to socket 1 cpus 0002
>> ------------------------------------------------------------------
>> mpiexec was unable to start the specified application as it
>>  encountered an error
>> on node tyr.informatik.hs-fulda.de. More information may be
>>  available above.
>> ------------------------------------------------------------------
>> 4 total processes failed to start
>> tyr fd1026 119
>>
>>
>>
>> "linpc3" and "linpc4" are two processor dual core machines.
>>
>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>> -np 4 -bind-to-core -bycore rank_size
>> [linpc4:16842] [[40914,0],0] odls:default:
>>  fork binding child [[40914,1],1] to cpus 0001
>> [linpc4:16842] [[40914,0],0] odls:default:
>>  fork binding child [[40914,1],3] to cpus 0002
>> [linpc3:31384] [[40914,0],1] odls:default:
>>  fork binding child [[40914,1],0] to cpus 0001
>> [linpc3:31384] [[40914,0],1] odls:default:
>>  fork binding child [[40914,1],2] to cpus 0002
>> I'm process 1 of 4 ...
>>
>>
>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>>  -np 4 -bind-to-core -bysocket rank_size
>> [linpc4:16846] [[40918,0],0] odls:default:
>>  fork binding child [[40918,1],1] to socket 0 cpus 0001
>> [linpc4:16846] [[40918,0],0] odls:default:
>>  fork binding child [[40918,1],3] to socket 0 cpus 0002
>> [linpc3:31435] [[40918,0],1] odls:default:
>>  fork binding child [[40918,1],0] to socket 0 cpus 0001
>> [linpc3:31435] [[40918,0],1] odls:default:
>>  fork binding child [[40918,1],2] to socket 0 cpus 0002
>> I'm process 1 of 4 ...
>>
>>
>>
>>
>> linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
>>  -np 4 -bind-to-socket -bycore rank_size
>> ------------------------------------------------------------------
>> Unable to bind to socket 0 on node linpc3.
>> ------------------------------------------------------------------
>> ------------------------------------------------------------------
>> Unable to bind to socket 0 on node linpc4.
>> ------------------------------------------------------------------
>> ------------------------------------------------------------------
>> mpiexec was unable to start the specified application as it
>>  encountered an error:
>>
>> Error name: Fatal
>> Node: linpc4
>>
>> when attempting to start process rank 1.
>> ------------------------------------------------------------------
>> 4 total processes failed to start
>> linpc4 fd1026 105
>>
>>
>> linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \
>>  -np 4 -bind-to-socket -bysocket rank_size
>> ------------------------------------------------------------------
>> Unable to bind to socket 0 on node linpc4.
>> ------------------------------------------------------------------
>> ------------------------------------------------------------------
>> Unable to bind to socket 0 on node linpc3.
>> ------------------------------------------------------------------
>> ------------------------------------------------------------------
>> mpiexec was unable to start the specified application as it
>>  encountered an error:
>>
>> Error name: Fatal
>> Node: linpc4
>>
>> when attempting to start process rank 1.
>> --------------------------------------------------------------------------
>> 4 total processes failed to start
>>
>>
>> It's interesting that commands that work on Solaris fail on Linux
>> and vice versa.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>>> I couldn't really say for certain - I don't see anything obviously
>>>> wrong with your syntax, and the code appears to be working or else
>>>> it would fail on the other nodes as well. The fact that it fails
>>>> solely on that machine seems suspect.
>>>>
>>>> Set aside the rankfile for the moment and try to just bind to cores
>>>> on that machine, something like:
>>>>
>>>> mpiexec --report-bindings -bind-to-core
>>>>  -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>>>
>>>> If that doesn't work, then the problem isn't with rankfile
>>>
>>> It doesn't work but I found out something else as you can see below.
>>> I get a segmentation fault for some rankfiles.
>>>
>>>
>>> tyr small_prog 110 mpiexec --report-bindings -bind-to-core
>>>  -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>> --------------------------------------------------------------------------
>>> An attempt to set processor affinity has failed - please check to
>>> ensure that your system supports such functionality. If so, then
>>> this is probably something that should be reported to the OMPI developers.
>>> --------------------------------------------------------------------------
>>> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default:
>>>  fork binding child [[30561,1],0] to cpus 0001
>>> --------------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>>  encountered an error:
>>>
>>> Error name: Resource temporarily unavailable
>>> Node: rs0.informatik.hs-fulda.de
>>>
>>> when attempting to start process rank 0.
>>> --------------------------------------------------------------------------
>>> 2 total processes failed to start
>>> tyr small_prog 111
>>>
>>>
>>>
>>>
>>> Perhaps I have a hint for the error on Solaris Sparc. I use the
>>> following rankfile to keep everything simple.
>>>
>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>> rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>> rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>
>>> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size"
>>> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine.
>>>
>>> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size
>>> [linpc4:08018] [[49482,0],0] odls:default:fork binding child
>>>  [[49482,1],5] to slot_list 0:0
>>> [linpc3:22030] [[49482,0],4] odls:default:fork binding child
>>>  [[49482,1],4] to slot_list 0:0
>>> [linpc0:12887] [[49482,0],2] odls:default:fork binding child
>>>  [[49482,1],1] to slot_list 0:0
>>> [linpc1:08323] [[49482,0],3] odls:default:fork binding child
>>>  [[49482,1],2] to slot_list 0:0
>>> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child
>>>  [[49482,1],7] to slot_list 0:0
>>> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork
>>>  binding child [[49482,1],9] to slot_list 0:0
>>> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork
>>>  binding child [[49482,1],6] to slot_list 0:0
>>> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork
>>>  binding child [[49482,1],0] to slot_list 0:0
>>> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork
>>>  binding child [[49482,1],8] to slot_list 0:0
>>> ...
>>>
>>>
>>>
>>> I get a segmentation fault when I run it on my local machine
>>> (Solaris Sparc).
>>>
>>> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size
>>> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG:
>>>  Data unpack would read past end of buffer in file
>>>  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>  at line 927
>>> [tyr:21421] *** Process received signal ***
>>> [tyr:21421] Signal: Segmentation Fault (11)
>>> [tyr:21421] Signal code: Address not mapped (1)
>>> [tyr:21421] Failing at address: 5ba
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec
>>> /lib/libc.so.1:0xcad04
>>> /lib/libc.so.1:0xbf3b4
>>> /lib/libc.so.1:0xbf59c
>>> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)]
>>> /lib/libc.so.1:free+0x24
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  orte_odls_base_default_construct_child_list+0x1234
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/
>>>  mca_odls_default.so:0x90b8
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  orte_daemon_cmd_processor+0x328
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  opal_event_base_loop+0x228
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  opal_progress+0xec
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  orte_plm_base_report_launched+0x1c4
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>  orte_plm_base_launch_apps+0x318
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so:
>>>  orte_plm_rsh_launch+0xac4
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24
>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8
>>> [tyr:21421] *** End of error message ***
>>> Segmentation fault
>>> tyr small_prog 142
>>>
>>>
>>> The funny thing is that I get a segmentation fault on the Linux
>>> machine as well if I change my rankfile in the following way.
>>>
>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>
>>>
>>> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size
>>> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would
>>>  read past end of buffer in file
>>>  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>  at line 927
>>> [linpc4:08402] *** Process received signal ***
>>> [linpc4:08402] Signal: Segmentation fault (11)
>>> [linpc4:08402] Signal code: Address not mapped (1)
>>> [linpc4:08402] Failing at address: 0x5f32fffc
>>> [linpc4:08402] [ 0] [0xffffe410]
>>> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>  mca_odls_default.so(+0x4023) [0xf73ec023]
>>> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(+0x42b91) [0xf7667b91]
>>> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3]
>>> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(+0x8f366) [0xf76b4366]
>>> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc]
>>> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526]
>>> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(opal_progress+0xba) [0xf769303a]
>>> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f]
>>> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/
>>>  libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27]
>>> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>  mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd]
>>> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf]
>>> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273]
>>> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003]
>>> [linpc4:08402] *** End of error message ***
>>> Segmentation fault
>>> linpc4 small_prog 107
>>>
>>>
>>> Hopefully this information helps to fix the problem.
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>>
>>>
>>>
>>>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross
>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm new to rankfiles so that I played a little bit with different
>>>>> options. I thought that the following entry would be similar to an
>>>>> entry in an appfile and that MPI could place the process with rank 0
>>>>> on any core of any processor.
>>>>>
>>>>> rank 0=tyr.informatik.hs-fulda.de
>>>>>
>>>>> Unfortunately it's not allowed and I got an error. Can somebody add
>>>>> the missing help to the file?
>>>>>
>>>>>
>>>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>> --------------------------------------------------------------------------
>>>>> Sorry!  You were supposed to get help about:
>>>>>   no-slot-list
>>>>> from the file:
>>>>>   help-rmaps_rank_file.txt
>>>>> But I couldn't find that topic in the file.  Sorry!
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> As you can see below I could use a rankfile on my old local machine
>>>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
>>>>> logged into the machine via ssh and tried the same command once more
>>>>> as a local user without success. It's more or less the same error as
>>>>> before when I tried to bind the process to a remote machine.
>>>>>
>>>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
>>>>> binding child [[19734,1],0] to slot_list 0:0
>>>>> --------------------------------------------------------------------------
>>>>> We were unable to successfully process/set the requested processor
>>>>> affinity settings:
>>>>>
>>>>> Specified slot list: 0:0
>>>>> Error: Cross-device link
>>>>>
>>>>> This could mean that a non-existent processor was specified, or
>>>>> that the specification had improper syntax.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec was unable to start the specified application as it encountered an
>> error:
>>>>>
>>>>> Error name: No such file or directory
>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>
>>>>> when attempting to start process rank 0.
>>>>> --------------------------------------------------------------------------
>>>>> rs0 small_prog 119
>>>>>
>>>>>
>>>>> The application is available.
>>>>>
>>>>> rs0 small_prog 119 which rank_size
>>>>> /home/fd1026/SunOS/sparc/bin/rank_size
>>>>>
>>>>>
>>>>> Is it a problem in the Open MPI implementation or in my rankfile?
>>>>> How can I request which sockets and cores per socket are
>>>>> available so that I can use correct values in my rankfile?
>>>>> In lam-mpi I had a command "lamnodes" which I could use to get
>>>>> such information. Thank you very much for any help in advance.
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>> Siegmar
>>>>>
>>>>>
>>>>>
>>>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
>>>>>>
>>>>>> Yes, both machines are Sparc. I tried first in a homogeneous
>>>>>> environment.
>>>>>>
>>>>>> tyr fd1026 106 psrinfo -v
>>>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
>>>>>> on-line since 08/31/2012 15:44:42.
>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>       and has a sparcv9 floating point processor.
>>>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
>>>>>> on-line since 08/31/2012 15:44:39.
>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>       and has a sparcv9 floating point processor.
>>>>>> tyr fd1026 107
>>>>>>
>>>>>> My local machine (tyr) is a dual processor machine and the
>>>>>> other one is equipped with two quad-core processors each
>>>>>> capable of running two hardware threads.
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>> Siegmar
>>>>>>
>>>>>>
>>>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross
>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> the man page for "mpiexec" shows the following:
>>>>>>>>
>>>>>>>>       cat myrankfile
>>>>>>>>       rank 0=aa slot=1:0-2
>>>>>>>>       rank 1=bb slot=0:0,1
>>>>>>>>       rank 2=cc slot=1-2
>>>>>>>>       mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
>>>>>>>>
>>>>>>>>     Rank 0 runs on node aa, bound to socket 1, cores 0-2.
>>>>>>>>     Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>>>>>>>>     Rank 2 runs on node cc, bound to cores 1 and 2.
>>>>>>>>
>>>>>>>> Does it mean that the process with rank 0 should be bound to
>>>>>>>> core 0, 1, or 2 of socket 1?
>>>>>>>>
>>>>>>>> I tried to use a rankfile and have a problem. My rankfile contains
>>>>>>>> the following lines.
>>>>>>>>
>>>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
>>>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
>>>>>>>>
>>>>>>>>
>>>>>>>> Everything is fine if I use the file with just my local machine
>>>>>>>> (the first two lines).
>>>>>>>>
>>>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
>>>>>>>> I'm process 0 of 2 available processes running on
>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>> I'm process 1 of 2 available processes running on
>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>> tyr small_prog 116
>>>>>>>>
>>>>>>>>
>>>>>>>> I can also change the socket number and the processes will be attached
>>>>>>>> to the correct cores. Unfortunately it doesn't work if I add one
>>>>>>>> other machine (third line).
>>>>>>>>
>>>>>>>>
>>>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>>
>> --------------------------------------------------------------------------
>>>>>>>> We were unable to successfully process/set the requested processor
>>>>>>>> affinity settings:
>>>>>>>>
>>>>>>>> Specified slot list: 0:0
>>>>>>>> Error: Cross-device link
>>>>>>>>
>>>>>>>> This could mean that a non-existent processor was specified, or
>>>>>>>> that the specification had improper syntax.
>>>>>>>>
>> --------------------------------------------------------------------------
>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
>>>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
>>>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>>>>>> whose contact information is unknown in file
>>>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
>>>>>>>> to [[10212,1],0]: tag 20
>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
>>>>>>>> A message is attempting to be sent to a process whose contact
>>>>>>>> information is unknown in file
>>>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>>>>>> at line 2501
>>>>>>>>
>> --------------------------------------------------------------------------
>>>>>>>> mpiexec was unable to start the specified application as it
>>>>>>>> encountered an error:
>>>>>>>>
>>>>>>>> Error name: Error 0
>>>>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>>>>
>>>>>>>> when attempting to start process rank 2.
>>>>>>>>
>> --------------------------------------------------------------------------
>>>>>>>> tyr small_prog 113
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The other machine has two 8 core processors.
>>>>>>>>
>>>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v
>>>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
>>>>>>>> on-line since 07/26/2012 15:03:14.
>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>      and has a sparcv9 floating point processor.
>>>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
>>>>>>>> ...
>>>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
>>>>>>>> on-line since 07/26/2012 15:03:16.
>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>      and has a sparcv9 floating point processor.
>>>>>>>> tyr small_prog 122
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Is it necessary to specify another option on the command line or
>>>>>>>> is my rankfile faulty? Thank you very much for any suggestions in
>>>>>>>> advance.
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards
>>>>>>>>
>>>>>>>> Siegmar
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 07 Sep 2012 18:01:46 -0400
> From: Gus Correa <g...@ldeo.columbia.edu>
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>         of      servers
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <504a6eca.4040...@ldeo.columbia.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 09/03/2012 04:39 PM, Andrea Negri wrote:
>> max locked memory             (kbytes, -l) 32
>>>  max memory size                (kbytes, -m) unlimited
>>>  open files                           (-n) 1024
>>>  pipe size                            (512 bytes, -p) 8
>>>  POSIX message queues     (bytes, -q) 819200
>>>  stack size                           (kbytes, -s) 10240
>>>
>
> Hi Andrea
> This is besides the possibilities of
> running out of physical memory,
> or even defective memory chips, which Jeff, Ralph,
> John, George have addressed, I still think that the
> system limits above may play a role.
> In a 8-year old cluster, hardware failures are not unexpected.
>
>
> 1) System limits
>
> For what it is worth, virtually none of the programs we run here,
> mostly atmosphere/ocean/climate codes,
> would run with these limits.
> On our compute nodes we set
> max locked memory and stack size to
> unlimited, to avoid problems with symptoms very
> similar to those you describe.
> Typically there are lots of automatic arrays in subroutines,
> etc, which require a large stack.
> Your sys admin could add these lines to the bottom
> of /etc/security/limits.conf [the last one sets the
> max number of open files]:
>
> *   -   memlock     -1
> *   -   stack       -1
> *   -   nofile      4096
>
> 2) Defective network interface/cable/switch port
>
> Yet another possibility, following Ralph's suggestion,
> is that you may have a failing network interface, or a bad
> Ethernet cable or connector, on the node that goes south,
> or on the switch port that serves that node.
> [I am assuming your network is Ethernet, probably GigE.]
>
> Again, in a 8-year old cluster, hardware failures are not unexpected.
>
> We had this sort of problems with old clusters, old nodes.
>
> 3) Quarantine the bad node
>
> Is it always the same node that fails, or does it vary?
> [Please answer, it helps us understand what's going on.]
>
> If it is always the same node, have you tried to quarantine it,
> either temporarily removing it from your job submission system
> or just turning it off, and run the job on the remaining
> nodes?
>
> I hope this helps,
> Gus Correa
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 07 Sep 2012 18:12:20 -0400
> From: Gus Correa <g...@ldeo.columbia.edu>
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>         of      servers
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <504a7144.1000...@ldeo.columbia.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 09/07/2012 08:02 AM, Jeff Squyres wrote:
>> On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
>>
>>> Also look for hardware errors.  Perhaps you have some bad RAM somewhere.  
>>> Is it always the same node that crashes?  And so on.
>>
>>
>> Another thought on hardware errors... I actually have seen bad RAM cause 
>> spontaneous reboots with no Linux warnings.
>>
>> Do you have any hardware diagnostics from your server
>> vendor that you can run?
>>
>
> If you don't have a vendor provided diagnostic tool,
> you or your sys admin could try Advanced Clustering "breakin":
>
> http://www.advancedclustering.com/our-software/view-category.html
>
> Download the ISO version, burn a CD, put in the node CD drive,
> assuming it has one, reboot, chose breakin in the menu options.
> If there is no CD drive, there is an alternative with network boot,
> although more involved.
>
> I hope it helps,
> Gus Correa
>
>> A simple way to test your RAM (it's not completely comprehensive, but it 
>> does check for a surprisingly wide array of memory issues) is to do 
>> something like this (pseudocode):
>>
>> -----
>> size_t i, size, increment;
>> increment = 1GB;
>> size = 1GB;
>> int *ptr;
>>
>> // Find the biggest amount of memory that you can malloc
>> while (increment>= 1024) {
>>      ptr = malloc(size);
>>      if (NULL != ptr) {
>>           free(ptr);
>>           size += increment;
>>      } else {
>>           size -= increment;
>>           increment /= 2;
>>      }
>> }
>> printf("I can malloc %lu bytes\n", size);
>>
>> // Malloc that huge chunk of memory
>> ptr = malloc(size);
>> for (i = 0; i<  size / sizeof(int); ++i, ++ptr) {
>>      *ptr = 37;
>>      if (*ptr != 37) {
>>          printf("Readback error!\n");
>>      }
>> }
>>
>> printf("All done\n");
>> -----
>>
>> Depending on how much memory you have,
> that might take a little while to run
> (all the memory has to be paged in, etc.).
> You might want to add a status output to show progress,
> and/or write/read a page at a time for better efficiency, etc.
> But you get the idea.
>>
>> Hope that helps.
>>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sat, 8 Sep 2012 06:25:22 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] MPI_Allreduce fail (minGW gfortran + OpenMPI
>         1.6.1)
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <1b815295-1d69-46db-8dcb-b308c228b...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> I am unable to replicate your problem, but admittedly I only have access to 
> gfortran on Linux.  And I am definitely *not* a Fortran expert.  :-\
>
> The code seems to run fine for me -- can you send another test program that 
> actually tests the results of the all reduce?  Fortran allocatable stuff 
> always confuses me; I wonder if perhaps we're not getting the passed pointer 
> properly.  Checking the results of the all reduce would be a good way to 
> check this theory.
>
>
>
> On Sep 6, 2012, at 12:05 PM, Yonghui wrote:
>
>> Dear mpi users and developers,
>>
>> I am having some trouble with MPI_Allreduce. I am using MinGW (gcc 4.6.2) 
>> with OpenMPI 1.6.1. The MPI_Allreduce in c version works fine, but the 
>> fortran version failed with error. Here is the simple fortran code to 
>> reproduce the error:
>>
>>                 program main
>>                                 implicit none
>>                                 include 'mpif.h'
>>                                 character * (MPI_MAX_PROCESSOR_NAME) 
>> processor_name
>>                                 integer myid, numprocs, namelen, rc, ierr
>>                                 integer, allocatable :: mat1(:, :, :)
>>
>>                                 call MPI_INIT( ierr )
>>                                 call MPI_COMM_RANK( MPI_COMM_WORLD, myid, 
>> ierr )
>>                                 call MPI_COMM_SIZE( MPI_COMM_WORLD, 
>> numprocs, ierr )
>>                                 allocate(mat1(-36:36, -36:36, -36:36))
>>                                 mat1(:,:,:) = 111
>>                                 print *, "Going to call MPI_Allreduce."
>>                                 call MPI_Allreduce(MPI_IN_PLACE, mat1(-36, 
>> -36, -36), 389017, MPI_INTEGER, MPI_BOR, MPI_COMM_WORLD, ierr)
>>                                 print *, "MPI_Allreduce done!!!"
>>                                 call MPI_FINALIZE(rc)
>>                 endprogram
>>
>> The command that I used to compile:
>> gfortran Allreduce.f90 -IC:\OpenMPI-win32\include 
>> C:\OpenMPI-win32\lib\libmpi_f77.lib
>>
>> The MPI_Allreduce fail. [xxxxxxx:02112] [[17193,0],0]-[[17193,1],0] 
>> mca_oob_tcp_msg_recv: readv failed: Unknown error (108).
>> I am not sure why this happens. But I think it is the windows build MPI 
>> problem. Since the simple code works on a Linux system with gfortran.
>>
>> Any ideas? I appreciate any response!
>>
>> Yonghui
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Sat, 8 Sep 2012 07:46:16 -0500
> From: Jed Brown <j...@59a2.org>
> Subject: [OMPI users] Setting RPATH for Open MPI libraries
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>         <CAM9tzS=qFkOZW=C4=dFg65W+QcX1J=h4w2-dzxwbpnenowj...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Is there a way to configure Open MPI to use RPATH without needing to
> manually specify --with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib (and
> similar for non-GNU-compatible compilers)?
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 2347, Issue 1
> **************************************

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Reply via email to