Same result. (It works though 102 but not greater than that) Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname
Output: [Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs [Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd [Gen2Node3:54348] [[18008,0],0] orted_cmd: all routes and children gone - exiting -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -------------------------------------------------------------------------- 128 total processes failed to start Collin From: Joshua Ladd <jladd.m...@gmail.com> Sent: Tuesday, January 28, 2020 2:24 PM To: Collin Strassburger <cstrassbur...@bihrle.com> Cc: Open MPI Users <users@lists.open-mpi.org>; Ralph Castain <r...@open-mpi.org> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote: Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname Output: [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - exiting -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -------------------------------------------------------------------------- 128 total processes failed to start Collin From: Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 12:48 PM To: Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> Cc: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>; Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Sorry, typo, try: mpirun -np 128 --debug-daemons -mca plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: And if you try: mpirun -np 128 --debug-daemons -plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote: Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -------------------------------------- Collin From: Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 12:31 PM To: Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> Cc: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>; Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote: In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node3 when attempting to start process rank 0. -------------------------------------------------------------------------- 128 total processes failed to start Collin From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 12:06 PM To: Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> Cc: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>; Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote: Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -------------------------------------------------------------------------- 128 total processes failed to start Collin From: Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> Sent: Tuesday, January 28, 2020 11:39 AM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>>; Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>; Artem Polyakov <art...@mellanox.com<mailto:art...@mellanox.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx --with-platform=contrib/platform/mellanox/optimized With the new compile it fails on the higher core counts. Collin From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Tuesday, January 28, 2020 11:02 AM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still compiles but fails to execute correctly. As the option with the actual run options results in a smaller config log, I am including it here. Version: 4.0.2 The config log is available at https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the ompi dump is available athttps://pastebin.com/md3HwTUR. The IB network information (which is not being explicitly operated across): Packages: MLNX_OFED and Mellanox HPC-X, both are current versions (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) Ulimit -l = unlimited Ibv_devinfo: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 … vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1100120019 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand It looks like the rest of the IB information is in the config file. I hope this helps, Collin From: Jeff Squyres (jsquyres) <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> Sent: Monday, January 27, 2020 3:40 PM To: Open MPI User's List <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Collin Strassburger <cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Can you please send all the information listed here: https://www.open-mpi.org/community/help/ Thanks! On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hello, I had initially thought the same thing about the streams, but I have 2 sockets with 64 cores each. Additionally, I have not yet turned multithreading off, so lscpu reports a total of 256 logical cores and 128 physical cores. As such, I don’t see how it could be running out of streams unless something is being passed incorrectly. Collin From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ray Sheppard via users Sent: Monday, January 27, 2020 11:53 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Cc: Ray Sheppard <rshep...@iu.edu<mailto:rshep...@iu.edu>> Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hi All, Just my two cents, I think error code 63 is saying it is running out of streams to use. I think you have only 64 cores, so at 100, you are overloading most of them. It feels like you are running out of resources trying to swap in and out ranks on physical cores. Ray On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. Hello Howard, To remove potential interactions, I have found that the issue persists without ucx and hcoll support. Run command: mpirun -np 128 bin/xhpcg Output: -------------------------------------------------------------------------- mpirun was unable to start the specified application as it encountered an error: Error code: 63 Error name: (null) Node: Gen2Node4 when attempting to start process rank 0. -------------------------------------------------------------------------- 128 total processes failed to start It returns this error for any process I initialize with >100 processes per node. I get the same error message for multiple different codes, so the error code is mpi related rather than being program specific. Collin From: Howard Pritchard <hpprit...@gmail.com><mailto:hpprit...@gmail.com> Sent: Monday, January 27, 2020 11:20 AM To: Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> Cc: Collin Strassburger <cstrassbur...@bihrle.com><mailto:cstrassbur...@bihrle.com> Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could provide more information about the problem you are hitting? Howard Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>: Hello, I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these versions cause the same error (error code 63) when utilizing more than 100 cores on a single node. The processors I am utilizing are AMD Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both the default gcc 8 and locally compiled gcc 9. I have already tried modifying the maximum name field values with no success. My compile options are: ./configure --prefix=${HPCX_HOME}/ompi --with-platform=contrib/platform/mellanox/optimized Any assistance would be appreciated, Collin Collin Strassburger Bihrle Applied Research Inc. -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com>