Interesting. Can you try: mpirun -np 128 --debug-daemons hostname
Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full node utilization, > etc). > > > > Here are the results for the hostname test: > > Input: mpirun -np 128 hostname > > > > Output: > > -------------------------------------------------------------------------- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node3 > > > > when attempting to start process rank 0. > > -------------------------------------------------------------------------- > > 128 total processes failed to start > > > > > > Collin > > > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 12:06 PM > *To:* Joshua Ladd <jladd.m...@gmail.com> > *Cc:* Ralph Castain <r...@open-mpi.org>; Open MPI Users < > users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Josh - if you read thru the thread, you will see that disabling > Mellanox/IB drivers allows the program to run. It only fails when they are > enabled. > > > > > > On Jan 28, 2020, at 8:49 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > > > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > Here’s the I/O for these high local core count runs. (“xhpcg” is the > standard hpcg benchmark) > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -------------------------------------------------------------------------- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > -------------------------------------------------------------------------- > > 128 total processes failed to start > > > > > > Collin > > > > *From:* Joshua Ladd <jladd.m...@gmail.com> > *Sent:* Tuesday, January 28, 2020 11:39 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>; Ralph Castain < > r...@open-mpi.org>; Artem Polyakov <art...@mellanox.com> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Can you send the output of a failed run including your command line. > > > > Josh > > > > On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < > users@lists.open-mpi.org> wrote: > > Okay, so this is a problem with the Mellanox software - copying Artem. > > > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger <cstrassbur...@bihrle.com> > wrote: > > > > I just tried that and it does indeed work with pbs and without Mellanox > (until a reboot makes it complain about Mellanox/IB related defaults as no > drivers were installed, etc). > > > > After installing the Mellanox drivers, I used > > ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx > --with-platform=contrib/platform/mellanox/optimized > > > > With the new compile it fails on the higher core counts. > > > > > > Collin > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 11:02 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Ralph Castain <r...@open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > > > > > > On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > > > Hello, > > > > I have done some additional testing and I can say that it works correctly > with gcc8 and no mellanox or pbs installed. > > > > I am have done two runs with Mellanox and pbs installed. One run includes > the actual run options I will be using while the other includes a truncated > set which still compiles but fails to execute correctly. As the option > with the actual run options results in a smaller config log, I am including > it here. > > > > Version: 4.0.2 > > The config log is available at > https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and > the ompi dump is available athttps://pastebin.com/md3HwTUR. > > > > The IB network information (which is not being explicitly operated across): > > Packages: MLNX_OFED and Mellanox HPC-X, both are current versions > (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and > hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) > > Ulimit -l = unlimited > > Ibv_devinfo: > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.42.5000 > > … > > vendor_id: 0x02c9 > > vendor_part_id: 4099 > > hw_ver: 0x1 > > board_id: MT_1100120019 > > phys_port_cnt: 1 > > Device ports: > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu: 4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 12 > > port_lmc: 0x00 > > link_layer: InfiniBand > > It looks like the rest of the IB information is in the config file. > > > > I hope this helps, > > Collin > > > > > > > > *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> > *Sent:* Monday, January 27, 2020 3:40 PM > *To:* Open MPI User's List <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Can you please send all the information listed here: > > > > https://www.open-mpi.org/community/help/ > > > > Thanks! > > > > > > On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > > > Hello, > > > > I had initially thought the same thing about the streams, but I have 2 > sockets with 64 cores each. Additionally, I have not yet turned > multithreading off, so lscpu reports a total of 256 logical cores and 128 > physical cores. As such, I don’t see how it could be running out of > streams unless something is being passed incorrectly. > > > > Collin > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ray > Sheppard via users > *Sent:* Monday, January 27, 2020 11:53 AM > *To:* users@lists.open-mpi.org > *Cc:* Ray Sheppard <rshep...@iu.edu> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > > Hi All, > Just my two cents, I think error code 63 is saying it is running out of > streams to use. I think you have only 64 cores, so at 100, you are > overloading most of them. It feels like you are running out of resources > trying to swap in and out ranks on physical cores. > Ray > > On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: > > This message was sent from a non-IU address. Please exercise caution when > clicking links or opening attachments from external sources. > > > > Hello Howard, > > > > To remove potential interactions, I have found that the issue persists > without ucx and hcoll support. > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -------------------------------------------------------------------------- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > -------------------------------------------------------------------------- > > 128 total processes failed to start > > > > It returns this error for any process I initialize with >100 processes per > node. I get the same error message for multiple different codes, so the > error code is mpi related rather than being program specific. > > > > Collin > > > > *From:* Howard Pritchard <hpprit...@gmail.com> <hpprit...@gmail.com> > *Sent:* Monday, January 27, 2020 11:20 AM > *To:* Open MPI Users <users@lists.open-mpi.org> <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> > <cstrassbur...@bihrle.com> > *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when > utilizing 100+ processors per node > > > > Hello Collen, > > > > Could you provide more information about the error. Is there any output > from either Open MPI or, maybe, UCX, that could provide more information > about the problem you are hitting? > > > > Howard > > > > > > Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < > users@lists.open-mpi.org>: > > Hello, > > > > I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of > these versions cause the same error (error code 63) when utilizing more > than 100 cores on a single node. The processors I am utilizing are AMD > Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both > the default gcc 8 and locally compiled gcc 9. I have already tried > modifying the maximum name field values with no success. > > > > My compile options are: > > ./configure > > --prefix=${HPCX_HOME}/ompi > > --with-platform=contrib/platform/mellanox/optimized > > > > Any assistance would be appreciated, > > Collin > > > > Collin Strassburger > > Bihrle Applied Research Inc. > > > > > > > > > -- > Jeff Squyres > jsquy...@cisco.com > > > > >