Re: [OMPI users] Error using hpcc benchmark

wodel youchi Thu, 02 Feb 2017 08:27:56 -0800

Hi Cabral, and thank you.

I started hpcc benchmark using -x PSM_MEMORY=large without any error, I
didn't finish the test for now, but I waited about 10 minutes, and this
time no errors, I even augmented the Ns variable on hpccint.txt and started
the test again without problem.


The cluster is composed of :
- one management node
- 32 compute nodes, each one has 16 cores (2sockets x 8 cores), 32GB of
RAM, and intel qle_7340 one port infiniband 40Gb/s card

I used this site to generate the input file for hpcc :
http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
with some modifications :

1            # of problems sizes (N)
331520         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
16            Ps
32            Qs

The Ns here represents almost 90% of the total memory of the cluster. the
total number of processes is 512, each node will start 16 processes 1 per
core.

Before modifying the PSM_MEMORY value, the test exited with the mentioned
error, even with lower values of Ns.

I find it weird, that there is no mention of this variable anywhere in the
net, not even in the intel true scale ofed+ documentation!!!???

Thanks again.




2017-02-01 22:12 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Hi Wodel,
>
>
>
> As you already figured out, mpirun -x <ENV_VAR=value> … is the right way
> to do it so the psm library will read the values when initializing on every
> node.
>
> The default value for "PSM_MEMORY" is “normal” and you may change it to
> “large”. If you want to look inside the code, it is on
> https://github.com/01org/psm . One useful variable to play with is
> PSM_TRACEMASK (only set it on the head node) to see what values are being
> used. I think 0xffff will dump lots of info.
>
> As I mentioned below, playing with the size of the MQ is tricky since will
> be using system memory. I think this will be a combination of a) number of
> total ranks and per node b) memory on the hosts c) HPCC parameters. The
> bigger number of ranks, more ranks will be possibly transmitting
> simultaneously to a single node (I would assume a reduction) (a node could
> be posting receives at faster rate it is completing them), so will need
> bigger MQ, so more memory used. Would you share the number of ranks per
> node, nodes, and memory per node to have an idea?  A quick test could be to
> start with very small number of ranks to see if it runs.
>
>
>
> Thanks,
>
> Regards,
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *wodel
> youchi
> *Sent:* Wednesday, February 01, 2017 3:36 AM
>
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Error using hpcc benchmark
>
>
>
> Hi,
>
> Thank you for you replies, but :-) it didn't work for me.
>
> Using hpcc compiled with OpenMPI 2.0.1 :
>
> I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by
> Howard, but the job didn't take into account the export (I am starting the
> job from the home directory of a user, the home directory is shared by nfs
> with all compute nodes).
>
> I tried to use the .bash_profile to export the variable, but the job
> didn't take it into account, I got the same error
> *Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
> user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)*
>
> And as I mentioned before, each time on different node(s).
>
>
> From the help of the mpirun command, I read that to pass an environment
> variable we have to use *-x *with the commend; i.e. :
> mpirun -np 512* -x PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile
> hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt
>
> But when tested, I get this errors
>
>
>
> *PSM was unable to open an endpoint. Please make sure that the network
> link is active on the node and the hardware is functioning. Error: Ran out
> of memory*
>
> I tested with lower values, the only one that worked for me is  *2097152 
> *which
> is 2 times the default value of PSM_MQ..., but even with this value, I get
> the same error with the new value, and the exits.
>
> *Exhausted 2097152 MQ irecv request descriptors, which usually indicates a
> user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=2097152 )*
>
>
> PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
> *to be able to modify it.
>
> Any idea??? Could this be a problem on the infiniband configuration?
>
>
>
> Does the mtu have anything to do with this problem ?
>
> ibv_devinfo
> hca_id: qib0
>         transport:                      InfiniBand (0)
>         fw_ver:                         0.0.0
>         node_guid:                      0011:7500:0070:59a6
>         sys_image_guid:                 0011:7500:0070:59a6
>         vendor_id:                      0x1175
>         vendor_part_id:                 29474
>         hw_ver:                         0x2
>         board_id:                       InfiniPath_QLE7340
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>
> *max_mtu:                4096 (5)
> active_mtu:             2048 (4)*
>                         sm_lid:                 1
>                         port_lid:               1
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>
>
>
>
>
>
> Regards.
>
>
>
> 2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>:
>
> Hi Wodel,
>
>
>
> As Howard mentioned, this is probably because many ranks and sending to a
> single one and exhausting the receive requests MQ. You can individually
> enlarge the receive/send requests queues with the specific variables
> (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
> PSM_MEMORY=max.  Note that the psm library will allocate more system memory
> for the queues.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, January 31, 2017 6:38 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Error using hpcc benchmark
>
>
>
> Hi Wodel
>
>
>
> Randomaccess part of HPCC is probably causing this.
>
>
>
> Perhaps set PSM env. variable -
>
>
>
> Export PSM_MQ_REVCREQ_MAX=10000000
>
>
>
> or something like that.
>
>
>
> Alternatively launch the job using
>
>
>
> mpirun --mca plm ob1 --host ....
>
>
>
> to avoid use of psm.  Performance will probably suffer with this option
> however.
>
>
>
> Howard
>
> wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um
> 08:27:
>
> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
> compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)
> ------------------------------------------------------- Primary job
> terminated normally, but 1 process returned a non-zero exit code.. Per
> user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing the job to be terminated. The first process to do so was:
> Process name: [[19601,1],272]   Exit code:    255
> --------------------------------------------------------------------------*
>
> Platform : IBM PHPC
>
> OS : RHEL 6.5
>
> one management node
>
> 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
> infiniband 40Gb/s
>
> I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
> and Openmpi 1.8.1 (compiled with gcc 4.4.7)
>
> I get the errors, but each time on different compute nodes.
>
> This is the command I used to start the job
> *mpirun -np 512 --mca mtl psm --hostfile hosts32
> /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
>
>
>
> Any help will be appreciated, and if you need more details, let me know.
>
> Thanks in advance.
>
>
>
> Regards.
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

Reply via email to