Hi Cabral, and thank you. I started hpcc benchmark using -x PSM_MEMORY=large without any error, I didn't finish the test for now, but I waited about 10 minutes, and this time no errors, I even augmented the Ns variable on hpccint.txt and started the test again without problem.
The cluster is composed of : - one management node - 32 compute nodes, each one has 16 cores (2sockets x 8 cores), 32GB of RAM, and intel qle_7340 one port infiniband 40Gb/s card I used this site to generate the input file for hpcc : http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/ with some modifications : 1 # of problems sizes (N) 331520 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 16 Ps 32 Qs The Ns here represents almost 90% of the total memory of the cluster. the total number of processes is 512, each node will start 16 processes 1 per core. Before modifying the PSM_MEMORY value, the test exited with the mentioned error, even with lower values of Ns. I find it weird, that there is no mention of this variable anywhere in the net, not even in the intel true scale ofed+ documentation!!!??? Thanks again. 2017-02-01 22:12 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>: > Hi Wodel, > > > > As you already figured out, mpirun -x <ENV_VAR=value> … is the right way > to do it so the psm library will read the values when initializing on every > node. > > The default value for "PSM_MEMORY" is “normal” and you may change it to > “large”. If you want to look inside the code, it is on > https://github.com/01org/psm . One useful variable to play with is > PSM_TRACEMASK (only set it on the head node) to see what values are being > used. I think 0xffff will dump lots of info. > > As I mentioned below, playing with the size of the MQ is tricky since will > be using system memory. I think this will be a combination of a) number of > total ranks and per node b) memory on the hosts c) HPCC parameters. The > bigger number of ranks, more ranks will be possibly transmitting > simultaneously to a single node (I would assume a reduction) (a node could > be posting receives at faster rate it is completing them), so will need > bigger MQ, so more memory used. Would you share the number of ranks per > node, nodes, and memory per node to have an idea? A quick test could be to > start with very small number of ranks to see if it runs. > > > > Thanks, > > Regards, > > > > _MAC > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *wodel > youchi > *Sent:* Wednesday, February 01, 2017 3:36 AM > > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Error using hpcc benchmark > > > > Hi, > > Thank you for you replies, but :-) it didn't work for me. > > Using hpcc compiled with OpenMPI 2.0.1 : > > I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by > Howard, but the job didn't take into account the export (I am starting the > job from the home directory of a user, the home directory is shared by nfs > with all compute nodes). > > I tried to use the .bash_profile to export the variable, but the job > didn't take it into account, I got the same error > *Exhausted 1048576 MQ irecv request descriptors, which usually indicates a > user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)* > > And as I mentioned before, each time on different node(s). > > > From the help of the mpirun command, I read that to pass an environment > variable we have to use *-x *with the commend; i.e. : > mpirun -np 512* -x PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile > hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt > > But when tested, I get this errors > > > > *PSM was unable to open an endpoint. Please make sure that the network > link is active on the node and the hardware is functioning. Error: Ran out > of memory* > > I tested with lower values, the only one that worked for me is *2097152 > *which > is 2 times the default value of PSM_MQ..., but even with this value, I get > the same error with the new value, and the exits. > > *Exhausted 2097152 MQ irecv request descriptors, which usually indicates a > user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=2097152 )* > > > PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY > *to be able to modify it. > > Any idea??? Could this be a problem on the infiniband configuration? > > > > Does the mtu have anything to do with this problem ? > > ibv_devinfo > hca_id: qib0 > transport: InfiniBand (0) > fw_ver: 0.0.0 > node_guid: 0011:7500:0070:59a6 > sys_image_guid: 0011:7500:0070:59a6 > vendor_id: 0x1175 > vendor_part_id: 29474 > hw_ver: 0x2 > board_id: InfiniPath_QLE7340 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > > *max_mtu: 4096 (5) > active_mtu: 2048 (4)* > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > > > > > > > Regards. > > > > 2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>: > > Hi Wodel, > > > > As Howard mentioned, this is probably because many ranks and sending to a > single one and exhausting the receive requests MQ. You can individually > enlarge the receive/send requests queues with the specific variables > (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with > PSM_MEMORY=max. Note that the psm library will allocate more system memory > for the queues. > > > > Thanks, > > > > _MAC > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, January 31, 2017 6:38 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Error using hpcc benchmark > > > > Hi Wodel > > > > Randomaccess part of HPCC is probably causing this. > > > > Perhaps set PSM env. variable - > > > > Export PSM_MQ_REVCREQ_MAX=10000000 > > > > or something like that. > > > > Alternatively launch the job using > > > > mpirun --mca plm ob1 --host .... > > > > to avoid use of psm. Performance will probably suffer with this option > however. > > > > Howard > > wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um > 08:27: > > Hi, > > I am n newbie in HPC world > > I am trying to execute the hpcc benchmark on our cluster, but every time I > start the job, I get this error, then the job exits > > > > > > > > > > > > > > > *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) > compute019.22847Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) > ------------------------------------------------------- Primary job > terminated normally, but 1 process returned a non-zero exit code.. Per > user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing the job to be terminated. The first process to do so was: > Process name: [[19601,1],272] Exit code: 255 > --------------------------------------------------------------------------* > > Platform : IBM PHPC > > OS : RHEL 6.5 > > one management node > > 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD > infiniband 40Gb/s > > I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) > and Openmpi 1.8.1 (compiled with gcc 4.4.7) > > I get the errors, but each time on different compute nodes. > > This is the command I used to start the job > *mpirun -np 512 --mca mtl psm --hostfile hosts32 > /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt* > > > > Any help will be appreciated, and if you need more details, let me know. > > Thanks in advance. > > > > Regards. > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users