Hi Wodel, As you already figured out, mpirun -x <ENV_VAR=value> … is the right way to do it so the psm library will read the values when initializing on every node. The default value for "PSM_MEMORY" is “normal” and you may change it to “large”. If you want to look inside the code, it is on https://github.com/01org/psm . One useful variable to play with is PSM_TRACEMASK (only set it on the head node) to see what values are being used. I think 0xffff will dump lots of info. As I mentioned below, playing with the size of the MQ is tricky since will be using system memory. I think this will be a combination of a) number of total ranks and per node b) memory on the hosts c) HPCC parameters. The bigger number of ranks, more ranks will be possibly transmitting simultaneously to a single node (I would assume a reduction) (a node could be posting receives at faster rate it is completing them), so will need bigger MQ, so more memory used. Would you share the number of ranks per node, nodes, and memory per node to have an idea? A quick test could be to start with very small number of ranks to see if it runs.
Thanks, Regards, _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of wodel youchi Sent: Wednesday, February 01, 2017 3:36 AM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Error using hpcc benchmark Hi, Thank you for you replies, but :-) it didn't work for me. Using hpcc compiled with OpenMPI 2.0.1 : I tried to use export PSM_MQ_RECVREQS_MAX=10000000 as mentioned by Howard, but the job didn't take into account the export (I am starting the job from the home directory of a user, the home directory is shared by nfs with all compute nodes). I tried to use the .bash_profile to export the variable, but the job didn't take it into account, I got the same error Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) And as I mentioned before, each time on different node(s). From the help of the mpirun command, I read that to pass an environment variable we have to use -x with the commend; i.e. : mpirun -np 512 -x PSM_MQ_RECVREQS_MAX=10000000 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt But when tested, I get this errors PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Ran out of memory I tested with lower values, the only one that worked for me is 2097152 which is 2 times the default value of PSM_MQ..., but even with this value, I get the same error with the new value, and the exits. Exhausted 2097152 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=2097152 ) PS: for Cabral, I didn't find any way to know the default value of PSM_MEMORY to be able to modify it. Any idea??? Could this be a problem on the infiniband configuration? Does the mtu have anything to do with this problem ? ibv_devinfo hca_id: qib0 transport: InfiniBand (0) fw_ver: 0.0.0 node_guid: 0011:7500:0070:59a6 sys_image_guid: 0011:7500:0070:59a6 vendor_id: 0x1175 vendor_part_id: 29474 hw_ver: 0x2 board_id: InfiniPath_QLE7340 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand Regards. 2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>: Hi Wodel, As Howard mentioned, this is probably because many ranks and sending to a single one and exhausting the receive requests MQ. You can individually enlarge the receive/send requests queues with the specific variables (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with PSM_MEMORY=max. Note that the psm library will allocate more system memory for the queues. Thanks, _MAC From: users [mailto:users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>] On Behalf Of Howard Pritchard Sent: Tuesday, January 31, 2017 6:38 AM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] Error using hpcc benchmark Hi Wodel Randomaccess part of HPCC is probably causing this. Perhaps set PSM env. variable - Export PSM_MQ_REVCREQ_MAX=10000000 or something like that. Alternatively launch the job using mpirun --mca plm ob1 --host .... to avoid use of psm. Performance will probably suffer with this option however. Howard wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am Di. 31. Jan. 2017 um 08:27: Hi, I am n newbie in HPC world I am trying to execute the hpcc benchmark on our cluster, but every time I start the job, I get this error, then the job exits compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[19601,1],272] Exit code: 255 -------------------------------------------------------------------------- Platform : IBM PHPC OS : RHEL 6.5 one management node 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and Openmpi 1.8.1 (compiled with gcc 4.4.7) I get the errors, but each time on different compute nodes. This is the command I used to start the job mpirun -np 512 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt Any help will be appreciated, and if you need more details, let me know. Thanks in advance. Regards. _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users