[OMPI users] problem with cancelling Send-Request

2019-10-02 Thread Christian Von Kutzleben via users
Hi,

I’m currently evaluating to use openmpi (4.0.1) in our application.

We are using a construct like this for some cleanup functionality, to cancel 
some Send requests:

if (*req != MPI_REQUEST_NULL) {
MPI_Cancel(req);
MPI_Wait(req, MPI_STATUS_IGNORE);
assert(*req == MPI_REQUEST_NULL);
}

However the MPI_Wait hangs indefinitely and I’ve debugged into it and I came 
across this in pml_ob1_sendreq.c, eventually invoked from MPI_Cancel in my 
scenario:

static int mca_pml_ob1_send_request_cancel(struct ompi_request_t* request, int 
complete)
{
/* we dont cancel send requests by now */
return OMPI_SUCCESS;
}

The man page for MPI_Cancel does not mention that cancelling Send requests does 
not work, so I’m wondering,
whether this is a current limitation or are we not supposed to end up in this 
specific …_request_cancel implementation?

Thank you in advance!

Christian


Re: [OMPI users] problem with cancelling Send-Request

2019-10-02 Thread Emyr James via users
Hi Christian,


I would suggest using mvapich2 instead. It is supposedly faster than OpenMpi on 
infiniband and it seems to have fewer options under the hood which means less 
things you have to tweak to get it working for you.


Regards,


Emyr James
Head of Scientific IT
CRG -Centre for Genomic Regulation
C/ Dr. Aiguader, 88
Edif. PRBB
08003 Barcelona, Spain
Phone Ext: #1098


From: users  on behalf of Christian Von 
Kutzleben via users 
Sent: 02 October 2019 16:14:24
To: users@lists.open-mpi.org
Cc: Christian Von Kutzleben
Subject: [OMPI users] problem with cancelling Send-Request

Hi,

I’m currently evaluating to use openmpi (4.0.1) in our application.

We are using a construct like this for some cleanup functionality, to cancel 
some Send requests:

if (*req != MPI_REQUEST_NULL) {
MPI_Cancel(req);
MPI_Wait(req, MPI_STATUS_IGNORE);
assert(*req == MPI_REQUEST_NULL);
}

However the MPI_Wait hangs indefinitely and I’ve debugged into it and I came 
across this in pml_ob1_sendreq.c, eventually invoked from MPI_Cancel in my 
scenario:

static int mca_pml_ob1_send_request_cancel(struct ompi_request_t* request, int 
complete)
{
/* we dont cancel send requests by now */
return OMPI_SUCCESS;
}

The man page for MPI_Cancel does not mention that cancelling Send requests does 
not work, so I’m wondering,
whether this is a current limitation or are we not supposed to end up in this 
specific …_request_cancel implementation?

Thank you in advance!

Christian


[OMPI users] openmpi-4.0.1 build error

2019-10-02 Thread Llolsten Kaonga via users
Hello all,

 

OS: CentOS 7.7

OFED: MLNX_OFED_LINUX-4.7-1.0.0.1

 

Running the command "make all install" returns:

 

In file included from btl_uct_device_context.h:16:0,

   from btl_uct_component.c:40:

btl_uct_rdma.h: In function 'mca_btl_uct_get_rkey':

btl_uct_rdma.h:58:5 error: too few arguments to function  'uct_rkey_unpack'

uct_status = uct_rkey_unpack ((void *) remote_handle, rkey);

^

 

.

 

In file included from btl_uct.h:41:0,

   from btl_uct_device_context.h:15,

   from btl_uct_component.c:40

/usr/include/uct/api/uct.h:1377:14 note: expected 'const struct
uct_md_config_t *' but argument is of type 'struct uct_md **'

  Ucs_status_t uct_md_open(uct_component_h component, const char *md_name, 

btl_uct_component.c:348:5: error: too many arguments to function
'uct_md_open'

   uct_md_open (md_desc->md_name, uct_config, &md->uct_md);

   ^

 

I will be happy to send the whole log if that would be more useful/helpful

 

I thank you.

--

Llolsten

 



[OMPI users] Spawns no local

2019-10-02 Thread Martín Morales via users
Hello all. I will like request you a practical example about to how to set with 
MPI_Info_set(&info, …) so that “info” passed to MPI_Comm_spawn() not spawns 
local any process (let say “master” host), but yes in a slave (“slave” host), 
without using mpirun (just “./o.out”). Im using OpenMPI 4.0.1.
Thanks!


Re: [OMPI users] UCX errors after upgrade

2019-10-02 Thread Raymond Muno via users
We are now using OpenMPI 4.0.2RC2 and RC3 compiled (with Intel, PGI and 
GCC)  with MLNX_OFED 4.7 (released a couple days ago). This supplies UCX 
1.7.  So far, it seems like things are working well.


Any estimate on when OpenMPI 4.2 will be released?


On 9/25/19 2:27 PM, Jeff Squyres (jsquyres) wrote:
Thanks Raymond; I have filed an issue for this on Github and tagged 
the relevant Mellanox people:


https://github.com/open-mpi/ompi/issues/7009


On Sep 25, 2019, at 3:09 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are running against 4.0.2RC2 now. This is ussing current Intel 
compilers, version 2019update4. Still having issues.


[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: 
ucp_ep_create(proc=265) failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to 
resolve UCX endpoint for rank 265

[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process 
[47001162088449,46999827120425]

[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[epyc-compute-1-3:16626] ***    and potentially your MPI job)


On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote:
Can you try the latest 4.0.2rc tarball?  We're very, very close to 
releasing v4.0.2...


I don't know if there's a specific UCX fix in there, but there are a 
ton of other good bug fixes in there since v4.0.1.



On Sep 25, 2019, at 2:12 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside 
MLNX_OFED 4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and 
AOCC compilers. We could run with no issues.


To accommodate updates needed to get our IB gear all running at 
HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the 
current MLNX_OFED 4.6.x.


We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: 
ucp_ep_create(proc=276) failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to 
resolve UCX endpoint for rank 276

[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process 
[47894553493505,47893180318004]

[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in 
this communicator will now abort,

[epyc-compute-3-2:42447] ***    and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: 
ucp_ep_create(proc=147) failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to 
resolve UCX endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages


UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running 
on the compute nodes.


I have a case open with Mellanox but it is not clear where this 
error is coming from.

--




--
Jeff Squyres
jsquy...@cisco.com 


--
  
  Ray Muno

  IT Manager
  University of Minnesota
  Aerospace Engineering and Mechanics Mechanical Engineering
  



--
Jeff Squyres
jsquy...@cisco.com 


--
 
 Ray Muno

 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering
 



Re: [OMPI users] problem with cancelling Send-Request

2019-10-02 Thread Jeff Hammond via users
Don’t try to cancel sends.

https://github.com/mpi-forum/mpi-issues/issues/27 has some useful info.

Jeff

On Wed, Oct 2, 2019 at 7:17 AM Christian Von Kutzleben via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> I’m currently evaluating to use openmpi (4.0.1) in our application.
>
> We are using a construct like this for some cleanup functionality, to
> cancel some Send requests:
>
> *if* (*req != MPI_REQUEST_NULL) {
> MPI_Cancel(req);
> MPI_Wait(req, MPI_STATUS_IGNORE);
> assert(*req == MPI_REQUEST_NULL);
> }
>
> However the MPI_Wait hangs indefinitely and I’ve debugged into it and I
> came across this in pml_ob1_sendreq.c, eventually invoked from MPI_Cancel
> in my scenario:
>
> *static* *int* *mca_pml_ob1_send_request_cancel*(*struct* ompi_request_t*
> request, *int* complete)
> {
> /* we dont cancel send requests by now */
> *return* OMPI_SUCCESS;
> }
>
> The man page for MPI_Cancel does not mention that cancelling Send requests
> does not work, so I’m wondering,
> whether this is a current limitation or are we not supposed to end up in
> this specific …_request_cancel implementation?
>
> Thank you in advance!
>
> Christian
>
-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] problem with cancelling Send-Request

2019-10-02 Thread Jeff Hammond via users
“Supposedly faster” isn’t a particularly good reason to change MPI
implementations but canceling sends is hard for reasons that have nothing
to do with performance.

Also, I’d not be so eager to question the effectiveness of Open-MPI on
InfiniBand. Check the commit logs for Mellanox employees some time.

Jeff

On Wed, Oct 2, 2019 at 7:46 AM Emyr James via users <
users@lists.open-mpi.org> wrote:

> Hi Christian,
>
>
> I would suggest using mvapich2 instead. It is supposedly faster than
> OpenMpi on infiniband and it seems to have fewer options under the hood
> which means less things you have to tweak to get it working for you.
>
>
> Regards,
>
>
> Emyr James
> Head of Scientific IT
> CRG -Centre for Genomic Regulation
> C/ Dr. Aiguader, 88
> 
> Edif. PRBB
> 08003 Barcelona, Spain
> Phone Ext: #1098
>
> --
> *From:* users  on behalf of Christian
> Von Kutzleben via users 
> *Sent:* 02 October 2019 16:14:24
> *To:* users@lists.open-mpi.org
> *Cc:* Christian Von Kutzleben
> *Subject:* [OMPI users] problem with cancelling Send-Request
>
> Hi,
>
> I’m currently evaluating to use openmpi (4.0.1) in our application.
>
> We are using a construct like this for some cleanup functionality, to
> cancel some Send requests:
>
> *if* (*req != MPI_REQUEST_NULL) {
> MPI_Cancel(req);
> MPI_Wait(req, MPI_STATUS_IGNORE);
> assert(*req == MPI_REQUEST_NULL);
> }
>
> However the MPI_Wait hangs indefinitely and I’ve debugged into it and I
> came across this in pml_ob1_sendreq.c, eventually invoked from MPI_Cancel
> in my scenario:
>
> *static* *int* *mca_pml_ob1_send_request_cancel*(*struct* ompi_request_t*
> request, *int* complete)
> {
> /* we dont cancel send requests by now */
> *return* OMPI_SUCCESS;
> }
>
> The man page for MPI_Cancel does not mention that cancelling Send requests
> does not work, so I’m wondering,
> whether this is a current limitation or are we not supposed to end up in
> this specific …_request_cancel implementation?
>
> Thank you in advance!
>
> Christian
>
-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/