Hi all,

I am trying to compute a large 3D fft using fftw3_mpi on our cluster.
It runs fine with 1024 ranks on 8 nodes.

However, when trying to run with 2048 ranks on 16 nodes, I get a lot of the 
following tcp errors:
[btl_tcp_endpoint.c:733:mca_btl_tcp_endpoint_start_connect] bind on local 
address (10.0.20.18:0) failed: Address already in use (98) 

We tried increasing the available local ipv4 ports to the following values:
cat /proc/sys/net/ipv4/ip_local_port_range
1024    65000

but it did not solve the problem.

Parameters btl_tcp_port_min_v4 and btl_tcp_port_range_v4 have respective values 
of 1024, 64511

This is run on openmpi 4.1.0, CentOS 8.

Any help greatly appreciated !

Cheers,

        Simon

PS: let me know if more info is needed.

Reply via email to