Hello Adam,
During the InfiniBand Plugfest 34 event last October, we found that mpirun hang
on FDR systems if you run with the openib btl option.
Yossi Itigin (@Mellanox) suggested that we run using the following options:
--mca btl self,vader --mca pml ucx -x UCX_RC_PATH_MTU=4096
Hello Adam,
IMB had a bug related to Reduce_scatter.
https://github.com/intel/mpi-benchmarks/pull/11
I'm not sure this bug is the cause but you can try the patch.
https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569
Thanks,
Takahiro Kawashima,
Fujitsu
>
Ryan,
as Edgar explained, that could be a compiler issue (fwiw, I am unable to
reproduce the bug)
You can build Open MPI again and pass --disable-builtin-atomics to the
configure command line.
That being said, the "Alarm clock" message looks a bit suspicious.
Does it always occur at 20+
I was not able to reproduce the issue with openib on the 4.0, but instead I
randomly segfault in MPI finalize during the grdma cleanup.
I could however reproduce the TCP timeout part with both 4.0 and master, on
a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no
surprise, the
Hello Howard,
Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so
the nightly build didn't even install, but will do it by hand and will give
you an update tomorrow somewhere in the afternoon.
Thanks,
Ad
Hello Adam,
This helps some. Could you post first 20 lines of you config.log. This
will
help in trying to reproduce. The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node im
On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:
[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:1
George,
Thanks for letting us know about this issue, it was a misconfiguration
issue with the form. I guess we did not realized as most of us are
automatically signed in by our browsers. Anyway, thanks for the feedback,
the access to the form should now be completely open.
Sorry for the inconveni
On Wed, 2019-02-20 at 13:21 -0500, George Bosilca wrote:
> To obtain representative samples of the MPI community, we have
> prepared a survey
>
> https://docs.google.com/forms/d/e/1FAIpQLSd1bDppVODc8nB0BjIXdqSCO_MuEuNAAbBixl4onTchwSQFwg/viewform
>
To access the survey, I was asked to create a g
Can you try the latest 4.0.x nightly snapshot and see if the problem still
occurs?
https://www.open-mpi.org/nightly/v4.0.x/
> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc wrote:
>
> I do here is the output:
>
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:122
This is what I did for my build — not much going on there:
../openmpi-3.1.3/configure --prefix=/opt/sw/packages/gcc-4_8/openmpi/3.1.3
--with-pmi && \
make -j32
We have a mixture of types of Infiniband, using the RHEL-supplied Infiniband
packages.
--
|| \\UTGERS, |-
Well, the way you describe it, it sounds to me like maybe an atomic issue with
this compiler version. What was your configure line of Open MPI, and what
network interconnect are you using?
An easy way to test this theory would be to force OpenMPI to use the tcp
interfaces (everything will be sl
I do here is the output:
2 total processes killed (some possibly by mpirun during cleanup)
[pandora:12238] *** Process received signal ***
[pandora:12238] Signal: Segmentation fault (11)
[pandora:12238] Signal code: Invalid permissions (2)
[pandora:12238] Failing at address: 0x7f5c8e31fff0
[pandor
Dear colleagues,
As part of a wide-ranging effort to understand the current usage of the
Message Passing Interface (MPI) in the development of parallel applications
and to drive future additions to the MPI standard, an international team is
seeking feedback from the largest possible MPI audience (
HI Adam,
As a sanity check, if you try to use --mca btl self,vader,tcp
do you still see the segmentation fault?
Howard
Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:
> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --
Does it make any sense that it seems to work fine when OpenMPI and HDF5 are
built with GCC 7.4 and GCC 8.2, but /not/ when they are built with
RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5
build, I did try an XFS filesystem and it didn’t help. GPFS works fine for
e
Hello,
When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
--mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_allow_ib 1 -np 6
-hostfile /home/aleblanc/ib-mpi-hosts IMB-MP
17 matches
Mail list logo