Hi Åke, Thank for the pointer. However, when I tried
eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-build --from-pr 14496 it failed in the same way. Cheers, Loris Åke Sandgren <[email protected]> writes: > Reg the cuda-enabled openmpi problem, see PR > https://github.com/easybuilders/easybuild-easyconfigs/pull/14496 > > On 12/21/21 1:34 PM, Loris Bennett wrote: >> Hi, >> >> I am running >> >> eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot >> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >> --tmpdir=/scratch/eb-build >> >> on a GPU node. The build step succeeds but the tests fail with the error >> >> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid >> argument >> >> See below for full extract from the log file. >> >> There is a PyTorch issue >> >> https://github.com/pytorch/tensorpipe/issues/413 >> >> which seems related and we do indeed have an Omnipath fabric. >> >> On the other had, in the EB log file it says at some point: >> >> -- MPI libraries: >> /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so >> CMake Warning at cmake/Dependencies.cmake:1081 (message): >> OpenMPI found, but it is not built with CUDA support. >> >> Could that be related? Is a CUDA-enabled OpenMPI needed? Or do we just >> need to skip the test? >> >> Cheers, >> >> Loris >> >> ============================= test session starts >> ============================== >> platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- >> /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python >> cachedir: .pytest_cache >> hypothesis profile 'default' -> >> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples') >> torch: 1.10.0 >> rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: >> pytest.ini >> plugins: hypothesis-6.13.1 >> collecting ... collected 13 items >> >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR [ >> 7%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ >> 15%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ >> 23%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ >> 30%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR [ >> 38%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ >> 46%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ >> 53%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR >> [ 61%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR >> [ 69%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] >> ERROR [ 76%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] >> ERROR [ 84%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] >> ERROR [ 92%] >> distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR >> [100%] >> >> ==================================== ERRORS >> ==================================== >> _____________________ ERROR at setup of test_1to3[never-3] >> _____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 229, in _init_rpc_backend >> rpc_agent = backend_registry.init_backend( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", >> line 106, in init_backend >> return backend.value.init_backend_handler(*args, **kwargs) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", >> line 309, in _tensorpipe_init_backend_handler >> api._init_rpc_states(agent) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py", >> line 114, in _init_rpc_states >> _set_and_start_rpc_agent(agent) >> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid >> argument >> ___________________ ERROR at setup of test_1to3[never-1:2] >> ____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ____________________ ERROR at setup of test_1to3[never-2:1] >> ____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ___________________ ERROR at setup of test_1to3[never-1:1:1] >> ___________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ____________________ ERROR at setup of test_1to3[always-3] >> _____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ___________________ ERROR at setup of test_1to3[always-1:2] >> ____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ___________________ ERROR at setup of test_1to3[always-2:1] >> ____________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> __________________ ERROR at setup of test_1to3[always-1:1:1] >> ___________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> __________________ ERROR at setup of test_1to3[except_last-3] >> __________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> _________________ ERROR at setup of test_1to3[except_last-1:2] >> _________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> _________________ ERROR at setup of test_1to3[except_last-2:1] >> _________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> ________________ ERROR at setup of test_1to3[except_last-1:1:1] >> ________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> _______________________ ERROR at setup of test_none_skip >> _______________________ >> Traceback (most recent call last): >> File >> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >> line 44, in setup_rpc >> dist.rpc.init_rpc( >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 195, in init_rpc >> _init_rpc_backend(backend, store, name, rank, world_size, >> rpc_backend_options) >> File >> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >> line 226, in _init_rpc_backend >> raise RuntimeError("RPC is already initialized") >> RuntimeError: RPC is already initialized >> =========================== short test summary info >> ============================ >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - >> Runt... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - >> Ru... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - >> Ru... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - >> ... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - >> Run... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - >> R... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - >> R... >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] >> ERROR >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] >> ERROR >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] >> ERROR >> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] >> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - >> RuntimeE... >> ============================== 13 errors in 0.17s >> ============================== >> distributed/pipeline/sync/skip/test_gpipe failed! >> Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... >> [2021-12-21 09:34:23.699450] >> Executing >> ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', >> '-m', 'pytest', >> 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', '-v'] ... >> [2021-12-21 09:34:23.699498] >> >>

