Hi Åke,

Thank for the pointer.  However, when I tried

  eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build --from-pr 14496

it failed in the same way.

Cheers,

Loris

Åke Sandgren <[email protected]> writes:

> Reg the cuda-enabled openmpi problem, see PR
> https://github.com/easybuilders/easybuild-easyconfigs/pull/14496
>
> On 12/21/21 1:34 PM, Loris Bennett wrote:
>> Hi,
>> 
>> I am running 
>> 
>>   eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
>> --tmpdir=/scratch/eb-build
>> 
>> on a GPU node.  The build step succeeds but the tests fail with the error
>> 
>>   RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
>> argument
>> 
>> See below for full extract from the log file.
>> 
>> There is a PyTorch issue 
>> 
>>   https://github.com/pytorch/tensorpipe/issues/413
>> 
>> which seems related and we do indeed have an Omnipath fabric.
>> 
>> On the other had, in the EB log file it says at some point:
>> 
>>   -- MPI libraries: 
>> /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so
>>   CMake Warning at cmake/Dependencies.cmake:1081 (message):
>>     OpenMPI found, but it is not built with CUDA support.
>> 
>> Could that be related?  Is a CUDA-enabled OpenMPI needed?  Or do we just
>> need to skip the test?
>> 
>> Cheers,
>> 
>> Loris
>> 
>> ============================= test session starts 
>> ==============================
>> platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- 
>> /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python
>> cachedir: .pytest_cache
>> hypothesis profile 'default' -> 
>> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples')
>> torch: 1.10.0
>> rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: 
>> pytest.ini
>> plugins: hypothesis-6.13.1
>> collecting ... collected 13 items
>> 
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR   [  
>> 7%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ 
>> 15%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ 
>> 23%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ 
>> 30%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR  [ 
>> 38%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ 
>> 46%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ 
>> 53%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR 
>> [ 61%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR 
>> [ 69%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] 
>> ERROR [ 76%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] 
>> ERROR [ 84%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] 
>> ERROR [ 92%]
>> distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR       
>> [100%]
>> 
>> ==================================== ERRORS 
>> ====================================
>> _____________________ ERROR at setup of test_1to3[never-3] 
>> _____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 229, in _init_rpc_backend
>>     rpc_agent = backend_registry.init_backend(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>>  line 106, in init_backend
>>     return backend.value.init_backend_handler(*args, **kwargs)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>>  line 309, in _tensorpipe_init_backend_handler
>>     api._init_rpc_states(agent)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py",
>>  line 114, in _init_rpc_states
>>     _set_and_start_rpc_agent(agent)
>> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
>> argument
>> ___________________ ERROR at setup of test_1to3[never-1:2] 
>> ____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ____________________ ERROR at setup of test_1to3[never-2:1] 
>> ____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ___________________ ERROR at setup of test_1to3[never-1:1:1] 
>> ___________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ____________________ ERROR at setup of test_1to3[always-3] 
>> _____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ___________________ ERROR at setup of test_1to3[always-1:2] 
>> ____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ___________________ ERROR at setup of test_1to3[always-2:1] 
>> ____________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> __________________ ERROR at setup of test_1to3[always-1:1:1] 
>> ___________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> __________________ ERROR at setup of test_1to3[except_last-3] 
>> __________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> _________________ ERROR at setup of test_1to3[except_last-1:2] 
>> _________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> _________________ ERROR at setup of test_1to3[except_last-2:1] 
>> _________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> ________________ ERROR at setup of test_1to3[except_last-1:1:1] 
>> ________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> _______________________ ERROR at setup of test_none_skip 
>> _______________________
>> Traceback (most recent call last):
>>   File 
>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>  line 44, in setup_rpc
>>     dist.rpc.init_rpc(
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 195, in init_rpc
>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>> rpc_backend_options)
>>   File 
>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>  line 226, in _init_rpc_backend
>>     raise RuntimeError("RPC is already initialized")
>> RuntimeError: RPC is already initialized
>> =========================== short test summary info 
>> ============================
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - 
>> Runt...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - 
>> Ru...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - 
>> Ru...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - 
>> ...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - 
>> Run...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - 
>> R...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - 
>> R...
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1]
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3]
>> ERROR 
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2]
>> ERROR 
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1]
>> ERROR 
>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1]
>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - 
>> RuntimeE...
>> ============================== 13 errors in 0.17s 
>> ==============================
>> distributed/pipeline/sync/skip/test_gpipe failed!
>> Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... 
>> [2021-12-21 09:34:23.699450]
>> Executing 
>> ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python',
>>  '-m', 'pytest', 
>> 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', '-v'] ... 
>> [2021-12-21 09:34:23.699498]
>> 
>> 

Reply via email to