[OMPI users] openmpi, stdin and qlogic infiniband
Hello I have to compile a program (abinit) reading data from stdin and it doesn't works. I made a simplified version of the program PROGRAM TESTSTDIN use mpi integer ( kind = 4 ) error integer ( kind = 4 ) id integer ( kind = 4 ) p real ( kind = 8 ) wtime CHARACTER(LEN=255) a call MPI_Init ( error ) call MPI_Comm_size ( MPI_COMM_WORLD, p, error ) call MPI_Comm_rank ( MPI_COMM_WORLD, id, error ) if ( id == 0 ) then PRINT*, "id0" READ(5,'(A)') a end if write ( *, '(a)' ) ' ' write ( *, '(a,i8,a)' ) ' Process ', id, ' says "Hello, world!"' if ( id == 0 ) then write ( *, '(a)' ) 'READ from stdin' write ( *, '(a)' ) a end if call MPI_Finalize ( error ) stop end I've tried openmpi 1.6.5 and 1.7.2 The fortran compiler is ifort (tried Version 14.0.0.080 Build 20130728 and Version 11.1Build 20100806) (c compiler is gcc, centos 6.x, infiniband stack from qlogic infinipath-libs-3.1-3420.1122_rhel6_qlc.x86_64) Trying with and without infiniband (qlogic card) mpirun -np 8 ./teststdin < /tmp/a forrtl: Bad file descriptor forrtl: severe (108): cannot stat file, unit 5, file /proc/43811/fd/0 Image PCRoutineLine Source teststdin 0040BF48 Unknown Unknown Unknown mpirun -mca mtl ^psm -mca btl self,sm -np 8 ./teststdin < /tmp/a id0 Process0 says "Hello, world!" READ from stdin zer Process1 says "Hello, world!" ... Is it a known problem ? Fabrice BOYRIE
Re: [OMPI users] openmpi, stdin and qlogic infiniband
Thanks for your answer I will forward it to an AbInit developper... Fabrice BOYRIE NB: The problems seems specific to my qlogic driver (QLogicIB-Basic.RHEL6-x86_64.7.1.0.0.58.tgz). On Thu, Sep 19, 2013 at 08:37:18AM -0500, Jeff Hammond wrote: > See this related post > http://lists.mpich.org/pipermail/discuss/2013-September/001452.html. > > The only text in the MPI standard I could find related to stdin is > "assuming the MPI implementation supports stdin such that this works", > which is not what I'd call a ringing endorsement of the practice of using > it. > > Tell the AbInit people that they're wrong for using stdin. There are lots > of cases where it won't work. > > Jeff > > > On Thu, Sep 19, 2013 at 6:42 AM, Fabrice Boyrie > wrote: > > > > Hello > > > > I have to compile a program (abinit) reading data from stdin and it > > doesn't works. > > > > > > I made a simplified version of the program > > > > > > > > PROGRAM TESTSTDIN > > > > use mpi > > integer ( kind = 4 ) error > > integer ( kind = 4 ) id > > integer ( kind = 4 ) p > > real ( kind = 8 ) wtime > > CHARACTER(LEN=255) a > > call MPI_Init ( error ) > > call MPI_Comm_size ( MPI_COMM_WORLD, p, error ) > > call MPI_Comm_rank ( MPI_COMM_WORLD, id, error ) > > > > if ( id == 0 ) then > > PRINT*, "id0" > > READ(5,'(A)') a > > end if > > > > write ( *, '(a)' ) ' ' > > write ( *, '(a,i8,a)' ) ' Process ', id, ' says "Hello, world!"' > > > > if ( id == 0 ) then > > write ( *, '(a)' ) 'READ from stdin' > > write ( *, '(a)' ) a > > end if > > call MPI_Finalize ( error ) > > > > stop > > end > > > > > > I've tried openmpi 1.6.5 and 1.7.2 > > The fortran compiler is ifort (tried Version 14.0.0.080 Build 20130728 > > and Version 11.1Build 20100806) > > (c compiler is gcc, centos 6.x, infiniband stack from qlogic > > infinipath-libs-3.1-3420.1122_rhel6_qlc.x86_64) > > > > Trying with and without infiniband (qlogic card) > > > > mpirun -np 8 ./teststdin < /tmp/a > > forrtl: Bad file descriptor > > forrtl: severe (108): cannot stat file, unit 5, file /proc/43811/fd/0 > > Image PCRoutineLine > > Source > > teststdin 0040BF48 Unknown Unknown > Unknown > > > > > > > > mpirun -mca mtl ^psm -mca btl self,sm -np 8 ./teststdin < /tmp/a > > > > id0 > > Process0 says "Hello, world!" > > READ from stdin > > zer > > > > Process1 says "Hello, world!" > > ... > > > > > > > > Is it a known problem ? > > > > Fabrice BOYRIE > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Jeff Hammond > jeff.scie...@gmail.com > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [EXTERNAL] Problem with Mellanox ConnectX3 (FDR) and openmpi 4
Hi Howard Thanks for your answer. With disable-verbs, the error message disappears. Now Vasp is working, it seems the problem was vasp process on the second node used default LD_LIBRARY_PATH and not the one used for mpirun. Fabrice Boyrie Le 19/08/2022 à 18:26, Pritchard Jr., Howard a écrit : Hi Boyrie, The warning message is coming from the older ibverbs component of the Open MPI 4.0/4.1 releases. You can make this message using several ways. One at configure time is to add --disable-verbs to the configure options. At runtime you can set export OMPI_MCA_btl=^openib The ucx messages are just being chatty about which ucx transport type is being selected. The VASP hang may be something else. Howard *From: *users on behalf of Boyrie Fabrice via users *Reply-To: *Open MPI Users *Date: *Friday, August 19, 2022 at 9:51 AM *To: *"users@lists.open-mpi.org" *Cc: *Boyrie Fabrice *Subject: *[EXTERNAL] [OMPI users] Problem with Mellanox ConnectX3 (FDR) and openmpi 4 Hi I had to reinstall a cluster in AlmaLinux 8.6 I am unable to make openmpi 4 working with infiniband. I have the following message in a trivial pingpong test mpirun --hostfile hostfile -np 2 pingpong -- WARNING: There was an error initializing an OpenFabrics device. Local host: node2 Local device: mlx4_0 -- [node2:12431] common_ucx.c:107 using OPAL memory hooks as external events [node2:12431] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [node1:13188] common_ucx.c:174 using OPAL memory hooks as external events [node1:13188] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [node2:12431] pml_ucx.c:289 mca_pml_ucx_init [node1:13188] common_ucx.c:333 posix/memory: did not match transport list [node1:13188] common_ucx.c:333 sysv/memory: did not match transport list [node1:13188] common_ucx.c:333 self/memory0: did not match transport list [node1:13188] common_ucx.c:333 tcp/lo: did not match transport list [node1:13188] common_ucx.c:333 tcp/eno1: did not match transport list [node1:13188] common_ucx.c:333 tcp/ib0: did not match transport list [node1:13188] common_ucx.c:228 driver '../../../../bus/pci/drivers/mlx4_core' matched by 'mlx*' [node1:13188] common_ucx.c:324 rc_verbs/mlx4_0:1: matched both transport and device list [node1:13188] common_ucx.c:337 support level is transports and devices [node1:13188] pml_ucx.c:289 mca_pml_ucx_init [node2:12431] pml_ucx.c:114 Pack remote worker address, size 155 [node2:12431] pml_ucx.c:114 Pack local worker address, size 291 [node2:12431] pml_ucx.c:351 created ucp context 0xf832a0, worker 0x109fc50 [node1:13188] pml_ucx.c:114 Pack remote worker address, size 155 [node1:13188] pml_ucx.c:114 Pack local worker address, size 291 [node1:13188] pml_ucx.c:351 created ucp context 0x1696320, worker 0x16c9ce0 [node1:13188] pml_ucx_component.c:147 returning priority 51 [node2:12431] pml_ucx.c:182 Got proc 0 address, size 291 [node2:12431] pml_ucx.c:411 connecting to proc. 0 [node1:13188] pml_ucx.c:182 Got proc 1 address, size 291 [node1:13188] pml_ucx.c:411 connecting to proc. 1 length time/message (usec) transfer rate (Gbyte/sec) [node2:12431] pml_ucx.c:182 Got proc 1 address, size 155 [node2:12431] pml_ucx.c:411 connecting to proc. 1 [node1:13188] pml_ucx.c:182 Got proc 0 address, size 155 [node1:13188] pml_ucx.c:411 connecting to proc. 0 1 45.683729 0.88 1001 4.286029 0.934198 2001 5.755391 1.390696 3001 6.902443 1.739095 4001 8.485305 1.886084 5001 9.596994 2.084403 6001 11.055146 2.171297 7001 11.977093 2.338130 8001 13.324408 2.401908 9001 14.471116 2.487991 10001 15.806676 2.530829 [node2:12431] common_ucx.c:240 disconnecting from rank 0 [node2:12431] common_ucx.c:240 disconnecting from rank 1 [node2:12431] common_ucx.c:204 waiting for 1 disconnect requests [node2:12431] common_ucx.c:204 waiting for 0 disconnect requests [node1:13188] common_ucx.c:466 disconnecting from rank 0 [node1:13188] common_ucx.c:430 waiting for 1 disconnect requests [node1:13188] common_ucx.c:466 disconnecting from rank 1 [node1:13188] common_ucx.c:430 waiting for 0 disconnect requests [node2:12431] pml_ucx.c:367 mca_pml_ucx_cleanup [node1:13188] pml_ucx.c:367 mca_pml_ucx_cleanup [node2:12431] pml_ucx.c:268 mca_pml_ucx_close [node1:13188] pml_ucx.c:268 mca_pml_ucx_close cat hostfile node1 slots=1 node2 slots=1 And with a real program (Vasp) it stops. Infinband seems to be working. I can ssh over infiniband and qperf works in rdma mode qperf -t 10 ibnode1 ud_lat ud_bw ud_lat: latency = 18.2 us ud_bw: send_bw = 2.81 GB/sec recv