[OMPI users] openmpi, stdin and qlogic infiniband

2013-09-19 Thread Fabrice Boyrie
Hello

I have to compile a program (abinit) reading data from stdin and it
doesn't works.


  I made a simplified version of the program



PROGRAM TESTSTDIN

  use mpi
  integer ( kind = 4 ) error
  integer ( kind = 4 ) id
  integer ( kind = 4 ) p
  real ( kind = 8 ) wtime
  CHARACTER(LEN=255) a
  call MPI_Init ( error )
  call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
  call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )

  if ( id == 0 ) then
PRINT*, "id0"
READ(5,'(A)') a
  end if

  write ( *, '(a)' ) ' '
  write ( *, '(a,i8,a)' ) '  Process ', id, ' says "Hello, world!"'

  if ( id == 0 ) then
write ( *, '(a)' ) 'READ from stdin'
write ( *, '(a)' ) a
  end if
  call MPI_Finalize ( error )

  stop
end


I've tried openmpi 1.6.5 and 1.7.2
The fortran compiler is ifort (tried Version 14.0.0.080 Build 20130728
and Version 11.1Build 20100806)
(c compiler is gcc, centos 6.x, infiniband stack from qlogic
infinipath-libs-3.1-3420.1122_rhel6_qlc.x86_64)

Trying with and without infiniband (qlogic card)

mpirun -np 8 ./teststdin < /tmp/a
forrtl: Bad file descriptor
forrtl: severe (108): cannot stat file, unit 5, file /proc/43811/fd/0
Image  PCRoutineLine
Source 
teststdin  0040BF48  Unknown   Unknown  Unknown



 mpirun -mca mtl ^psm -mca btl self,sm -np 8 ./teststdin < /tmp/a

 id0
  Process0 says "Hello, world!"
READ from stdin
zer 

            
   

  Process1 says "Hello, world!"
...



Is it a known problem ?

 Fabrice BOYRIE




Re: [OMPI users] openmpi, stdin and qlogic infiniband

2013-09-19 Thread Fabrice Boyrie
Thanks for your answer I will forward it to an AbInit developper...

Fabrice BOYRIE

NB: The problems seems specific to my qlogic driver
(QLogicIB-Basic.RHEL6-x86_64.7.1.0.0.58.tgz). 


On Thu, Sep 19, 2013 at 08:37:18AM -0500, Jeff Hammond wrote:
> See this related post
> http://lists.mpich.org/pipermail/discuss/2013-September/001452.html.
> 
> The only text in the MPI standard I could find related to stdin is
> "assuming the MPI implementation supports stdin such that this works",
> which is not what I'd call a ringing endorsement of the practice of using
> it.
> 
> Tell the AbInit people that they're wrong for using stdin.  There are lots
> of cases where it won't work.
> 
> Jeff
> 
> 
> On Thu, Sep 19, 2013 at 6:42 AM, Fabrice Boyrie 
> wrote:
> >
> > Hello
> >
> > I have to compile a program (abinit) reading data from stdin and it
> > doesn't works.
> >
> >
> >   I made a simplified version of the program
> >
> >
> >
> > PROGRAM TESTSTDIN
> >
> >   use mpi
> >   integer ( kind = 4 ) error
> >   integer ( kind = 4 ) id
> >   integer ( kind = 4 ) p
> >   real ( kind = 8 ) wtime
> >   CHARACTER(LEN=255) a
> >   call MPI_Init ( error )
> >   call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
> >   call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
> >
> >   if ( id == 0 ) then
> > PRINT*, "id0"
> > READ(5,'(A)') a
> >   end if
> >
> >   write ( *, '(a)' ) ' '
> >   write ( *, '(a,i8,a)' ) '  Process ', id, ' says "Hello, world!"'
> >
> >   if ( id == 0 ) then
> > write ( *, '(a)' ) 'READ from stdin'
> > write ( *, '(a)' ) a
> >   end if
> >   call MPI_Finalize ( error )
> >
> >   stop
> > end
> >
> >
> > I've tried openmpi 1.6.5 and 1.7.2
> > The fortran compiler is ifort (tried Version 14.0.0.080 Build 20130728
> > and Version 11.1Build 20100806)
> > (c compiler is gcc, centos 6.x, infiniband stack from qlogic
> > infinipath-libs-3.1-3420.1122_rhel6_qlc.x86_64)
> >
> > Trying with and without infiniband (qlogic card)
> >
> > mpirun -np 8 ./teststdin < /tmp/a
> > forrtl: Bad file descriptor
> > forrtl: severe (108): cannot stat file, unit 5, file /proc/43811/fd/0
> > Image  PCRoutineLine
> > Source
> > teststdin  0040BF48  Unknown   Unknown
>  Unknown
> >
> >
> >
> >  mpirun -mca mtl ^psm -mca btl self,sm -np 8 ./teststdin < /tmp/a
> >
> >  id0
> >   Process0 says "Hello, world!"
> > READ from stdin
> > zer
> >
> >   Process1 says "Hello, world!"
> > ...
> >
> >
> >
> > Is it a known problem ?
> >
> >  Fabrice BOYRIE
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> --
> Jeff Hammond
> jeff.scie...@gmail.com

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] [EXTERNAL] Problem with Mellanox ConnectX3 (FDR) and openmpi 4

2022-08-20 Thread Fabrice Boyrie via users

Hi Howard

Thanks for your answer.
With disable-verbs, the error message disappears.
Now Vasp is working, it seems the problem was vasp process on the second 
node used default LD_LIBRARY_PATH and not the one used for mpirun.



Fabrice Boyrie


Le 19/08/2022 à 18:26, Pritchard Jr., Howard a écrit :


Hi Boyrie,

The warning message is coming from the older ibverbs component of the 
Open MPI 4.0/4.1 releases.


You can make this message using several ways.  One at configure time 
is to add


--disable-verbs

to the configure options.

At runtime you can set

export OMPI_MCA_btl=^openib

The ucx messages are just being chatty about which ucx transport type 
is being selected.


The VASP hang may be something else.

Howard

*From: *users  on behalf of Boyrie 
Fabrice via users 

*Reply-To: *Open MPI Users 
*Date: *Friday, August 19, 2022 at 9:51 AM
*To: *"users@lists.open-mpi.org" 
*Cc: *Boyrie Fabrice 
*Subject: *[EXTERNAL] [OMPI users] Problem with Mellanox ConnectX3 
(FDR) and openmpi 4


Hi

I had to reinstall a cluster in AlmaLinux 8.6

I am unable to make openmpi 4 working with infiniband. I have the 
following message in a trivial pingpong test


mpirun --hostfile hostfile -np 2 pingpong

-- 


WARNING: There was an error initializing an OpenFabrics device.

 Local host:   node2
 Local device: mlx4_0
-- 


[node2:12431] common_ucx.c:107 using OPAL memory hooks as external events
[node2:12431] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[node1:13188] common_ucx.c:174 using OPAL memory hooks as external events
[node1:13188] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[node2:12431] pml_ucx.c:289 mca_pml_ucx_init
[node1:13188] common_ucx.c:333 posix/memory: did not match transport list
[node1:13188] common_ucx.c:333 sysv/memory: did not match transport list
[node1:13188] common_ucx.c:333 self/memory0: did not match transport list
[node1:13188] common_ucx.c:333 tcp/lo: did not match transport list
[node1:13188] common_ucx.c:333 tcp/eno1: did not match transport list
[node1:13188] common_ucx.c:333 tcp/ib0: did not match transport list
[node1:13188] common_ucx.c:228 driver 
'../../../../bus/pci/drivers/mlx4_core' matched by 'mlx*'
[node1:13188] common_ucx.c:324 rc_verbs/mlx4_0:1: matched both 
transport and device list

[node1:13188] common_ucx.c:337 support level is transports and devices
[node1:13188] pml_ucx.c:289 mca_pml_ucx_init
[node2:12431] pml_ucx.c:114 Pack remote worker address, size 155
[node2:12431] pml_ucx.c:114 Pack local worker address, size 291
[node2:12431] pml_ucx.c:351 created ucp context 0xf832a0, worker 
0x109fc50

[node1:13188] pml_ucx.c:114 Pack remote worker address, size 155
[node1:13188] pml_ucx.c:114 Pack local worker address, size 291
[node1:13188] pml_ucx.c:351 created ucp context 0x1696320, worker 
0x16c9ce0

[node1:13188] pml_ucx_component.c:147 returning priority 51
[node2:12431] pml_ucx.c:182 Got proc 0 address, size 291
[node2:12431] pml_ucx.c:411 connecting to proc. 0
[node1:13188] pml_ucx.c:182 Got proc 1 address, size 291
[node1:13188] pml_ucx.c:411 connecting to proc. 1
length   time/message (usec)    transfer rate (Gbyte/sec)
[node2:12431] pml_ucx.c:182 Got proc 1 address, size 155
[node2:12431] pml_ucx.c:411 connecting to proc. 1
[node1:13188] pml_ucx.c:182 Got proc 0 address, size 155
[node1:13188] pml_ucx.c:411 connecting to proc. 0
1 45.683729   0.88
1001  4.286029   0.934198
2001  5.755391   1.390696
3001  6.902443   1.739095
4001  8.485305   1.886084
5001  9.596994   2.084403
6001  11.055146   2.171297
7001  11.977093   2.338130
8001  13.324408   2.401908
9001  14.471116   2.487991
10001 15.806676   2.530829
[node2:12431] common_ucx.c:240 disconnecting from rank 0
[node2:12431] common_ucx.c:240 disconnecting from rank 1
[node2:12431] common_ucx.c:204 waiting for 1 disconnect requests
[node2:12431] common_ucx.c:204 waiting for 0 disconnect requests
[node1:13188] common_ucx.c:466 disconnecting from rank 0
[node1:13188] common_ucx.c:430 waiting for 1 disconnect requests
[node1:13188] common_ucx.c:466 disconnecting from rank 1
[node1:13188] common_ucx.c:430 waiting for 0 disconnect requests
[node2:12431] pml_ucx.c:367 mca_pml_ucx_cleanup
[node1:13188] pml_ucx.c:367 mca_pml_ucx_cleanup
[node2:12431] pml_ucx.c:268 mca_pml_ucx_close
[node1:13188] pml_ucx.c:268 mca_pml_ucx_close

cat hostfile
node1 slots=1
node2 slots=1

And with a real program (Vasp) it stops.

Infinband seems to be working. I can ssh over infiniband and qperf 
works in rdma mode


qperf  -t 10 ibnode1 ud_lat ud_bw
ud_lat:
   latency  =  18.2 us
ud_bw:
   send_bw  =  2.81 GB/sec
   recv