[OMPI users] Calculation stuck in MPI

2009-03-03 Thread Ondrej Marsalek
Dear everyone,

I have a calculation (the CP2K program) using MPI over Infiniband and
it is stuck. All processes (16 on 4 nodes) are running, taking 100%
CPU. Attaching a debugger reveals this (only the end of the stack
shown here):

(gdb) backtrace
#0  0x2b3460916dbf in btl_openib_component_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
#1  0x2b345c22c778 in opal_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
#2  0x2b345bd2d66d in ompi_request_default_wait_any () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#3  0x2b345bd6021a in PMPI_Waitany () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#4  0x2b345bae77f1 in pmpi_waitany__ () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0

It has survived a restart of the IB switch, unlike "healthy" runs. My
question is - is it obvious at what level the problem is? IB, Open
MPI, application?I would be glad to provide detailed information, if
anyone was willing to help. I want to work on this, but unfortunately
I am not sure where to begin.

Best regards,
Ondrej Marsalek


[OMPI users] MPI-IO Inconsistency over Lustre using OMPI 1.3

2009-03-03 Thread Nathan Baca
Hello,

I am seeing inconsistent mpi-io behavior when writing to a Lustre file
system using open mpi 1.3 with romio. What follows is a simple reproducer
and output. Essentially one or more of the running processes does not read
or write the correct amount of data to its part of a file residing on a
Lustre (parallel) file system.

Any help figuring out what is happening is greatly appreciated. Thanks, Nate

program gcrm_test_io
  implicit none
  include "mpif.h"

  integer X_SIZE

  integer w_me, w_nprocs
  integer  my_info

  integer i
  integer (kind=4) :: ierr
  integer (kind=4) :: fileID

  integer (kind=MPI_OFFSET_KIND):: mylen
  integer (kind=MPI_OFFSET_KIND):: offset
  integer status(MPI_STATUS_SIZE)
  integer count
  integer ncells
  real (kind=4), allocatable, dimension (:) :: array2
  logical sync

  call mpi_init(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)

  call mpi_info_create(my_info, ierr)
! optional ways to set things in mpi-io
! call mpi_info_set   (my_info, "romio_ds_read" , "enable"   , ierr)
! call mpi_info_set   (my_info, "romio_ds_write", "enable"   , ierr)
! call mpi_info_set   (my_info, "romio_cb_write", "enable", ierr)

  x_size = 410011  ! A 'big' number, with bigger numbers it is more
likely to fail
  sync = .true.  ! Extra file synchronization

  ncells = (X_SIZE * w_nprocs)

!  Use node zero to fill it with nines
  if (w_me .eq. 0) then
  call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
  allocate (array2(ncells))
  array2(:) = 9.0
  mylen = ncells
  offset = 0 * 4
  call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
"native",MPI_INFO_NULL,ierr)
  call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)

  call MPI_Get_count(status,MPI_INTEGER, count, ierr)
  if (count .ne. mylen) print*, "Wrong initial write count:",
count,mylen
  deallocate(array2)
  if (sync) call MPI_FILE_SYNC (fileID,ierr)
  call MPI_FILE_CLOSE (fileID,ierr)
  endif

!  All nodes now fill their area with ones
  call MPI_BARRIER(MPI_COMM_WORLD,ierr)
  allocate (array2( X_SIZE))
  array2(:) = 1.0
  offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4
  mylen = X_SIZE
  call MPI_FILE_OPEN  (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
my_info, fileID, ierr)
  print*,"node",w_me,"starting",(offset/4) + 1,"ending",(offset/4)+mylen

  call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
"native",MPI_INFO_NULL,ierr)
  call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)
  call MPI_Get_count(status,MPI_INTEGER, count, ierr)
  if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me
  deallocate(array2)
  if (sync) call MPI_FILE_SYNC (fileID,ierr)
  call MPI_FILE_CLOSE (fileID,ierr)

!  Read it back on node zero to see if it is ok data
  if (w_me .eq. 0) then
  call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat", MPI_MODE_RDONLY,
my_info, fileID, ierr)
  mylen = ncells
  allocate (array2(ncells))
  call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr)
  call MPI_Get_count(status,MPI_INTEGER, count, ierr)
  if (count .ne. mylen) print*, "Wrong read count:", count,mylen
  do i=1,ncells
   if (array2(i) .ne. 1) then
  print*, "ERROR", i,array2(i), ((i-1)*4),
((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
  goto 999
   end if
  end do
  print*, "All done with nothing wrong"
 999  deallocate(array2)
  call MPI_FILE_CLOSE (fileID,ierr)
  call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr)
  endif

  call mpi_finalize(ierr)

end program gcrm_test_io

1.3 Open MPI
 node   0 starting 1 ending
410011
 node   1 starting410012 ending
820022
 node   2 starting820023 ending
1230033
 node   3 starting   1230034 ending
1640044
 node   4 starting   1640045 ending
2050055
 node   5 starting   2050056 ending
2460066
 All done with nothing wrong


 node   0 starting 1 ending
410011
 node   1 starting410012 ending
820022
 node   2 starting820023 ending
1230033
 node   5 starting   2050056 ending
2460066
 node   4 starting   1640045 ending
2050055
 node   3 starting   1230034 ending
1640044
 Wrong write count:  228554410011   2
 Wrong read count: 1048576   2460066
 ERROR 1048577  0.000E+00  

Re: [OMPI users] MPI-IO Inconsistency over Lustre using OMPI 1.3

2009-03-03 Thread Brian Dobbins
Hi Nathan,

  I just ran your code here and it worked fine - CentOS 5 on dual Xeons w/
IB network, and the kernel is 2.6.18-53.1.14.el5_lustre.1.6.5smp.  I used an
OpenMPI 1.3.0 install compiled with Intel 11.0.081 and, independently, one
with GCC 4.1.2.  I tried a few different times with varying numbers of
processors.

  (Both executables were compiled with -O2)

  I'm sure the main OpenMPI guys will have better ideas, but in the meantime
what kernel, OS and compilers are you using?  And does it happen when you
write to a single OST?  Make a directory and try setting the stripe-size to
1 (eg, lfs setstripe  1048576 0 1' will give you, I think, a
1MB stripe size starting at OST 0 and of size 1.)  I'm just wondering
whether it's something with your hardware, maybe a particular OST, since it
seems to work for me.

  ... Sorry I can't be of more help, but I imagine the regular experts will
chime in shortly.

  Cheers,
  - Brian


On Tue, Mar 3, 2009 at 12:51 PM, Nathan Baca  wrote:

> Hello,
>
> I am seeing inconsistent mpi-io behavior when writing to a Lustre file
> system using open mpi 1.3 with romio. What follows is a simple reproducer
> and output. Essentially one or more of the running processes does not read
> or write the correct amount of data to its part of a file residing on a
> Lustre (parallel) file system.
>
> Any help figuring out what is happening is greatly appreciated. Thanks,
> Nate
>
> program gcrm_test_io
>   implicit none
>   include "mpif.h"
>
>   integer X_SIZE
>
>   integer w_me, w_nprocs
>   integer  my_info
>
>   integer i
>   integer (kind=4) :: ierr
>   integer (kind=4) :: fileID
>
>   integer (kind=MPI_OFFSET_KIND):: mylen
>   integer (kind=MPI_OFFSET_KIND):: offset
>   integer status(MPI_STATUS_SIZE)
>   integer count
>   integer ncells
>   real (kind=4), allocatable, dimension (:) :: array2
>   logical sync
>
>   call mpi_init(ierr)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
>   call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
>
>   call mpi_info_create(my_info, ierr)
> ! optional ways to set things in mpi-io
> ! call mpi_info_set   (my_info, "romio_ds_read" , "enable"   , ierr)
> ! call mpi_info_set   (my_info, "romio_ds_write", "enable"   , ierr)
> ! call mpi_info_set   (my_info, "romio_cb_write", "enable", ierr)
>
>   x_size = 410011  ! A 'big' number, with bigger numbers it is more
> likely to fail
>   sync = .true.  ! Extra file synchronization
>
>   ncells = (X_SIZE * w_nprocs)
>
> !  Use node zero to fill it with nines
>   if (w_me .eq. 0) then
>   call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
>   allocate (array2(ncells))
>   array2(:) = 9.0
>   mylen = ncells
>   offset = 0 * 4
>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>   call MPI_File_write(fileID, array2, mylen , MPI_REAL,
> status,ierr)
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong initial write count:",
> count,mylen
>   deallocate(array2)
>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>   call MPI_FILE_CLOSE (fileID,ierr)
>   endif
>
> !  All nodes now fill their area with ones
>   call MPI_BARRIER(MPI_COMM_WORLD,ierr)
>   allocate (array2( X_SIZE))
>   array2(:) = 1.0
>   offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4
>   mylen = X_SIZE
>   call MPI_FILE_OPEN  (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
> my_info, fileID, ierr)
>   print*,"node",w_me,"starting",(offset/4) +
> 1,"ending",(offset/4)+mylen
>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>   call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me
>   deallocate(array2)
>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>   call MPI_FILE_CLOSE (fileID,ierr)
>
> !  Read it back on node zero to see if it is ok data
>   if (w_me .eq. 0) then
>   call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
> MPI_MODE_RDONLY, my_info, fileID, ierr)
>   mylen = ncells
>   allocate (array2(ncells))
>   call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr)
>
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong read count:", count,mylen
>   do i=1,ncells
>if (array2(i) .ne. 1) then
>   print*, "ERROR", i,array2(i), ((i-1)*4),
> ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
>   goto 999
>end if
>   end do
>   print*, "All done 

[OMPI users] libnuma under ompi 1.3

2009-03-03 Thread Terry Frankcombe
Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I
merrily went off to compile my application.

In my final link with mpif90 I get the error:

/usr/bin/ld: cannot find -lnuma

Adding --showme reveals that

-I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib

is added to the compile early in the aggregated ifort command, and 

-L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte
-lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

is added to the end.

I note than when compiling Open MPI -lnuma was visible in the gcc
arguments, with no added -L.

On this system libnuma.so exists in /usr/lib64.  My (somewhat long!)
configure command was

./configure --enable-static --disable-shared
--prefix=/home/terry/bin/Local --enable-picky --disable-heterogeneous
--without-slurm --without-alps --without-xgrid --without-sge
--without-loadleveler --without-lsf F77=ifort


Should mpif90 have bundled a -L/usr/lib64 in there somewhere?

Regards
Terry


-- 
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509Skype: terry.frankcombe



Re: [OMPI users] MPI-IO Inconsistency over Lustre using OMPI 1.3

2009-03-03 Thread Nathan Baca
Thanks for the quick reply and suggestions.

I have tried both isolating the output to a single OST and striping across
multiple OSTs. Both will produce the same result. I have tried compiling
with multiple versions of both pathscale and intel compilers all with the
same result.

The odd thing is that this seems to work using hpmpi 2.03 compiled with
pathscale 3.2 and intel 10.1.018. The operating system is XC 3.2.1 which is
essentially rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is
lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp.

Thanks for the info, Nate


On Tue, Mar 3, 2009 at 11:10 AM, Brian Dobbins  wrote:

>
> Hi Nathan,
>
>   I just ran your code here and it worked fine - CentOS 5 on dual Xeons w/
> IB network, and the kernel is 2.6.18-53.1.14.el5_lustre.1.6.5smp.  I used an
> OpenMPI 1.3.0 install compiled with Intel 11.0.081 and, independently, one
> with GCC 4.1.2.  I tried a few different times with varying numbers of
> processors.
>
>   (Both executables were compiled with -O2)
>
>   I'm sure the main OpenMPI guys will have better ideas, but in the
> meantime what kernel, OS and compilers are you using?  And does it happen
> when you write to a single OST?  Make a directory and try setting the
> stripe-size to 1 (eg, lfs setstripe  1048576 0 1' will give
> you, I think, a 1MB stripe size starting at OST 0 and of size 1.)  I'm just
> wondering whether it's something with your hardware, maybe a particular OST,
> since it seems to work for me.
>
>   ... Sorry I can't be of more help, but I imagine the regular experts will
> chime in shortly.
>
>   Cheers,
>   - Brian
>
>
> On Tue, Mar 3, 2009 at 12:51 PM, Nathan Baca wrote:
>
>> Hello,
>>
>> I am seeing inconsistent mpi-io behavior when writing to a Lustre file
>> system using open mpi 1.3 with romio. What follows is a simple reproducer
>> and output. Essentially one or more of the running processes does not read
>> or write the correct amount of data to its part of a file residing on a
>> Lustre (parallel) file system.
>>
>> Any help figuring out what is happening is greatly appreciated. Thanks,
>> Nate
>>
>> program gcrm_test_io
>>   implicit none
>>   include "mpif.h"
>>
>>   integer X_SIZE
>>
>>   integer w_me, w_nprocs
>>   integer  my_info
>>
>>   integer i
>>   integer (kind=4) :: ierr
>>   integer (kind=4) :: fileID
>>
>>   integer (kind=MPI_OFFSET_KIND):: mylen
>>   integer (kind=MPI_OFFSET_KIND):: offset
>>   integer status(MPI_STATUS_SIZE)
>>   integer count
>>   integer ncells
>>   real (kind=4), allocatable, dimension (:) :: array2
>>   logical sync
>>
>>   call mpi_init(ierr)
>>   call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
>>   call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
>>
>>   call mpi_info_create(my_info, ierr)
>> ! optional ways to set things in mpi-io
>> ! call mpi_info_set   (my_info, "romio_ds_read" , "enable"   , ierr)
>> ! call mpi_info_set   (my_info, "romio_ds_write", "enable"   , ierr)
>> ! call mpi_info_set   (my_info, "romio_cb_write", "enable", ierr)
>>
>>   x_size = 410011  ! A 'big' number, with bigger numbers it is more
>> likely to fail
>>   sync = .true.  ! Extra file synchronization
>>
>>   ncells = (X_SIZE * w_nprocs)
>>
>> !  Use node zero to fill it with nines
>>   if (w_me .eq. 0) then
>>   call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
>> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
>>   allocate (array2(ncells))
>>   array2(:) = 9.0
>>   mylen = ncells
>>   offset = 0 * 4
>>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
>> "native",MPI_INFO_NULL,ierr)
>>   call MPI_File_write(fileID, array2, mylen , MPI_REAL,
>> status,ierr)
>>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>>   if (count .ne. mylen) print*, "Wrong initial write count:",
>> count,mylen
>>   deallocate(array2)
>>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>>   call MPI_FILE_CLOSE (fileID,ierr)
>>   endif
>>
>> !  All nodes now fill their area with ones
>>   call MPI_BARRIER(MPI_COMM_WORLD,ierr)
>>   allocate (array2( X_SIZE))
>>   array2(:) = 1.0
>>   offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4
>>   mylen = X_SIZE
>>   call MPI_FILE_OPEN  (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
>> my_info, fileID, ierr)
>>   print*,"node",w_me,"starting",(offset/4) +
>> 1,"ending",(offset/4)+mylen
>>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
>> "native",MPI_INFO_NULL,ierr)
>>   call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)
>>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>>   if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me
>>
>>   deallocate(array2)
>>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>>   call MPI_FILE_CLOSE (fi