[OMPI users] freezing in mpi_allreduce operation
I am seeing mpi_allreduce operations freeze execution of my code on some moderately-sized problems. The freeze does not manifest itself in every problem. In addition, it is in a portion of the code that is repeated many times. In the problem discussed below, the problem appears in the 60th iteration. The current test case that I'm looking at is a 64-processor job. This particular mpi_allreduce call applies to all 64 processors, with each communicator in the call containing a total of 4 processors. When I add print statements before and after the offending line, I see that all 64 processors successfully make it to the mpi_allreduce call, but only 32 successfully exit. Stack traces on the other 32 yield something along the lines of the trace listed at the bottom of this message. The call, itself, looks like: call mpi_allreduce(MPI_IN_PLACE, phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), & phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) These messages are sized to remain under the 32-bit integer size limitation for the "count" parameter. The intent is to perform the allreduce operation on a contiguous block of the array. Previously, I had been passing an assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation indicating that was potentially dangerous. Making the change from assumed- to explicit-shaped arrays doesn't solve the problem. However, if I declare an additional array and use separate send and receive buffers: call mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) phim(:,:,:,:,grp) = phim_global Then the problem goes away, and every thing works normally. Does anyone have any insight as to what may be happening here? I'm using "include 'mpif.h'" rather than the f90 module, does that potentially explain this? Thanks, Greg Stack trace(s) for thread: 1 - [0] (1 processes) - main() at ?:? solver() at solver.f90:31 solver_q_down() at solver_q_down.f90:52 iter() at iter.f90:56 mcalc() at mcalc.f90:38 pmpi_allreduce__() at ?:? PMPI_Allreduce() at ?:? ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:? ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:? ompi_coll_tuned_sendrecv_actual() at ?:? ompi_request_default_wait_all() at ?:? opal_progress() at ?:? Stack trace(s) for thread: 2 - [0] (1 processes) - start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? Stack trace(s) for thread: 3 - [0] (1 processes) - start_thread() at ?:? service_thread_start() at ?:? select() at ?:?
Re: [OMPI users] freezing in mpi_allreduce operation
Note also that coding the mpi_allreduce as: call mpi_allreduce(MPI_IN_PLACE,phim(0,1,1,1,grp),phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) results in the same freezing behavior in the 60th iteration. (I don't recall why the arrays were being passed, possibly just a mistake.) On Thu, Sep 8, 2011 at 4:17 PM, Greg Fischer wrote: > I am seeing mpi_allreduce operations freeze execution of my code on some > moderately-sized problems. The freeze does not manifest itself in every > problem. In addition, it is in a portion of the code that is repeated many > times. In the problem discussed below, the problem appears in the 60th > iteration. > > The current test case that I'm looking at is a 64-processor job. This > particular mpi_allreduce call applies to all 64 processors, with each > communicator in the call containing a total of 4 processors. When I add > print statements before and after the offending line, I see that all 64 > processors successfully make it to the mpi_allreduce call, but only 32 > successfully exit. Stack traces on the other 32 yield something along the > lines of the trace listed at the bottom of this message. The call, itself, > looks like: > > call mpi_allreduce(MPI_IN_PLACE, > phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), & > > phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) > > These messages are sized to remain under the 32-bit integer size limitation > for the "count" parameter. The intent is to perform the allreduce operation > on a contiguous block of the array. Previously, I had been passing an > assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation > indicating that was potentially dangerous. Making the change from assumed- > to explicit-shaped arrays doesn't solve the problem. However, if I declare > an additional array and use separate send and receive buffers: > > call > mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) > phim(:,:,:,:,grp) = phim_global > > Then the problem goes away, and every thing works normally. Does anyone > have any insight as to what may be happening here? I'm using "include > 'mpif.h'" rather than the f90 module, does that potentially explain this? > > Thanks, > Greg > > Stack trace(s) for thread: 1 > - > [0] (1 processes) > - > main() at ?:? > solver() at solver.f90:31 > solver_q_down() at solver_q_down.f90:52 > iter() at iter.f90:56 > mcalc() at mcalc.f90:38 > pmpi_allreduce__() at ?:? > PMPI_Allreduce() at ?:? > ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:? > ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:? > ompi_coll_tuned_sendrecv_actual() at ?:? > ompi_request_default_wait_all() at ?:? > opal_progress() at ?:? > Stack trace(s) for thread: 2 > - > [0] (1 processes) > - > start_thread() at ?:? > btl_openib_async_thread() at ?:? > poll() at ?:? > Stack trace(s) for thread: 3 > - > [0] (1 processes) > - > start_thread() at ?:? > service_thread_start() at ?:? > select() at ?:? >
[OMPI users] OMPI error in MPI_Cart_create (in code that works with MPICH2)
I'm receiving the error posted at the bottom of this message with a code compiled with Intel Fortran/C Version 11.1 against OpenMPI version 1.3.2. The same code works correctly when compiled against MPICH2. (We have re-compiled with OpenMPI to take advantage of newly-installed Infiniband hardware. The "ring" test problem appears to work correctly over Infiniband.) There are no "fork()" calls in our code, so I can only guess that something weird is going on with MPI_COMM_WORLD. The code in question is a Fortran 90 code. Right now, it is being compiled with "include 'mpif.h'" statements at the beginning of each MPI subroutine, instead of making use of the "mpi" modules. Could this be causing the problem? How else should I go about diagnosing the problem? Thanks, Greg -- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: bl316 (PID 26806) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -- [bl205:5014] *** An error occurred in MPI_Cart_create [bl205:5014] *** on communicator MPI_COMM_WORLD [bl205:5014] *** MPI_ERR_ARG: invalid argument of some other kind [bl205:5014] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -- mpirun has exited due to process rank 4 with PID 5010 on node bl205 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [bl205:05008] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [bl205:05008] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Re: [OMPI users] OMPI error in MPI_Cart_create (in code that works withMPICH2)
Thanks, Jeff. OK, I've found the offending code and gotten rid of the fork() warning. I'm still left with this: [bl302:26556] *** An error occurred in MPI_Cart_create [bl302:26556] *** on communicator MPI_COMM_WORLD [bl302:26556] *** MPI_ERR_ARG: invalid argument of some other kind [bl302:26556] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -- mpirun has exited due to process rank 4 with PID 13693 on node bl316 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [bl316:13691] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [bl316:13691] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages I'm going to try re-compiling OpenMPI, itself, with the Intel compilers. Any other ideas? On Wed, Sep 2, 2009 at 12:01 AM, Jeff Squyres wrote: > *Something* in your code is calling fork() -- it may be an indirect call > such as system() or popen() or somesuch. This particular error message is > only printed during a "fork pre-hook" that Open MPI installs during MPI_INIT > (registered via pthread_atfork()). > > Grep through your code for calls to system and popen -- see if any of these > are used. > > There is no functional difference between "include 'mpif.h'" and "use mpi" > in terms of MPI functionality at run time -- the only difference you get is > a "better" level of compile-time protection from the Fortran compiler. > Specifically, "use mpi" will introduce strict type checking for many (but > not all) of the MPI functions at compile time. Hence, the compiler will > complain if you forget an IERR parameter to an MPI function, for example. > > "use mpi" is not perfect, though -- there are many well-documented problems > because of the design of the MPI-2 Fortran 90 interface (which are currently > being addressed in MPI-3, if you care :-) ). More generally: "use mpi" will > catch *many* compile errors, but not *all* of them. > > But to answer your question succinctly: this problem won't be affected by > using "use mpi" or "include 'mpif.h'". > > > > > On Sep 1, 2009, at 9:02 PM, Greg Fischer wrote: > > I'm receiving the error posted at the bottom of this message with a code >> compiled with Intel Fortran/C Version 11.1 against OpenMPI version 1.3.2. >> >> The same code works correctly when compiled against MPICH2. (We have >> re-compiled with OpenMPI to take advantage of newly-installed Infiniband >> hardware. The "ring" test problem appears to work correctly over >> Infiniband.) >> >> There are no "fork()" calls in our code, so I can only guess that >> something weird is going on with MPI_COMM_WORLD. The code in question is a >> Fortran 90 code. Right now, it is being compiled with "include 'mpif.h'" >> statements at the beginning of each MPI subroutine, instead of making use >> of the "mpi" modules. Could this be causing the problem? How else should I >> go about diagnosing the problem? >> >> Thanks, >> Greg >> >> -- >> An MPI process has executed an operation involving a call to the >> "fork()" system call to create a child process. Open MPI is currently >> operating in a condition that could result in memory corruption or >> other system errors; your MPI job may hang, crash, or produce silent >> data corruption. The use of fork() (or system() or other calls that >> create child processes) is strongly discouraged. >> >> The process that invoked fork was: >> >> Local host: bl316 (PID 26806) >> MPI_COMM_WORLD rank: 0 >> >> If you are *absolutely sure* that your application will successfully >> and correctly survive a call to fork(), you may disable this warning >> by setting the mpi_warn_on_fork MCA parameter to 0. >> -- >> [bl205:5014] *** An error occurred in MPI_Cart_create >> [bl205:5014] *** on communicator MPI_COMM_WORLD >> [bl205:5014] *** MPI_ERR_ARG: invalid argument of some other kind >> [bl205:5014] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> >> -- >> mpirun has exited due to process rank 4
[OMPI users] error compiling OpenMPI 1.3.3 with Intel compiler suite 11.1 on Linux
I'm attempting to compile OpenMPI version 1.3.3 with Intel C/C++/Fortran version 11.1.046. Others have reported success using these compilers ( http://software.intel.com/en-us/forums/intel-c-compiler/topic/68111/). The line where compilation fails is included at the end of this message. I have also attached complete "./configure" and "make" outputs. Does anyone have any insight as to what I'm doing wrong? Thanks, Greg icpc11.1 -DHAVE_CONFIG_H -I. -I../../../opal/include -I../../../orte/include -I../../../ompi/include -I../../../opal/mca/paffinity/linux/plpa/src/libplpa -DOMPI_CONFIGURE_USER="\"fischega\"" -DOMPI_CONFIGURE_HOST="\"susedev1\"" -DOMPI_CONFIGURE_DATE="\"Fri Sep 4 09:53:03 EDT 2009\"" -DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\"" -DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -fvisibility=hidden\"" -DOMPI_BUILD_CPPFLAGS="\"-I../../.. \"" -DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -pthread\"" -DOMPI_BUILD_CXXCPPFLAGS="\"-I../../.. \"" -DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\"" -DOMPI_BUILD_LDFLAGS="\"-export-dynamic \"" -DOMPI_BUILD_LIBS="\"-lnsl -lutil \"" -DOMPI_CC_ABSOLUTE="\"/usr/scripts/icc11.1\"" -DOMPI_CXX_ABSOLUTE="\"/usr/scripts/icpc11.1\"" -DOMPI_F77_ABSOLUTE="\"/usr/scripts/ifort11.1\"" -DOMPI_F90_ABSOLUTE="\"/usr/scripts/ifort11.1\"" -DOMPI_F90_BUILD_SIZE="\"small\"" -I../../..-O3 -DNDEBUG -finline-functions -pthread -MT components.o -MD -MP -MF $depbase.Tpo -c -o components.o components.cc &&\ mv -f $depbase.Tpo $depbase.Po icpc: error #10236: File not found: 'Sep' icpc: error #10236: File not found: '4' icpc: error #10236: File not found: '09:53:03' icpc: error #10236: File not found: 'EDT' icpc: error #10236: File not found: '2009"' icpc: error #10236: File not found: 'Sep' icpc: error #10236: File not found: '4' icpc: error #10236: File not found: '10:11:04' icpc: error #10236: File not found: 'EDT' icpc: error #10236: File not found: '2009"' icpc: command line warning #10159: invalid argument for option '-fvisibility' icpc: error #10236: File not found: '"' icpc: command line warning #10156: ignoring option '-p'; no argument required icpc: error #10236: File not found: '"' icpc: error #10236: File not found: '"' icpc: error #10236: File not found: '"' make[2]: *** [components.o] Error 1 make[2]: Leaving directory `/home/fischega/src/openmpi-1.3.3/ompi/tools/ompi_info' ompi-output.tar.bz2 Description: BZip2 compressed data
Re: [OMPI users] error compiling OpenMPI 1.3.3 with Intel compilersuite 11.1 on Linux
Yep, that was it. The icpc11.1, ifort11.1, and icc11.1 scripts are included in the tar file attached to my original email. They set the PATH, LD_LIBRARY_PATH, and INTEL_LICENSE_FILE correctly. When I set the environment variables manually and use the regular icpc, ifort, and icc commands, it works fine. Good catch! Thanks, Greg On Fri, Sep 4, 2009 at 11:54 PM, Jeff Squyres wrote: > Can you clarify what icpc11.1 is? Is it a sym link to the icpc 11.1 > compiler, or is it a shell script that ends up invoking the icpc v11.1 > compiler? > > I ask because the compile line in question ends up with a complex quoting > scheme that includes a token with spaces in it: > >-DOMPI_CONFIGURE_DATE="\"Fri Sep 4 09:53:03 EDT 2009\"" > > If icpc11.1 is a shell script that ends up invoking the real icpc compiler > underneath, I could see how the quoting might get screwed up and end up > passing "Sep" (and following) as individual tokens rather than One Big Token > (including quotes). > > That's just a first guess -- can you check to see if this is happening? > > > > > On Sep 4, 2009, at 5:28 PM, Greg Fischer wrote: > > I'm attempting to compile OpenMPI version 1.3.3 with Intel C/C++/Fortran >> version 11.1.046. Others have reported success using these compilers ( >> http://software.intel.com/en-us/forums/intel-c-compiler/topic/68111/). >> The line where compilation fails is included at the end of this message. I >> have also attached complete "./configure" and "make" outputs. Does anyone >> have any insight as to what I'm doing wrong? >> >> Thanks, >> Greg >> >> icpc11.1 -DHAVE_CONFIG_H -I. -I../../../opal/include >> -I../../../orte/include -I../../../ompi/include >> -I../../../opal/mca/paffinity/linux/plpa/src/libplpa >> -DOMPI_CONFIGURE_USER="\"fischega\"" -DOMPI_CONFIGURE_HOST="\"susedev1\"" >> -DOMPI_CONFIGURE_DATE="\"Fri Sep 4 09:53:03 EDT 2009\"" >> -DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\"" >> -DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -DNDEBUG >> -finline-functions -fno-strict-aliasing -restrict -pthread >> -fvisibility=hidden\"" -DOMPI_BUILD_CPPFLAGS="\"-I../../.. \"" >> -DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -pthread\"" >> -DOMPI_BUILD_CXXCPPFLAGS="\"-I../../.. \"" -DOMPI_BUILD_FFLAGS="\"\"" >> -DOMPI_BUILD_FCFLAGS="\"\"" -DOMPI_BUILD_LDFLAGS="\"-export-dynamic \"" >> -DOMPI_BUILD_LIBS="\"-lnsl -lutil \"" >> -DOMPI_CC_ABSOLUTE="\"/usr/scripts/icc11.1\"" >> -DOMPI_CXX_ABSOLUTE="\"/usr/scripts/icpc11.1\"" >> -DOMPI_F77_ABSOLUTE="\"/usr/scripts/ifort11.1\"" >> -DOMPI_F90_ABSOLUTE="\"/usr/scripts/ifort11.1\"" >> -DOMPI_F90_BUILD_SIZE="\"small\"" -I../../..-O3 -DNDEBUG >> -finline-functions -pthread -MT components.o -MD -MP -MF $depbase.Tpo -c -o >> components.o components.cc &&\ >> mv -f $depbase.Tpo $depbase.Po >> icpc: error #10236: File not found: 'Sep' >> icpc: error #10236: File not found: '4' >> icpc: error #10236: File not found: '09:53:03' >> icpc: error #10236: File not found: 'EDT' >> icpc: error #10236: File not found: '2009"' >> icpc: error #10236: File not found: 'Sep' >> icpc: error #10236: File not found: '4' >> icpc: error #10236: File not found: '10:11:04' >> icpc: error #10236: File not found: 'EDT' >> icpc: error #10236: File not found: '2009"' >> icpc: command line warning #10159: invalid argument for option >> '-fvisibility' >> icpc: error #10236: File not found: '"' >> icpc: command line warning #10156: ignoring option '-p'; no argument >> required >> icpc: error #10236: File not found: '"' >> icpc: error #10236: File not found: '"' >> icpc: error #10236: File not found: '"' >> make[2]: *** [components.o] Error 1 >> make[2]: Leaving directory >> `/home/fischega/src/openmpi-1.3.3/ompi/tools/ompi_info' >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?
(I apologize in advance for the simplistic/newbie question.) I'm performing an ALLREDUCE operation on a multi-dimensional array. This operation is the biggest bottleneck in the code, and I'm wondering if there's a way to do it more efficiently than what I'm doing now. Here's a representative example of what's happening: ir=1 do ikl=1,km do ij=1,jm do ii=1,im albuf(ir)=array(ii,ij,ikl,nl,0,ng) ir=ir+1 enddo enddo enddo agbuf=0.0 call mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) ir=1 do ikl=1,km do ij=1,jm do ii=1,im phim(ii,ij,ikl,nl,0,ng)=agbuf(ir) ir=ir+1 enddo enddo enddo Is there any way to just do this in one fell swoop, rather than buffering, transmitting, and unbuffering? This operation is looped over many times. Are there savings to be had here? Thanks, Greg
Re: [OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?
It looks like the buffering operations consume about 15% as much time as the allreduce operations. Not huge, but not trivial, all the same. Is there any way to avoid the buffering step? On Thu, Sep 24, 2009 at 6:03 PM, Eugene Loh wrote: > Greg Fischer wrote: > > (I apologize in advance for the simplistic/newbie question.) > > I'm performing an ALLREDUCE operation on a multi-dimensional array. This > operation is the biggest bottleneck in the code, and I'm wondering if > there's a way to do it more efficiently than what I'm doing now. Here's a > representative example of what's happening: > >ir=1 >do ikl=1,km > do ij=1,jm >do ii=1,im > albuf(ir)=array(ii,ij,ikl,nl,0,ng) > ir=ir+1 >enddo > enddo >enddo >agbuf=0.0 >call > mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) >ir=1 >do ikl=1,km > do ij=1,jm >do ii=1,im > phim(ii,ij,ikl,nl,0,ng)=agbuf(ir) > ir=ir+1 >enddo > enddo >enddo > > Is there any way to just do this in one fell swoop, rather than buffering, > transmitting, and unbuffering? This operation is looped over many times. > Are there savings to be had here? > > There are three steps here: buffering, transmitting, and unbuffering. Any > idea how the run time is distributed among those three steps? E.g., if most > time is spent in the MPI call, then combining all three steps into one is > unlikely to buy you much... and might even hurt. If most of the time is > spent in the MPI call, then there may be some tuning of collective > algorithms to do. I don't have any experience doing this with OMPI. I'm > just saying it makes some sense to isolate the problem a little bit more. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] strange performance fluctuations and problems with mpif90-vt
I'm seeing some sporadic strange behavior in one of our MPI codes. Here are selected portions of the output: --- | | |im |jm |km | phi0 | | iter | sync |mcalc || |grp|itn|loc|loc|loc|Max Error| NSR |t(sec)|t(sec)|t(sec)| sysbal | --- 1 2 1 1 9 1.000E+00 1.000E+00 16.789 15.923 0.079 1.00E+00 1 3 1 1 5 1.000E+00 1.000E+00 16.800 15.935 0.078 1.00E+00 1 4 1 1 1 1.000E+00 1.000E+00 17.500 15.906 0.079 1.00E+00 ... 11 7 18 118 84 1.485E-01 1.117E+00 16.600 15.929 0.077 1.00E+00 11 8 20 124 84 1.516E-01 1.021E+00 16.600 15.929 0.077 1.00E+00 11 9 21 127 86 1.596E-01 1.053E+00 1.253 0.450 0.083 1.00E+00 11 10 7 131 88 1.290E-01 8.083E-01 0.808 0.014 0.272 1.00E+00 11 11 7 131 85 8.267E-02 6.408E-01 1.000 0.002 0.262 1.00E+00 ... 101 10 25 111 77 5.690E-02 8.179E-01 0.480 0.023 0.087 1.00E+00 101 11 32 113 77 4.782E-02 8.404E-01 0.479 0.023 0.087 1.00E+00 101 12 37 116 79 4.330E-02 9.055E-01 0.479 0.023 0.087 1.00E+00 This is an iterative calculation. The critical quantity of interest is "iter t(sec)", which is the time per iteration. (The other "t(sec)" quantities are subsets of "iter t(sec)".) Between "grp" 1 and 111, the calculation is not becoming appreciably more or less difficult, yet there is a factor of ~30 difference in performance between the beginning and the end. This problem does not appear all of the time. In many cases, performance is good throughout the entire calculation. ("Good", here, is being defined as what is seen in grp 101 above, which is roughly what I expect to be seeing.) However, when the problem does appear, it seems to mysteriously go away after grinding through the calculation for a while. Has anyone ever seen behavior like this? Any thoughts as to what could be causing it? I tried to recompile the code with mpif90-vt and mpicc-vt, in hopes that the vampirtrace outputs might shine some light as to the true nature of the problem. After recompiling, the code complains: [lx102:15254] *** An error occurred in MPI_Cart_create [lx102:15254] *** on communicator MPI_COMM_WORLD [lx102:15254] *** MPI_ERR_ARG: invalid argument of some other kind [lx102:15254] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) ...and then crashes out before doing anything useful. My understanding is that I only need to use the -vt compiler wrappers, and it will automatically "instrument" my code. Is there something else I should be doing? Thanks Greg
[OMPI users] MPI-IO: reading an unformatted binary fortran file
Hello, I'm attempting to wrap my brain around the MPI I/O mechanisms, and I was hoping to find some guidance. I'm trying to read a file that contains a 117-character string, followed by a series records that contain integers and reals. The following code would read it in serial: --- character(len=117) :: cfx1 read (nin) cfx1 do i=1,end_of_file read(nin) integer1,integer2,real1,real2,real3,real4,real5,real6,real7 enddo --- To simplify the problem, I removed the "cfx1" string from the file I'm reading, and created an MPI_TYPE_STRUCT as follows: --- length( 1 ) = 1 length( 2 ) = 2 length( 3 ) = 7 length( 3 ) = 1 disp( 1 ) = 0 disp( 2 ) = sizeof( MPI_LB ) disp( 3 ) = disp( 2 ) + 2*sizeof(MPI_INTEGER) disp( 4 ) = disp( 3 ) + 7*sizeof(MPI_REAL) type( 1 ) = MPI_LB type( 2 ) = MPI_INTEGER type( 3 ) = MPI_REAL type( 4 ) = MPI_UB call MPI_TYPE_STRUCT( 4, length, disp, type, sptype, ierr ) call MPI_TYPE_COMMIT( sptype, ierr ) --- I then open the file, set the view as follows and try to do a read: --- mode = MPI_MODE_RDONLY call MPI_FILE_OPEN( MPI_COMM_WORLD, filename, mode, +MPI_INFO_NULL, fh, ierr ) offset = 0 call MPI_FILE_SET_VIEW( fh, offset, sptype, +sptype, 'native', MPI_INFO_NULL, ierr ) call MPI_FILE_READ( fh, sourcepart, 1, sptype, + status, ierr ) --- where "sourcepart" is: --- type source_particle_datatype integer :: ipt,idm real :: xxx,yyy,zzz,uuu,vvv,www,erg end type --- This almost works. With some fiddling (I can't seem to make it work right now), I'm able to get most of the reals and integers into "sourcepart", but something doesn't line up quite correctly. I've spent a lot of time looking at the documentation and tutorials on the web, but haven't found a resource that helps me work through this problem. Ultimately, the objective will be to allow an arbitrary number of processes read this file, with each record being uniquely read by a single process. (e.g. process 1 read record 1, process 2 reads record 2, process 1 reads record 3, process 2 reads record 4, etc.) What's the best way to skin this cat? Any assistance would be greatly appreciated. Thanks, Greg