Dear all,

I am not sure if this the right forum to ask this question, so sorry if
I am wrong. I am using ScaLAPACK in my code and MPI of course (OMPI) in
a electromagnetic solver program, running on a cluster. I get very
strange behavior when I use a large number of processors to run my code
for very large problems. In these cases, however, the program finishes
successfully, but it stays until the wall time exceeds the limit and the
job is terminated by queue manager (I use qsub ti submit a job). This
happens when, for example I use more than 80 processors for a problem
which needs more than 700 GB memory. For smaller problem, everything is
OK and all output files are generated correctly, while when this
happens, the output files are empty. I am almost sure that there is a
synchronization problem and some processes fail to reach the
finalization point while others are done.

My code is written in C++ and in "main" function I call a routine called
"Solver". My Solver function looks like below:

Solver()
{
        for (std::vector<double>::iterator ti=times.begin();
ti!=times.end(); ++ti)
        {
                Stopwatch iwatch, dwatch, twatch;

                // some ScaLAPACK operations

                if (iamroot())
                {
                         // some operation only for root process
                }
          }

        blacs::gridexit(ictxt);
        blacs::exit(1);
}

and my "main" function which calls "Solver" looks like below:


int main()
{

       // some preparing operations

        Solver();
        if (rank==0)
                std::cout << "Total execution time: " << time.tick() <<
" s\n" << std::flush;

      err=MPI_Finalize();

      if (MPI_SUCCESS!=err)
      {
              std::cerr << "MPI_Finalize failed: " << err << "\n";
              return err;
      }

        return 0;
}

I did put a "blacs::barrier(ictxt, 'A')" at the and of "Solver" routine,
before calling "blacs::exit(1)" to make sure that all processes arrive
here before MPI_Finalize, but the problem didn't solve. Do you have any
idea where the problem is?

Thanks in advance,


-- 
Danesh Daroui
Ph.D Student
Lulea University of Technology
http://www.ltu.se

danesh.dar...@ltu.se
+46-704-399847

Reply via email to