On Mar 23, 2010, at 12:06 PM, Junwei Huang wrote:

> I am still using LAM/MPI on an old cluster and wonder if I can get
> some help from this mail list.

Please upgrade to Open MPI if possible.  :-)

> Here is the problem. I am using a 18
> node cluster, each node has 2 CPU and each CPU supports up to 2
> threads. So I assume I can use 18*4 number of processors. As running
> the following code, an error message will always pops up for np=30 or
> np=60.

Depending on your CPU type and application behavior, using hyperthreads may be 
more of a hinderance than a help.

> But works fine for np=12, np=1. The error message is always the
> same, something like: one of  the processor n15, exit with (0), ip
> 192......,
> 
> Here is a part of the code, where the n15 exit. All other PE can
> finish writing the file, except PE15. Then I see the error message
> about n15 and the written of file by PE15 is not completed.  An quick
> question here, is PE15 necessarily generated by node 15 on the
> cluster? Appreciate if anyone would share experiences in debuging
> errors like this.
> 
> code:
> ....
> sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
> each PE opens a different file

If each MPI process is opening a separate file, then it may not be a file issue 
that is causing the problem.  For example, if each process opens /dev/null, do 
you have the same problem?

>         if ((fp=fopen(p_obsfile,"w"))==NULL)
>                 printf("PE_%d: The file %s cannot be 
> opened\n",my_rank,p_obsfile);

I do note that you don't have an escape clause here -- if you fail to open the 
file, you still fall through and try to write to the file.

>         for (int id=loc*my_rank;id<loc*(my_rank+1);id++){  // 
> loc=TotalNum/NumofPE
>                 //call a function to calculate U, the function will return the
> finishing message
>                // no communication is needed among processors
>                 for (int j=0;j<NUM;j++)
>                         fprintf (fp, "%f\n",U[j]); //output updated U
>         }

I think you just want to try standard debugging stuff here -- are you going 
beyond the end of the U array?  And so on.  Perhaps try running your app 
through valgrind, or under a debugger, etc.  Do you get corefiles from the run? 
 And so on.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to