On Mar 23, 2010, at 12:06 PM, Junwei Huang wrote: > I am still using LAM/MPI on an old cluster and wonder if I can get > some help from this mail list.
Please upgrade to Open MPI if possible. :-) > Here is the problem. I am using a 18 > node cluster, each node has 2 CPU and each CPU supports up to 2 > threads. So I assume I can use 18*4 number of processors. As running > the following code, an error message will always pops up for np=30 or > np=60. Depending on your CPU type and application behavior, using hyperthreads may be more of a hinderance than a help. > But works fine for np=12, np=1. The error message is always the > same, something like: one of the processor n15, exit with (0), ip > 192......, > > Here is a part of the code, where the n15 exit. All other PE can > finish writing the file, except PE15. Then I see the error message > about n15 and the written of file by PE15 is not completed. An quick > question here, is PE15 necessarily generated by node 15 on the > cluster? Appreciate if anyone would share experiences in debuging > errors like this. > > code: > .... > sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID, > each PE opens a different file If each MPI process is opening a separate file, then it may not be a file issue that is causing the problem. For example, if each process opens /dev/null, do you have the same problem? > if ((fp=fopen(p_obsfile,"w"))==NULL) > printf("PE_%d: The file %s cannot be > opened\n",my_rank,p_obsfile); I do note that you don't have an escape clause here -- if you fail to open the file, you still fall through and try to write to the file. > for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ // > loc=TotalNum/NumofPE > //call a function to calculate U, the function will return the > finishing message > // no communication is needed among processors > for (int j=0;j<NUM;j++) > fprintf (fp, "%f\n",U[j]); //output updated U > } I think you just want to try standard debugging stuff here -- are you going beyond the end of the U array? And so on. Perhaps try running your app through valgrind, or under a debugger, etc. Do you get corefiles from the run? And so on. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/