Dear Jeff Thanks for your help. Unfortunately, after I thoroughly examined entire cluster, I found a bad node with busted hard drive. That's the reason why this job hanged. Also, when this job is sent with one bad node among the machinefile, neither the openmpi nor my program gives me any error messages. That's why I can't find the reason for job hanged.
Best regard 2009/4/22 Jeff Squyres <jsquy...@cisco.com> > On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote: > > I tried to increase speed of a program with openmpi-1.1.3 >> > > Did you mean 1.1.3 or 1.3.1?I mean 1.1.3. > > by adding following 4 parameters into openmpi-mca-params.conf file. >> >> mpi_leave_pinned=1 >> btl_openib_eager_rdma_num=128 >> btl_openib_max_eager_rdma=128 >> btl_openib_eager_limit=1024 >> > > If you meant 1.3.1 above, please see the following message about an > important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned: > > http://www.open-mpi.org/community/lists/announce/2009/03/0029.php > > > and then, I ran my program twice(124 processes on 31 nodes). one with >> "mpi_leave_pinned=1", another with "mpi_leave_pinned=0". >> All of them were stopped abnormally with "ctrl+c" and "killall -9 >> <program>". >> > > Why -- did they hang? I just fun my program for a few steps to see the speed and then I killed it. > > > After that, I couldn't start to run that program again. >> > > What exactly was the error? There are not any messages. > > > I checked every nodes with "free -m" and I found that huge amount of >> cached memory were used in each nodes. >> Could this situation be caused by those 4 parameters? IS there anyway to >> free theme? >> > > > Probably not. > > Can you send all the information listed here: > > http://www.open-mpi.org/community/help/ > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >