Dear Jeff

Thanks for your help.
Unfortunately, after I thoroughly examined entire cluster, I found a bad
node with busted hard drive. That's the reason why this job hanged.
Also, when this job is sent with one bad node among the machinefile, neither
the openmpi nor my program gives me any error messages. That's why I can't
find the reason for job hanged.

Best regard

2009/4/22 Jeff Squyres <jsquy...@cisco.com>

> On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:
>
>  I tried to increase speed of a program with openmpi-1.1.3
>>
>
> Did you mean 1.1.3 or 1.3.1?I mean 1.1.3.
>
>  by adding following 4 parameters into openmpi-mca-params.conf file.
>>
>> mpi_leave_pinned=1
>> btl_openib_eager_rdma_num=128
>> btl_openib_max_eager_rdma=128
>> btl_openib_eager_limit=1024
>>
>
> If you meant 1.3.1 above, please see the following message about an
> important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:
>
>    http://www.open-mpi.org/community/lists/announce/2009/03/0029.php
>
>
>  and then, I ran my program twice(124 processes on 31 nodes). one with
>> "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
>> All of them were stopped abnormally with "ctrl+c" and "killall -9
>> <program>".
>>
>
> Why -- did they hang?

    I just fun my program for a few steps to see the speed and then I killed
it.

>
>
>  After that, I couldn't start to run that program again.
>>
>
> What exactly was the error?

    There are not any messages.

>
>
>  I checked every nodes with "free -m" and I found that huge amount of
>> cached memory were used in each nodes.
>> Could this situation be caused by those 4 parameters? IS there anyway to
>> free theme?
>>
>
>
> Probably not.
>
> Can you send all the information listed here:
>
>    http://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to