On 9/4/2012 7:21 PM, Yong Qin wrote: > On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik > <klit...@dev.mellanox.co.il> wrote: >> On 8/30/2012 10:28 PM, Yong Qin wrote: >>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquy...@cisco.com> wrote: >>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote: >>>> >>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but >>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and >>>>> only one specific dataset is identified during the testing, and the OS >>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that >>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread >>>>> always runs with 100% CPU load, and it looks to me like that OMPI is >>>>> waiting for some memory to be available thus appears to be hung. >>>>> Reducing the per node processes would sometimes ease the problem a bit >>>>> but not always. So I did some further testing by playing around with >>>>> the kernel transparent hugepage support. >>>>> >>>>> 1. Disable transparent hugepage support completely (echo never >>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow >>>>> the program to progress as normal (as in 1.4.5). Total run time for an >>>>> iteration is 3036.03 s. >>>> >>>> I'll admit that we have not tested using transparent hugepages. I wonder >>>> if there's some kind of bad interaction going on here... >>> >>> The transparent hugepage is "transparent", which means it is >>> automatically applied to all applications unless it is explicitly told >>> otherwise. I highly suspect that it is not working properly in this >>> case. >> >> Like Jeff said - I don't think we've ever tested OMPI with transparent >> huge pages. >> > > Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS > 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or > not it's there.
Interesting. Indeed, THP is on be default in RHEL 6.x. I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem. I'm checking it with OFED folks, but I doubt that there are some dedicated tests for THP. So do you see it only with a specific application and only on a specific data set? Wonder if I can somehow reproduce it in-house... -- YK