On 2017-01-25 Michal Hocko wrote: > On Wed 25-01-17 04:02:46, Trevor Cordes wrote: > > OK, I patched & compiled mhocko's git tree from the other day > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a > > couple of weeks ago shows the newest commit (git log) is > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"? Let me know > > if I'm doing something wrong, see below.) > > My fault. I should have noted that you should use since-4.9 branch.
OK, I have good news. I compiled your mhocko git tree (properly this tim!) using since-4.9 branch (last commit ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3 3am's, over 60 hours, and I made sure all the usual oom culprits ran, and I ran extras (finds on the whole tree, extra rdiff-backups) to try to tax it. Based on my previous criteria I would say your since-4.9 as of the above commit solves my bug, at least over a 3 day test span (which it never survives when the bug is present)! I tested WITHOUT any cgroup/mem boot options. I do still have my mem=6G limiter on, though (I've never tested with it off, until I solve the bug with it on, since I've had it on for many months for other reasons). On 2017-01-27 Michal Hocko wrote: > OK, that matches the theory that these OOMs are caused by the > incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix > the active list aging for lowmem requests when memcg is enabled") b4536f0c829c isn't in the since-4.9 I tested above though? So something else you did must have fixed it (also)? I don't think I've run any tests yet with b4536f0c829c in them? I think the vanillas I was doing a couple of weeks ago were before b4536f0c829c, but I can't be sure. What do I test next? Does the since-4.9 stuff get pushed into vanilla (4.9 hopefully?) so it can find its way into Fedora's stuck F24 kernel? I want to also note that the RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more interest as more people start me-too'ing. The situation is almost always the same: large rsync's or similar tree-scan accesses cause oom on PAE boxes. However, I wanted to note that many people there reported that cgroup_disable=memory doesn't fix anything for them, whereas that always makes the problem go away on my boxes. Strange. Thanks Michal and Mel, I really appreciate it!