On 01/13/2014 02:26 PM, Mel Gorman wrote: > Really? > > zone_reclaim_mode is often a complete disaster unless the workload is > partitioned to fit within NUMA nodes. On older kernels enabling it would > sometimes cause massive stalls. I'm actually very surprised to hear it > fixes anything and would be interested in hearing more about what sort > of circumstnaces would convince you to enable that thing.
So the problem with the default setting is that it pretty much isolates all FS cache for PostgreSQL to whichever socket the postmaster is running on, and makes the other FS cache unavailable. This means that, for example, if you have two memory banks, then only one of them is available for PostgreSQL filesystem caching ... essentially cutting your available cache in half. And however slow moving cached pages between memory banks is, it's an order of magnitude faster than moving them from disk. But this isn't how the NUMA stuff is configured; it seems to assume that it's less expensive to get pages from disk than to move them between banks, so whatever you've got cached on the other bank, it flushes it to disk as fast as possible. I understand the goal was to make memory usage local to the processors stuff was running on, but that includes an implicit assumption that no individual process will ever want more than one memory bank worth of cache. So disabling all of the NUMA optimizations is the way to go for any workload I personally deal with. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers