On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote: > On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <kgri...@ymail.com> wrote: > > I notice, Josh, that you didn't mention the problems many people > > have run into with Transparent Huge Page defrag and with NUMA > > access. >
Ok, there are at least three potential problems there that you may or may not have run into. First, THP when it was first introduced was a bit of a disaster. In 3.0, it was *very* heavy handed and would trash the system reclaiming memory to satisfy an allocation. When it did this, it would also writeback a bunch of data and block on it to boot. It was not the smartest move of all time but was improved over time and in some cases the patches were also backported by 3.0.101. This is a problem that should have alleviated over time. The general symptoms of the problem would be massive stalls and monitoring the /proc/PID/stack of interesting processes would show it to be somewhere in do_huge_pmd_anonymous_page -> alloc_page_nodemask -> try_to_free_pages -> migrate_pages or something similar. You may have worked around it by disabling THP with a command line switch or /sys/kernel/mm/transparent_hugepage/enabled in the past. This is "not meant to happen" any more or at least it has been a while since a bug was filed against me in this area. There are corner cases though. If the underlying filesystem is NFS, the problem might still be experienced. That is the simple case. You might have also hit the case where THPages filled with zeros did not use the zero page. That would have looked like a larger footprint than anticipated and lead to another range of problems. This is also addressed since but maybe not recently enough. It's less likely this is your problem though as I expect you actually use your buffers, not leave them filled with zeros. You mention NUMA but that's trickier to figure out that problem without more context. THP can cause unexpected interleaving between NUMA nodes. Memory that would have been local on a 4K page boundary becomes remote accesses when THP is enabled and performance would be hit (maybe 3-5% depending on the machine). It's not the only possibility though. If memory was being used sparsely and THP was in use then the overall memory footprint may be higher than it should be. This potentially would cause allocations to spill over to remote nodes while kswapd wakes up to reclaim local memory. That would lead to weird buffer aging inversion problems. This is a hell of a lot of guessing though and we'd need a better handle on the reproduction case to pin it down. > Amen to that. Actually, I think NUMA can be (mostly?) fixed by > setting zone_reclaim_mode; is there some other problem besides that? > Really? zone_reclaim_mode is often a complete disaster unless the workload is partitioned to fit within NUMA nodes. On older kernels enabling it would sometimes cause massive stalls. I'm actually very surprised to hear it fixes anything and would be interested in hearing more about what sort of circumstnaces would convince you to enable that thing. > The other thing that comes to mind is the kernel's caching behavior. > We've talked a lot over the years about the difficulties of getting > the kernel to write data out when we want it to and to not write data > out when we don't want it to. Is sync_file_range() broke? > When it writes data back to disk too > aggressively, we get lousy throughput because the same page can get > written more than once when caching it for longer would have allowed > write-combining. Do you think that is related to dirty_ratio or dirty_writeback_centisecs? If it's dirty_writeback_centisecs then that would be particularly tricky because poor interactions there would come down to luck basically. > When it doesn't write data to disk aggressively > enough, we get huge latency spikes at checkpoint time when we call > fsync() and the kernel says "uh, what? you wanted that data *on the > disk*? sorry boss!" and then proceeds to destroy the world by starving > the rest of the system for I/O for many seconds or minutes at a time. Ok, parts of that are somewhat expected. It *may* depend on the underlying filesystem. Some of them handle fsync better than others. If you are syncing the whole file though when you call fsync then you are potentially burned by having to writeback dirty_ratio amounts of memory which could take a substantial amount of time. > We've made some desultory attempts to use sync_file_range() to improve > things here, but I'm not sure that's really the right tool, and if it > is we don't know how to use it well enough to obtain consistent > positive results. > That implies that either sync_file_range() is broken in some fashion we (or at least I) are not aware of and that needs kicking. > On a related note, there's also the problem of double-buffering. When > we read a page into shared_buffers, we leave a copy behind in the OS > buffers, and similarly on write-out. It's very unclear what to do > about this, since the kernel and PostgreSQL don't have intimate > knowledge of what each other are doing, but it would be nice to solve > somehow. > If it's mapped, clean and you do not need any more than madvise(MADV_DONTNEED). If you are accessing teh data via a file handle, then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do not know how it behaved historically but right now it will usually sync the data and then discard the pages. I say usually because it will not necessarily sync if the storage is congested and there is no guarantee it will be discarded. In older kernels, there was a bug where small calls to posix_fadvise() would not work at all. This was fixed in 3.9. The flipside is also meant to hold true. If you know data will be needed in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at the implementation it does a forced read-ahead on the range of pages of interest. It doesn't look like it would block. The completely different approach for double buffering is direct IO but there may be reasons why you are avoiding that and are unhappy with the interfaces that are meant to work. Just from the start, it looks like there are a number of problem areas. Some may be fixed -- in which case we should identify what fixed it, what kernel version and see can it be verified with a test case or did we manage to break something else in the process. Other bugs may still exist because we believe some interface works how users want when it is in fact unfit for purpose for some reason. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers