On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgor...@suse.de> wrote: >> Amen to that. Actually, I think NUMA can be (mostly?) fixed by >> setting zone_reclaim_mode; is there some other problem besides that? > > Really? > > zone_reclaim_mode is often a complete disaster unless the workload is > partitioned to fit within NUMA nodes. On older kernels enabling it would > sometimes cause massive stalls. I'm actually very surprised to hear it > fixes anything and would be interested in hearing more about what sort > of circumstnaces would convince you to enable that thing.
By "set" I mean "set to zero". We've seen multiple of instances of people complaining about large amounts of system memory going unused because this setting defaulted to 1. >> The other thing that comes to mind is the kernel's caching behavior. >> We've talked a lot over the years about the difficulties of getting >> the kernel to write data out when we want it to and to not write data >> out when we don't want it to. > > Is sync_file_range() broke? I don't know. I think a few of us have played with it and not been able to achieve a clear win. Whether the problem is with the system call or the programmer is harder to determine. I think the problem is in part that it's not exactly clear when we should call it. So suppose we want to do a checkpoint. What we used to do a long time ago is write everything, and then fsync it all, and then call it good. But that produced horrible I/O storms. So what we do now is do the writes over a period of time, with sleeps in between, and then fsync it all at the end, hoping that the kernel will write some of it before the fsyncs arrive so that we don't get a huge I/O spike. And that sorta works, and it's definitely better than doing it all at full speed, but it's pretty imprecise. If the kernel doesn't write enough of the data out in advance, then there's still a huge I/O storm when we do the fsyncs and everything grinds to a halt. If it writes out more data than needed in advance, it increases the total number of physical writes because we get less write-combining, and that hurts performance, too. I basically feel like the I/O scheduler sucks, though whether it sucks because it's not theoretically possible to do any better or whether it sucks because of some more tractable reason is not clear to me. In an ideal world, when I call fsync() a bunch of times from one process, other processes on the same machine should begin to observe 30+-second (or sometimes 300+-second) times for read or write of an 8kB block. Imagine a hypothetical UNIX-like system where when one process starts running at 100% CPU, every other process on the machine gets timesliced in only once per minute. That's obviously ridiculous, and yet it's pretty much exactly what happens with I/O. >> When it writes data back to disk too >> aggressively, we get lousy throughput because the same page can get >> written more than once when caching it for longer would have allowed >> write-combining. > > Do you think that is related to dirty_ratio or dirty_writeback_centisecs? > If it's dirty_writeback_centisecs then that would be particularly tricky > because poor interactions there would come down to luck basically. See above; I think it's related to fsync. >> When it doesn't write data to disk aggressively >> enough, we get huge latency spikes at checkpoint time when we call >> fsync() and the kernel says "uh, what? you wanted that data *on the >> disk*? sorry boss!" and then proceeds to destroy the world by starving >> the rest of the system for I/O for many seconds or minutes at a time. > > Ok, parts of that are somewhat expected. It *may* depend on the > underlying filesystem. Some of them handle fsync better than others. If > you are syncing the whole file though when you call fsync then you are > potentially burned by having to writeback dirty_ratio amounts of memory > which could take a substantial amount of time. Yeah. ext3 apparently fsyncs the whole filesystem, which is terrible for throughput, but if you happen to have xlog (which is flushed regularly) on the same filesystem as the data files (which are flushed only periodically) then at least you don't have the problem of the write queue getting too large. But I think most of our users are on ext4 at this point, probably some xfs and other things. We track the number of un-fsync'd blocks we've written to each file, and have gotten desperate enough to think of approaches like - ok, well if the total number of un-fsync'd blocks in the system exceeds some threshold, then fsync the file with the most such blocks, not because we really need the data on disk just yet but so that the write queue won't get too large for the kernel to deal with. And I think there may even be some test results from such crocks showing some benefit. But really, I don't understand why we have to baby the kernel like this. Ensuring scheduling fairness is a basic job of the kernel; if we wanted to have to control caching behavior manually, we could use direct I/O. Having accepted the double buffering that comes with NOT using direct I/O, ideally we could let the kernel handle scheduling and call it good. >> We've made some desultory attempts to use sync_file_range() to improve >> things here, but I'm not sure that's really the right tool, and if it >> is we don't know how to use it well enough to obtain consistent >> positive results. > > That implies that either sync_file_range() is broken in some fashion we > (or at least I) are not aware of and that needs kicking. So the problem is - when do you call it? What happens is: before a checkpoint, we may have already written some blocks to a file. During the checkpoint, we're going to write some more. At the end of the checkpoint, we'll need all blocks written before and during the checkpoint to be on disk. If we call sync_file_range() at the beginning of the checkpoint, then in theory that should get the ball rolling, but we may be about to rewrite some of those blocks, or at least throw some more on the pile. If we call sync_file_range() near the end of the checkpoint, just before calling fsync, there's not enough time for the kernel to reorder I/O to a sufficient degree to do any good. What we want, sorta, is to have the kernel start writing it out just at the right time to get it on disk by the time we're aiming to complete the checkpoint, but it's not clear exactly how to do that. We can't just write all the blocks, sync_file_range(), wait, and then fsync() because the "write all the blocks" step can trigger an I/O storm if the kernel decides there's too much dirty data. I suppose what we really want to do during a checkpoint is write data into the O/S cache at a rate that matches what the kernel can physically get down to the disk, and have the kernel schedule those writes in as timely a fashion as it can without disrupting overall system throughput too much. But the feedback mechanisms that exist today are just too crude for that. You can easily write() to the point where the whole system freezes up, or equally wait between write()s when the system could easily have handled more right away. And it's very hard to tell how much you can fsync() at once before performance falls off a cliff. A certain number of writes get absorbed by various layers of caching between us and the physical hardware - and then at some point, they're all full, and further writes lead to disaster. But I don't know of any way to assess how close we are to that point at any give time except to cross it, and at that point, it's too late. >> On a related note, there's also the problem of double-buffering. When >> we read a page into shared_buffers, we leave a copy behind in the OS >> buffers, and similarly on write-out. It's very unclear what to do >> about this, since the kernel and PostgreSQL don't have intimate >> knowledge of what each other are doing, but it would be nice to solve >> somehow. > > If it's mapped, clean and you do not need any more than > madvise(MADV_DONTNEED). If you are accessing teh data via a file handle, > then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do > not know how it behaved historically but right now it will usually sync > the data and then discard the pages. I say usually because it will not > necessarily sync if the storage is congested and there is no guarantee it > will be discarded. In older kernels, there was a bug where small calls to > posix_fadvise() would not work at all. This was fixed in 3.9. > > The flipside is also meant to hold true. If you know data will be needed > in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at > the implementation it does a forced read-ahead on the range of pages of > interest. It doesn't look like it would block. > > The completely different approach for double buffering is direct IO but > there may be reasons why you are avoiding that and are unhappy with the > interfaces that are meant to work. > > Just from the start, it looks like there are a number of problem areas. > Some may be fixed -- in which case we should identify what fixed it, what > kernel version and see can it be verified with a test case or did we > manage to break something else in the process. Other bugs may still > exist because we believe some interface works how users want when it is > in fact unfit for purpose for some reason. It's all read, not mapped, because we have a need to prevent pages from being written back to their backing files until WAL is fsync'd, and there's no way to map a file and modify the page but not let it be written back to disk until some other event happens. We've experimented with don't-need but it's tricky. Here's an example. Our write-ahead log files (WAL) are all 16MB; eventually, when they're no longer needed for any purpose, older files cease to be needed, but there's a continued demand for new files driven by database modifications. Experimentation some years ago revealed that it's faster to rename and overwrite the old files than to remove them and create new ones, so that's what we do. Ideally this means that at steady state we're just recycling the files over and over and never creating or destroying any, though I'm not sure whether we ever actually achieve that ideal. However, benchmarking has showed that making the wrong decision about whether to don't-need those files has a significant effect on performance. If there's enough cache around to keep all the files in memory, then we don't want to don't-need them because then access will be slow when the old files are recycled. If however there is cache pressure then we want to don't-need them as quickly as possible to make room for other, higher priority data. Now that may not really be the kernel's fault; it's a general property of ring buffers that you want to an LRU policy if they fit in cache and immediate eviction of everything but the active page if they don't. But I think it demonstrates the general difficulty of using posix_fadvise. Similar cases arise for prefetching: gee, we'd like to prefetch this data because we're going to use it soon, but if the system is under enough pressure, the data may get evicted again before "soon" actually arrives. Thanks for taking the time to write all of these comments, and listen to our concerns. I really appreciate it, whether anything tangible comes of it or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers