On Wed, Jan 05, 2011 at 12:55:22PM +0100, Olaf van der Spek wrote: > > If you give me a specific approach, I can tell you why it won't work, > > or why it won't be accepted by the kernel maintainers (for example, > > because it involves pouring far too much complexity into the kernel). > > Let's consider the temp file workaround, since a lot of existing apps > use it. A request is to commit the source data before committing the > rename. Seems quite simple.
Currently ext4 is initiating writeback on the source file at the time of the rename. Given performance measurements others (maybe it was you, I can't remember, and I don't feel like going through the literally hundreds of messages on this and related threads) have cited, it seems that btrfs is doing something similar. The problem with doing a full commit, which means surviving a power failure, is that you have to request a barrier operation to make sure the data goes all the way down to the disk platter --- and this is expensive (on the order of at least 20-30ms, more if you've written a lot to the disk). We have had experience with forcing data writeback (what you call "commit the source data") before the rename --- ext3 did that. And it had some very nasty performance problems which showed up very busy systems where people were doing a lot of different things at the same time: large background writes from bittorrents and/or DVD ripping, compiles, web browsing, etc. If you force a large amount of data out when you do a commit, everything else that tries to write to the file system at that point stops, and if you have stupid programs (i.e., firefox trying to do database updates on its UI loop), it can cause programs to apparently lock up, and users get really upset. So one of the questions is how much should be penalizing programs that are doing things right (i.e., using fsync), versus programs which are doing things wrong (i.e., using rename and trusting to luck). This is a policy question, for which you might have a different opinion than I might have on the subject. We could also simply force a synchronous data writeback at rename time, instead of merely starting writeback at the point of the rename. In the case of a program which has already done an fsync(), the synchronous data writeback would be a no-op, so that's good in terms of not penalizing programs which do things right. But the problem there is that there could be some renames where forcing data writeback is not needed, and so we would be forcing the performance hit of the "commit the source data" even when it might not be needed (or wanted) by the user. How often does it happen that someone does a rename on top of an already-existing file, where the fsync() isn't wanted. Well, I can think up scenarios, such as where an existing .iso image is corrupted or needs to be updated, and so the user creates a new one and then renames it on top of the old .iso image, but then gets surprised when the rename ends up taking minutes to complete. Is that a common occurrence? Probably not, but the case of the system crashing right after the rename() is someone unusual as well. Humans in general suck at reasoning about low-probability events; that's why we are allowing low-paid TSA workers to grope air-travellers to avoid terrorist blowing up planes midflight, while not being up in arms over the number of deaths every year due to automobile accidents. For this reason, I'm cautious about going overboard at forcing commits on renames; doing this has real performance implications, and it is a computer science truism that optimizing for the uncommon/failure case is a bad thing to do. OK, what about simply deferring the commit of the rename until the file writeback has naturally completed? The problem with that is "entangled updates". Suppose there is another file which is written to the same directory block as the one affected by the rename, and *that* file is fsync()'ed? Keeping track of all of the data dependencies is **hard**. See: http://lwn.net/Articles/339337/ > > But for me to list all possible approaches and tell you why each one > > is not going to work? You'll have to pay me before I'm willing to > > invest that kind of time. > > That's not what I asked. Actually, it is, although maybe you didn't realize it. Look above, and how I had to present multiple alternatives, and then shoot them all down, one at a time. There are hundreds of solutions, all of them wrong. Hence why *my* counter is --- submit patches. The mere act of actually trying to code an alternative will allow you to determine why your approach won't work, or failing that, others can take your patch, apply them, and then demonstrate use cases where your idea completely falls apart. But it means that you do most of the work, which is fair since you're the one demanding the feature. It doesn't scale for me to spend a huge amount of time composing e-mails like this, which is why it's rare that I do that. You've tricked me into it this time, which is time that I've lost and I can't get back into doing useful things, like improving ext4. Congratulations. It probably won't be happening again. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110105182628.gl2...@thunk.org