Thanks. Suresh and Kihwal are right-- renames are journalled, but not necessarily durable (stored to disk). I was getting mixed up with HDFS semantics, in which we actually do make the journal durable before returning success to the client.
It might be a good idea for HDFS to fsync the file descriptor of the directories involved in the rename operation, before assuming that the operation is durable. If you're using ext{2,3,4}, a quick fix would be to use mount -o dirsync. I haven't tested it out, but it's supposed to make these operations synchronous. >From the man page: dirsync All directory updates within the filesystem should be done syn- chronously. This affects the following system calls: creat, link, unlink, symlink, mkdir, rmdir, mknod and rename. Colin On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <sur...@hortonworks.com> wrote: > On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cmcc...@alumni.cmu.edu> wrote: > >> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <sur...@hortonworks.com> >> wrote: >> > Dave, >> > >> > Thanks for the detailed email. Sorry I did not read all the details you >> had >> > sent earlier completely (on my phone). As you said, this is not related >> to >> > data loss related to HBase log and hsync. I think you are right; the >> rename >> > operation itself might not have hit the disk. I think we should either >> > ensure metadata operation is synced on the datanode or handle it being >> > reported as blockBeingWritten. Let me spend sometime to debug this issue. >> >> In theory, ext3 is journaled, so all metadata operations should be >> durable in the case of a power outage. It is only data operations >> that should be possible to lose. It is the same for ext4. (Assuming >> you are not using nonstandard mount options.) >> > > ext3 journal may not hit the disk right. From what I read, if you do not > specifically > call sync, even the metadata operations do not hit disk. > > See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt > > commit=nrsec (*) Ext3 can be told to sync all its data and metadata > every 'nrsec' seconds. The default value is 5 seconds. > This means that if you lose your power, you will lose > as much as the latest 5 seconds of work (your > filesystem will not be damaged though, thanks to the > journaling). This default value (or any low value) > will hurt performance, but it's good for data-safety. > Setting it to 0 will have the same effect as leaving > it at the default (5 seconds). > Setting it to very large values will improve > > performance.