[ 
https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329420#comment-16329420
 ] 

Todd Lipcon commented on KUDU-2195:
-----------------------------------

Here's a way to reproduce the issues with our current lack of fsync on metadata 
files:

As root:
{code}
dd if=/dev/zero of=/tmp/file bs=1M count=500
losetup -f /tmp/file
mdadm --create /dev/md0 --level=faulty --raid-devices=1 /dev/loop0
mdadm --zero-superblock /dev/md0
mkfs.xfs /dev/md0
mount /dev/md0 /mnt
chown todd /mnt/
{code}

As user:
{code}
rm -Rf /mnt/m /mnt/t
kudu-master -fs-wal-dir /mnt/m &
M_PID=$!
kudu-tserver -fs-wal-dir /mnt/t &
TS_PID=$!
sleep 3
kudu perf loadgen --keep-auto-table localhost
kudu perf loadgen --keep-auto-table localhost
kill -9 $TS_PID
kudu-tserver -fs-wal-dir /mnt/t &
TS_PID=$!
kudu perf loadgen --keep-auto-table localhost
sleep 1
kill -9 $TS_PID
kill -9 $M_PID
{code}

As root:
{code}
mdadm --grow /dev/md0 -l faulty -p write-all
umount /mnt
mdadm --grow /dev/md0 -l faulty -p clear
mount /dev/md0 /mnt
ls -l /mnt/t/consensus-meta
{code}

Now we can observe 0-length files in /mnt/t/consensus-meta:

{code}
root@todd-laptop:~# ls -l /mnt/t/consensus-meta
total 160
-rw------- 1 todd todd 10132 Jan 17 12:32 05909e52e3a64ddcadd4f882e3ba8b5a
-rw------- 1 todd todd 10132 Jan 17 12:32 082b66a699ce43489aef96fcf3596b2e
-rw------- 1 todd todd     0 Jan 17 12:32 11734256f5e74d56a7ea6b9b35049e0e
-rw------- 1 todd todd 10132 Jan 17 12:32 1ae5fbfaac8e4ec692f44dda98aed814
-rw------- 1 todd todd 10132 Jan 17 12:32 20c9a2ad999744228447229439a2a0b6
-rw------- 1 todd todd 10132 Jan 17 12:32 2f7b9aad94944aef86cf9d0c11999ebf
-rw------- 1 todd todd 10132 Jan 17 12:32 3214795507524703b679b5a8242ef963
-rw------- 1 todd todd 10132 Jan 17 12:32 32499224331a47178a175dd2fae6ce45
-rw------- 1 todd todd 10132 Jan 17 12:32 451681698b034ece95cee49538366eea
-rw------- 1 todd todd 10132 Jan 17 12:32 49ab2dcea2e94b9e941958ada78f1383
-rw------- 1 todd todd     0 Jan 17 12:32 5e6d25398be54db7b7c32aee721572a9
-rw------- 1 todd todd     0 Jan 17 12:32 69026fae548d46e2a8a905aed9c7925c
-rw------- 1 todd todd 10132 Jan 17 12:32 6f25ecdc798243c7a972ec6a84cf1f28
-rw------- 1 todd todd 10132 Jan 17 12:32 7ae4113d6ced400f8e271045dee87b67
-rw------- 1 todd todd 10132 Jan 17 12:32 7e031faa71f64bd5a08ae3f8fb76c85c
-rw------- 1 todd todd 10132 Jan 17 12:32 9b0cef225d5b445c963d7e98d459489b
-rw------- 1 todd todd 10132 Jan 17 12:32 a60edfb6a4a7431abb2fe8a25e24cd0c
-rw------- 1 todd todd     0 Jan 17 12:32 b55b7ee186c0475e9dd710a2ee80f1da
-rw------- 1 todd todd 10132 Jan 17 12:32 be468ae4c4fe489d86b0e79a5fde41e6
-rw------- 1 todd todd     0 Jan 17 12:32 cf2e3c7678794a4daa9ca216e7a97021
-rw------- 1 todd todd     0 Jan 17 12:32 dd6f07102f0f446e96bc21b0fdc60608
-rw------- 1 todd todd 10132 Jan 17 12:32 e2806d8cce5e4e77a4d41d8bb22b67c2
-rw------- 1 todd todd     0 Jan 17 12:32 ed29ad52605e451880c08cf61cc5a16b
{code}

Try to start kudu, reproduces error:
{code}
E0117 12:25:39.294999 15670 ts_tablet_manager.cc:940] T 
6e30abcb47c747ec9309b079820ae08c P d6933c7686c8488c9f72bd95431d7685: Failed to 
load consensus metadata: Incomplete: Unable to load consensus metadata for 
tablet 6e30abcb47c747ec9309b079820ae08c: Could not read header for proto 
container file /mnt/t/consensus-meta/6e30abcb47c747ec9309b079820ae08c: File 
size not large enough to be valid: Proto container file 
/mnt/t/consensus-meta/6e30abcb47c747ec9309b079820ae08c: Tried to read 16 bytes 
at offset 0 but file size is only 0 bytes
{code}

If I include the patch for https://gerrit.cloudera.org/#/c/9043/ then the 
metadata files all have non-zero length.

> Enforce durability happened before relationships on multiple disks
> ------------------------------------------------------------------
>
>                 Key: KUDU-2195
>                 URL: https://issues.apache.org/jira/browse/KUDU-2195
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tablet
>            Reporter: David Alves
>            Priority: Major
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we 
> should still enforce certain happened before relationships which are not 
> currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for 
> instance on term change) with the intention that either {}, \{c} or \{c, w} 
> were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to 
> make sure that that all commit messages that refer to on disk row sets (and 
> deltas) are on disk before the row sets they point to, i.e. with the 
> intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right 
> order. With weaker semantics that is not the case though. If using the same 
> disk for both the wal and data then the invariants are  still preserved, as 
> buffers will be flushed in the right order but if using different disks for 
> the wal and data (and because cmeta is stored with the data) that is not 
> always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() 
> implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing 
> tablet meta.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to