Re: [ceph-users] ZFS or BTRFS for performance?

Lionel Bouton Sat, 19 Mar 2016 16:46:54 -0700

Le 19/03/2016 18:38, Heath Albritton a écrit :
> If you google "ceph bluestore" you'll be able to find a couple slide
> decks on the topic.  One of them by Sage is easy to follow without the
> benefit of the presentation.  There's also the " Redhat Ceph Storage
> Roadmap 2016" deck.
>
> In any case, bluestore is not intended to address bitrot.  Given that
> ceph is a distributed file system, many of the posix file system
> features are not required for the underlying block storage device.
>  Bluestore is intended to address this and reduce the disk IO required
> to store user data.
>
> Ceph protects against bitrot at a much higher level by validating the
> checksum of the entire placement group during a deep scrub.


My impression is that the only protection against bitrot is provided by
the underlying filesystem which means that you don't get any if you use
XFS or EXT4.

I can't trust Ceph on this alone until its bitrot protection (if any) is
clearly documented. The situation is far from clear right now. The
documentations states that deep scrubs are using checksums to validate
data, but this is not good enough at least because we don't known what
these checksums are supposed to cover (see below for another reason).
There is even this howto by Sebastien Han about repairing a PG :
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
which clearly concludes that with only 2 replicas you can't reliably
find out which object is corrupted with Ceph alone. If Ceph really
stored checksums to verify all the objects it stores we could manually
check which replica is valid.

Even if deep scrubs would use checksums to verify data this would not be
enough to protect against bitrot: there is a window between a corruption
event and a deep scrub where the data on a primary can be returned to a
client. BTRFS solves this problem by returning an IO error for any data
read that doesn't match its checksum (or automatically rebuilds it if
the allocation group is using RAID1/10/5/6). I've never seen this kind
of behavior documented for Ceph.

Lionel

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

Reply via email to