We’re prototyping a native SMR object store now to run alongside Ceph, which 
has been our only object store backend for the last 3 years and is somewhat 
problematic on some metrics. I believe trying to use SMR drives with a file 
system in the architecture (as in Ceph) is a non starter, the solution is 
really to treat them like tape drives. Using a sequential access model across 
the volume is everything. So the tape metaphor is important down to how you 
handle free space collection.

Unfortunately, trying to expose this new architecture using a libRADOS API 
isn’t something we’re interested in doing since we don’t use Ceph for anything 
above the libRADOS layer of the stack, but our code will be open source if 
someone else wants to take a crack at it. It wouldn’t be easy. Our 
implementation is based on reusing the sequential access version of the SCSI 
protocol, and our host transport is iSER (iSCSI over RDMA), so it’s not a 
natural overlay to libRADOS to put it mildly, particularly since we’re doing EC 
well above this layer of our application stack.


Steve Cranage
Principal Architect, Co-Founder
DeepSpace Storage
[cid:image001.png@01D3FCBC.58FDB6F0]

From: Oliver Freyermuth<mailto:freyerm...@physik.uni-bonn.de>
Sent: Wednesday, May 6, 2020 5:28 AM
To: Janne Johansson<mailto:icepic...@gmail.com>
Cc: ceph-users<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Re: State of SMR support in Ceph?

Dear Janne,

Am 06.05.20 um 09:18 schrieb Janne Johansson:
> Den ons 6 maj 2020 kl 00:58 skrev Oliver Freyermuth 
> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>>:
>
>     Dear Cephalopodians,
>     seeing the recent moves of major HDD vendors to sell SMR disks targeted 
> for use in consumer NAS devices (including RAID systems),
>     I got curious and wonder what the current status of SMR support in 
> Bluestore is.
>     Of course, I'd expect disk vendors to give us host-managed SMR disks for 
> data center use cases (and to tell us when actually they do so...),
>     but in that case, Bluestore surely needs some new intelligence for best 
> performance in the shingled ages.
>
>
> I've only done filestore on SMRs, and it did work for a while, in normal 
> cases for us, but it broke down horribly as soon as recovery needed to be 
> done.
> I have no idea if filestore was doing the worst ever for SMRs, or if 
> bluestore will do better or if patches are going to help bluestore become 
> useful, but all in all, I can't say anything else to people wanting to 
> experiment with SMRs than "if you must use SMRs, make sure you test the most 
> evil of corner cases".

Thanks for the input and especially the hands-on experience! That's very 
helpful (and "expensive" to gather), so thanks for sharing!

After my "small-scale" experiences, I would indeed have expected exactly that. 
My sincere hope is that this hardware will become useable by making use of 
Copy-on-Write semantics
to align writes into larger, consecutive batches.

>
> As you noted, one can easily get into <1M/s with SMRs by doing something else 
> than long linear writes, and you don't want to be in a place where several 
> hundred TBs of data is doing recovery at that speed.
>
> To me, SMR is a con, its a trick to sell cheap crap to people who can't or 
> won't test properly. Doesn't matter if its ceph recovery/backfill, btrfs 
> deletes or someones NAS raid sync job that places the final straw on the 
> camels back and breaks it, it's the fact that filesystems do lots more than 
> just easy nice long linear writes. No matter if it is fsck, defrags or ceph 
> PG splits/reshardings, there will be disk meta-operations that needs to be 
> done which includes tons of random small writes, and SMR drives will punish 
> you for them when you need the drive up the most. 8-(
>
> If I had some very special system which used cheap disks to pretend to be a 
> tape device and only did 10G sized reads/writes like a tape would do, then I 
> could see a use case for SMR.

I agree that in many cases SMR is not the correct hardware to use, and never 
will be. Indeed, I also agree that in most cases the "trick to sell cheap crap 
to people who can't or won't test properly"
applies, even more with disk-managed SMR which in some cases gives you zero 
control and maximum frustration.

Still, my hope would be that especially for archiving purposes (think of a pure 
Ceph-RGW cluster fed with Restic, Duplicati or similar tools), we can make good 
use of the cheaper hardware
(but then, this would of course need to be host-managed SMR, and the file 
system should know about it). I currently only know of Dropbox who are actively 
doing that
(and I guess they can easily, since they deduplicate data and probably rarely 
delete), and they seem to have developed their own file system to deal with 
this essentially.

It would be cool to have this with Ceph. You might also think about having a 
separate pool for "colder" objects which is SMR-backed
(likely coupled with SSDs / NVMes for WAL / BlockDB). In short, we'd never even 
think about using it with CephFS in our HPC cluster
(unless some admin-controllable write-once-read-many use cases evolve, which we 
could think about for centrally managed high-energy physics data),
or RBD in our virtualization cluster.
We're more interested in it for our backup cluster which mostly sees data 
ingest and the chunking into larger batches is even done client-side (Restic, 
Duplicati etc.).

Of course, your point about resharding and PG splits fully applies, so this for 
sure needs careful development (and testing!) to reduce the randomness as far 
as possible
(if we want to make use of this hardware for the use cases it may fit).

Cheers and thanks for your input,
        Oliver

>
> --
> May the most significant bit of your life be positive.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to