From: Jan Kara <j...@suse.cz>
Date: Monday, February 29, 2016 at 22:40
To: Damien Le Moal <damien.lem...@hgst.com>
Cc: Jan Kara <j...@suse.cz>, "linux-bl...@vger.kernel.org"
<linux-bl...@vger.kernel.org>, Bart Van Assche <bart.vanass...@sandisk.com>,
Matias Bjorling <m...@bjorling.me>, "linux-scsi@vger.kernel.org"
<linux-scsi@vger.kernel.org>, "lsf...@lists.linuxfoundation.org"
<lsf...@lists.linuxfoundation.org>
Subject: Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks
chunked writepages
>On Mon 29-02-16 02:02:16, Damien Le Moal wrote:
>>
>> >On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>> >>
>> >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>> >> >>
>> >> >> >On 02/22/16 18:56, Damien Le Moal wrote:
>> >> >> >> 2) Write back of dirty pages to SMR block devices:
>> >> >> >>
>> >> >> >> Dirty pages of a block device inode are currently processed using
>> >> >> >> the
>> >> >> >> generic_writepages function, which can be executed simultaneously
>> >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>> >> >> >> Mutual exclusion of the dirty page processing being achieved only at
>> >> >> >> the page level (page lock & page writeback flag), multiple processes
>> >> >> >> executing a "sync" of overlapping block ranges over the same zone of
>> >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests
>> >> >> >> being sent to the underlying device. On a host managed SMR disk,
>> >> >> >> where
>> >> >> >> sequential write to disk zones is mandatory, this result in errors
>> >> >> >> and
>> >> >> >> the impossibility for an application using raw sequential disk write
>> >> >> >> accesses to be guaranteed successful completion of its write or
>> >> >> >> fsync
>> >> >> >> requests.
>> >> >> >>
>> >> >> >> Using the zone information attached to the SMR block device queue
>> >> >> >> (introduced by Hannes), calls to the generic_writepages function can
>> >> >> >> be made mutually exclusive on a per zone basis by locking the zones.
>> >> >> >> This guarantees sequential request generation for each zone and
>> >> >> >> avoid
>> >> >> >> write errors without any modification to the generic code
>> >> >> >> implementing
>> >> >> >> generic_writepages.
>> >> >> >>
>> >> >> >> This is but one possible solution for supporting SMR host-managed
>> >> >> >> devices without any major rewrite of page cache management and
>> >> >> >> write-back processing. The opinion of the audience regarding this
>> >> >> >> solution and discussing other potential solutions would be greatly
>> >> >> >> appreciated.
>> >> >> >
>> >> >> >Hello Damien,
>> >> >> >
>> >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR
>> >> >> >drives
>> >> >> >or would you also like to see that filesystems like ext4 can use SMR
>> >> >> >drives ? In the latter case: the behavior of SMR drives differs so
>> >> >> >significantly from that of other block devices that I'm not sure that
>> >> >> >we
>> >> >> >should try to support these directly from infrastructure like the
>> >> >> >page
>> >> >> >cache. If we look e.g. at NAND SSDs then we see that the
>> >> >> >characteristics
>> >> >> >of NAND do not match what filesystems expect (e.g. large erase
>> >> >> >blocks).
>> >> >> >That is why every SSD vendor provides an FTL (Flash Translation
>> >> >> >Layer),
>> >> >> >either inside the SSD or as a separate software driver. An FTL
>> >> >> >implements a so-called LFS (log-structured filesystem). With what I
>> >> >> >know
>> >> >> >about SMR this technology looks also suitable for implementation of a
>> >> >> >LFS. Has it already been considered to implement an LFS driver for
>> >> >> >SMR
>> >> >> >drives ? That would make it possible for any filesystem to access an
>> >> >> >SMR
>> >> >> >drive as any other block device. I'm not sure of this but maybe it
>> >> >> >will
>> >> >> >be possible to share some infrastructure with the LightNVM driver
>> >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver
>> >> >> >namely implements an FTL.
>> >> >>
>> >> >> I totally agree with you that trying to support SMR disks by only
>> >> >> modifying
>> >> >> the page cache so that unmodified standard file systems like BTRFS or
>> >> >> ext4
>> >> >> remain operational is not realistic at best, and more likely simply
>> >> >> impossible.
>> >> >> For this kind of use case, as you said, an FTL or a device mapper
>> >> >> driver are
>> >> >> much more suitable.
>> >> >>
>> >> >> The case I am considering for this discussion is for raw block device
>> >> >> accesses
>> >> >> by an application (writes from user space to /dev/sdxx). This is a
>> >> >> very likely
>> >> >> use case scenario for high capacity SMR disks with applications like
>> >> >> distributed
>> >> >> object stores / key value stores.
>> >> >>
>> >> >> In this case, write-back of dirty pages in the block device file inode
>> >> >> mapping
>> >> >> is handled in fs/block_dev.c using the generic helper function
>> >> >> generic_writepages.
>> >> >> This does not guarantee the generation of the required sequential
>> >> >> write pattern
>> >> >> per zone necessary for host-managed disks. As I explained, aligning
>> >> >> calls of this
>> >> >> function to zone boundaries while locking the zones under write-back
>> >> >> solves
>> >> >> simply the problem (implemented and tested). This is of course only
>> >> >> one possible
>> >> >> solution. Pushing modifications deeper in the code or providing a
>> >> >> "generic_sequential_writepages" helper function are other potential
>> >> >> solutions
>> >> >> that in my opinion are worth discussing as other types of devices may
>> >> >> benefit also
>> >> >> in terms of performance (e.g. regular disk drives prefer sequential
>> >> >> writes, and
>> >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or
>> >> >> device mapper
>> >> >> driver.
>> >> >>
>> >> >> For a file system, an SMR compliant implementation of a file inode
>> >> >> mapping
>> >> >> writepages method should be provided by the file system itself as the
>> >> >> sequentiality
>> >> >> of the write pattern depends further on the block allocation mechanism
>> >> >> of the file
>> >> >> system.
>> >> >>
>> >> >> Note that the goal here is not to hide to applications the sequential
>> >> >> write
>> >> >> constraint of SMR disks. The page cache itself (the mapping of the
>> >> >> block
>> >> >> device inode) remains unchanged. But the modification proposed
>> >> >> guarantees that
>> >> >> a well behaved application writing sequentially to zones through the
>> >> >> page cache
>> >> >> will see successful sync operations.
>> >> >
>> >> >So the easiest solution for the OS, when the application is already aware
>> >> >of the storage constraints, would be for an application to use direct IO.
>> >> >Because when using page-cache and writeback there are all sorts of
>> >> >unexpected things that can happen (e.g. writeback decides to skip a page
>> >> >because someone else locked it temporarily). So it will work in 99.9% of
>> >> >cases but sometimes things will be out of order for hard-to-track down
>> >> >reasons. And for ordinary drives this is not an issue because we just
>> >> >slow
>> >> >down writeback a bit but rareness of this makes it non-issue. But for
>> >> >host
>> >> >managed SMR the IO fails and that is something the application does not
>> >> >expect.
>> >> >
>> >> >So I would really say just avoid using page-cache when you are using SMR
>> >> >drives directly without a translation layer. For writes your throughput
>> >> >won't suffer anyway since you have to do big sequential writes. Using
>> >> >page-cache for reads may still be beneficial and if you are careful
>> >> >enough
>> >> >not to do direct IO writes to the same range as you do buffered reads,
>> >> >this
>> >> >will work fine.
>> >> >
>> >> >Thinking some more - if you want to make it foolproof, you could
>> >> >implement
>> >> >something like read-only page cache for block devices. Any write will be
>> >> >in
>> >> >fact direct IO write, writeable mmaps will be disallowed, reads will
>> >> >honor
>> >> >O_DIRECT flag.
>> >>
>> >> Hi Jan,
>> >>
>> >> Indeed, using O_DIRECT for raw block device write is an obvious solution
>> >> to
>> >> guarantee the application successful sequential writes within a zone.
>> >> However,
>> >> host-managed SMR disks (and to a lesser extent host-aware drives too)
>> >> already
>> >> put on applications the constraint of ensuring sequential writes. Adding
>> >> to this
>> >> further mandatory rewrite to support direct I/Os is in my opinion asking
>> >> a lot,
>> >> if not too much.
>> >
>> >So I don't think adding O_DIRECT to open flags is such a burden -
>> >sequential writes are IMO much harder to do :). And furthermore this could
>> >happen magically inside the kernel in which case app needn't be aware about
>> >this at all (similarly to how we handle writes to persistent memory).
>> >
>> >> The example you mention above of writeback skipping a locked page and
>> >> resulting
>> >> in I/O errors is precisely what the proposed patch avoids by first
>> >> locking the
>> >> zone the page belongs to. In the same spirit as the writeback page
>> >> locking, if
>> >> the zone is already locked, it is skipped. That is, zones are treated in
>> >> a sense
>> >> as gigantic pages, ensuring that the actual dirty pages within each one
>> >> are
>> >> processed in one go, sequentially.
>> >
>> >But you cannot rule out mm subsystem locking a page to do something (e.g.
>> >migrate the page to help with compaction of large order pages). These other
>> >places accessing and locking pages are what I'm worried about. Furthermore
>> >kswapd can decide to writeback particular page under memory pressure and
>> >that will just make SMR disk freak out.
>> >
>> >> This allows preserving all possible application level accesses (buffered,
>> >> direct or mmapped). The only constraint is the one the disk imposes:
>> >> writes must be sequential.
>> >>
>> >> Granted, this view may be too simplistic and may be overlooking some hard
>> >> to track page locking paths which will compete with this. But I think
>> >> that this can be easily solved by forcing the zone-aligned
>> >> generic_writepages calls to not skip any page (a flag in struct
>> >> writeback_control would do the trick). And no modification is necessary
>> >> on the read side (i.e. page locking only is enough) since reading an SMR
>> >> disks blocks after a zone write-pointer position does not make sense (in
>> >> Hannes code, this is possible, but the request does not go to the disk
>> >> and returns garbage data).
>> >>
>> >> Bottom line: no fundamental change to the page caching mechanism, only
>> >> how it is being used/controlled for writeback makes this work.
>> >> Considering the benefits on the application side, it is in my opinion a
>> >> valid modification to have.
>> >
>> >See above, there are quite a few places which will break your assumptions.
>> >And I don't think changing them all to handle SMR is worth it. IMO caching
>> >sequential writes to SMR disks has low effect (if any) anyway so I would
>> >just avoid that. We can talk about how to make this as seamless to
>> >applications as possible. The only thing which I don't think is reasonably
>> >doable without dirtying pagecache are writeable mmaps of an SMR device so
>> >applications would have to avoid that.
>>
>> Jan,
>>
>> Thank you for your insight.
>> These "few places" breaking sequential write sequences are indeed
>> problematic for SMR drives. At the same time, I wonder how these paths
>> would react to an I/O error generated by the check "write at write
>> pointer" in the request submission path at the SCSI level. Could these be
>> ignored in the case of an "unaligned write error" ? That is, the page is
>> left dirty and hopefully the regular writeback path catches them later in
>> the proper sequence.
>
>You'd hope ;) But in fact what happens is that the page ends
>up being clean, marked as having error, and buffers will not be uptodate =>
>you have just lost one page worth of data. See what happens in
>end_buffer_async_write(). Now our behavior in presence of IO errors needs
>improvement for a long time so you are certainly welcome to improve on this
>but what I described is what happens now.
Jan,
Got it. Thanks for the pointers.
I will work a little more on identifying this. In any case, the first problem
to tackle I guess is to get more information than just a -EIO on error. Without
that, no chance to ever be able to retry recoverable errors (unaligned writes).
Thanks !
Best regards.
------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital company
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality
Notice & Disclaimer:
This e-mail and any files transmitted with it may contain confidential or
legally privileged information of WDC and/or its affiliates, and are intended
solely for the use of the individual or entity to which they are addressed. If
you are not the intended recipient, any disclosure, copying, distribution or
any action taken or omitted to be taken in reliance on it, is prohibited. If
you have received this e-mail in error, please notify the sender immediately
and delete the e-mail in its entirety from your system.