Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages

Jan Kara Tue, 01 Mar 2016 01:28:55 -0800

On Tue 01-03-16 00:43:37, Damien Le Moal wrote:
> From:  Jan Kara <j...@suse.cz>
> Date:  Monday, February 29, 2016 at 22:40
> To:  Damien Le Moal <damien.lem...@hgst.com>
> Cc:  Jan Kara <j...@suse.cz>, "linux-bl...@vger.kernel.org" 
> <linux-bl...@vger.kernel.org>, Bart Van Assche <bart.vanass...@sandisk.com>, 
> Matias Bjorling <m...@bjorling.me>, "linux-scsi@vger.kernel.org" 
> <linux-scsi@vger.kernel.org>, "lsf...@lists.linuxfoundation.org" 
> <lsf...@lists.linuxfoundation.org>
> Subject:  Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR 
> disks chunked writepages
> 
> 
> >On Mon 29-02-16 02:02:16, Damien Le Moal wrote:
> >> 
> >> >On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
> >> >> 
> >> >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
> >> >> >> 
> >> >> >> >On 02/22/16 18:56, Damien Le Moal wrote:
> >> >> >> >> 2) Write back of dirty pages to SMR block devices:
> >> >> >> >>
> >> >> >> >> Dirty pages of a block device inode are currently processed using 
> >> >> >> >> the
> >> >> >> >> generic_writepages function, which can be executed simultaneously
> >> >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, 
> >> >> >> >> etc).
> >> >> >> >> Mutual exclusion of the dirty page processing being achieved only 
> >> >> >> >> at
> >> >> >> >> the page level (page lock & page writeback flag), multiple 
> >> >> >> >> processes
> >> >> >> >> executing a "sync" of overlapping block ranges over the same zone 
> >> >> >> >> of
> >> >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write 
> >> >> >> >> requests
> >> >> >> >> being sent to the underlying device. On a host managed SMR disk, 
> >> >> >> >> where
> >> >> >> >> sequential write to disk zones is mandatory, this result in 
> >> >> >> >> errors and
> >> >> >> >> the impossibility for an application using raw sequential disk 
> >> >> >> >> write
> >> >> >> >> accesses to be guaranteed successful completion of its write or 
> >> >> >> >> fsync
> >> >> >> >> requests.
> >> >> >> >>
> >> >> >> >> Using the zone information attached to the SMR block device queue
> >> >> >> >> (introduced by Hannes), calls to the generic_writepages function 
> >> >> >> >> can
> >> >> >> >> be made mutually exclusive on a per zone basis by locking the 
> >> >> >> >> zones.
> >> >> >> >> This guarantees sequential request generation for each zone and 
> >> >> >> >> avoid
> >> >> >> >> write errors without any modification to the generic code 
> >> >> >> >> implementing
> >> >> >> >> generic_writepages.
> >> >> >> >>
> >> >> >> >> This is but one possible solution for supporting SMR host-managed
> >> >> >> >> devices without any major rewrite of page cache management and
> >> >> >> >> write-back processing. The opinion of the audience regarding this
> >> >> >> >> solution and discussing other potential solutions would be greatly
> >> >> >> >> appreciated.
> >> >> >> >
> >> >> >> >Hello Damien,
> >> >> >> >
> >> >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR 
> >> >> >> >drives 
> >> >> >> >or would you also like to see that filesystems like ext4 can use 
> >> >> >> >SMR 
> >> >> >> >drives ? In the latter case: the behavior of SMR drives differs so 
> >> >> >> >significantly from that of other block devices that I'm not sure 
> >> >> >> >that we 
> >> >> >> >should try to support these directly from infrastructure like the 
> >> >> >> >page 
> >> >> >> >cache. If we look e.g. at NAND SSDs then we see that the 
> >> >> >> >characteristics 
> >> >> >> >of NAND do not match what filesystems expect (e.g. large erase 
> >> >> >> >blocks). 
> >> >> >> >That is why every SSD vendor provides an FTL (Flash Translation 
> >> >> >> >Layer), 
> >> >> >> >either inside the SSD or as a separate software driver. An FTL 
> >> >> >> >implements a so-called LFS (log-structured filesystem). With what I 
> >> >> >> >know 
> >> >> >> >about SMR this technology looks also suitable for implementation of 
> >> >> >> >a 
> >> >> >> >LFS. Has it already been considered to implement an LFS driver for 
> >> >> >> >SMR 
> >> >> >> >drives ? That would make it possible for any filesystem to access 
> >> >> >> >an SMR 
> >> >> >> >drive as any other block device. I'm not sure of this but maybe it 
> >> >> >> >will 
> >> >> >> >be possible to share some infrastructure with the LightNVM driver 
> >> >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver 
> >> >> >> >namely implements an FTL.
> >> >> >> 
> >> >> >> I totally agree with you that trying to support SMR disks by only 
> >> >> >> modifying
> >> >> >> the page cache so that unmodified standard file systems like BTRFS 
> >> >> >> or ext4
> >> >> >> remain operational is not realistic at best, and more likely simply 
> >> >> >> impossible.
> >> >> >> For this kind of use case, as you said, an FTL or a device mapper 
> >> >> >> driver are
> >> >> >> much more suitable.
> >> >> >> 
> >> >> >> The case I am considering for this discussion is for raw block 
> >> >> >> device accesses
> >> >> >> by an application (writes from user space to /dev/sdxx). This is a 
> >> >> >> very likely
> >> >> >> use case scenario for high capacity SMR disks with applications like 
> >> >> >> distributed
> >> >> >> object stores / key value stores.
> >> >> >> 
> >> >> >> In this case, write-back of dirty pages in the block device file 
> >> >> >> inode mapping
> >> >> >> is handled in fs/block_dev.c using the generic helper function 
> >> >> >> generic_writepages.
> >> >> >> This does not guarantee the generation of the required sequential 
> >> >> >> write pattern
> >> >> >> per zone necessary for host-managed disks. As I explained, aligning 
> >> >> >> calls of this
> >> >> >> function to zone boundaries while locking the zones under write-back 
> >> >> >> solves
> >> >> >> simply the problem (implemented and tested). This is of course only 
> >> >> >> one possible
> >> >> >> solution. Pushing modifications deeper in the code or providing a
> >> >> >> "generic_sequential_writepages" helper function are other potential 
> >> >> >> solutions
> >> >> >> that in my opinion are worth discussing as other types of devices 
> >> >> >> may benefit also
> >> >> >> in terms of performance (e.g. regular disk drives prefer sequential 
> >> >> >> writes, and
> >> >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or 
> >> >> >> device mapper
> >> >> >> driver.
> >> >> >> 
> >> >> >> For a file system, an SMR compliant implementation of a file inode 
> >> >> >> mapping
> >> >> >> writepages method should be provided by the file system itself as 
> >> >> >> the sequentiality
> >> >> >> of the write pattern depends further on the block allocation 
> >> >> >> mechanism of the file
> >> >> >> system.
> >> >> >> 
> >> >> >> Note that the goal here is not to hide to applications the 
> >> >> >> sequential write
> >> >> >> constraint of SMR disks. The page cache itself (the mapping of the 
> >> >> >> block
> >> >> >> device inode) remains unchanged. But the modification proposed 
> >> >> >> guarantees that
> >> >> >> a well behaved application writing sequentially to zones through the 
> >> >> >> page cache
> >> >> >> will see successful sync operations.
> >> >> >
> >> >> >So the easiest solution for the OS, when the application is already 
> >> >> >aware
> >> >> >of the storage constraints, would be for an application to use direct 
> >> >> >IO.
> >> >> >Because when using page-cache and writeback there are all sorts of
> >> >> >unexpected things that can happen (e.g. writeback decides to skip a 
> >> >> >page
> >> >> >because someone else locked it temporarily). So it will work in 99.9% 
> >> >> >of
> >> >> >cases but sometimes things will be out of order for hard-to-track down
> >> >> >reasons. And for ordinary drives this is not an issue because we just 
> >> >> >slow
> >> >> >down writeback a bit but rareness of this makes it non-issue. But for 
> >> >> >host
> >> >> >managed SMR the IO fails and that is something the application does not
> >> >> >expect.
> >> >> >
> >> >> >So I would really say just avoid using page-cache when you are using 
> >> >> >SMR
> >> >> >drives directly without a translation layer. For writes your throughput
> >> >> >won't suffer anyway since you have to do big sequential writes. Using
> >> >> >page-cache for reads may still be beneficial and if you are careful 
> >> >> >enough
> >> >> >not to do direct IO writes to the same range as you do buffered reads, 
> >> >> >this
> >> >> >will work fine.
> >> >> >
> >> >> >Thinking some more - if you want to make it foolproof, you could 
> >> >> >implement
> >> >> >something like read-only page cache for block devices. Any write will 
> >> >> >be in
> >> >> >fact direct IO write, writeable mmaps will be disallowed, reads will 
> >> >> >honor
> >> >> >O_DIRECT flag.
> >> >> 
> >> >> Hi Jan,
> >> >> 
> >> >> Indeed, using O_DIRECT for raw block device write is an obvious 
> >> >> solution to
> >> >> guarantee the application successful sequential writes within a zone. 
> >> >> However,
> >> >> host-managed SMR disks (and to a lesser extent host-aware drives too) 
> >> >> already
> >> >> put on applications the constraint of ensuring sequential writes. 
> >> >> Adding to this
> >> >> further mandatory rewrite to support direct I/Os is in my opinion 
> >> >> asking a lot,
> >> >> if not too much.
> >> >
> >> >So I don't think adding O_DIRECT to open flags is such a burden -
> >> >sequential writes are IMO much harder to do :). And furthermore this could
> >> >happen magically inside the kernel in which case app needn't be aware 
> >> >about
> >> >this at all (similarly to how we handle writes to persistent memory).
> >> > 
> >> >> The example you mention above of writeback skipping a locked page and 
> >> >> resulting
> >> >> in I/O errors is precisely what the proposed patch avoids by first 
> >> >> locking the
> >> >> zone the page belongs to. In the same spirit as the writeback page 
> >> >> locking, if
> >> >> the zone is already locked, it is skipped. That is, zones are treated 
> >> >> in a sense
> >> >> as gigantic pages, ensuring that the actual dirty pages within each one 
> >> >> are
> >> >> processed in one go, sequentially.
> >> >
> >> >But you cannot rule out mm subsystem locking a page to do something (e.g.
> >> >migrate the page to help with compaction of large order pages). These 
> >> >other
> >> >places accessing and locking pages are what I'm worried about. Furthermore
> >> >kswapd can decide to writeback particular page under memory pressure and
> >> >that will just make SMR disk freak out.
> >> >
> >> >> This allows preserving all possible application level accesses 
> >> >> (buffered,
> >> >> direct or mmapped). The only constraint is the one the disk imposes:
> >> >> writes must be sequential.
> >> >> 
> >> >> Granted, this view may be too simplistic and may be overlooking some 
> >> >> hard
> >> >> to track page locking paths which will compete with this. But I think
> >> >> that this can be easily solved by forcing the zone-aligned
> >> >> generic_writepages calls to not skip any page (a flag in struct
> >> >> writeback_control would do the trick). And no modification is necessary
> >> >> on the read side (i.e. page locking only is enough) since reading an SMR
> >> >> disks blocks after a zone write-pointer position does not make sense (in
> >> >> Hannes code, this is possible, but the request does not go to the disk
> >> >> and returns garbage data).
> >> >> 
> >> >> Bottom line: no fundamental change to the page caching mechanism, only
> >> >> how it is being used/controlled for writeback makes this work.
> >> >> Considering the benefits on the application side, it is in my opinion a
> >> >> valid modification to have.
> >> >
> >> >See above, there are quite a few places which will break your assumptions.
> >> >And I don't think changing them all to handle SMR is worth it. IMO caching
> >> >sequential writes to SMR disks has low effect (if any) anyway so I would
> >> >just avoid that. We can talk about how to make this as seamless to
> >> >applications as possible. The only thing which I don't think is reasonably
> >> >doable without dirtying pagecache are writeable mmaps of an SMR device so
> >> >applications would have to avoid that.
> >> 
> >> Jan,
> >> 
> >> Thank you for your insight.
> >> These "few places" breaking sequential write sequences are indeed
> >> problematic for SMR drives. At the same time, I wonder how these paths
> >> would react to an I/O error generated by the check "write at write
> >> pointer" in the request submission path at the SCSI level. Could these be
> >> ignored in the case of an "unaligned write error" ? That is, the page is
> >> left dirty and hopefully the regular writeback path catches them later in
> >> the proper sequence.
> >
> >You'd hope ;) But in fact what happens is that the page ends
> >up being clean, marked as having error, and buffers will not be uptodate =>
> >you have just lost one page worth of data. See what happens in
> >end_buffer_async_write(). Now our behavior in presence of IO errors needs
> >improvement for a long time so you are certainly welcome to improve on this
> >but what I described is what happens now.
> 
> 
> Jan,
> 
> Got it. Thanks for the pointers.  I will work a little more on
> identifying this. In any case, the first problem to tackle I guess is to
> get more information than just a -EIO on error. Without that, no chance
> to ever be able to retry recoverable errors (unaligned writes).


Yes, propagating more information to fs / writeback code so that it can
distinguish permanent errors from transient ones is certainly useful for
other usecases than SMR.

                                                                Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages

Reply via email to