Writing tests for write back
I am writing tests for the write back section of the kernel. This is the stuff found in fs/fs-writeback.c. Currently I have tests to fsync millions of inodes concurrently, tests involving mounting and unmounting file systems, and other tests to tickle some starvation of big file situations. Has anyone else been bitten by writeback not performing correctly either in the past or now? If so please send me the situation so I can incorporate it into my tests and also make sure my new attempt at an implementation does not cause damage. These tests will end up in autotest on test.kernel.org so they will become something the whole community can use. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible fix for lockup in drop_caches
On Dec 22, 2007 2:06 AM, Andrew Morton <[EMAIL PROTECTED]> wrote: > Oh boy. Do we really want to add all this stuff to JBD just for > drop_caches which is a silly root-only broken-in-22-other-ways thing? > > Michael, might your convert-inode-lists-to-tree patches eliminate the need > for taking inode_lock in drop_pagecache_sb()? Probably not, as it uses an > rbtree. It would have been possible if it was using a radix-tree, I > suspect.. You are correct. The rbtree will still req > > > -void __journal_unfile_buffer(struct journal_head *jh) > > +void __journal_unfile_buffer(struct journal_head *jh, > > + struct buffer_head **dirty_bh) > > { > > - __journal_temp_unlink_buffer(jh); > > + __journal_temp_unlink_buffer(jh, dirty_bh); > > jh->b_transaction = NULL; > > } > > I suspect the code would end up simpler if __journal_unfile_buffer() were > to take an additional ref on the bh which it placed at *dirty_bh. > > Callers of __journal_unfile_buffer() could then call > > void handle_dirty_bh(struct buffer_head *bh) > { > if (bh) { > jbd_mark_buffer_dirty(bh); > put_bh(bh); > } > } > > ? > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible fix for lockup in drop_caches
On Dec 22, 2007 2:06 AM, Andrew Morton <[EMAIL PROTECTED]> wrote: > On Mon, 17 Dec 2007 12:13:22 + richard <[EMAIL PROTECTED]> wrote: > Michael, might your convert-inode-lists-to-tree patches eliminate the need > for taking inode_lock in drop_pagecache_sb()? Probably not, as it uses an > rbtree. It would have been possible if it was using a radix-tree, I > suspect.. You are correct the new patch based on an rbtree will still require the lock. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/1] Writeback fix for concurrent large and small file writes.
From: Michael Rubin <[EMAIL PROTECTED]> Fixing a bug where writing to large files while concurrently writing to smaller ones creates a situation where writeback cannot keep up with the traffic and memory baloons until the we hit the threshold watermark. This can result in surprising latency spikes when syncing. This latency can take minutes on large memory systems. Upon request I can provide a test to reproduce this situation. The only concern I have is that this makes the wb_kupdate slightly more agressive. I am not sure it is enough to cause any problems. I think there is enough checks to throttle the background activity. Feng also the one line change that you recommended here http://marc.info/?l=linux-kernel&m=119629655402153&w=2 had no effect. Signed-off-by: Michael Rubin <[EMAIL PROTECTED]> --- Index: 2624rc3_feng/fs/fs-writeback.c === --- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.0 -0800 +++ 2624rc3_feng/fs/fs-writeback.c 2007-12-10 17:21:45.0 -0800 @@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s { const unsigned long start = jiffies;/* livelock avoidance */ - if (!wbc->for_kupdate || list_empty(&sb->s_io)) - queue_io(sb, wbc->older_than_this); + queue_io(sb, wbc->older_than_this); while (!list_empty(&sb->s_io)) { struct inode *inode = list_entry(sb->s_io.prev, Index: 2624rc3_feng/mm/page-writeback.c === --- 2624rc3_feng.orig/mm/page-writeback.c 2007-11-16 21:16:36.0 -0800 +++ 2624rc3_feng/mm/page-writeback.c2007-12-10 17:37:17.0 -0800 @@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); if (wbc.nr_to_write > 0) { - if (wbc.encountered_congestion || wbc.more_io) + if (wbc.encountered_congestion) congestion_wait(WRITE, HZ/10); else break; /* All the old data is written */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Writeback fix for concurrent large and small file writes.
On Dec 12, 2007 12:55 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > On Mon, 2007-12-10 at 18:02 -0800, Michael Rubin wrote: > > From: Michael Rubin <[EMAIL PROTECTED]> > The part I miss here is the rationale on _how_ you solve the problem. > > The patch itself is simple enough, but I've been staring at this code > for a while now, and I'm just not getting it. Apologies for the lack of rationale. I have been staring at this code for awhile also and it makes my head hurt. I have a patch coming (hopefully today) that proposes using one data structure with a more consistent priority scheme for 2.6.25. To me it's simpler, but I am biased. The problem we encounter when we append to a large file at a fast rate while also writing to smaller files is that the wb_kupdate thread does not keep up with disk traffic. In this workload often the inodes end up at fs/fs-writeback.c:287 after do_writepages, since do_writepages did not write all the pages. This can be due to congestion but I think there are other causes also since I have observed so. The first issue is that the inode is put on the s_more_io queue. This ensures that more_io is set at the end of sync_sb_inodes. The result from that is the wb_kupdate routine will perform a sleep at mm/page-writeback.c:642. This slows us down enough that the wb_kupdate cannot keep up with traffic. The other issue is that the inode that has been placed on the s_more_io queue cannot be processed by sync_sb_inodes until the entire s_io list is empty. With lots of small files that are not being dirtied as quickly as the one large inode on the s_more_io queue the inode with the most pages being dirtied is not given attention and wb_kupdate cannot keep up again. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Converting writeback linked lists to a tree based data structure
From: Michael Rubin <[EMAIL PROTECTED]> This is an attempt to unify the writeback data structures. By adding an rb tree we are able to have one consistent time ordering mechanism for writeback. This should aid debugging and allow for future work with more sophisticated time ordering methods. This is a proposal for 2.6.25. The patch below includes the following changes. 1) Adding a data structure to guarantee fairness when writing inodes to disk. The flush_tree is based on an rbtree. with duplicate keys being chained off the same rb_node. 2) Added a FS flag to mark file systems that are not disk backed so we don't have to flush them. Not sure I marked all of them. But just marking these improves writeback performance. 3) Added an inode flag to allow inodes to be marked so that they are never written back to disk. See get_pipe_inode. Under autotest this patch has passed: fsx, bonnie, and iozone. I am currently writing more writeback focused tests (which so far have been passed) to add into autotest. Performance wise I ran a quick test. a) I used sysctl to stop background writeback. b) I ran the "sync" command. c) Then I created 10,000,000 files in directories of 1000 files per directory. d) Finally I timed the "sync" command for all the dirty inodes that had been parked. I ran the perf test 5 times on each kernel. The average for the 2.6.24 kernel was 87.8 seconds. With this patch it was 69.3. Signed-off-by: Michael Rubin <[EMAIL PROTECTED]> --- Index: 2624rc3_wb/fs/block_dev.c === --- 2624rc3_wb.orig/fs/block_dev.c 2007-12-11 13:52:47.0 -0800 +++ 2624rc3_wb/fs/block_dev.c 2007-12-11 13:55:01.0 -0800 @@ -518,6 +518,7 @@ static struct file_system_type bd_type = .name = "bdev", .get_sb = bd_get_sb, .kill_sb= kill_anon_super, + .fs_flags = FS_ANONYMOUS, }; static struct vfsmount *bd_mnt __read_mostly; Index: 2624rc3_wb/fs/fs-writeback.c === --- 2624rc3_wb.orig/fs/fs-writeback.c 2007-12-11 13:52:47.0 -0800 +++ 2624rc3_wb/fs/fs-writeback.c2007-12-11 14:19:33.0 -0800 @@ -23,8 +23,174 @@ #include #include #include +#include #include "internal.h" +#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node) + +/* + * When inodes are parked for writeback they are parked in the + * flush_tree. The flush tree is a data structure based on an rb tree. + * + * Duplicate keys are handled by making a list in the tree for each key + * value. The order of how we choose the next inode to flush is decided + * by two fields. First the earliest dirtied_when value. If there are + * duplicate dirtied_when values then the earliest i_flushed_when value + * determines who gets flushed next. + * + * The flush tree organizes the dirtied_when keys with the rb_tree. Any + * inodes with a duplicate dirtied_when value are link listed together. This + * link list is sorted by the inode's i_flushed_when. When both the + * dirtied_when and the i_flushed_when are identical the order in the + * linked list determines the order we flush the inodes. + */ + +/* + * Find a rb_node matching the key in the flush tree. There are no duplicate + * rb_nodes in the tree. Instead they are chained off the first node. + */ +static struct inode *flush_tree_search(struct super_block *sb, + unsigned long ts) +{ + struct rb_node *n = sb->s_flush_root.rb_node; + assert_spin_locked(&inode_lock); + while (n) { + struct inode *inode = rb_to_inode(n); + if (time_before(ts, inode->dirtied_when)) { + n = n->rb_left; + } else if (time_after(ts, inode->dirtied_when)) { + n = n->rb_right; + } else { + return inode; + } + } + return NULL; +} + +/* + * Inserting an inode into the flush tree. The tree is keyed by the + * dirtied_when member. + * + * If there is a duplicate key in the tree already the new inode is put + * on the tail of a list of the rb_node. + * All inserted inodes must have one of the I_DIRTY flags set. + */ +static void flush_tree_insert(struct super_block *sb, struct inode *inode) +{ + struct rb_node **new = &(sb->s_flush_root.rb_node); + struct rb_node *parent = NULL; + + assert_spin_locked(&inode_lock); + BUG_ON((inode->i_state & I_DIRTY) == 0); + BUG_ON(inode->i_state & (I_FREEING|I_CLEAR)); + BUG_ON(RB_LINKED_NODE(&inode->i_flush_node)); + + sb->s_flush_count++; + + list_del_init(&inode->i_list); + while (*new) { + struct inode *this = rb_to_inode(*new); + parent = *new
[patch 1/1] Writeback fix for concurrent large and small file writes
>From [EMAIL PROTECTED] Wed Nov 28 11:10:06 2007 Message-Id: <[EMAIL PROTECTED]> Date: Wed, 28 Nov 2007 11:01:21 -0800 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [patch 1/1] Writeback fix for concurrent large and small file writes. From: Michael Rubin <[EMAIL PROTECTED]> Fixing a bug where writing to large files while concurrently writing to smaller ones creates a situation where writeback cannot keep up with the traffic and memory baloons until the we hit the threshold watermark. This can result in surprising latency spikes when syncing. This latency can take minutes on large memory systems. Upon request I can provide a test to reproduce this situation. The flush tree fixes this issue and fixes several other minor issues with fairness also. 1) Adding a data structure to guarantee fairness when writing inodes to disk. The flush_tree is based on an rbtree. The only difference is how duplicate keys are chained off the same rb_node. 2) Added a FS flag to mark file systems that are not disk backed so we don't have to flush them. Not sure I marked all of them. But just marking these improves writeback performance. 3) Added an inode flag to allow inodes to be marked so that they are never written back to disk. See get_pipe_inode. Under autotest this patch has passed: fsx, bonnie, and iozone. I am currently writing more writeback focused tests (which so far have been passed) to add into autotest. Signed-off-by: Michael Rubin <[EMAIL PROTECTED]> --- Index: 2624rc3/fs/block_dev.c === --- 2624rc3.orig/fs/block_dev.c 2007-11-16 21:16:36.0 -0800 +++ 2624rc3/fs/block_dev.c 2007-11-27 10:51:26.0 -0800 @@ -518,6 +518,7 @@ static struct file_system_type bd_type = .name = "bdev", .get_sb = bd_get_sb, .kill_sb= kill_anon_super, + .fs_flags = FS_ANONYMOUS, }; static struct vfsmount *bd_mnt __read_mostly; Index: 2624rc3/fs/fs-writeback.c === --- 2624rc3.orig/fs/fs-writeback.c 2007-11-16 21:16:36.0 -0800 +++ 2624rc3/fs/fs-writeback.c 2007-11-27 17:40:19.0 -0800 @@ -23,8 +23,174 @@ #include #include #include +#include #include "internal.h" +#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node) + +/* + * When inodes are parked for writeback they are parked in the + * flush_tree. The flush tree is a data structure based on an rb tree. + * + * Duplicate keys are handled by making a list in the tree for each key + * value. The order of how we choose the next inode to flush is decided + * by two fields. First the earliest dirtied_when value. If there are + * duplicate dirtied_when values then the earliest i_flushed_when value + * determines who gets flushed next. + * + * The flush tree organizes the dirtied_when keys with the rb_tree. Any + * inodes with a duplicate dirtied_when value are link listed together. This + * link list is sorted by the inode's i_flushed_when. When both the + * dirited_when and the i_flushed_when are indentical the order in the + * linked list determines the order we flush the inodes. + */ + +/* + * Find a rb_node matching the key in the flush tree. There are no duplicate + * rb_nodes in the tree. Instead they are chained off the first node. + */ +static struct inode *flush_tree_search(struct super_block *sb, + unsigned long ts) +{ + struct rb_node *n = sb->s_flush_root.rb_node; + assert_spin_locked(&inode_lock); + while (n) { + struct inode *inode = rb_to_inode(n); + if (time_before(ts, inode->dirtied_when)) { + n = n->rb_left; + } else if (time_after(ts, inode->dirtied_when)) { + n = n->rb_right; + } else { + return inode; + } + } + return NULL; +} + +/* + * Inserting an inode into the flush tree. The tree is keyed by the + * dirtied_when member. + * + * If there is a duplicate key in the tree already the new inode is put + * on the tail of a list of the rb_node. + * All inserted inodes must have one of the I_DIRTY flags set. + */ +static void flush_tree_insert(struct super_block *sb, struct inode *inode) +{ + struct rb_node **new = &(sb->s_flush_root.rb_node); + struct rb_node *parent = NULL; + + assert_spin_locked(&inode_lock); + BUG_ON((inode->i_state & I_DIRTY) == 0); + BUG_ON(inode->i_state & (I_FREEING|I_CLEAR)); + BUG_ON(RB_LINKED_NODE(&inode->i_flush_node)); + + sb->s_flush_count++; + + list_del_init(&inode->i_list); + while (*new) { + struct inode *this = rb_to_inode(*new); + parent = *new; + if (time_before(ino
Re: [patch 1/1] Writeback fix for concurrent large and small file writes
Thank you. Integrated the fixes in my patch. On Nov 28, 2007 6:13 PM, Frans Pop <[EMAIL PROTECTED]> wrote: > Two typos in comments. > > Cheers, > FJP > > Michael Rubin wrote: > > + * The flush tree organizes the dirtied_when keys with the rb_tree. Any > > + * inodes with a duplicate dirtied_when value are link listed together. > > This + * link list is sorted by the inode's i_flushed_when. When both the > > + * dirited_when and the i_flushed_when are indentical the order in the > > + * linked list determines the order we flush the inodes. > > s/dirited_when/dirtied_when/ > > > + * Here is where we interate to find the next inode to process. The > > + * strategy is to first look for any other inodes with the same > > dirtied_when + * value. If we have already processed that node then we > > need to find + * the next highest dirtied_when value in the tree. > > s/interate/iterate/ > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Writeback fix for concurrent large and small file writes
So Feng's one line change fixes the problem at hand. I will do some more testing with it and then submit his patch credited with him for 2.6.24. If that's cool with Feng. Also I will take the comment changes and re-submit my patch for 2.6.25 for general purpose improvement and see what happens. mrubin On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Wed, Nov 28, 2007 at 11:29:57AM -0800, Michael Rubin wrote: > > >From [EMAIL PROTECTED] Wed Nov 28 11:10:06 2007 > > Message-Id: <[EMAIL PROTECTED]> > > Date: Wed, 28 Nov 2007 11:01:21 -0800 > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > Subject: [patch 1/1] Writeback fix for concurrent large and small file > > writes. > > > > From: Michael Rubin <[EMAIL PROTECTED]> > > > > Fixing a bug where writing to large files while concurrently writing to > > smaller ones creates a situation where writeback cannot keep up with the > > Could you demonstrate the situation? Or if I guess it right, could it > be fixed by the following patch? (not a nack: If so, your patch could > also be considered as a general purpose improvement, instead of a bug > fix.) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 0fca820..62e62e2 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -301,7 +301,7 @@ __sync_single_inode(struct inode *inode, struct > writeback_control *wbc) > * Someone redirtied the inode while were writing back > * the pages. > */ > - redirty_tail(inode); > + requeue_io(inode); > } else if (atomic_read(&inode->i_count)) { > /* > * The inode is clean, inuse > > Thank you, > Fengguang > > > > traffic and memory baloons until the we hit the threshold watermark. This > > can result in surprising latency spikes when syncing. This latency > > can take minutes on large memory systems. Upon request I can provide > > a test to reproduce this situation. The flush tree fixes this issue and > > fixes several other minor issues with fairness also. > > > > 1) Adding a data structure to guarantee fairness when writing inodes > > to disk. The flush_tree is based on an rbtree. The only difference is > > how duplicate keys are chained off the same rb_node. > > > > 2) Added a FS flag to mark file systems that are not disk backed so we > > don't have to flush them. Not sure I marked all of them. But just marking > > these improves writeback performance. > > > > 3) Added an inode flag to allow inodes to be marked so that they are > > never written back to disk. See get_pipe_inode. > > > > Under autotest this patch has passed: fsx, bonnie, and iozone. I am > > currently writing more writeback focused tests (which so far have been > > passed) to add into autotest. > > > > Signed-off-by: Michael Rubin <[EMAIL PROTECTED]> > > --- > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Writeback fix for concurrent large and small file writes
Due to my faux pas of top posting (see http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt) I am resending this email. On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > Could you demonstrate the situation? Or if I guess it right, could it > be fixed by the following patch? (not a nack: If so, your patch could > also be considered as a general purpose improvement, instead of a bug > fix.) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 0fca820..62e62e2 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -301,7 +301,7 @@ __sync_single_inode(struct inode *inode, struct > writeback_control *wbc) > * Someone redirtied the inode while were writing back > * the pages. > */ > - redirty_tail(inode); > + requeue_io(inode); > } else if (atomic_read(&inode->i_count)) { > /* > * The inode is clean, inuse > By testing the situation I can confirm that the one line patch above fixes the problem. I will continue testing some other cases to see if it cause any other issues but I don't expect it to. I will post this change for 2.6.24 and list Feng as author. If that's ok with Feng. As for the original patch I will resubmit it for 2.6.25 as a general purpose improvement. mrubin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Writeback fix for concurrent large and small file writes
On Nov 29, 2007 5:32 PM, Fengguang Wu <[EMAIL PROTECTED]: > > On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > Could you demonstrate the situation? Or if I guess it right, could it > > > be fixed by the following patch? Feng I am sorry to have been mistaken but I reran my tests and I am now finding that the patch you gave me is NOT fixing the problem. The patch I refer to is the one you posted on this thread that adds a requeue_io in __sync_single_inode. I tarred up my test code. It is still in very rough shape but it can reproduce the issue. You can find it here: http://neverthere.org/mhr/wb/wb-test.tar.bz2 Just make the test and run it with the args "-duration 0:5:0 -starvation". You must be root so it can set some sysctl values. > One major concern could be whether a continuous writer dirting pages > at the 'right' pace will generate a steady flow of write I/Os which are > _tiny_hence_inefficient_. > > So it's not a problem in *theory* :-) > > > I will post this change for 2.6.24 and list Feng as author. If that's > > ok with Feng. I am going to try to track down what is up in 2.6.24 and see if I can find a less dramatic fix than my tree patch for the short term. If you get a chance to reproduce the problem with my test on your patch that would rock. I still would like to see my full patch accepted into 2.6.25. A patch should be arriving shortly that will incorporate my larger patch and Qi Yong's fix for skip-writing-data-pages-when-inode-is-under-i_sync. http://www.gossamer-threads.com/lists/linux/kernel/849493 As always thanks for your patience, mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Converting writeback linked lists to a tree based data structure
From: Michael Rubin <[EMAIL PROTECTED]> For those of you who have waited so long. This is the third submission of the first attempt at this patch. It is a trilogy. Two changes are in this patch. They are dependant on each other. In addition we get an unintended performance improvement. Syncing 1,000,000 inodes each with 4KB of dirty data takes the original kernel 83 seconds and with the change it take 77 seconds. 1) Adding a datastructure to guarantee fairness when writing inodes to disk and simplify the code base. When inodes are parked for writeback they are parked in the flush_tree. The flush tree is a data structure based on an rb tree. Duplicate keys are handled by making a list in the tree for each key value. The order of how we choose the next inode to flush is decided by two fields. First the earliest dirtied_when value. If there are duplicate dirtied_when values then the earliest i_flush_gen value determines who gets flushed next. The flush tree organizes the dirtied_when keys with the rb_tree. Any inodes with a duplicate dirtied_when value are link listed together. This link list is sorted by the inode's i_flush_gen. When both the dirtied_when and the i_flush_gen are identical the order in the linked list determines the order we flush the inodes. 2) Added an inode flag to allow inodes to be marked so that they are never written back to disk. The motivation behind this change is several fold. The first is to insure fairness in the writeback algorithm. The second is to deal with a bug where the writing to large files concurrently to smaller ones creates a situation where writeback cannot keep up with traffic and memory baloons until the we hit the threshold watermark. This can result in surprising long latency with respect to disk traffic. This latency can take minutes. The flush tree fixes this issue and fixes several other minor issues with fairness also. Signed-off-by: Michael Rubin <[EMAIL PROTECTED]> --- Index: 2624rc7_wb/fs/anon_inodes.c === --- 2624rc7_wb.orig/fs/anon_inodes.c2008-01-06 13:45:38.0 -0800 +++ 2624rc7_wb/fs/anon_inodes.c 2008-01-08 13:46:59.0 -0800 @@ -154,13 +154,7 @@ static struct inode *anon_inode_mkinode( inode->i_fop = &anon_inode_fops; - /* -* Mark the inode dirty from the very beginning, -* that way it will never be moved to the dirty -* list because mark_inode_dirty() will think -* that it already _is_ on the dirty list. -*/ - inode->i_state = I_DIRTY; + inode->i_state = I_WRITEBACK_NEVER; inode->i_mode = S_IRUSR | S_IWUSR; inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; Index: 2624rc7_wb/fs/fs-writeback.c === --- 2624rc7_wb.orig/fs/fs-writeback.c 2008-01-06 13:45:38.0 -0800 +++ 2624rc7_wb/fs/fs-writeback.c2008-01-14 18:53:54.0 -0800 @@ -23,8 +23,185 @@ #include #include #include +#include #include "internal.h" +#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node) + +/* + * When inodes are parked for writeback they are parked in the + * flush_tree. The flush tree is a data structure based on an rb tree. + * + * Duplicate keys are handled by making a list in the tree for each key + * value. The order of how we choose the next inode to flush is decided + * by two fields. First the earliest dirtied_when value. If there are + * duplicate dirtied_when values then the earliest i_flush_gen value + * determines who gets flushed next. + * + * The flush tree organizes the dirtied_when keys with the rb_tree. Any + * inodes with a duplicate dirtied_when value are link listed together. This + * link list is sorted by the inode's i_flush_gen. When both the + * dirtied_when and the i_flush_gen are identical the order in the + * linked list determines the order we flush the inodes. + */ + +/* + * Find a rb_node matching the key in the flush tree. There are no duplicate + * rb_nodes in the tree. Instead they are chained off the first node. + */ +static struct inode *flush_tree_search(struct super_block *sb, + unsigned long ts) +{ + struct rb_node *n = sb->s_flush_root.rb_node; + assert_spin_locked(&inode_lock); + while (n) { + struct inode *inode = rb_to_inode(n); + if (time_before(ts, inode->dirtied_when)) { + n = n->rb_left; + } else if (time_after(ts, inode->dirtied_when)) { + n = n->rb_right; + } else { + return inode; + } + } + return NULL; +} + +/* + * Inserting an inode into the flush tree. The tree is keyed by the +
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 15, 2008 12:46 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > Just a quick question, how does this interact/depend-uppon etc.. with > Fengguangs patches I still have in my mailbox? (Those from Dec 28th) They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25. This work was done before Fengguang's patches. I am trying to test Fengguang's for comparison but am having problems with getting mm1 to boot on my systems. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/13] writeback bug fixes and simplifications take 2
On Jan 15, 2008 4:36 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > Andrew, > > This patchset mainly polishes the writeback queuing policies. Anyone know which tree is this patched based out of? > The main goals are: > > (1) small files should not be starved by big dirty files > (2) sync as fast as possible for not-blocked inodes/pages > - don't leave them out; no congestion_wait() in between them > (3) avoid busy iowait for blocked inodes > - retry them in the next go of s_io(maybe at the next wakeup of pdflush) > Fengguang do you have any specific tests for any of these cases? As I have posted earlier I am putting together a writeback test suite for test.kernel.org and if you have one (even if it's an ugly shell script) that would save me some time. Also if you want any of mine let me know. :-) mrubin > The role of the queues: > > s_dirty: park for dirtied_when expiration > s_io: park for io submission > s_more_io: for big dirty inodes, they will be retried in this run of pdflush >(it ensures fairness between small/large files) > s_more_io_wait: for blocked inodes, they will be picked up in next run of s_io > > > This patchset is in better shape, but still not ready for merge. > It begins with: > > [PATCH 01/13] writeback: revert > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b > [PATCH 02/13] writeback: clear PAGECACHE_TAG_DIRTY for truncated page > in block_write_full_page() > > Introduces more_io/more_io_wait based policies: > > [PATCH 03/13] writeback: introduce writeback_control.more_io > [PATCH 04/13] writeback: introduce super_block.s_more_io_wait > [PATCH 05/13] writeback: merge duplicate code into > writeback_some_pages() > [PATCH 06/13] writeback: defer writeback on not-all-pages-written > [PATCH 07/13] writeback: defer writeback on locked inode > [PATCH 08/13] writeback: defer writeback on locked buffers > [PATCH 09/13] writeback: requeue_io() on redirtied inode > > And finishes with some code cleanups: > > [PATCH 10/13] writeback: introduce queue_dirty() > [PATCH 11/13] writeback: queue_dirty() on memory-backed bdi > [PATCH 12/13] writeback: remove redirty_tail() > [PATCH 13/13] writeback: cleanup __sync_single_inode() > > Diffstat: > > fs/buffer.c |2 > fs/fs-writeback.c | 121 +++--- > fs/super.c |1 > include/linux/fs.h |1 > mm/page-writeback.c | 46 +++ > 5 files changed, 72 insertions(+), 99 deletions(-) > > Regards, > Fengguang Wu > -- > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 15, 2008 7:01 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > Basically I think rbtree is an overkill to do time based ordering. > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io > provides fair queuing between small/large files, and s_more_io_wait > provides waiting mechanism for blocked inodes. I think the flush_tree (which is a little more than just an rbtree) provides the same queuing mechanisms that the three or four lists heads do and manages to do it in one structure. The i_flushed_when provides the ability to have blocked inodes wait their turn so to speak. Another motivation behind the rbtree patch is to unify the data structure that handles the priority and mechanism of how we write out the pages of the inodes. There are some ideas about introducing priority schemes for QOS and such in the future. I am not saying this patch is about making that happen, but the idea is to if possible unify the four stages of lists into a single structure to facilitate efforts like that. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Crash with 2.6.24-rc6-mm1 in restore_i387_ia32
When I boot my system with 2.6.24-rc6 everything is great. When I apply the mm1 patch I crash with the output below. Anyone seen this before? Starting sendmail: Unable to handle kernel paging request at 00010013 RIP: [] restore_i387_ia32+0x1f/0x150 PGD 47e134067 PUD 0 Oops: [1] SMP last sysfs file: /sys/devices/system/machinecheck/machinecheck3/tolerant CPU 3 Modules linked in: Pid: 0, comm: make Not tainted 2.6.24-rc6-mm1 #2 RIP: 0010:[] [] restore_i387_ia32+0x1f/0x150 RSP: :81047e58be38 EFLAGS: 00010282 RAX: 81047e58bfd8 RBX: 81047e4cc000 RCX: RDX: RSI: ffdc232c RDI: 81047e4cc000 RBP: 81047e58bec8 R08: R09: ffdc25d8 R10: 81047e58a000 R11: R12: R13: 81047e4cc000 R14: 81047e58bf58 R15: 81047e58bf24 FS: 7ff45f9566e0(0003) GS:81047f015d80(0003) knlGS:f7eebe80 CS: 0010 DS: 002b ES: 002b CR0: 8005003b CR2: 00010013 CR3: 00047e117000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process make (pid: 0, threadinfo , task 81047e4cc000) Stack: 81047e58be48 80243202 81047e58bf48 8020ad7d 81047e58be98 0011 0001 0011 810400040001 0008087c Call Trace: Code: 8b 51 14 89 d0 83 e0 01 85 c0 74 11 9b 83 e2 fe 89 51 14 0f RIP [] restore_i387_ia32+0x1f/0x150 RSP CR2: 00010013 divide error: [2] SMP last sysfs file: /sys/devices/system/machinecheck/machinecheck3/tolerant CPU 1 Modules linked in: Pid: 2175, comm: uname Tainted: G D 2.6.24-rc6-mm1 #2 RIP: 0010:[] [] calc_delta_mine+0x64/0x90 RSP: :81027dc9fb78 EFLAGS: 00010046 RAX: 0001 RBX: 81047e4cc038 RCX: RDX: RSI: 0400 RDI: c009aeef285e RBP: 81027dc9fb78 R08: 81047e4cc038 R09: R10: 0001 R11: 0064 R12: 8102800a21c0 R13: 8102800a21c0 R14: 000a6eeea85e R15: 0003 FS: 7f120bb256e0() GS:81027f029000() knlGS:f7ef6e80 CS: 0010 DS: 002b ES: 002b CR0: 8005003b CR2: f7f05aa0 CR3: 00027dc59000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process uname (pid: 2175, threadinfo 81027dc9e000, task 81027e542790) Stack: 81027dc9fba8 802313a6 0001 8102800a21c0 81047e552038 0001 81027dc9fbd8 80231491 81027dc9fbd8 81047e552038 8102800a5dc0 8102800a5dc0 Call Trace: [] update_curr+0xa6/0xb0 [] enqueue_entity+0x21/0x70 [] enqueue_task_fair+0x32/0x60 [] enqueue_task+0x13/0x30 [] activate_task+0x30/0x50 [] try_to_wake_up+0x10d/0x130 [] default_wake_function+0xd/0x10 [] autoremove_wake_function+0x11/0x40 [] __wake_up_common+0x4e/0x90 [] __wake_up_sync+0x4a/0x70 [] pipe_write+0x3d2/0x500 [] __do_fault+0x215/0x470 [] do_sync_write+0xf1/0x150 [] autoremove_wake_function+0x0/0x40 [] do_page_fault+0x420/0x7e0 [] __up_write+0x72/0x150 [] vfs_write+0xc7/0x170 [] sys_write+0x50/0x90 [] ia32_sysret+0x0/0xa Code: 48 f7 f1 48 8d 48 01 49 89 48 08 eb 9f 48 8d 82 00 80 00 00 RIP [] calc_delta_mine+0x64/0x90 RSP ---[ end trace ca96c3a87188c958 ]--- BUG: sleeping function called from invalid context at kernel/rwsem.c:21 in_atomic():0, irqs_disabled():1 Pid: 2175, comm: uname Tainted: G D 2.6.24-rc6-mm1 #2 Call Trace: [] __might_sleep+0xb2/0xd0 [] down_read+0x1d/0x30 [] exit_mm+0x30/0x100 [] do_exit+0x1bb/0x850 [] oops_end+0x85/0x90 [] die+0x5e/0x90 [] do_trap+0x154/0x160 [] do_divide_error+0x8a/0xa0 [] calc_delta_mine+0x64/0x90 [] do_lookup+0x80/0x1f0 [] dput+0x65/0x1a0 [] __link_path_walk+0xcf5/0xd50 [] error_exit+0x0/0x51 [] calc_delta_mine+0x64/0x90 [] update_curr+0xa6/0xb0 [] enqueue_entity+0x21/0x70 [] enqueue_task_fair+0x32/0x60 [] enqueue_task+0x13/0x30 [] activate_task+0x30/0x50 [] try_to_wake_up+0x10d/0x130 [] default_wake_function+0xd/0x10 [] autoremove_wake_function+0x11/0x40 [] __wake_up_common+0x4e/0x90 [] __wake_up_sync+0x4a/0x70 [] pipe_write+0x3d2/0x500 [] __do_fault+0x215/0x470 [] do_sync_write+0xf1/0x150 [] autoremove_wake_function+0x0/0x40 [] do_page_fault+0x420/0x7e0 [] __up_write+0x72/0x150 [] vfs_write+0xc7/0x170 [] sys_write+0x50/0x90 [] ia32_sysret+0x0/0xa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 17, 2008 1:41 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote: > The main benefit of rbtree is possibly better support of future policies. > Can you demonstrate an example? These are ill-formed thoughts as of now on my end but the idea was that keeping one tree sorted via a scheme might be simpler than multiple list_heads. > Bugs can only be avoided by good understanding of all possible cases.) I take the above statement as a tautology. And am trying my best to do so. :-) > The most tricky writeback issues could be starvation prevention > between > - small/large files > - new/old files > - superblocks So I have written tests and believe I have covered these issues. If you are concerned in specific on any and have a test case please let me know. > Some kind of limit should be applied for each. They used to be: > - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached > this preempts big files The patch uses th same limit. > - refill s_io iif it is drained > this prevents promotion of big/old files Once a big file gets its first do_writepages it is moved behind the other smaller files via i_flushed_when. And the same in reverse for big vs old. > - return from sync_sb_inodes() after one go of s_io I am not sure how this limit helps things out. Is this for superblock starvation? Can you elaborate? > Michael, could you sort out and document the new starvation prevention > schemes? The basic idea behind the writeback algorithm to handle starvation. The over arching idea is that we want to preserve order of writeback based on when an inode was dirtied and also preserve the dirtied_when contents until the inode has been written back (partially or fully) Every sync_sb_inodes we find the least recent inodes dirtied. To deal with large or small starvation we have a s_flush_gen for each iteration of sync_sb_inodes every time we issue a writeback we mark that the inode cannot be processed until the next s_flush_gen. This way we don't process one get to the rest since we keep pushing them into subsequent s_fush_gen's. Let me know if you want more detail or structured responses. > Introduce i_flush_gen to help restarting from the last inode? > Well, it's not as simple as list_heads. > > > 2) Added an inode flag to allow inodes to be marked so that they > >are never written back to disk. > > > >The motivation behind this change is several fold. The first is > >to insure fairness in the writeback algorithm. The second is to > > What do you mean by fairness? So originally this comment was written when I was trying to fix a bug in 2.6.23. The one where we were starving large files from being flushed. There was a fairness issue where small files were being flushed but the large ones were just ballooning in memory. > Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch? The WRITEBACK_NEVER could be in a previous patch to the rbtree. But not a subsequent patch to the rbtree. The rbtree depends on the WRITEBACK_NEVER patch otherwise we run in to problems in generic_delete_inode. Now that you point it out I think I can split this patch into two patches and make the WRITEBACK_NEVER in the first one. > More details about the fixings, please? So again this comment was written against 2.6.23. The biggest fix is the starving of big files. I remember there were other smaller issues, but there have been so many changes in the patch sets that I need to go back to quantify. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 17, 2008 9:01 PM, David Chinner <[EMAIL PROTECTED]> wrote: First off thank you for the very detailed reply. This rocks and gives me much to think about. > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote: > This seems suboptimal for large files. If you keep feeding in > new least recently dirtied files, the large files will never > get an unimpeded go at the disk and hence we'll struggle to > get decent bandwidth under anything but pure large file > write loads. You're right. I understand now. I just changed a dial on my tests, ran it and found pdflush not keeping up like it should. I need to address this. > Switching inodes during writeback implies a seek to the new write > location, while continuing to write the same inode has no seek > penalty because the writeback is sequential. It follows from this > that allowing larges file a disproportionate amount of data > writeback is desirable. > > Also, cycling rapidly through all the large files to write 4MB to each is > going to cause us to spend time seeking rather than writing compared > to cycling slower and writing 40MB from each large file at a time. > > i.e. servicing one large file for 100ms is going to result in higher > writeback throughput than servicing 10 large files for 10ms each > because there's going to be less seeking and more writing done by > the disks. > > That is, think of large file writes like process scheduler batch > jobs - bulk throughput is what matters, so the larger the time slice > you give them the higher the throughput. > > IMO, the sort of result we should be looking at is a > writeback design that results in cycling somewhat like: > > slice 1: iterate over small files > slice 2: flush large file 1 > slice 3: iterate over small files > slice 4: flush large file 2 > .. > slice n-1: flush large file N > slice n: iterate over small files > slice n+1: flush large file N+1 > > So that we keep the disk busy with a relatively fair mix of > small and large I/Os while both are necessary. I am getting where you are coming from. But if we are going to make changes to optimize for seeks maybe we need to be more aggressive in write back in how we organize both time and location. Right now AFAIK there is no attention to location in the writeback path. > The higher the bandwidth of the device, the more frequently > we need to be servicing the inodes with large amounts of > dirty data to be written to maintain write throughput at a > significant percentage of the device capability. > Could you expand that to say it's not the inodes of large files but the ones with data that we can exploit locality? Often large files are fragmented. Would it make more sense to pursue cracking the inodes and grouping their blocks's locations? Or is this all overkill and should be handled at a lower level like the elevator? > BTW, it needs to be recognised that if we are under memory pressure > we can clean much more memory in a short period of time by writing > out all the large files first. This would clearly benefit the system > as a whole as we'd get the most pages available for reclaim as > possible in a short a time as possible. The writeback algorithm > should really have a mode that allows this sort of flush ordering to > occur I completely agree. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 17, 2008 8:56 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote: Once again thanks for the speedy replies. :-) > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote: > Suppose we want to grant longer expiration window for temp files, > adding a new list named s_dirty_tmpfile would be a handy solution. When you mean tmp do you mean files that eventually get written to disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure if that feature is worth making a special case. It seems like the location based ideas may be more useful. > So the question is: should we need more than 3 QoS classes? > > > > The most tricky writeback issues could be starvation prevention > > > between > > > > > > > - small/large files > > > - new/old files > > > - superblocks > > > > So I have written tests and believe I have covered these issues. If > > you are concerned in specific on any and have a test case please let > > me know. > > OK. > > > > Some kind of limit should be applied for each. They used to be: > > > - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached > > > this preempts big files > > > > The patch uses th same limit. > > > > > - refill s_io iif it is drained > > > this prevents promotion of big/old files > > > > Once a big file gets its first do_writepages it is moved behind the > > other smaller files via i_flushed_when. And the same in reverse for > > big vs old. > > You mean i_flush_gen? Yeah sorry. It was once called i_flush_when. (sheepish) > No, sync_sb_inodes() will abort on every > MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly. > Hence the sync will restart from big/old files. If I understand you correctly I am not sure I agree. Here is what I think happens in the patch: 1) pull big inode off of flush tree 2) sync big inode 3) Hit MAX_WRITEBACK_PAGES 4) Re-insert big inode (without modifying the dirtied_when) 5) update the i_flush_gen on big inode and re-insert behind small inodes we have not synced yet. In a subsequent sync_sb_inode we end up retrieving the small inode we had not serviced yet. > > > - return from sync_sb_inodes() after one go of s_io > > > > I am not sure how this limit helps things out. Is this for superblock > > starvation? Can you elaborate? > > We should have a way to go to next superblock even if new dirty inodes > or pages are emerging fast in this superblock. Fill and drain s_io > only once and then abort helps. Got it. > s_io is a stable and bounded working set in one go of superblock. Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit. > Basically you make one list_head in each rbtree node. > That list_head is recycled cyclic, and is an analog to the old > fashioned s_dirty. We need to know 'where we are' and 'where it ends'. > So an extra indicator must be introduced - i_flush_gen. It's awkward. > We are simply repeating the aged list_heads' problem. To me they both feel a little awkward. I feel like the original problem in 2.6.23 led to a lot of examination which is bringing new possibilities to light. BTW the issue that started me on this whole path (starving large files) was still present in 2.6.23-rc8 but now looks fixed in 2.6.24-rc3. Still no idea about your changes in 2.6.24-rc6-mm1. I have given up trying to get that thing to boot. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/13] writeback bug fixes and simplifications take 2
On Jan 15, 2008 4:36 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote: > Andrew, > > This patchset mainly polishes the writeback queuing policies. > The main goals are: > > (1) small files should not be starved by big dirty files > (2) sync as fast as possible for not-blocked inodes/pages > - don't leave them out; no congestion_wait() in between them > (3) avoid busy iowait for blocked inodes > - retry them in the next go of s_io(maybe at the next wakeup of pdflush) > > The role of the queues: > > s_dirty: park for dirtied_when expiration > s_io: park for io submission > s_more_io: for big dirty inodes, they will be retried in this run of pdflush >(it ensures fairness between small/large files) > s_more_io_wait: for blocked inodes, they will be picked up in next run of s_io Quick question to make sure I get this. Each queue is sorted as such: s_dirty - sorted by the dirtied_when field s_io - sorted by no explicit key but by the order we want to process in sync_sb_inodes s_more_io - held for later they are sorted in the same manner as s_io Is that it? mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Jan 18, 2008 12:54 AM, David Chinner <[EMAIL PROTECTED]> wrote: > At this point, I'd say it is best to leave it to the filesystem and > the elevator to do their jobs properly. Amen. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/