from:"Michael Rubin"

Writing tests for write back

2008-01-07 Thread Michael Rubin

I am writing tests for the write back section of the kernel. This is
the stuff found in fs/fs-writeback.c.

Currently I have tests to fsync millions of inodes concurrently, tests
involving mounting and unmounting file systems, and other tests to
tickle some starvation of big file situations. Has anyone else been
bitten by writeback not performing correctly either in the past or
now? If so please send me the situation so I can incorporate it into
my tests and also make sure my new attempt at an implementation does
not cause damage.

These tests will end up in autotest on test.kernel.org so they will
become something the whole community can use.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible fix for lockup in drop_caches

2008-01-08 Thread Michael Rubin

On Dec 22, 2007 2:06 AM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> Oh boy.  Do we really want to add all this stuff to JBD just for
> drop_caches which is a silly root-only broken-in-22-other-ways thing?
>
> Michael, might your convert-inode-lists-to-tree patches eliminate the need
> for taking inode_lock in drop_pagecache_sb()?  Probably not, as it uses an
> rbtree.  It would have been possible if it was using a radix-tree, I
> suspect..

You are correct. The rbtree will still req
>
> > -void __journal_unfile_buffer(struct journal_head *jh)
> > +void __journal_unfile_buffer(struct journal_head *jh,
> > + struct buffer_head **dirty_bh)
> >  {
> > - __journal_temp_unlink_buffer(jh);
> > + __journal_temp_unlink_buffer(jh, dirty_bh);
> >   jh->b_transaction = NULL;
> >  }
>
> I suspect the code would end up simpler if __journal_unfile_buffer() were
> to take an additional ref on the bh which it placed at *dirty_bh.
>
> Callers of __journal_unfile_buffer() could then call
>
> void handle_dirty_bh(struct buffer_head *bh)
> {
> if (bh) {
> jbd_mark_buffer_dirty(bh);
> put_bh(bh);
> }
> }
>
> ?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible fix for lockup in drop_caches

2008-01-08 Thread Michael Rubin

On Dec 22, 2007 2:06 AM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> On Mon, 17 Dec 2007 12:13:22 + richard <[EMAIL PROTECTED]> wrote:
> Michael, might your convert-inode-lists-to-tree patches eliminate the need
> for taking inode_lock in drop_pagecache_sb()?  Probably not, as it uses an
> rbtree.  It would have been possible if it was using a radix-tree, I
> suspect..

You are correct the new patch based on an rbtree will still require the lock.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/1] Writeback fix for concurrent large and small file writes.

2007-12-10 Thread Michael Rubin

From: Michael Rubin <[EMAIL PROTECTED]>

Fixing a bug where writing to large files while concurrently writing to
smaller ones creates a situation where writeback cannot keep up with the
traffic and memory baloons until the we hit the threshold watermark. This
can result in surprising latency spikes when syncing. This latency
can take minutes on large memory systems. Upon request I can provide
a test to reproduce this situation.

The only concern I have is that this makes the wb_kupdate slightly more
agressive. I am not sure it is enough to cause any problems. I think
there is enough checks to throttle the background activity.

Feng also the one line change that you recommended here 
http://marc.info/?l=linux-kernel&m=119629655402153&w=2 had no effect.

Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
---
Index: 2624rc3_feng/fs/fs-writeback.c
===
--- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.0 -0800
+++ 2624rc3_feng/fs/fs-writeback.c  2007-12-10 17:21:45.0 -0800
@@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s
 {
const unsigned long start = jiffies;/* livelock avoidance */
 
-   if (!wbc->for_kupdate || list_empty(&sb->s_io))
-   queue_io(sb, wbc->older_than_this);
+   queue_io(sb, wbc->older_than_this);
 
while (!list_empty(&sb->s_io)) {
struct inode *inode = list_entry(sb->s_io.prev,
Index: 2624rc3_feng/mm/page-writeback.c
===
--- 2624rc3_feng.orig/mm/page-writeback.c   2007-11-16 21:16:36.0 
-0800
+++ 2624rc3_feng/mm/page-writeback.c2007-12-10 17:37:17.0 -0800
@@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
-   if (wbc.encountered_congestion || wbc.more_io)
+   if (wbc.encountered_congestion)
congestion_wait(WRITE, HZ/10);
else
break;  /* All the old data is written */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/1] Writeback fix for concurrent large and small file writes.

2007-12-12 Thread Michael Rubin

On Dec 12, 2007 12:55 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
>
> On Mon, 2007-12-10 at 18:02 -0800, Michael Rubin wrote:
> > From: Michael Rubin <[EMAIL PROTECTED]>
> The part I miss here is the rationale on _how_ you solve the problem.
>
> The patch itself is simple enough, but I've been staring at this code
> for a while now, and I'm just not getting it.

Apologies for the lack of rationale. I have been staring at this code
for awhile also and it makes my head hurt. I have a patch coming
(hopefully today) that proposes using one data  structure with a more
consistent priority scheme for 2.6.25. To me it's simpler, but I am
biased.

The problem we encounter when we append to a large file at a fast rate
while also writing to smaller files is that the wb_kupdate thread does
not keep up with disk traffic. In this workload often the inodes end
up at fs/fs-writeback.c:287 after do_writepages, since do_writepages
did not write all the pages.  This can be due to congestion but I
think there are other causes also since I have observed so.

The first issue is that the inode is put on the s_more_io queue. This
ensures that more_io is set at the end of sync_sb_inodes. The result
from that is the wb_kupdate routine will perform a sleep at
mm/page-writeback.c:642. This slows us down enough that the wb_kupdate
cannot keep up with traffic.

The other issue is that the inode that has been placed on the
s_more_io queue cannot be processed by sync_sb_inodes until the entire
s_io list is empty. With lots of small files that are not being
dirtied as quickly as the one large inode on the s_more_io queue the
inode with the most pages being dirtied is not given attention and
wb_kupdate cannot keep up again.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Converting writeback linked lists to a tree based data structure

2007-12-12 Thread Michael Rubin

From: Michael Rubin <[EMAIL PROTECTED]>

This is an attempt to unify the writeback data structures. By adding an
rb tree we are able to have one consistent time ordering mechanism for
writeback. This should aid debugging and allow for future work with more
sophisticated time ordering methods. This is a proposal for 2.6.25.

The patch below includes the following changes.

1) Adding a data structure to guarantee fairness when writing inodes
to disk.  The flush_tree is based on an rbtree. with duplicate keys
being chained off the same rb_node.

2) Added a FS flag to mark file systems that are not disk backed so
we don't have to flush them. Not sure I marked all of them. But just
marking these improves writeback performance.

3) Added an inode flag to allow inodes to be marked so that they are
never written back to disk. See get_pipe_inode.

Under autotest this patch has passed: fsx, bonnie, and iozone. I am
currently writing more writeback focused tests (which so far have been
passed) to add into autotest.

Performance wise I ran a quick test.

a) I used sysctl to stop background writeback.
b) I ran the "sync" command.
c) Then I created 10,000,000 files in directories of 1000 files per directory.
d) Finally I timed the "sync" command for all the dirty inodes that had been 
parked.

I ran the perf test 5 times on each kernel. The average for the 2.6.24
kernel was 87.8 seconds. With this patch it was 69.3.
Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
---

Index: 2624rc3_wb/fs/block_dev.c
===
--- 2624rc3_wb.orig/fs/block_dev.c  2007-12-11 13:52:47.0 -0800
+++ 2624rc3_wb/fs/block_dev.c   2007-12-11 13:55:01.0 -0800
@@ -518,6 +518,7 @@ static struct file_system_type bd_type =
.name   = "bdev",
.get_sb = bd_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_ANONYMOUS,
 };
 
 static struct vfsmount *bd_mnt __read_mostly;
Index: 2624rc3_wb/fs/fs-writeback.c
===
--- 2624rc3_wb.orig/fs/fs-writeback.c   2007-12-11 13:52:47.0 -0800
+++ 2624rc3_wb/fs/fs-writeback.c2007-12-11 14:19:33.0 -0800
@@ -23,8 +23,174 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flushed_when value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flushed_when. When both the
+ * dirtied_when and the i_flushed_when are identical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+  unsigned long ts)
+{
+   struct rb_node *n = sb->s_flush_root.rb_node;
+   assert_spin_locked(&inode_lock);
+   while (n) {
+   struct inode *inode = rb_to_inode(n);
+   if (time_before(ts, inode->dirtied_when)) {
+   n = n->rb_left;
+   } else if (time_after(ts, inode->dirtied_when)) {
+   n = n->rb_right;
+   } else {
+   return inode;
+   }
+   }
+   return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+ * dirtied_when member.
+ *
+ * If there is a duplicate key in the tree already the new inode is put
+ * on the tail of a list of the rb_node.
+ * All inserted inodes must have one of the I_DIRTY flags set.
+ */
+static void flush_tree_insert(struct super_block *sb, struct inode *inode)
+{
+   struct rb_node **new = &(sb->s_flush_root.rb_node);
+   struct rb_node *parent = NULL;
+
+   assert_spin_locked(&inode_lock);
+   BUG_ON((inode->i_state & I_DIRTY) == 0);
+   BUG_ON(inode->i_state & (I_FREEING|I_CLEAR));
+   BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
+
+   sb->s_flush_count++;
+
+   list_del_init(&inode->i_list);
+   while (*new) {
+   struct inode *this = rb_to_inode(*new);
+   parent = *new

[patch 1/1] Writeback fix for concurrent large and small file writes

2007-11-28 Thread Michael Rubin

>From [EMAIL PROTECTED] Wed Nov 28 11:10:06 2007
Message-Id: <[EMAIL PROTECTED]>
Date: Wed, 28 Nov 2007 11:01:21 -0800
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [patch 1/1] Writeback fix for concurrent large and small file writes.

From: Michael Rubin <[EMAIL PROTECTED]>

Fixing a bug where writing to large files while concurrently writing to
smaller ones creates a situation where writeback cannot keep up with the
traffic and memory baloons until the we hit the threshold watermark. This
can result in surprising latency spikes when syncing. This latency
can take minutes on large memory systems. Upon request I can provide
a test to reproduce this situation. The flush tree fixes this issue and
fixes several other minor issues with fairness also.

1) Adding a data structure to guarantee fairness when writing inodes
to disk.  The flush_tree is based on an rbtree. The only difference is
how duplicate keys are chained off the same rb_node.

2) Added a FS flag to mark file systems that are not disk backed so we
don't have to flush them. Not sure I marked all of them. But just marking
these improves writeback performance.

3) Added an inode flag to allow inodes to be marked so that they are
never written back to disk. See get_pipe_inode.

Under autotest this patch has passed: fsx, bonnie, and iozone. I am
currently writing more writeback focused tests (which so far have been
passed) to add into autotest.

Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
---

Index: 2624rc3/fs/block_dev.c
===
--- 2624rc3.orig/fs/block_dev.c 2007-11-16 21:16:36.0 -0800
+++ 2624rc3/fs/block_dev.c  2007-11-27 10:51:26.0 -0800
@@ -518,6 +518,7 @@ static struct file_system_type bd_type =
.name   = "bdev",
.get_sb = bd_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_ANONYMOUS,
 };
 
 static struct vfsmount *bd_mnt __read_mostly;
Index: 2624rc3/fs/fs-writeback.c
===
--- 2624rc3.orig/fs/fs-writeback.c  2007-11-16 21:16:36.0 -0800
+++ 2624rc3/fs/fs-writeback.c   2007-11-27 17:40:19.0 -0800
@@ -23,8 +23,174 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flushed_when value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flushed_when. When both the
+ * dirited_when and the i_flushed_when are indentical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+  unsigned long ts)
+{
+   struct rb_node *n = sb->s_flush_root.rb_node;
+   assert_spin_locked(&inode_lock);
+   while (n) {
+   struct inode *inode = rb_to_inode(n);
+   if (time_before(ts, inode->dirtied_when)) {
+   n = n->rb_left;
+   } else if (time_after(ts, inode->dirtied_when)) {
+   n = n->rb_right;
+   } else {
+   return inode;
+   }
+   }
+   return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+ * dirtied_when member.
+ *
+ * If there is a duplicate key in the tree already the new inode is put
+ * on the tail of a list of the rb_node.
+ * All inserted inodes must have one of the I_DIRTY flags set.
+ */
+static void flush_tree_insert(struct super_block *sb, struct inode *inode)
+{
+   struct rb_node **new = &(sb->s_flush_root.rb_node);
+   struct rb_node *parent = NULL;
+
+   assert_spin_locked(&inode_lock);
+   BUG_ON((inode->i_state & I_DIRTY) == 0);
+   BUG_ON(inode->i_state & (I_FREEING|I_CLEAR));
+   BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
+
+   sb->s_flush_count++;
+
+   list_del_init(&inode->i_list);
+   while (*new) {
+   struct inode *this = rb_to_inode(*new);
+   parent = *new;
+   if (time_before(ino

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

2007-11-28 Thread Michael Rubin

Thank you. Integrated the fixes in my patch.

On Nov 28, 2007 6:13 PM, Frans Pop <[EMAIL PROTECTED]> wrote:
> Two typos in comments.
>
> Cheers,
> FJP
>
> Michael Rubin wrote:
> > + * The flush tree organizes the dirtied_when keys with the rb_tree. Any
> > + * inodes with a duplicate dirtied_when value are link listed together.
> > This + * link list is sorted by the inode's i_flushed_when. When both the
> > + * dirited_when and the i_flushed_when are indentical the order in the
> > + * linked list determines the order we flush the inodes.
>
> s/dirited_when/dirtied_when/
>
> > + * Here is where we interate to find the next inode to process. The
> > + * strategy is to first look for any other inodes with the same
> > dirtied_when + * value. If we have already processed that node then we
> > need to find + * the next highest dirtied_when value in the tree.
>
> s/interate/iterate/
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

2007-11-29 Thread Michael Rubin

So Feng's one line change fixes the problem at hand. I will do some
more testing with it and then submit his patch credited with him for
2.6.24. If that's cool with Feng.

Also I will take the comment changes and re-submit my patch for 2.6.25
for general purpose improvement and see what happens.

mrubin

On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> On Wed, Nov 28, 2007 at 11:29:57AM -0800, Michael Rubin wrote:
> > >From [EMAIL PROTECTED] Wed Nov 28 11:10:06 2007
> > Message-Id: <[EMAIL PROTECTED]>
> > Date: Wed, 28 Nov 2007 11:01:21 -0800
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > Subject: [patch 1/1] Writeback fix for concurrent large and small file 
> > writes.
> >
> > From: Michael Rubin <[EMAIL PROTECTED]>
> >
> > Fixing a bug where writing to large files while concurrently writing to
> > smaller ones creates a situation where writeback cannot keep up with the
>
> Could you demonstrate the situation? Or if I guess it right, could it
> be fixed by the following patch? (not a nack: If so, your patch could
> also be considered as a general purpose improvement, instead of a bug
> fix.)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0fca820..62e62e2 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -301,7 +301,7 @@ __sync_single_inode(struct inode *inode, struct 
> writeback_control *wbc)
>  * Someone redirtied the inode while were writing back
>  * the pages.
>  */
> -   redirty_tail(inode);
> +   requeue_io(inode);
> } else if (atomic_read(&inode->i_count)) {
> /*
>  * The inode is clean, inuse
>
> Thank you,
> Fengguang
>
>
> > traffic and memory baloons until the we hit the threshold watermark. This
> > can result in surprising latency spikes when syncing. This latency
> > can take minutes on large memory systems. Upon request I can provide
> > a test to reproduce this situation. The flush tree fixes this issue and
> > fixes several other minor issues with fairness also.
> >
> > 1) Adding a data structure to guarantee fairness when writing inodes
> > to disk.  The flush_tree is based on an rbtree. The only difference is
> > how duplicate keys are chained off the same rb_node.
> >
> > 2) Added a FS flag to mark file systems that are not disk backed so we
> > don't have to flush them. Not sure I marked all of them. But just marking
> > these improves writeback performance.
> >
> > 3) Added an inode flag to allow inodes to be marked so that they are
> > never written back to disk. See get_pipe_inode.
> >
> > Under autotest this patch has passed: fsx, bonnie, and iozone. I am
> > currently writing more writeback focused tests (which so far have been
> > passed) to add into autotest.
> >
> > Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
> > ---
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

2007-11-29 Thread Michael Rubin

Due to my faux pas of top posting (see
http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt) I am
resending this email.

On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> Could you demonstrate the situation? Or if I guess it right, could it
> be fixed by the following patch? (not a nack: If so, your patch could
> also be considered as a general purpose improvement, instead of a bug
> fix.)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0fca820..62e62e2 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -301,7 +301,7 @@ __sync_single_inode(struct inode *inode, struct 
> writeback_control *wbc)
>  * Someone redirtied the inode while were writing back
>  * the pages.
>  */
> -   redirty_tail(inode);
> +   requeue_io(inode);
> } else if (atomic_read(&inode->i_count)) {
> /*
>  * The inode is clean, inuse
>

By testing the situation I can confirm that the one line patch above
fixes the problem.

I will continue testing some other cases to see if it cause any other
issues but I don't expect it to.
I will post this change for 2.6.24 and list Feng as author. If that's
ok with Feng.

As for the original patch I will resubmit it for 2.6.25 as a general
purpose improvement.

mrubin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

2007-12-04 Thread Michael Rubin

On Nov 29, 2007 5:32 PM, Fengguang Wu <[EMAIL PROTECTED]:
> > On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > Could you demonstrate the situation? Or if I guess it right, could it
> > > be fixed by the following patch?

Feng I am sorry to have been mistaken but I reran my tests and I am
now finding that the patch you gave me is NOT fixing the problem.  The
patch I refer to is the one you posted on this thread that adds a
requeue_io in __sync_single_inode.

I tarred up my test code. It is still in very rough shape but it can
reproduce the issue. You can find it here:

http://neverthere.org/mhr/wb/wb-test.tar.bz2

Just make the test and run it with the args "-duration 0:5:0
-starvation". You must be root so it can set some sysctl values.

> One major concern could be whether a continuous writer dirting pages
> at the 'right' pace will generate a steady flow of write I/Os which are
> _tiny_hence_inefficient_.
>
> So it's not a problem in *theory* :-)
>
> > I will post this change for 2.6.24 and list Feng as author. If that's
> > ok with Feng.

I am going to try to track down what is up in 2.6.24 and see if I can
find a less dramatic fix than my tree patch for the short term. If you
get a chance to reproduce the problem with my test on your patch that
would rock.

I still would like to see my full patch accepted into 2.6.25. A patch
should be arriving shortly that will incorporate my larger patch and
Qi Yong's fix for skip-writing-data-pages-when-inode-is-under-i_sync.
http://www.gossamer-threads.com/lists/linux/kernel/849493

As always thanks for your patience,

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Converting writeback linked lists to a tree based data structure

2008-01-15 Thread Michael Rubin

From: Michael Rubin <[EMAIL PROTECTED]>

For those of you who have waited so long. This is the third submission
of the first attempt at this patch. It is a trilogy.

Two changes are in this patch. They are dependant on each other.

In addition we get an unintended performance improvement. Syncing
1,000,000 inodes each with 4KB of dirty data takes the original kernel
83 seconds and with the change it take 77 seconds.

1) Adding a datastructure to guarantee fairness when writing
   inodes to disk and simplify the code base.

   When inodes are parked for writeback they are parked in the
   flush_tree. The flush tree is a data structure based on an rb tree.

   Duplicate keys are handled by making a list in the tree for each key
   value. The order of how we choose the next inode to flush is decided
   by two fields. First the earliest dirtied_when value. If there are
   duplicate dirtied_when values then the earliest i_flush_gen value
   determines who gets flushed next.

   The flush tree organizes the dirtied_when keys with the rb_tree. Any
   inodes with a duplicate dirtied_when value are link listed together. This
   link list is sorted by the inode's i_flush_gen. When both the
   dirtied_when and the i_flush_gen are identical the order in the
   linked list determines the order we flush the inodes.

2) Added an inode flag to allow inodes to be marked so that they
   are never written back to disk.

   The motivation behind this change is several fold. The first is
   to insure fairness in the writeback algorithm. The second is to
   deal with a bug where the writing to large files concurrently
   to smaller ones creates a situation where writeback cannot
   keep up with traffic and memory baloons until the we hit the
   threshold watermark. This can result in surprising long latency
   with respect to disk traffic. This latency can take minutes. The
   flush tree fixes this issue and fixes several other minor issues
   with fairness also.

Signed-off-by: Michael Rubin <[EMAIL PROTECTED]>
---

Index: 2624rc7_wb/fs/anon_inodes.c
===
--- 2624rc7_wb.orig/fs/anon_inodes.c2008-01-06 13:45:38.0 -0800
+++ 2624rc7_wb/fs/anon_inodes.c 2008-01-08 13:46:59.0 -0800
@@ -154,13 +154,7 @@ static struct inode *anon_inode_mkinode(
 
inode->i_fop = &anon_inode_fops;
 
-   /*
-* Mark the inode dirty from the very beginning,
-* that way it will never be moved to the dirty
-* list because mark_inode_dirty() will think
-* that it already _is_ on the dirty list.
-*/
-   inode->i_state = I_DIRTY;
+   inode->i_state = I_WRITEBACK_NEVER;
inode->i_mode = S_IRUSR | S_IWUSR;
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
Index: 2624rc7_wb/fs/fs-writeback.c
===
--- 2624rc7_wb.orig/fs/fs-writeback.c   2008-01-06 13:45:38.0 -0800
+++ 2624rc7_wb/fs/fs-writeback.c2008-01-14 18:53:54.0 -0800
@@ -23,8 +23,185 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flush_gen value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flush_gen. When both the
+ * dirtied_when and the i_flush_gen are identical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+  unsigned long ts)
+{
+   struct rb_node *n = sb->s_flush_root.rb_node;
+   assert_spin_locked(&inode_lock);
+   while (n) {
+   struct inode *inode = rb_to_inode(n);
+   if (time_before(ts, inode->dirtied_when)) {
+   n = n->rb_left;
+   } else if (time_after(ts, inode->dirtied_when)) {
+   n = n->rb_right;
+   } else {
+   return inode;
+   }
+   }
+   return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-15 Thread Michael Rubin

On Jan 15, 2008 12:46 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> Just a quick question, how does this interact/depend-uppon etc.. with
> Fengguangs patches I still have in my mailbox? (Those from Dec 28th)

They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.

This work was done before Fengguang's patches. I am trying to test
Fengguang's for comparison but am having problems with getting mm1 to
boot on my systems.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/13] writeback bug fixes and simplifications take 2

2008-01-15 Thread Michael Rubin

On Jan 15, 2008 4:36 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> Andrew,
>
> This patchset mainly polishes the writeback queuing policies.

Anyone know which tree is this patched based out of?

> The main goals are:
>
> (1) small files should not be starved by big dirty files
> (2) sync as fast as possible for not-blocked inodes/pages
> - don't leave them out; no congestion_wait() in between them
> (3) avoid busy iowait for blocked inodes
> - retry them in the next go of s_io(maybe at the next wakeup of pdflush)
>

Fengguang do you have any specific tests for any of these cases? As I
have posted earlier I am putting together a writeback test suite for
test.kernel.org and if you have one (even if it's an ugly shell
script) that would save me some time.

Also if you want any of mine let me know. :-)

mrubin


> The role of the queues:
>
> s_dirty:   park for dirtied_when expiration
> s_io:  park for io submission
> s_more_io: for big dirty inodes, they will be retried in this run of pdflush
>(it ensures fairness between small/large files)
> s_more_io_wait: for blocked inodes, they will be picked up in next run of s_io
>
>
> This patchset is in better shape, but still not ready for merge.
> It begins with:
>
> [PATCH 01/13] writeback: revert 
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
> [PATCH 02/13] writeback: clear PAGECACHE_TAG_DIRTY for truncated page 
> in block_write_full_page()
>
> Introduces more_io/more_io_wait based policies:
>
> [PATCH 03/13] writeback: introduce writeback_control.more_io
> [PATCH 04/13] writeback: introduce super_block.s_more_io_wait
> [PATCH 05/13] writeback: merge duplicate code into 
> writeback_some_pages()
> [PATCH 06/13] writeback: defer writeback on not-all-pages-written
> [PATCH 07/13] writeback: defer writeback on locked inode
> [PATCH 08/13] writeback: defer writeback on locked buffers
> [PATCH 09/13] writeback: requeue_io() on redirtied inode
>
> And finishes with some code cleanups:
>
> [PATCH 10/13] writeback: introduce queue_dirty()
> [PATCH 11/13] writeback: queue_dirty() on memory-backed bdi
> [PATCH 12/13] writeback: remove redirty_tail()
> [PATCH 13/13] writeback: cleanup __sync_single_inode()
>
> Diffstat:
>
>  fs/buffer.c |2
>  fs/fs-writeback.c   |  121 +++---
>  fs/super.c  |1
>  include/linux/fs.h  |1
>  mm/page-writeback.c |   46 +++
>  5 files changed, 72 insertions(+), 99 deletions(-)
>
> Regards,
> Fengguang Wu
> --
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread Michael Rubin

On Jan 15, 2008 7:01 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> Basically I think rbtree is an overkill to do time based ordering.
> Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> provides fair queuing between small/large files, and s_more_io_wait
> provides waiting mechanism for blocked inodes.

I think the flush_tree (which is a little more than just an rbtree)
provides the same queuing mechanisms that the three or four lists
heads do and manages to do it in one structure. The i_flushed_when
provides the ability to have blocked inodes wait their turn so to
speak.

Another motivation behind the rbtree patch is to unify the data
structure that handles the priority and mechanism of how we write out
the pages of the inodes. There are some ideas about introducing
priority schemes for QOS and such in the future. I am not saying this
patch is about making that happen, but the idea is to if possible
unify the four stages of lists into a single structure to facilitate
efforts like that.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Crash with 2.6.24-rc6-mm1 in restore_i387_ia32

2008-01-17 Thread Michael Rubin

When I boot my system with 2.6.24-rc6 everything is great. When I apply the mm1
patch I crash with the output below. Anyone seen this before?

Starting sendmail: Unable to handle kernel paging request at 00010013 
RIP: 
 [] restore_i387_ia32+0x1f/0x150
PGD 47e134067 PUD 0 
Oops:  [1] SMP 
last sysfs file: /sys/devices/system/machinecheck/machinecheck3/tolerant
CPU 3 
Modules linked in:
Pid: 0, comm: make Not tainted 2.6.24-rc6-mm1 #2
RIP: 0010:[]  [] 
restore_i387_ia32+0x1f/0x150
RSP: :81047e58be38  EFLAGS: 00010282
RAX: 81047e58bfd8 RBX: 81047e4cc000 RCX: 
RDX:  RSI: ffdc232c RDI: 81047e4cc000
RBP: 81047e58bec8 R08:  R09: ffdc25d8
R10: 81047e58a000 R11:  R12: 
R13: 81047e4cc000 R14: 81047e58bf58 R15: 81047e58bf24
FS:  7ff45f9566e0(0003) GS:81047f015d80(0003) knlGS:f7eebe80
CS:  0010 DS: 002b ES: 002b CR0: 8005003b
CR2: 00010013 CR3: 00047e117000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process make (pid: 0, threadinfo , task 81047e4cc000)
Stack:  81047e58be48 80243202 81047e58bf48 8020ad7d
 81047e58be98 0011 0001 
 0011 810400040001 0008087c 
Call Trace:
Code: 8b 51 14 89 d0 83 e0 01 85 c0 74 11 9b 83 e2 fe 89 51 14 0f 
RIP  [] restore_i387_ia32+0x1f/0x150
 RSP 
CR2: 00010013
divide error:  [2] SMP 
last sysfs file: /sys/devices/system/machinecheck/machinecheck3/tolerant
CPU 1 
Modules linked in:
Pid: 2175, comm: uname Tainted: G  D 2.6.24-rc6-mm1 #2
RIP: 0010:[]  [] calc_delta_mine+0x64/0x90
RSP: :81027dc9fb78  EFLAGS: 00010046
RAX: 0001 RBX: 81047e4cc038 RCX: 
RDX:  RSI: 0400 RDI: c009aeef285e
RBP: 81027dc9fb78 R08: 81047e4cc038 R09: 
R10: 0001 R11: 0064 R12: 8102800a21c0
R13: 8102800a21c0 R14: 000a6eeea85e R15: 0003
FS:  7f120bb256e0() GS:81027f029000() knlGS:f7ef6e80
CS:  0010 DS: 002b ES: 002b CR0: 8005003b
CR2: f7f05aa0 CR3: 00027dc59000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process uname (pid: 2175, threadinfo 81027dc9e000, task 81027e542790)
Stack:  81027dc9fba8 802313a6 0001 8102800a21c0
 81047e552038 0001 81027dc9fbd8 80231491
 81027dc9fbd8 81047e552038 8102800a5dc0 8102800a5dc0
Call Trace:
 [] update_curr+0xa6/0xb0
 [] enqueue_entity+0x21/0x70
 [] enqueue_task_fair+0x32/0x60
 [] enqueue_task+0x13/0x30
 [] activate_task+0x30/0x50
 [] try_to_wake_up+0x10d/0x130
 [] default_wake_function+0xd/0x10
 [] autoremove_wake_function+0x11/0x40
 [] __wake_up_common+0x4e/0x90
 [] __wake_up_sync+0x4a/0x70
 [] pipe_write+0x3d2/0x500
 [] __do_fault+0x215/0x470
 [] do_sync_write+0xf1/0x150
 [] autoremove_wake_function+0x0/0x40
 [] do_page_fault+0x420/0x7e0
 [] __up_write+0x72/0x150
 [] vfs_write+0xc7/0x170
 [] sys_write+0x50/0x90
 [] ia32_sysret+0x0/0xa
Code: 48 f7 f1 48 8d 48 01 49 89 48 08 eb 9f 48 8d 82 00 80 00 00 
RIP  [] calc_delta_mine+0x64/0x90
 RSP 
---[ end trace ca96c3a87188c958 ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:21
in_atomic():0, irqs_disabled():1
Pid: 2175, comm: uname Tainted: G  D 2.6.24-rc6-mm1 #2
Call Trace:
 [] __might_sleep+0xb2/0xd0
 [] down_read+0x1d/0x30
 [] exit_mm+0x30/0x100
 [] do_exit+0x1bb/0x850
 [] oops_end+0x85/0x90
 [] die+0x5e/0x90
 [] do_trap+0x154/0x160
 [] do_divide_error+0x8a/0xa0
 [] calc_delta_mine+0x64/0x90
 [] do_lookup+0x80/0x1f0
 [] dput+0x65/0x1a0
 [] __link_path_walk+0xcf5/0xd50
 [] error_exit+0x0/0x51
 [] calc_delta_mine+0x64/0x90
 [] update_curr+0xa6/0xb0
 [] enqueue_entity+0x21/0x70
 [] enqueue_task_fair+0x32/0x60
 [] enqueue_task+0x13/0x30
 [] activate_task+0x30/0x50
 [] try_to_wake_up+0x10d/0x130
 [] default_wake_function+0xd/0x10
 [] autoremove_wake_function+0x11/0x40
 [] __wake_up_common+0x4e/0x90
 [] __wake_up_sync+0x4a/0x70
 [] pipe_write+0x3d2/0x500
 [] __do_fault+0x215/0x470
 [] do_sync_write+0xf1/0x150
 [] autoremove_wake_function+0x0/0x40
 [] do_page_fault+0x420/0x7e0
 [] __up_write+0x72/0x150
 [] vfs_write+0xc7/0x170
 [] sys_write+0x50/0x90
 [] ia32_sysret+0x0/0xa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-17 Thread Michael Rubin

On Jan 17, 2008 1:41 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> The main benefit of rbtree is possibly better support of future policies.
> Can you demonstrate an example?

These are ill-formed thoughts as of now on my end but the idea was
that keeping one tree sorted via a scheme might be simpler than
multiple list_heads.

> Bugs can only be avoided by good understanding of all possible cases.)

I take the above statement as a tautology.  And am trying my best to do so. :-)

> The most tricky writeback issues could be starvation prevention
> between

> - small/large files
> - new/old files
> - superblocks

So I have written tests and believe I have covered these issues. If
you are concerned in specific on any and have a test case please let
me know.

> Some kind of limit should be applied for each. They used to be:
> - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
>   this preempts big files

The patch uses th same limit.

> - refill s_io iif it is drained
>   this prevents promotion of big/old files

Once a big file gets its first do_writepages it is moved behind the
other smaller files via i_flushed_when. And the same in reverse for
big vs old.

> - return from sync_sb_inodes() after one go of s_io

I am not sure how this limit helps things out. Is this for superblock
starvation? Can you elaborate?

> Michael, could you sort out and document the new starvation prevention 
> schemes?

The basic idea behind the writeback algorithm to handle starvation.
The over arching idea is that we want to preserve order of writeback
based on when an inode was dirtied and also preserve the dirtied_when
contents until the inode has been written back (partially or fully)

Every sync_sb_inodes we find the least recent inodes dirtied. To deal
with large or small starvation we have a s_flush_gen for each
iteration of sync_sb_inodes every time we issue a writeback we mark
that the inode cannot be processed until the next s_flush_gen. This
way we don't process one get to the rest since we keep pushing them
into subsequent s_fush_gen's.

Let me know if you want more detail or structured responses.

> Introduce i_flush_gen to help restarting from the last inode?
> Well, it's not as simple as list_heads.
>
> > 2) Added an inode flag to allow inodes to be marked so that they
> >are never written back to disk.
> >
> >The motivation behind this change is several fold. The first is
> >to insure fairness in the writeback algorithm. The second is to
>
> What do you mean by fairness?

So originally this comment was written when I was trying to fix a bug
in 2.6.23. The one where we were starving large files from being
flushed. There was a fairness issue where small files were being
flushed but the large ones were just ballooning in memory.

> Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?

The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
not a subsequent patch to the rbtree. The rbtree depends on the
WRITEBACK_NEVER patch otherwise we run in to problems in
generic_delete_inode. Now that you point it out I think I can split
this patch into two patches and make the WRITEBACK_NEVER in the first
one.

> More details about the fixings, please?

So again this comment was written against 2.6.23. The biggest fix is
the starving of big files. I remember there were other smaller issues,
but there have been so many changes in the patch sets that I need to
go back to quantify.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-17 Thread Michael Rubin

On Jan 17, 2008 9:01 PM, David Chinner <[EMAIL PROTECTED]> wrote:

First off thank you for the very detailed reply. This rocks and gives
me much to think about.

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> This seems suboptimal for large files. If you keep feeding in
> new least recently dirtied files, the large files will never
> get an unimpeded go at the disk and hence we'll struggle to
> get decent bandwidth under anything but pure large file
> write loads.

You're right. I understand now. I just  changed a dial on my tests,
ran it and found pdflush not keeping up like it should. I need to
address this.

> Switching inodes during writeback implies a seek to the new write
> location, while continuing to write the same inode has no seek
> penalty because the writeback is sequential.  It follows from this
> that allowing larges file a disproportionate amount of data
> writeback is desirable.
>
> Also, cycling rapidly through all the large files to write 4MB to each is
> going to cause us to spend time seeking rather than writing compared
> to cycling slower and writing 40MB from each large file at a time.
>
> i.e. servicing one large file for 100ms is going to result in higher
> writeback throughput than servicing 10 large files for 10ms each
> because there's going to be less seeking and more writing done by
> the disks.
>
> That is, think of large file writes like process scheduler batch
> jobs - bulk throughput is what matters, so the larger the time slice
> you give them the higher the throughput.
>
> IMO, the sort of result we should be looking at is a
> writeback design that results in cycling somewhat like:
>
> slice 1: iterate over small files
> slice 2: flush large file 1
> slice 3: iterate over small files
> slice 4: flush large file 2
> ..
> slice n-1: flush large file N
> slice n: iterate over small files
> slice n+1: flush large file N+1
>
> So that we keep the disk busy with a relatively fair mix of
> small and large I/Os while both are necessary.

I am getting where you are coming from. But if we are going to make
changes to optimize for seeks maybe we need to be more aggressive in
write back in how we organize both time and location. Right now AFAIK
there is no attention to location in the writeback path.

> The higher the bandwidth of the device, the more frequently
> we need to be servicing the inodes with large amounts of
> dirty data to be written to maintain write throughput at a
> significant percentage of the device capability.
>

Could you expand that to say it's not the inodes of large files but
the ones with data that we can exploit locality? Often large files are
fragmented. Would it make more sense to pursue cracking the inodes and
grouping their blocks's locations? Or is this all overkill and should
be handled at a lower level like the elevator?

> BTW, it needs to be recognised that if we are under memory pressure
> we can clean much more memory in a short period of time by writing
> out all the large files first. This would clearly benefit the system
> as a whole as we'd get the most pages available for reclaim as
> possible in a short a time as possible. The writeback algorithm
> should really have a mode that allows this sort of flush ordering to
> occur

I completely agree.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-17 Thread Michael Rubin

On Jan 17, 2008 8:56 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:

Once again thanks for the speedy replies. :-)

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> Suppose we want to grant longer expiration window for temp files,
> adding a new list named s_dirty_tmpfile would be a handy solution.

When you mean tmp do you mean files that eventually get written to
disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure
if that feature is worth making a special case. It seems like the
location based ideas may be more useful.

> So the question is: should we need more than 3 QoS classes?
>
> > > The most tricky writeback issues could be starvation prevention
> > > between
> >
> >
> > > - small/large files
> > > - new/old files
> > > - superblocks
> >
> > So I have written tests and believe I have covered these issues. If
> > you are concerned in specific on any and have a test case please let
> > me know.
>
> OK.
>
> > > Some kind of limit should be applied for each. They used to be:
> > > - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> > >   this preempts big files
> >
> > The patch uses th same limit.
> >
> > > - refill s_io iif it is drained
> > >   this prevents promotion of big/old files
> >
> > Once a big file gets its first do_writepages it is moved behind the
> > other smaller files via i_flushed_when. And the same in reverse for
> > big vs old.
>
> You mean i_flush_gen?

Yeah sorry. It was once called i_flush_when. (sheepish)

> No, sync_sb_inodes() will abort on every
> MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
> Hence the sync will restart from big/old files.

If I understand you correctly I am not sure I agree. Here is what I
think happens in the patch:

1) pull big inode off of flush tree
2) sync big inode
3) Hit MAX_WRITEBACK_PAGES
4) Re-insert big inode (without modifying the dirtied_when)
5) update the i_flush_gen on big inode and re-insert behind small
inodes we have not synced yet.

In a subsequent sync_sb_inode we end up retrieving the small inode we
had not serviced yet.

> > > - return from sync_sb_inodes() after one go of s_io
> >
> > I am not sure how this limit helps things out. Is this for superblock
> > starvation? Can you elaborate?
>
> We should have a way to go to next superblock even if new dirty inodes
> or pages are emerging fast in this superblock. Fill and drain s_io
> only once and then abort helps.

Got it.

> s_io is a stable and bounded working set in one go of superblock.

Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit.

> Basically you make one list_head in each rbtree node.
> That list_head is recycled cyclic, and is an analog to the old
> fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
> So an extra indicator must be introduced - i_flush_gen. It's awkward.
> We are simply repeating the aged list_heads' problem.

To me they both feel a little awkward. I feel like the original
problem in 2.6.23 led to a lot of examination which is bringing new
possibilities to light.

BTW the issue that started me on this whole path (starving large
files) was still present in 2.6.23-rc8 but now looks fixed in
2.6.24-rc3.
Still no idea about your changes in 2.6.24-rc6-mm1. I have given up
trying to get that thing to boot.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/13] writeback bug fixes and simplifications take 2

2008-01-17 Thread Michael Rubin

On Jan 15, 2008 4:36 AM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> Andrew,
>
> This patchset mainly polishes the writeback queuing policies.
> The main goals are:
>
> (1) small files should not be starved by big dirty files
> (2) sync as fast as possible for not-blocked inodes/pages
> - don't leave them out; no congestion_wait() in between them
> (3) avoid busy iowait for blocked inodes
> - retry them in the next go of s_io(maybe at the next wakeup of pdflush)
>
> The role of the queues:
>
> s_dirty:   park for dirtied_when expiration
> s_io:  park for io submission
> s_more_io: for big dirty inodes, they will be retried in this run of pdflush
>(it ensures fairness between small/large files)
> s_more_io_wait: for blocked inodes, they will be picked up in next run of s_io

Quick question to make sure I get this. Each queue is sorted as such:

s_dirty - sorted by the dirtied_when field
s_io - sorted by  no explicit key but by the order we want to process
in sync_sb_inodes
s_more_io - held for later they are sorted in the same manner as s_io

Is that it?

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-18 Thread Michael Rubin

On Jan 18, 2008 12:54 AM, David Chinner <[EMAIL PROTECTED]> wrote:
> At this point, I'd say it is best to leave it to the filesystem and
> the elevator to do their jobs properly.

Amen.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Writing tests for write back

Re: Possible fix for lockup in drop_caches

Re: Possible fix for lockup in drop_caches

[patch 1/1] Writeback fix for concurrent large and small file writes.

Re: [patch 1/1] Writeback fix for concurrent large and small file writes.

[patch] Converting writeback linked lists to a tree based data structure

[patch 1/1] Writeback fix for concurrent large and small file writes

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

[patch] Converting writeback linked lists to a tree based data structure

Re: [patch] Converting writeback linked lists to a tree based data structure

Re: [PATCH 00/13] writeback bug fixes and simplifications take 2

Re: [patch] Converting writeback linked lists to a tree based data structure

Crash with 2.6.24-rc6-mm1 in restore_i387_ia32

Re: [patch] Converting writeback linked lists to a tree based data structure

Re: [patch] Converting writeback linked lists to a tree based data structure

Re: [patch] Converting writeback linked lists to a tree based data structure

Re: [PATCH 00/13] writeback bug fixes and simplifications take 2

Re: [patch] Converting writeback linked lists to a tree based data structure

21 matches

Site Navigation

Mail list logo

Footer information