On Thu 27-06-13 19:59:50, Linus Torvalds wrote: > On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <da...@fromorbit.com> wrote: > > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: > >> > >> So what made it all start happening now? I don't recall us having had > >> these kinds of issues before.. > > > > Not sure - it's a sudden surprise for me, too. Then again, I haven't > > been looking at sync from a performance or lock contention point of > > view any time recently. The algorithm that wait_sb_inodes() is > > effectively unchanged since at least 2009, so it's probably a case > > of it having been protected from contention by some external factor > > we've fixed/removed recently. Perhaps the bdi-flusher thread > > replacement in -rc1 has changed the timing sufficiently that it no > > longer serialises concurrent sync calls as much.... > > > > However, the inode_sb_list_lock is known to be a badly contended > > lock from a create/unlink fastpath for XFS, so it's not like this sort > > of thing is completely unexpected. > > That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb > one? No, that won't fix all problems, but it might at least help a > *bit*. > > Also, looking some more now at that wait_sb_inodes logic, I have to > say that if the problem is primarily the inode->i_lock, then that's > just crazy. We normally shouldn't even *need* that lock, since we > could do a totally unlocked iget() as long as the count is non-zero. > > And no, I don't think really need the i_lock for checking > "mapping->nrpages == 0" or the magical "inode is being freed" bits > either. Or at least we could easily do some of this optimistically for > the common cases. > > So instead of doing > > struct address_space *mapping = inode->i_mapping; > > spin_lock(&inode->i_lock); > if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > (mapping->nrpages == 0)) { > spin_unlock(&inode->i_lock); > continue; > } > __iget(inode); > spin_unlock(&inode->i_lock); > > I really think we could do that without getting the inode lock at > *all* in the common case. > > I'm attaching a pretty trivial patch, which may obviously be trivially > totally flawed. I have not tested this in any way, but half the new > lines are comments about why it's doing what it is doing. And I > really think that it should make the "actually take the inode lock" be > something quite rare. > > And quite frankly, I'd much rather get *rid* of crazy i_lock accesses, > than try to be clever and use a whole different list at this point. > Not that I disagree that it wouldn't be much nicer to use a separate > list in the long run, but for a short-term solution I'd much rather > keep the old logic and just tweak it to be much more usable.. > > Hmm? Al? Jan? Comments? Yeah, the patch looks good to me so if it helps Dave with his softlockups I also think it's a safer alternative than Dave's patch for 3.10. BTW, one suggestion for improvement below:
fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 48 insertions(+), 10 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 3be57189efd5..3dcc8b202a40 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1206,6 +1206,52 @@ out_unlock_inode: } EXPORT_SYMBOL(__mark_inode_dirty); +/* + * Do we want to get the inode for writeback? + */ +static int get_inode_for_writeback(struct inode *inode) +{ + struct address_space *mapping = inode->i_mapping; + + /* + * It's a data integrity sync, but we don't care about + * racing with new pages - we're about data integrity + * of things in the past, not the future + */ + if (!ACCESS_ONCE(mapping->nrpages)) + return 0; I think we can change the above condition to: if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) return 0; That should make us skip most of the inodes in the case Dave Chinner was testing. Honza + + /* Similar logic wrt the I_NEW bit */ + if (ACCESS_ONCE(inode->i_state) & I_NEW) + return 0; + + /* + * When the inode count goes down to zero, the + * I_WILL_FREE and I_FREEING bits might get set. + * But not before. + * + * So if we get this, we know those bits are + * clear, and the inode is still interesting. + */ + if (atomic_inc_not_zero(&inode->i_count)) + return 1; + + /* + * Slow path never happens normally, since any + * active inode will be referenced by a dentry + * and thus caught above + */ + spin_lock(&inode->i_lock); + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || + (mapping->nrpages == 0)) { + spin_unlock(&inode->i_lock); + return 0; + } + __iget(inode); + spin_unlock(&inode->i_lock); + return 1; +} + static void wait_sb_inodes(struct super_block *sb) { struct inode *inode, *old_inode = NULL; @@ -1226,16 +1272,8 @@ static void wait_sb_inodes(struct super_block *sb) * we still have to wait for that writeout. */ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - struct address_space *mapping = inode->i_mapping; - - spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (mapping->nrpages == 0)) { - spin_unlock(&inode->i_lock); + if (!get_inode_for_writeback(inode)) continue; - } - __iget(inode); - spin_unlock(&inode->i_lock); spin_unlock(&inode_sb_list_lock); /* @@ -1249,7 +1287,7 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); old_inode = inode; - filemap_fdatawait(mapping); + filemap_fdatawait(inode->i_mapping); cond_resched(); -- Jan Kara <j...@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/