Andrew Morton wrote:
On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <[EMAIL PROTECTED]> wrote:Some more information - stripe unit on the dm raid0 is 512k. I have not attempted to increase I/O sizes at all yet - these test are just demonstrating efficiency improvements in the filesystem. These numbers for 32GB files. READ WRITE disks blksz tput sys tput sys ----- ----- ----- ---- ----- ---- 1 4k 89 18s 57 44s 1 16k 46 13s 67 18s 1 64k 75 12s 68 12s 2 4k 179 20s 114 43s 2 16k 55 13s 132 18s 2 64k 126 12s 126 12s 4 4k 350 20s 214 43s 4 16k 350 14s 264 19s 4 64k 176 11s 266 12s 8 4k 415 21s 446 41s 8 16k 655 13s 518 19s 8 64k 664 12s 552 12s 12 4k 413 20s 633 33s 12 16k 736 14s 741 19s 12 64k 836 12s 743 12s Throughput in MB/s. Consistent improvement across the write results, first time I've hit the limits of the PCI-X bus with a single buffered I/O thread doing either reads or writes.1-disk and 2-disk read throughput fell by an improbable amount, which makes me cautious about the other numbers. Your annotation says "blocksize". Are you really varying the fs blocksize here, or did you mean "pagesize"? What worries me here is that we have inefficient code, and increasing the pagesize amortises that inefficiency without curing it. If so, it would be better to fix the inefficiencies, so that 4k pagesize will also benefit. For example, see __do_page_cache_readahead(). It does a read_lock() and a page allocation and a radix-tree lookup for each page. We can vastly improve that. Step 1: - do a read-lock - do a radix-tree walk to work out how many pages are missing - read-unlock - allocate that many pages - read_lock() - populate all the pages. - read_unlock - if any pages are left over, free them - if we ended up not having enough pages, redo the whole thing. that will reduce the number of read_lock()s, read_unlock()s and radix-tree descents by a factor of 32 or so in this testcase. That's a lot, and it's something we (Nick ;)) should have done ages ago.
We can do pretty well with the lockless radix tree (that is already upstream) there. I split that stuff out of my most recent lockless pagecache patchset, because it doesn't require the "scary" speculative refcount stuff of the lockless pagecache proper. Subject: [patch 5/9] mm: lockless probe. So that is something we could merge pretty soon. The other thing is that we can batch up pagecache page insertions for bulk writes as well (that is. write(2) with buffer size > page size). I should have a patch somewhere for that as well if anyone interested. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

