Nick Piggin <[EMAIL PROTECTED]> writes: > On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote: >> Nick Piggin <[EMAIL PROTECTED]> writes: >> > So I believe the problem is that for a multi-segment iovec, we currently >> > prepare_write/commit_write once for each segment, right? We do this >> >> It is more complex. >> >> Currently a __grab_cache_page, a_ops->prepare_write, >> filemap_copy_from_user[_iovec] and a_ops->commit_write is done >> whenever we hit >> >> a) a page boundary > > This is required by the prepare_write/commit_write API. The write_begin > / write_end API is also a page-based one, but in future, we are looking > at having a more general API but we haven't completely decided on the > form yet. "perform_write" is one proposal you can look for. > >> b) a segment boundary > > This is done, as I said, because of the deadlock issue. While the issue is > more completely fixed in -mm, a special case for kernel memory (eg. nfsd) > is in the latest mainline kernels.
Can you tell me where to get the fix from -mm? If it is completly fixed there then that could make our patch obsolete. >> Those two cases don't have to, and from the stats basically never, >> coincide. For NFSd this means we do this TWICE per segment and TWICE >> per page. > > The page boundary doesn't matter so much (well it does for other reasons, > but we've never been good at them...). The segment boundary means that > we aren't able to do block sized writes very well and end up doing a lot of > read-modify-write operations that could be avoided. Those are extremly costly for lustre. We have tested exporting a lustre filesystem to NFS. Without fixes we get 40MB/s and with the fixes it rises to nearly 200MB/s. That is a factor of 5 in speed. >> > because there is a nasty deadlock in the VM (copy_from_user being >> > called with a page locked), and copying multiple segs dramatically >> > increases the chances that one of these copies will cause a page fault >> > and thus potentially deadlock. >> >> What actually locks the page? Is it __grab_cache_page or >> a_ops->prepare_write? > > prepare_write must be given a locked page. Then that means __grab_cache_page does return a locked page because there is nothing between the two calls that would. >> Note that the patch does not change the number of copy_from_user calls >> being made nor does it change their arguments. If we need 2 (or more) >> segments to fill a page we still do 2 seperate calls to >> filemap_copy_from_user_iovec, both only spanning (part of) one >> segment. >> >> What the patch changes is the number of copy_from_user calls between >> __grab_cache_page and a_ops->commit_write. > > So you're doing all copy_from_user calls within a prepare_write? Then > you're increasing the chances of deadlock. If not, then you're breaking > the API contract. Actually due to a bug, as you noticed, we do the copy first and then prepare/write. But fixing that would indeed do multiple copies between prepare and commit. >> Copying a full PAGE_SIZE bytes from multiple segments in one go would >> be a further improvement if that is possible. >> >> > The fix you have I don't think can work because a filesystem must be >> > notified of the modification _before_ it has happened. (If I understand >> > correctly, you are skipping the prepare_write potentially until after >> > some data is copied?). >> >> Yes. We changed the order of copy_from_user calls and >> a_ops->prepare_write by mistake. We will rectify that and do the >> prepare_write for the full page (when possible) before copying the >> data into the page. > > OK, that is what used to be done, but the API is broken due to this > deadlock. write_begin/write_end fixes it properly. I'm verry interested in that fix. >> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but >> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did >> > you try a later kernel to see if it is fixed there? >> >> Later than 2.6.23-rc5? > > No it would be included earlier. The "segment_eq" check should be > allowing kernel writes (nfsd) to write multiple segments. If you have a > patch which changes this significantly, then it would indicate the > existing logic has a problem (or you've got a userspace application doing > the writev, which should be fixed by the write_begin patches in -mm). I've got userspace application doing the writev. To be exact 14% of the commits were saved by combining multiple segments into a single prepare/write pair. Since the kernel segments don't fragment anymore in 2.6.23-rc5 those savings must come from user space stuff. >From the stats posted earlier you can see that there is a substantial amount of calls with 6 segments all (alot) smaller than a page. Lots of calls our patch or the write_begin/end will save. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/