Re: [PATCH] ext4: dir inode reservation V3

2007-11-13 Thread Alex Tomas
hmm. so you trade 265% degradation of creation for 40% improvement of unlink? thanks, Alex Coly Li wrote: normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas
On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote: > The problem is this: to alter the fundamental block size of the > filesystem we also need to alter the data block size and that is > exactly the piece that linux does not support right now. So while > we have the capability to use large block

Re: [PATCH] ext4:fix unexpected error from ext4_reserve_global

2007-06-18 Thread Alex Tomas
ACK, of course. thanks, Alex Mingming Cao wrote: On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote: I just cant belive my eyes then i saw this at the first time... simple test: strace dd if=/dev/zero of=/mnt/file Thanks for reporting it. open("/dev/zero", O_RDONLY) = 0

Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas
Andrew Morton wrote: I'm still not understanding. The terms you're using are a bit ambiguous. What does "find some dirty unallocated blocks" mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(

Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas
Andrew Morton wrote: On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/bl

Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas
Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts t

Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas
Andrew Morton wrote: We can make great improvements here, and I've (twice) previously decribed how: hoist the entire ordered-mode data handling out of ext3, and out of the buffer_head layer and move it up into the VFS pagecache layer. Basically, do ordered-data with a commit-time inode walk, cal

Re: 2.6.21-ext4-1

2007-04-30 Thread Alex Tomas
Theodore Ts'o wrote: P.S. One bug which I've noted --- if there is a failure due to disk filling up, running e2fsck on the filesystem will show that the i_blocks fields on the inodes where there was a failure to allocate disk blocks are left incorrect. I'm guessing this is a bug in the delayed

Re: O_DIRECT question

2007-01-17 Thread Alex Tomas
I think one problem with mmap/msync is that they can't maintain i_size atomically like regular write does. so, one needs to implement own i_size management in userspace. thanks, Alex > Side note: the only reason O_DIRECT exists is because database people are > too used to it, because other OS's

Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Alex Tomas
> Peter Staubach (PS) writes: PS> Just out of curosity, what keeps i_nlink from going to 0 immediately PS> after the new test is executed? i_mutex in vfs_link() and vfs_unlink() thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes: ES> Al says "no" and I'm not arguing. :) ES> Apparently this may be OK with some filesystems, and Al says he doesn't ES> want to know about i_nlink in the vfs in any case. well, generic_drop_inode() uses i_nlink ... ES> But I suppose there may be other files

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes: ES> I tend to agree, chatting w/ Al I think he does too. :) I'll test ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit ES> that if things go well. shouldn't VFS do that? thanks, Alex - To unsubscribe from this list: send the line "u

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes: ES> so I think it's possible that link can sneak in there & find it after ES> the mutex is dropped...? Is this ok? :) It's certainly -happening- ES> anyway yes, but it shouldn't allow to re-link such inode back, IMHO. a filesystem may start some non-revert

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
ck i_nlink. thanks, Alex >>>>> Alex Tomas (AT) writes: AT> interesting .. AT> I thought VFS doesn't allow concurrent operations. AT> if unlink goes first, then link should wait on the AT> parent's i_mutex and then found no source name. AT> thanks, Alex

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
interesting .. I thought VFS doesn't allow concurrent operations. if unlink goes first, then link should wait on the parent's i_mutex and then found no source name. thanks, Alex > Eric Sandeen (ES) writes: ES> ) ES> I've been looking at a case where many threads are opening, unlinking, a

Re: [RFC] delayed allocation for ext4

2006-12-28 Thread Alex Tomas
> David Chinner (DC) writes: DC> So that mean's we'll have 2 separate mechanisms for marking DC> pages as delalloc. XFS uses the BH_delay flag to indicate DC> that a buffer (block) attached to the page is using delalloc. >> >> well, for blocksize=pagesize we can save 56 bytes on every pa

Re: [RFC] ext4-block-reservation.patch

2006-12-23 Thread Alex Tomas
Hi, > Andrew Morton (AM) writes: AM> Should be cacheline_aligned_in_smp. AM> That's assuming it needs to be cacheline aligned at all. It can consume a AM> lot of space. the idea is to make block reservation cheap because it's called for every page. AM> AM> oh, this should be

Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas
> Christoph Hellwig (CH) writes: CH> Note that recording delayed alloc state at a page granularity in addition CH> to just the buffer heads has a lot of advantages aswell and would help CH> xfs, too. But I think it makes a lot more sense to record it as a radix CH> tree tag to speed up th

Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas
Good day, > David Chinner (DC) writes: DC> So that mean's we'll have 2 separate mechanisms for marking DC> pages as delalloc. XFS uses the BH_delay flag to indicate DC> that a buffer (block) attached to the page is using delalloc. well, for blocksize=pagesize we can save 56 bytes on ever

[RFC] ext4-block-reservation.patch

2006-12-22 Thread Alex Tomas
Index: linux-2.6.20-rc1/include/linux/ext4_fs.h === --- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0 +0300 @@

[RFC] ext4-delayed-allocation.patch

2006-12-22 Thread Alex Tomas
+0300 +++ linux-2.6.20-rc1/fs/ext4/writeback.c2006-12-22 22:59:33.00000 +0300 @@ -0,0 +1,1167 @@ +/* + * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED] + * Written by Alex Tomas <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribut

[RFC] booked-page-flag.patch

2006-12-22 Thread Alex Tomas
Index: linux-2.6.20-rc1/include/linux/page-flags.h === --- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0 +0300

[RFC] delayed allocation for ext4

2006-12-22 Thread Alex Tomas
Good day, probably the previous set of patches (including mballoc/lg) is too large. so, I reworked delayed allocation a bit so that it can be used on top of regular balloc, though it still can be used with extents-enabled files only. this time series contains just 3 patches: - booked-page-flag

Re: Boot failure with ext2 and initrds

2006-11-16 Thread Alex Tomas
> Andrew Morton (AM) writes: AM> What lock protects the fields in struct ext[234]_reserve_window from being AM> concurrently modified by two CPUs? None, it seems. Ditto AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not AM> for pageout over a file hole. If we

Re: [RFC] pdirops: vfs patch

2005-02-23 Thread Alex Tomas
> Jan Blunck (JB) writes: JB> Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found JB> by __d_lookup() until it is rehashed which is implicit done by ->lookup(). that means we can have two processes allocated dentry for same name. they'll call ->lookup() each ag

Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
> Jan Blunck (JB) writes: JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be JB> acquired for ->lookup() and because we might sleep on i_sem, we have to get it JB> early and check for repopulation of the dcache. dentry is part of dcache, right? i_sem prote

Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
> Jan Blunck (JB) writes: >> 1) i_sem protects dcache too JB> Where? i_sem is the per-inode lock, and shouldn't be used else. read comments in fs/namei.c:read_lookup() >> 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) >> 3) I have pdirops patch for ext3, but it ne

Re: [RFC] pdirops: vfs patch

2005-02-20 Thread Alex Tomas
> Jan Blunck (JB) writes: JB> With luck you have s_pdirops_size (or 1024) different renames altering JB> concurrently one directory inode. Therefore you need a lock protecting JB> your filesystem data. This is basically the job done by i_sem. So in JB> my opinion you only move "The Problem

Re: [RFC] pdirops: tmpfs patch

2005-02-19 Thread Alex Tomas
Index: linux-2.6.10/mm/shmem.c === --- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300 +++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300 @@ -1849,7 +1849,7 @@ #endif }; -static int shmem_

Re: [RFC] pdirops: vfs patch

2005-02-19 Thread Alex Tomas
fs/inode.c |1 fs/namei.c | 66 ++--- include/linux/fs.h | 11 3 files changed, 54 insertions(+), 24 deletions(-) Index: linux-2.6.10/fs/namei.c ===

[RFC] parallel directory operations

2005-02-19 Thread Alex Tomas
Good day Al and all could you review couple patches that implement $subj for vfs and tmpfs. In short the idea is that we can protect operations taking semaphore related for set of names. definitely, protection at vfs layer isn't enough and filesystem will need to protect their own structures by i

Re: [Ext2-devel] Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-15 Thread Alex Tomas
> Sonny Rao (SR) writes: SR> Alex, small buglet, If the FIBMAP-ioctl get's called on a file with SR> delayed allocation, you need to flush it (or at least allocate) before SR> returning the mappings. This doesn't seem to work properly at SR> present. good catch. thanks. - To unsubsc

Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-11 Thread Alex Tomas
Good day all, I've updated the patchset against 2.6.10. A bunch of bugs have been fixed and mballoc now behaves smarter a bit. Extents and mballoc patches collects some stats they print upon umount. NOTE: they must not be used to store important data. A lot of things are to be done. Please rev

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-30 Thread Alex Tomas
>>>>> Stephen C Tweedie (SCT) writes: SCT> Hi, SCT> On Tue, 2005-01-25 at 19:30, Alex Tomas wrote: >> >> journal_dirty_metadata(handle, bh) >> >> { >> >> transaction->t_reserved--; >> >> handle->h_buffer_

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes: >> journal_dirty_metadata(handle, bh) >> { >> transaction->t_reserved--; >> handle->h_buffer_credits--; >> if (jh->b_tcount > 0) { >> /* modifed, no need to track it any more */ >> transaction-> t_outstanding_credits++; >>

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas
Hi, could you review the following solution? t_outstanding_credits - number of _modified_ blocks in the transaction t_reserved - number of blocks all running handle reserved transaction size = t_outstanding_credits + t_reserved; #define TSIZE(t)((t)->t_outstanding_credits + (t)

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes: >> + /* return credit back to the handle if it was really spent */ >> + if (credits) { >> + handle->h_buffer_credits++; >> + spin_lock(&handle->h_transaction->t_handle_lock); >> + handle->h_transaction->t_outstandi

Re: [Ext2-devel] [PATCH] JBD: log space management optimization

2005-01-24 Thread Alex Tomas
mmit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/t

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes: >> + /* return credit back to the handle if it was really spent */ >> + if (credits) >> + handle->h_buffer_credits++; >> + jh->b_tcount--; >> + if (jh->b_tcount == 0) { >> + /* >> +* this was last reference to

Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes: SCT> /* SCT>* Be pessimistic here about the number of those free blocks which SCT>* might be required for log descriptor control blocks. SCT>*/ SCT> ... SCT> left -= (left >> 3); oops. i overlooked this line. so, the fix becomes minor

Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes: SCT> I don't see how that "limit" is relevant here. wbuf is nothing but the SCT> size of the IO batches we pass to ll_rw_block() during that commit SCT> phase. j_free affects the total size of space the *entire* commit has SCT> to run into, and (as akpm h

[PATCH] JBD: log space management optimization

2005-01-19 Thread Alex Tomas
ion. for example, removal of 500MB file reserves 136 blocks, but only 10 blocks go to the log. a commit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. thanks, Alex Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linu

[PATCH] JBD: journal_release_buffer()

2005-01-19 Thread Alex Tomas
xattr very much. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/include/linux/journal-head.h === --- linux-2.6.7.orig/include/linux/journal-head.h 2003-06-24 18:05:26.0 +0400 +++ linux-2.6.7/include

[PATCH] JBD: fix against journal overflow

2005-01-19 Thread Alex Tomas
nerates too many descriptor blocks because static array wbuf can hold 64 blocks only. The fix is to have persistent array big enough to hold max. possible blocks. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/include/linux/jbd.h