from:"Dmitriy Monakhov"

Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-14 Thread Dmitriy Monakhov

Eric Sandeen <[EMAIL PROTECTED]> writes:

> I've been looking at a case where many threads are opening, unlinking, and
> hardlinking files on ext3 .
How many concurent threads do you use and how long does it takes to trigger 
this race? I've tried to reproduce this with two threads, but not succeed.
  
fd = create("src")
close(fd)
unlink("src")

link("src", "dst")
unlink("dst")

Original testcase will be the best answer :).
Thanks.
>  At unmount time I see an oops, because the superblock's
> orphan list points to a freed inode.
>
> I did some tracing of the inodes, and it looks like this:
>
>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
>   i_state:0x7 cpu:3 i_count:1 i_nlink:0
>
>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
>   i_state:0x7 cpu:3 i_count:1 i_nlink:1
>
> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, 
> so
> it puts it on the orphan inode list.  Then link comes along, and bumps the 
> link
> back up to 1.  So now we are on the orphan inode list, but we are not 
> unlinked.
>
> Eventually when count goes to 0, and we still have 1 link, again no action is
> taken to remove the inode from the orphan list, because it is still linked 
> (i.e.
> we don't go through ext3_delete())
>
> When this inode is eventually freed, the sb orphan list gets corrupted, 
> because 
> we have freed it without first removing it from the orphan list.
>
> I think the simple solution is to remove the inode from the orphan list
> when we bump the link back up from 0 to 1.  I put that test in there because
> there are other potential reasons that we might be on the list (truncates,
> direct IO).
>
> Comments?
>
> Thanks,
> -Eric
>
> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
> arg, and are very infrequently called.  I'll probably submit a patch
> to just put the single line of code into the caller, too.
>
> ---
>
> Remove inode from the orphan list in ext3_link() if we might have
> raced with ext3_unlink(), which potentially put it on the list.
> If we're on the list with nlink > 0, we'll never get cleaned up
> properly and eventually may corrupt the list.
>
> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
>
> Index: linux-2.6.19/fs/ext3/namei.c
> ===
> --- linux-2.6.19.orig/fs/ext3/namei.c
> +++ linux-2.6.19/fs/ext3/namei.c
> @@ -2204,6 +2204,9 @@ retry:
>   inode->i_ctime = CURRENT_TIME_SEC;
>   ext3_inc_count(handle, inode);
>   atomic_inc(&inode->i_count);
> + /* did we race w/ unlink? */
> + if (inode->i_nlink == 1)
> + ext3_orphan_del(handle, inode);
>  
>   err = ext3_add_nondir(handle, dentry, inode);
>   ext3_journal_stop(handle);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 6/10] mm: be sure to trim blocks

2007-01-14 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> we may have failed the write operation despite prepare_write having
> instantiated blocks past i_size. Fix this, and consolidate the trimming into
> one place.
>
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
>
> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
>   }
>  
>   status = a_ops->prepare_write(file, page, offset, offset+bytes);
> - if (unlikely(status)) {
> - loff_t isize = i_size_read(inode);
> + if (unlikely(status))
> + goto fs_write_aop_error;
May be it's stupid question but still..
Why we treat non zero prepare_write() return code as error, it may be positive.
Positive error code may be used as fine grained 'bytes' limiter in case of 
blksize < pgsize as follows:

status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
if (status > 0) {
bytes = min(bytes, status);
status = 0;
} else {
goto fs_write_aop_error;
}
}
---
This is useful because fs may want to reduce 'bytes' by number of reasons,
for example make it blksize bound. 
Example : filesystem has 1k blksize and only two free blocks. And we try 
write 4k bytes.
Currently  write(fd, buff, 4096) will return -ENOSPC
But after this fix write(fd, buff, 4096) will return as mutch as it can (2048).
>  
> - if (status != AOP_TRUNCATED_PAGE)
> - unlock_page(page);
> - page_cache_release(page);
> - if (status == AOP_TRUNCATED_PAGE)
> - continue;
> - /*
> -  * prepare_write() may have instantiated a few blocks
> -  * outside i_size.  Trim these off again.
> -  */
> - if (pos + bytes > isize)
> - vmtruncate(inode, isize);
> - break;
> - }
>   if (likely(nr_segs == 1))
>   copied = filemap_copy_from_user(page, offset,
>   buf, bytes);
> @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
>   cur_iov, iov_offset, bytes);
>   flush_dcache_page(page);
>   status = a_ops->commit_write(file, page, offset, offset+bytes);
> - if (status == AOP_TRUNCATED_PAGE) {
> - page_cache_release(page);
> - continue;
> - }
> + if (unlikely(status))
> + goto fs_write_aop_error;
> +
>   if (likely(copied > 0)) {
>   if (!status)
>   status = copied;
> @@ -1969,6 +1955,25 @@ generic_file_buffered_write(struct kiocb
>   break;
>   balance_dirty_pages_ratelimited(mapping);
>   cond_resched();
> + continue;
> +
> +fs_write_aop_error:
> + if (status != AOP_TRUNCATED_PAGE)
> + unlock_page(page);
> + page_cache_release(page);
> +
> + /*
> +  * prepare_write() may have instantiated a few blocks
> +  * outside i_size.  Trim these off again. Don't need
> +  * i_size_read because we hold i_mutex.
> +  */
> + if (pos + bytes > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> + if (status == AOP_TRUNCATED_PAGE)
> + continue;
> + else
> + break;
> +
>   } while (count);
>   *ppos = pos;
>  
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH] mm: search_binary_handler mem limit fix

2007-01-20 Thread Dmitriy Monakhov


Function change mem limit to USER_DS before possible modprobe, but never 
restore 
it again. Why does this happend, is it just forgotten? As i understand currently
this not cause actual problems, but any one may call access_ok() after 
search_binary_handler() and will be really surprised.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/exec.c b/fs/exec.c
index 11fe93f..c7e017b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1020,6 +1020,7 @@ int search_binary_handler(struct linux_b
 {
int try,retval;
struct linux_binfmt *fmt;
+   mm_segment_t oldfs;
 #ifdef __alpha__
/* handle /sbin/loader.. */
{
@@ -1061,11 +1062,12 @@ int search_binary_handler(struct linux_b
 
/* kernel module loader fixup */
/* so we don't try to load run modprobe in kernel space. */
+   oldfs = get_fs();
set_fs(USER_DS);
 
retval = audit_bprm(bprm);
if (retval)
-   return retval;
+   goto out;
 
retval = -ENOENT;
for (try=0; try<2; try++) {
@@ -1086,7 +1088,7 @@ int search_binary_handler(struct linux_b
bprm->file = NULL;
current->did_exec = 1;
proc_exec_connector(current);
-   return retval;
+   goto out;
}
read_lock(&binfmt_lock);
put_binfmt(fmt);
@@ -1094,7 +1096,7 @@ int search_binary_handler(struct linux_b
break;
if (!bprm->file) {
read_unlock(&binfmt_lock);
-   return retval;
+   goto out;
}
}
read_unlock(&binfmt_lock);
@@ -1112,6 +1114,8 @@ int search_binary_handler(struct linux_b
 #endif
}
}
+out:
+   set_fs(oldfs);
return retval;
 }

[PATH][RFC] mm: Move common segments checks to separate function

2007-01-24 Thread Dmitriy Monakhov

Move common segments checks from  __generic_file_aio_{read,write}_nolock()
to separate helper function generic_segment_checks(). 

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..d01abb6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1126,6 +1126,38 @@ success:
return size;
 }
 
+/*
+ * Performs necessary segments checks.
+ * Returns bytes count or appropriate error code that caller should return.
+ */
+static ssize_t 
+generic_segment_checks(unsigned long nr_segs, const struct iovec *iov,
+   unsigned long access_flag)
+{
+   size_t bcount;
+   unsigned long   seg;
+   bcount = 0;
+   for (seg = 0; seg < nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+   
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   bcount += iv->iov_len;
+   if (unlikely((ssize_t)(bcount|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flag, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   nr_segs = seg;
+   bcount -= iv->iov_len;  /* This segment is no good */
+   break;
+   }
+   return bcount;
+}
+
 /**
  * generic_file_aio_read - generic filesystem read routine
  * @iocb:  kernel I/O control block
@@ -1146,25 +1178,7 @@ generic_file_aio_read(struct kiocb *iocb
size_t count;
loff_t *ppos = &iocb->ki_pos;
 
-   count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   count = generic_segment_checks(nr_segs, iov, VERIFY_WRITE);
 
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
@@ -2225,31 +2239,11 @@ __generic_file_aio_write_nolock(struct k
size_t ocount;  /* original count */
size_t count;   /* after file limit checks */
struct inode*inode = mapping->host;
-   unsigned long   seg;
loff_t  pos;
ssize_t written;
ssize_t err;
 
-   ocount = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   ocount -= iv->iov_len;  /* This segment is no good */
-   break;
-   }
-
+   ocount = generic_segment_checks(nr_segs, iov, VERIFY_READ);
count = ocount;
pos = *ppos;

[PATCH][RFC] incorrect direct io error handling (v3)

2007-01-24 Thread Dmitriy Monakhov

incorrect direct io error handling (v3)
Changes from v2:
 - Remove BUG_ON(!mutex_is_locked(..)) for non blkdev.
 - vmtruncate() called from generic_file_aio_write().
 - depends on patch titled:
   [PATH][RFC] mm: Move common segments checks to separate function

LOG:
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path. Initialy i've proposed do it in
__generic_file_aio_write_nolock() with explicit guarantee i_mutex always held,
but not everybody was agree with it. So we may safely call vmtruncate() inside 
generic_file_aio_write(), here i_mutex already locked.
 
TEST_CASE:
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaa"..., 104857600) = -1 ENOSPC (No space left on device)

#stat /mnt/test/BIG_FILE
  File: `/mnt/test/BIG_FILE'
  Size: 0   Blocks: 110896 IO Block: 1024   regular empty file
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<file size is less than biggest block idx

Device: fe07h/65031dInode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2007-01-24 20:03:38.0 +0300
Modify: 2007-01-24 20:03:38.0 +0300
Change: 2007-01-24 20:03:39.0 +0300

#fsck.ext3 -f /dev/VG/test 
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544.  Fix? yes
Pass 2: Checking directory structure

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index d01abb6..96840e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2058,8 +2058,9 @@ generic_file_direct_write(struct kiocb *
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
+* AIO O_DIRECT ops attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2365,6 +2366,17 @@ ssize_t generic_file_aio_write(struct ki
&iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
 
+   if (unlikely(ret < 0 && (file->f_flags & O_DIRECT))) {
+   ssize_t cnt = generic_segment_checks(nr_segs, iov, VERIFY_READ);
+   loff_t isize = i_size_read(inode);
+   /*
+* generic_file_direct_write() may have instantiated a few 
+* blocks outside i_size.  Trim these off again.
+*/
+   if (cnt > 0 && (pos + cnt > isize))
+   vmtruncate(inode, isize);
+   }
+
if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
ssize_t err;
 
@@ -2377,8 +2389,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

Re: [PATH][RFC] mm: Move common segments checks to separate function

2007-01-24 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> Move common segments checks from  __generic_file_aio_{read,write}_nolock()
> to separate helper function generic_segment_checks(). 

WOW my eyes 
I'm realy sorry but this patch is litle bit broken. :(
I've forgot to check generic_segment_checks() return value.
So folowing is the right version:

LOG:
Move common segments checks from  __generic_file_aio_{read,write}_nolock()
to separate helper function generic_segment_checks(). 

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
--- mm/filemap.c.orig   2007-01-25 09:33:03.0 +0300
+++ mm/filemap.c2007-01-25 09:34:25.0 +0300
@@ -1126,6 +1126,38 @@ success:
return size;
 }
 
+/*
+ * Performs necessary segments checks.
+ * Returns bytes count or appropriate error code that caller should return.
+ */
+static ssize_t 
+generic_segment_checks(unsigned long nr_segs, const struct iovec *iov,
+   unsigned long access_flag)
+{
+   size_t bcount;
+   unsigned long   seg;
+   bcount = 0;
+   for (seg = 0; seg < nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+   
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   bcount += iv->iov_len;
+   if (unlikely((ssize_t)(bcount|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flag, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   nr_segs = seg;
+   bcount -= iv->iov_len;  /* This segment is no good */
+   break;
+   }
+   return bcount;
+}
+
 /**
  * generic_file_aio_read - generic filesystem read routine
  * @iocb:  kernel I/O control block
@@ -1146,25 +1178,9 @@ generic_file_aio_read(struct kiocb *iocb
size_t count;
loff_t *ppos = &iocb->ki_pos;
 
-   count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   count = generic_segment_checks(nr_segs, iov, VERIFY_WRITE);
+   if (count < 0)
+   return count;
 
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
@@ -2225,31 +2241,13 @@ __generic_file_aio_write_nolock(struct k
size_t ocount;  /* original count */
size_t count;   /* after file limit checks */
struct inode*inode = mapping->host;
-   unsigned long   seg;
loff_t  pos;
ssize_t written;
ssize_t err;
 
-   ocount = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   ocount -= iv->iov_len;  /* This segment is no good */
-   break;
-   }
-
+   ocount = generic_segment_checks(nr_segs, iov, VERIFY_READ);
+   if (ocount < 0)
+   return ocount;
count = ocount;
pos = *ppos;

[PATCH] ecryptfs lower_file largefile issue

2007-02-21 Thread Dmitriy Monakhov

Where is largefile issue in ecryptfs.
Even if we want open file on ia32 with explicit O_LARGEFILE, lower_file will
be opened without O_LARGEFILE flag  this result in various errors in dmesg
and data corruption.
Testcase:
   write 'a' chars beyond 2Gb
  # strace ./writer_test root/file
  open("root/file", O_WRONLY|O_CREAT|O_LARGEFILE, 0666) = 3
  lseek(3, 2147483640, SEEK_SET)  = 2147483640
  write(3, ""..., 10240) = 10240
  ##check content
  # hexdump root/file4 
  000        
  *
  7ff0     6161 6161 6161 6161
  8000 6161 6161 6161 6161 6161 6161 6161 6161
  *
  800027f0 6161 6161 6161 6161
  800027f8
  
  #umount root 
  # mount -tecryptfs pr/dir/ root/  -ocipher=aes
  ## check content after remount
  # hexdump root/file4 
  000        
  *
  7ff0     6161 6161 6161 6161
  8000 6161 6161 6161 6161 6161 6161 6161 6161
  *
  800027f0 6161 6161 6161 6161    
  80002800        
  *
  80003000 ee3d e793 c410 3bd4 c642 5774 ad71 7932
  80003010 17a6 443c 4621 9e4a cf15 babd 19d0 f5e7
  *
  80004000 ce32 49ed 217c f87d 5539 3e96 517c 961c
  80004010 17a6 443c 4621 9e4a cf15 babd 19d0 f5e7
  *
  80005000  
  ### Wow content was changed, kernel complain with folowing messages:
  Error opening lower file for lower_dentry [0xf73a111c], lower_mnt 
[0xf7d8b1c0], and flags [0x0]
  Error opening lower_file to read header region
  Error attempting to read the [user.ecryptfs] xattr from the lower file; 
return value = [4294967201]
  Valid metadata not found in header region or xattr region; treating file as 
unencrypted
  
Lets explicitly add O_LARGEFILE to opened lower file flags as it done in unionfs
and nfsd. Also remove unnecessery #define from ecryptfs_initialize_file().
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index bd969ad..7a7d25d 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -205,6 +205,7 @@ int ecryptfs_open_lower_file(struct file **lower_file,
 {
int rc = 0;
 
+   flags |= O_LARGEFILE;
dget(lower_dentry);
mntget(lower_mnt);
*lower_file = dentry_open(lower_dentry, lower_mnt, flags);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 27fd14a..a1fe8b4 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -200,9 +200,6 @@ static int ecryptfs_initialize_file(struct dentry 
*ecryptfs_dentry)
inode = ecryptfs_dentry->d_inode;
crypt_stat = &ecryptfs_inode_to_private(inode)->crypt_stat;
lower_flags = ((O_CREAT | O_TRUNC) & O_ACCMODE) | O_RDWR;
-#if BITS_PER_LONG != 32
-   lower_flags |= O_LARGEFILE;
-#endif
lower_mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
/* Corresponding fput() at end of this function */
if ((rc = ecryptfs_open_lower_file(&lower_file, lower_dentry, lower_mnt,

Re: [Ecryptfs-devel] [PATCH] ecryptfs lower_file largefile issue

2007-02-21 Thread Dmitriy Monakhov

Michael Halcrow <[EMAIL PROTECTED]> writes:

> On Wed, Feb 21, 2007 at 01:07:22PM +0300, Dmitriy Monakhov wrote:
>> Where is largefile issue in ecryptfs.
>
> Thanks for your thorough work on resolving such issues. We will
> integrate your patches and testcases into the next release as soon as
> we get a chance to review and test.
BTW previously i've sent some patches for ecryptfs to lkml but forgot add
[EMAIL PROTECTED] to CC:.
links to patches:
"ecryptfs lower_file handling issues" http://lkml.org/lkml/2007/2/19/147
"ecryptfs_read_super path_lookup" errh fix http://lkml.org/lkml/2007/2/19/171

>
> Mike
> .___.
>  Michael A. Halcrow  
>Security Software Engineer, IBM Linux Technology Center   
> GnuPG Fingerprint: 419C 5B1E 948A FA73 A54C  20F5 DB40 8531 6DCA 8769
>
> "To prohibit sharing software is to cut the bonds of society."   
>  - Richard Stallman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs remove unnecessary flush_dcache_page

2007-02-21 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> 1)Function ecryptfs_do_readpage() calls flush_dcache_page(lower_page),
>   but lower_page was't changed here. So remove this line.
>
> 2)prepare_write ret val was ignored in ecryptfs_write_inode_size_to_header().
> If error happends we can't call commit_write, just do cleanup and fial.
> It is issue easy to reproduce with full lower_fs, in this case prepare_write()
Second issue was fixed by "ecryptfs-resolve-lower-page-unlocking-problem.patch",
but first issue was't.

[LOG]
  Function ecryptfs_do_readpage() calls flush_dcache_page(lower_page),
  but lower_page was't changed here. Even if it was changed by 
  lower_a_ops->readpage() dcache was flushed by readpage() itself.
  So remove this unnecessary line.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 1e5d2ba..2e45513 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -238,7 +238,6 @@ int ecryptfs_do_readpage(struct file *file, struct page 
*page,
lower_page_data = kmap_atomic(lower_page, KM_USER1);
memcpy(page_data, lower_page_data, PAGE_CACHE_SIZE);
kunmap_atomic(lower_page_data, KM_USER1);
-   flush_dcache_page(lower_page);
kunmap_atomic(page_data, KM_USER0);
flush_dcache_page(page);
rc = 0;

[PATCH] MPT fusion: handle mpt_resume() failure while resuming

2007-02-23 Thread Dmitriy Monakhov

Since mpt-fusion-handle-pci-layer-error-on-resume.patch
function mpt_resume() can return error code.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/message/fusion/mptscsih.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/message/fusion/mptscsih.c 
b/drivers/message/fusion/mptscsih.c
index decadbe..96be1d6 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1191,8 +1191,11 @@ mptscsih_resume(struct pci_dev *pdev)
MPT_ADAPTER *ioc = pci_get_drvdata(pdev);
struct Scsi_Host*host = ioc->sh;
MPT_SCSI_HOST   *hd;
+   int ret;
 
-   mpt_resume(pdev);
+   ret = mpt_resume(pdev);
+   if (ret)
+   return  ret;
 
if(!host)
return 0;
-- 
1.4.4.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] libata: handle ata_pci_device_do_resume() failure while resuming

2007-02-23 Thread Dmitriy Monakhov

Since commit:553c4aa630af7bc885e056d0436e4eb7f238579b
ata_pci_device_do_resume() can return error code, all callers was updated
except this one.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/ata/sata_inic162x.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/sata_inic162x.c b/drivers/ata/sata_inic162x.c
index 31b636f..7933043 100644
--- a/drivers/ata/sata_inic162x.c
+++ b/drivers/ata/sata_inic162x.c
@@ -639,7 +639,10 @@ static int inic_pci_device_resume(struct pci_dev *pdev)
void __iomem *mmio_base = host->iomap[MMIO_BAR];
int rc;
 
-   ata_pci_device_do_resume(pdev);
+   rc = ata_pci_device_do_resume(pdev);
+   if (rc) {
+   return rc;
+   }
 
if (pdev->dev.power.power_state.event == PM_EVENT_SUSPEND) {
printk("XXX\n");
-- 
1.4.4.4


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: handles AOP_TRUNCATED_PAGE better

2007-02-23 Thread Dmitriy Monakhov

- In fact we don't have to fail if AOP_TRUNCATED_PAGE was returned from 
  prepare_write or commit_write. It is beter to retry attempt where it
  is possible.

- Rearange ecryptfs_get_lower_page() error handling logic, make it more clean.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 fs/ecryptfs/mmap.c |   28 
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 842497d..b8fc1ef 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -447,6 +447,7 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
const struct address_space_operations *lower_a_ops;
u64 file_size;
 
+retry:
header_page = grab_cache_page(lower_inode->i_mapping, 0);
if (!header_page) {
ecryptfs_printk(KERN_ERR, "grab_cache_page for "
@@ -457,9 +458,10 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
lower_a_ops = lower_inode->i_mapping->a_ops;
rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
if (rc) {
-   if (rc == AOP_TRUNCATED_PAGE)
+   if (rc == AOP_TRUNCATED_PAGE) {
ecryptfs_release_lower_page(header_page, 0);
-   else
+   goto retry;
+   } else
ecryptfs_release_lower_page(header_page, 1);
goto out;
}
@@ -474,9 +476,10 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
if (rc < 0)
ecryptfs_printk(KERN_ERR, "Error commiting header page "
"write\n");
-   if (rc == AOP_TRUNCATED_PAGE)
+   if (rc == AOP_TRUNCATED_PAGE) {
ecryptfs_release_lower_page(header_page, 0);
-   else
+   goto retry;
+   } else
ecryptfs_release_lower_page(header_page, 1);
lower_inode->i_mtime = lower_inode->i_ctime = CURRENT_TIME;
mark_inode_dirty_sync(inode);
@@ -565,6 +568,7 @@ int ecryptfs_get_lower_page(struct page **lower_page, 
struct inode *lower_inode,
 {
int rc = 0;
 
+retry:
*lower_page = grab_cache_page(lower_inode->i_mapping, lower_page_index);
if (!(*lower_page)) {
rc = -EINVAL;
@@ -578,18 +582,18 @@ int ecryptfs_get_lower_page(struct page **lower_page, 
struct inode *lower_inode,
  byte_offset,
  region_bytes);
if (rc) {
-   ecryptfs_printk(KERN_ERR, "prepare_write for "
+   if (rc == AOP_TRUNCATED_PAGE) {
+   ecryptfs_release_lower_page(*lower_page, 0);
+   goto retry;
+   } else {
+   ecryptfs_printk(KERN_ERR, "prepare_write for "
"lower_page_index = [0x%.16x] failed; rc = "
"[%d]\n", lower_page_index, rc);
-   }
-out:
-   if (rc && (*lower_page)) {
-   if (rc == AOP_TRUNCATED_PAGE)
-   ecryptfs_release_lower_page(*lower_page, 0);
-   else
ecryptfs_release_lower_page(*lower_page, 1);
-   (*lower_page) = NULL;
+   (*lower_page) = NULL;
+   }
}
+out:
return rc;
 }
 
-- 
1.4.4.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] 3c59x: Handle pci_enable_device() failure while resuming

2007-02-23 Thread Dmitriy Monakhov

Handle pci_enable_device() failure while resuming, we can safely exit here.
Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/net/3c59x.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index 2b750bd..ea4a78f 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -821,11 +821,17 @@ static int vortex_resume(struct pci_dev *pdev)
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct vortex_private *vp = netdev_priv(dev);
+   int err;
 
if (dev && vp) {
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   printk(KERN_WARNING "%s: Could not enable device \n",
+   dev->name);
+   return err;
+   }
pci_set_master(pdev);
if (request_irq(dev->irq, vp->full_bus_master_rx ?
&boomerang_interrupt : &vortex_interrupt, 
IRQF_SHARED, dev->name, dev)) {
-- 
1.4.4.4


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sk98lin: handle pci_enable_device() return value in skge_resume()

2007-02-23 Thread Dmitriy Monakhov

This thread looks dead but issue was't fixed.

Jiri Kosina <[EMAIL PROTECTED]> writes:
>> > -  pci_enable_device(pdev);
>> > +  ret = pci_enable_device(pdev);
>> > +  if (ret) {
>> > +  printk(KERN_ERR "sk98lin: Cannot enable PCI device %s during 
>> > resume\n", 
>> > +  dev->name);
>> > +  unregister_netdev(dev);
>> This looks rather wrong - skge_exit() will run unregister_netdev() again.
>
> You are of course right (the problem was also spotted by Russell King). 
> This I believe is the correct one for the sk98lin case.
>
> [PATCH] fix sk98lin driver, ignoring return value from pci_enable_device()
>
> add check of return value to _resume() function of sk98lin driver.
>
> Signed-off-by: Jiri Kosina <[EMAIL PROTECTED]>
>
> ---
>
> --- a/drivers/net/sk98lin/skge.c
> +++ b/drivers/net/sk98lin/skge.c
> @@ -5070,7 +5070,12 @@ static int skge_resume(struct pci_dev *p
>  
>   pci_set_power_state(pdev, PCI_D0);
>   pci_restore_state(pdev);
> - pci_enable_device(pdev);
> + ret = pci_enable_device(pdev);
> + if (ret) {
> + printk(KERN_WARNING "sk98lin: unable to enable device %s in 
> resume\n",
> + dev->name);
> + goto out_err;
> + }   
[snip]
> +out_err:
> + pAC->AllocFlag &= ~SK_ALLOC_IRQ;
> + dev->irq = 0;
> + pci_disable_device(pdev);
< Ok what happend if we jump here right after pci_disable_device() has 
  failed, but pci_disable_device() was called anyway, this is wrong and 
  may be fatal because pdev->enable_cnt may becomes negative.  
> +
> + return ret;
> +
>  }
>  #else
>  #define skge_suspend NULL
This is reworked Jiri's patch:

[PATCH] sk98lin: handle pci_enable_device() return value in skge_resume()

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/net/sk98lin/skge.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/sk98lin/skge.c b/drivers/net/sk98lin/skge.c
index e94ab25..eea753a 100644
--- a/drivers/net/sk98lin/skge.c
+++ b/drivers/net/sk98lin/skge.c
@@ -5125,7 +5125,12 @@ static int skge_resume(struct pci_dev *pdev)
 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+   if (ret) {
+   printk(KERN_WARNING "sk98lin: unable to enable device %s "
+   "in resume\n", dev->name);
+   goto err_out;
+   }   
pci_set_master(pdev);
if (pAC->GIni.GIMacsFound == 2)
ret = request_irq(dev->irq, SkGeIsr, IRQF_SHARED, "sk98lin", 
dev);
@@ -5133,10 +5138,8 @@ static int skge_resume(struct pci_dev *pdev)
ret = request_irq(dev->irq, SkGeIsrOnePort, IRQF_SHARED, 
"sk98lin", dev);
if (ret) {
printk(KERN_WARNING "sk98lin: unable to acquire IRQ %d\n", 
dev->irq);
-   pAC->AllocFlag &= ~SK_ALLOC_IRQ;
-   dev->irq = 0;
-   pci_disable_device(pdev);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto err_out_disable_pdev;
}
 
netif_device_attach(dev);
@@ -5153,6 +5156,13 @@ static int skge_resume(struct pci_dev *pdev)
}
 
return 0;
+
+err_out_disable_pdev:
+   pci_disable_device(pdev);
+err_out:
+   pAC->AllocFlag &= ~SK_ALLOC_IRQ;
+   dev->irq = 0;
+   return ret;
 }
 #else
 #define skge_suspend NULL
-- 
1.4.4.4



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sk98lin: handle pci_enable_device() return value in skge_resume()

2007-02-24 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> This thread looks dead but issue was't fixed.
>
> Jiri Kosina <[EMAIL PROTECTED]> writes:
>>> > - pci_enable_device(pdev);
>>> > + ret = pci_enable_device(pdev);
>>> > + if (ret) {
>>> > + printk(KERN_ERR "sk98lin: Cannot enable PCI device %s during 
>>> > resume\n", 
>>> > + dev->name);
>>> > + unregister_netdev(dev);
>>> This looks rather wrong - skge_exit() will run unregister_netdev() again.
>>
>> You are of course right (the problem was also spotted by Russell King). 
>> This I believe is the correct one for the sk98lin case.
>>
>> [PATCH] fix sk98lin driver, ignoring return value from pci_enable_device()
>>
>> add check of return value to _resume() function of sk98lin driver.
>>
>> Signed-off-by: Jiri Kosina <[EMAIL PROTECTED]>
>>
>> ---
>>
>> --- a/drivers/net/sk98lin/skge.c
>> +++ b/drivers/net/sk98lin/skge.c
>> @@ -5070,7 +5070,12 @@ static int skge_resume(struct pci_dev *p
>>  
>>  pci_set_power_state(pdev, PCI_D0);
>>  pci_restore_state(pdev);
>> -pci_enable_device(pdev);
>> +ret = pci_enable_device(pdev);
>> +if (ret) {
>> +printk(KERN_WARNING "sk98lin: unable to enable device %s in 
>> resume\n",
>> +dev->name);
>> +goto out_err;
>> +}   
> [snip]
>> +out_err:
>> +pAC->AllocFlag &= ~SK_ALLOC_IRQ;
>> +dev->irq = 0;
>> +pci_disable_device(pdev);
> <<<<< Ok what happend if we jump here right after pci_disable_device() has 
 OOPs.Of course i mean pci_enable_device() here^
 I'm sorry about this. (Option complete_word not always good, some times 
 brain work required too :) )
 So correct comment looks like:
  <<<<< Ok what happend if we jump here right after pci_enable_device() has
failed, but pci_disable_device() was called anyway, this is wrong and 
may be fatal because pdev->enable_cnt may becomes negative.  
>> +
>> +return ret;
>> +
>>  }
>>  #else
>>  #define skge_suspend NULL
> This is reworked Jiri's patch:
>
> [PATCH] sk98lin: handle pci_enable_device() return value in skge_resume()
>
> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
> ---
>  drivers/net/sk98lin/skge.c |   20 +++-
>  1 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/sk98lin/skge.c b/drivers/net/sk98lin/skge.c
> index e94ab25..eea753a 100644
> --- a/drivers/net/sk98lin/skge.c
> +++ b/drivers/net/sk98lin/skge.c
> @@ -5125,7 +5125,12 @@ static int skge_resume(struct pci_dev *pdev)
>  
>   pci_set_power_state(pdev, PCI_D0);
>   pci_restore_state(pdev);
> - pci_enable_device(pdev);
> + ret = pci_enable_device(pdev);
> + if (ret) {
> + printk(KERN_WARNING "sk98lin: unable to enable device %s "
> + "in resume\n", dev->name);
> + goto err_out;
> + }   
>   pci_set_master(pdev);
>   if (pAC->GIni.GIMacsFound == 2)
>   ret = request_irq(dev->irq, SkGeIsr, IRQF_SHARED, "sk98lin", 
> dev);
> @@ -5133,10 +5138,8 @@ static int skge_resume(struct pci_dev *pdev)
>   ret = request_irq(dev->irq, SkGeIsrOnePort, IRQF_SHARED, 
> "sk98lin", dev);
>   if (ret) {
>   printk(KERN_WARNING "sk98lin: unable to acquire IRQ %d\n", 
> dev->irq);
> - pAC->AllocFlag &= ~SK_ALLOC_IRQ;
> - dev->irq = 0;
> - pci_disable_device(pdev);
> - return -EBUSY;
> + ret = -EBUSY;
> + goto err_out_disable_pdev;
>   }
>  
>   netif_device_attach(dev);
> @@ -5153,6 +5156,13 @@ static int skge_resume(struct pci_dev *pdev)
>   }
>  
>   return 0;
> +
> +err_out_disable_pdev:
> + pci_disable_device(pdev);
> +err_out:
> + pAC->AllocFlag &= ~SK_ALLOC_IRQ;
> + dev->irq = 0;
> + return ret;
>  }
>  #else
>  #define skge_suspend NULL
> -- 
> 1.4.4.4
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] libata: handle ata_pci_device_do_resume() failure while resuming (v2)

2007-02-25 Thread Dmitriy Monakhov

Randy Dunlap <[EMAIL PROTECTED]> writes:

> On Sat, 24 Feb 2007 00:43:18 +0300 Dmitriy Monakhov wrote:
>
>> Since commit:553c4aa630af7bc885e056d0436e4eb7f238579b
>> ata_pci_device_do_resume() can return error code, all callers was updated
>> except this one.
>> 
>> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
>> ---
>>  drivers/ata/sata_inic162x.c |5 -
>>  1 files changed, 4 insertions(+), 1 deletions(-)
>> 
>> diff --git a/drivers/ata/sata_inic162x.c b/drivers/ata/sata_inic162x.c
>> index 31b636f..7933043 100644
>> --- a/drivers/ata/sata_inic162x.c
>> +++ b/drivers/ata/sata_inic162x.c
>> @@ -639,7 +639,10 @@ static int inic_pci_device_resume(struct pci_dev *pdev)
>>  void __iomem *mmio_base = host->iomap[MMIO_BAR];
>>  int rc;
>>  
>> -ata_pci_device_do_resume(pdev);
>> +rc = ata_pci_device_do_resume(pdev);
>> +if (rc) {
>> +return rc;
>
> Either (a) don't use the braces when they are not needed, or
> (b) is it possible to add a meaningful message there, or is that
> done elsewhere?
True. Thanks for correcting me here.
It is't realy necessary print err message because ata_pci_device_do_resume()
already done this. So just remove braces.


 Since commit:553c4aa630af7bc885e056d0436e4eb7f238579b
 ata_pci_device_do_resume() can return error code, all callers was updated
 except this one.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/drivers/ata/sata_inic162x.c b/drivers/ata/sata_inic162x.c
index 31b636f..18ac665 100644
--- a/drivers/ata/sata_inic162x.c
+++ b/drivers/ata/sata_inic162x.c
@@ -639,7 +639,9 @@ static int inic_pci_device_resume(struct pci_dev *pdev)
void __iomem *mmio_base = host->iomap[MMIO_BAR];
int rc;
 
-   ata_pci_device_do_resume(pdev);
+   rc = ata_pci_device_do_resume(pdev);
+   if (rc)
+   return rc;
 
if (pdev->dev.power.power_state.event == PM_EVENT_SUSPEND) {
printk("XXX\n");

Re: [PATCH 2/3] pcmcia: Handle request_irq() failure while opening device

2007-02-26 Thread Dmitriy Monakhov

Monakhov Dmitriy <[EMAIL PROTECTED]> writes:

> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
> ---
>  drivers/net/pcmcia/axnet_cs.c |8 +++-
>  drivers/net/pcmcia/pcnet_cs.c |8 +++-
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/pcmcia/axnet_cs.c b/drivers/net/pcmcia/axnet_cs.c
> index 6139048..9b57bab 100644
> --- a/drivers/net/pcmcia/axnet_cs.c
> +++ b/drivers/net/pcmcia/axnet_cs.c
> @@ -523,6 +523,7 @@ static int axnet_open(struct net_device *dev)
>  {
>  axnet_dev_t *info = PRIV(dev);
>  struct pcmcia_device *link = info->p_dev;
> +int err = 0;
>  
>  DEBUG(2, "axnet_open('%s')\n", dev->name);
>  
> @@ -531,7 +532,12 @@ static int axnet_open(struct net_device *dev)
>  
>  link->open++;
>  
> -request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, "axnet_cs", dev);
> +err = request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, "axnet_cs", 
> dev);
> +if (err) {
> + printk("axnet_cs: request_irq() failed  %s\n", dev->name);
> + link->open--;
> + return -EBUSY;
As David Miller noted it is better to return exact error code from 
request_irq(),
truly most net_device->open() functions return error code without changing.  
The updated patch version following:

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/net/pcmcia/axnet_cs.c |8 +++-
 drivers/net/pcmcia/pcnet_cs.c |8 +++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/pcmcia/axnet_cs.c b/drivers/net/pcmcia/axnet_cs.c
index 6139048..9b57bab 100644
--- a/drivers/net/pcmcia/axnet_cs.c
+++ b/drivers/net/pcmcia/axnet_cs.c
@@ -523,6 +523,7 @@ static int axnet_open(struct net_device *dev)
 {
 axnet_dev_t *info = PRIV(dev);
 struct pcmcia_device *link = info->p_dev;
+int err = 0;
 
 DEBUG(2, "axnet_open('%s')\n", dev->name);
 
@@ -531,7 +532,12 @@ static int axnet_open(struct net_device *dev)
 
 link->open++;
 
-request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, "axnet_cs", dev);
+err = request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, "axnet_cs", dev);
+if (err) {
+   printk("axnet_cs: request_irq() failed  %s\n", dev->name);
+   link->open--;
+   return err;
+}
 
 info->link_status = 0x00;
 init_timer(&info->watchdog);
diff --git a/drivers/net/pcmcia/pcnet_cs.c b/drivers/net/pcmcia/pcnet_cs.c
index d88e9b2..13923ea 100644
--- a/drivers/net/pcmcia/pcnet_cs.c
+++ b/drivers/net/pcmcia/pcnet_cs.c
@@ -962,6 +962,7 @@ static int pcnet_open(struct net_device *dev)
 {
 pcnet_dev_t *info = PRIV(dev);
 struct pcmcia_device *link = info->p_dev;
+int err = 0;
 
 DEBUG(2, "pcnet_open('%s')\n", dev->name);
 
@@ -971,7 +972,12 @@ static int pcnet_open(struct net_device *dev)
 link->open++;
 
 set_misc_reg(dev);
-request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, dev_info, dev);
+err = request_irq(dev->irq, ei_irq_wrapper, IRQF_SHARED, dev_info, dev);
+if (err) {
+   printk("pcnet_open: request_irq() failed  %s\n", dev->name);
+   link->open--;
+   return err;
+}
 
 info->phy_id = info->eth_phy;
 info->link_status = 0x00;
-- 
1.5.0.1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: lower root result must be adirectory

2007-02-27 Thread Dmitriy Monakhov

patch against lastest mm tree.

- Currently after path_lookup succeed we dot't have any guarantie what
  it is DIR. This must be explicitly demanded.
- path_lookup can't return negative dentry, So inode check is useless.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 812427e..fc4a3a2 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -484,18 +484,12 @@ static int ecryptfs_read_super(struct super_block *sb, 
const char *dev_name)
struct vfsmount *lower_mnt;
 
memset(&nd, 0, sizeof(struct nameidata));
-   rc = path_lookup(dev_name, LOOKUP_FOLLOW, &nd);
+   rc = path_lookup(dev_name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
if (rc) {
ecryptfs_printk(KERN_WARNING, "path_lookup() failed\n");
goto out;
}
lower_root = nd.dentry;
-   if (!lower_root->d_inode) {
-   ecryptfs_printk(KERN_WARNING,
-   "No directory to interpose on\n");
-   rc = -ENOENT;
-   goto out_free;
-   }
lower_mnt = nd.mnt;
ecryptfs_set_superblock_lower(sb, lower_root->d_sb);
sb->s_maxbytes = lower_root->d_sb->s_maxbytes;

[PATCH] sk98lin: handle pci_enable_device() return value in skge_resume() [resend]

2007-02-27 Thread Dmitriy Monakhov


Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/net/sk98lin/skge.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/sk98lin/skge.c b/drivers/net/sk98lin/skge.c
index e94ab25..eea753a 100644
--- a/drivers/net/sk98lin/skge.c
+++ b/drivers/net/sk98lin/skge.c
@@ -5125,7 +5125,12 @@ static int skge_resume(struct pci_dev *pdev)
 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+   if (ret) {
+   printk(KERN_WARNING "sk98lin: unable to enable device %s "
+   "in resume\n", dev->name);
+   goto err_out;
+   }   
pci_set_master(pdev);
if (pAC->GIni.GIMacsFound == 2)
ret = request_irq(dev->irq, SkGeIsr, IRQF_SHARED, "sk98lin", 
dev);
@@ -5133,10 +5138,8 @@ static int skge_resume(struct pci_dev *pdev)
ret = request_irq(dev->irq, SkGeIsrOnePort, IRQF_SHARED, 
"sk98lin", dev);
if (ret) {
printk(KERN_WARNING "sk98lin: unable to acquire IRQ %d\n", 
dev->irq);
-   pAC->AllocFlag &= ~SK_ALLOC_IRQ;
-   dev->irq = 0;
-   pci_disable_device(pdev);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto err_out_disable_pdev;
}
 
netif_device_attach(dev);
@@ -5153,6 +5156,13 @@ static int skge_resume(struct pci_dev *pdev)
}
 
return 0;
+
+err_out_disable_pdev:
+   pci_disable_device(pdev);
+err_out:
+   pAC->AllocFlag &= ~SK_ALLOC_IRQ;
+   dev->irq = 0;
+   return ret;
 }
 #else
 #define skge_suspend NULL
-- 
1.4.4.4

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: be sure to trim blocks after direct_io has failed

2007-02-28 Thread Dmitriy Monakhov

This is updated version of patch aimed to fix direct_io error handling issue
i've previously sent 2 wheeks ago. If you don't like anything in this patch
plese let me know.

Changes:
  - comments added. I think now it is clearly describe things.
  - patch prepared against 2.6.20-mm2

How this patch tested:
  - LTP test, and other readv/writev op tests.
  - fsstress test.
  - manual direct_io tests.

Log:
 - Move common segment checks to separate helper function.
 - Trim off blocks after generic_file_direct_write() has failed.
 - Update out of date comments about direct_io locking rules.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 mm/filemap.c |  107 +++--
 1 files changed, 66 insertions(+), 41 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a9284c2..c81184c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1147,6 +1147,38 @@ success:
return size;
 }
 
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Adjust number of segments and amount of bytes to write.
+ * Returns appropriate error code that caller should return or
+ * zero in case that write should be allowed.
+ */
+int generic_segment_checks(const struct iovec *iov, unsigned long *nr_segs,
+   size_t *count, unsigned long access_flags)
+{
+   unsigned long   seg;
+   for (seg = 0; seg < *nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   *count += iv->iov_len;
+   if (unlikely((ssize_t)(*count|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flags, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   *nr_segs = seg;
+   *count -= iv->iov_len;  /* This segment is no good */
+   break;
+   }
+   return 0;
+}
+
 /**
  * generic_file_aio_read - generic filesystem read routine
  * @iocb:  kernel I/O control block
@@ -1168,24 +1200,9 @@ generic_file_aio_read(struct kiocb *iocb, const struct 
iovec *iov,
loff_t *ppos = &iocb->ki_pos;
 
count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+   if (retval)
+   return retval;
 
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
@@ -2080,8 +2097,9 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
+* AIO O_DIRECT ops attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2271,30 +2289,14 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
const struct iovec *iov,
size_t ocount;  /* original count */
size_t count;   /* after file limit checks */
struct inode*inode = mapping->host;
-   unsigned long   seg;
loff_t  pos;
ssize_t written;
ssize_t err;
 
ocount = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-

[PATCH] ecryptfs: check xattr operation support fix

2007-02-28 Thread Dmitriy Monakhov


  - ecryptfs_write_inode_size_to_metadata() error code was ignored.
  - i_op->setxattr() must be supported by lower fs because used below.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ecryptfs/inode.c |6 +++---
 fs/ecryptfs/mmap.c  |3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 27fd14a..9ccefad 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -168,9 +168,9 @@ static int grow_file(struct dentry *ecryptfs_dentry, struct 
file *lower_file,
goto out;
}
i_size_write(inode, 0);
-   ecryptfs_write_inode_size_to_metadata(lower_file, lower_inode, inode,
- ecryptfs_dentry,
- ECRYPTFS_LOWER_I_MUTEX_NOT_HELD);
+   rc = ecryptfs_write_inode_size_to_metadata(lower_file, lower_inode,
+   inode, ecryptfs_dentry,
+   ECRYPTFS_LOWER_I_MUTEX_NOT_HELD);
ecryptfs_inode_to_private(inode)->crypt_stat.flags |= ECRYPTFS_NEW_FILE;
 out:
return rc;
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 1e5d2ba..416985f 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -491,7 +491,8 @@ static int ecryptfs_write_inode_size_to_xattr(struct inode 
*lower_inode,
goto out;
}
lower_dentry = ecryptfs_dentry_to_lower(ecryptfs_dentry);
-   if (!lower_dentry->d_inode->i_op->getxattr) {
+   if (!lower_dentry->d_inode->i_op->getxattr ||
+   !lower_dentry->d_inode->i_op->setxattr) {
printk(KERN_WARNING
   "No support for setting xattr in lower filesystem\n");
rc = -ENOSYS;
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][RFC] ext3: Handle ext[34]_journal_stop() failure

2007-02-28 Thread Dmitriy Monakhov


Where are many places where _journal_stop() return code wasn't
checked. Off cause _journal_stop() failed very rarely (and usually
with fatal consequences), but this does'n meen it should not be checked.
For example most retry loops looks like follows:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
It is realy insane do "retry" if ext3_journal_stop() has failed, because
this may increase damage in case of real problem.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ext3/acl.c   |   12 
 fs/ext3/inode.c |   22 +-
 fs/ext3/ioctl.c |   12 
 fs/ext3/namei.c |   48 
 fs/ext3/xattr.c |4 ++--
 fs/ext4/acl.c   |   12 
 fs/ext4/inode.c |   22 +-
 fs/ext4/ioctl.c |   12 
 fs/ext4/namei.c |   48 
 fs/ext4/xattr.c |4 ++--
 10 files changed, 134 insertions(+), 62 deletions(-)

diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
index 1e5038d..bbf4d8a 100644
--- a/fs/ext3/acl.c
+++ b/fs/ext3/acl.c
@@ -375,7 +375,7 @@ int
 ext3_acl_chmod(struct inode *inode)
 {
struct posix_acl *acl, *clone;
-int error;
+   int error, error2;
 
if (S_ISLNK(inode->i_mode))
return -EOPNOTSUPP;
@@ -402,7 +402,9 @@ ext3_acl_chmod(struct inode *inode)
goto out;
}
error = ext3_set_acl(handle, inode, ACL_TYPE_ACCESS, clone);
-   ext3_journal_stop(handle);
+   error2 = ext3_journal_stop(handle);
+   if (error2 && (!error || error == -ENOSPC))
+   error = error2;
if (error == -ENOSPC &&
ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
@@ -485,7 +487,7 @@ ext3_xattr_set_acl(struct inode *inode, int type, const 
void *value,
 {
handle_t *handle;
struct posix_acl *acl;
-   int error, retries = 0;
+   int error, error2, retries = 0;
 
if (!test_opt(inode->i_sb, POSIX_ACL))
return -EOPNOTSUPP;
@@ -509,7 +511,9 @@ retry:
if (IS_ERR(handle))
return PTR_ERR(handle);
error = ext3_set_acl(handle, inode, type, acl);
-   ext3_journal_stop(handle);
+   error2 = ext3_journal_stop(handle);
+   if (error2 && (!error || error == -ENOSPC))
+   error = error2;
if (error == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
 
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 816f0c5..68d1c71 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -962,7 +962,9 @@ static int ext3_get_block(struct inode *inode, sector_t 
iblock,
 * Huge direct-io writes can hold off commits for long
 * periods of time.  Let this commit run.
 */
-   ext3_journal_stop(handle);
+   ret = ext3_journal_stop(handle);
+   if (ret)
+   goto get_block;
handle = ext3_journal_start(inode, DIO_CREDITS);
if (IS_ERR(handle))
ret = PTR_ERR(handle);
@@ -2998,7 +3000,13 @@ int ext3_setattr(struct dentry *dentry, struct iattr 
*attr)
if (attr->ia_valid & ATTR_GID)
inode->i_gid = attr->ia_gid;
error = ext3_mark_inode_dirty(handle, inode);
-   ext3_journal_stop(handle);
+   if (error) {
+   ext3_journal_stop(handle);
+   goto err_out;
+   }
+   error = ext3_journal_stop(handle);
+   if (error)
+   goto err_out;
}
 
if (S_ISREG(inode->i_mode) &&
@@ -3016,7 +3024,9 @@ int ext3_setattr(struct dentry *dentry, struct iattr 
*attr)
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
error = rc;
-   ext3_journal_stop(handle);
+   rc = ext3_journal_stop(handle);
+   if (!error)
+   error = rc;
}
 
rc = inode_setattr(inode, attr);
@@ -3231,7 +3241,7 @@ int ext3_change_inode_journal_flag(struct inode *inode, 
int val)
 {
journal_t *journal;
handle_t *handle;
-   int err;
+   int err, err2;
 
/*
 * We have to be very careful here: changing a data block's
@@ -3274,7 +3284,9 @@ int ext3_change_inode_journal_flag(struct inode *inode, 
int val)
 
err = ext3_mark_inode_dirty(handle, inode);
handle->h_sync = 1;
-   ext3_journal_stop(handle);
+   err2 = ext3_journal_stop(handle);
+   if (!err)
+   err = err2;
ext3_std_error(inode->i_sb, err);
 
return err;
diff --git a/fs/ext3/ioctl.c b/fs/ext3

[PATCH 0/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Dmitriy Monakhov

Almost all read/write operation handles data with chunks(segments or pages)
and result has integral behaviour for folowing scenario: 
for_each_chunk() {
 res = op();
 if(IS_ERROR(res))
   return progress ? progress : res;
 progress += res;
}
prepare_write may has integral behaviour in case of blksize < pgsize,
for example we successfully allocated/read some blocks, but not all of them,
and than some error happend. Currently we eliminate this progress by doing
vmtrunate() after prepare_has failed.
It is good to have ability to signal about this progress. Interprete positive
prepare_write() ret code as bytes num that fs ready to handle at this moment.
I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE
so it less than AOP_TRUNCATED_PAGE too.
 
BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
delayed allocation more correct, and its works well.
I think not everybody will happy about this,  but let's discuss all advantages
and disadvantages of this change.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 62632f5..b4f6eac 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -239,7 +239,11 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
page_cache_release(page);
continue;
}
-   goto unlock;
+   if (ret > 0 && ret < PAGE_CACHE_SIZE)
+   /* prepare_write demands limit size of bytes. */
+   size = min(size, (unsigned)ret);
+   else
+   goto unlock;
}
transfer_result = lo_do_transfer(lo, WRITE, page, offset,
bvec->bv_page, bv_offs, size, IV);
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 1e5d2ba..b5eebe8 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -454,6 +454,17 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
}
lower_a_ops = lower_inode->i_mapping->a_ops;
rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
+   if (unlikely(rc > 0 && rc < PAGE_CACHE_SIZE)) {
+   /* 
+* prepare_write can handle less bytes whan we need. This is not likely
+* to happend realy because usualy we need only one block. In order to
+* preserve prepare/commit balanced invoke commit end fail.
+*/
+   int ret;
+   ret = lower_a_ops->commit_write(lower_file, header_page, 0, rc);
+   rc = ret ? ret : -ENOSPC;
+   goto unlock;
+   }
file_size = (u64)i_size_read(inode);
ecryptfs_printk(KERN_DEBUG, "Writing size: [0x%.16x]\n", file_size);
file_size = cpu_to_be64(file_size);
@@ -462,6 +473,7 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
kunmap_atomic(header_virt, KM_USER0);
flush_dcache_page(header_page);
rc = lower_a_ops->commit_write(lower_file, header_page, 0, 8);
+unlock:
if (rc < 0)
ecryptfs_printk(KERN_ERR, "Error commiting header page "
"write\n");
diff --git a/fs/namei.c b/fs/namei.c
index b305589..723db81 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2704,6 +2704,16 @@ retry:
page_cache_release(page);
goto retry;
}
+   if (unlikely(err > 0 && err < PAGE_CACHE_SIZE)) {
+   /* 
+* prepare_write can handle less bytes whan we need. This is not likely
+* to happend realy because usualy we need only one block. In order to
+* preserve prepare/commit balanced invoke commit end fail.
+*/
+   int ret;
+   ret = mapping->a_ops->commit_write(NULL, page, 0, err);
+   err = ret ? ret : -ENOSPC;
+   }
if (err)
goto fail_map;
kaddr = kmap_atomic(page, KM_USER0);
diff --git a/fs/splice.c b/fs/splice.c
index 2fca6eb..d2b92bf 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -652,6 +652,17 @@ find_page:
if (unlikely(ret)) {
loff_t isize = i_size_read(mapping->host);
 
+   if (ret > 0 && ret < PAGE_CACHE_SIZE) {
+   /* 
+* prepare_write demands limit size of bytes. In order to
+* preserve prepare/commit balanced invoke commit end fail. 
+* Initial i_size saved, so vmtruncate safely restore it later.
+*/
+   int ret2;
+   ret2 = mapping->a_ops->commit_write(file, page, offset,
+   offset + ret);
+

[PATCH 1/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Dmitriy Monakhov

This patch solve ext3/4 retry loop issue

Issue description:
What we can do if block_prepare_write fail inside ext3_prepare_write ?
a) Stop transaction and do retry if possible, but what happend if 
   reboot comes after journal_force_commit, but before we exhaust
   all retry attempts and generic_file_buffered_write call trancate? 
   We may get allocated blocks outside i_size. Witch is bad.

b) Commit newly allocated blocks. This approach was used after Andrey's patch.
   But if reboot comes, or error happend, user will be surprised that i_size
   increased but blocks are zero filed. This is internal write operation state
   becames visiable to user. Witch is also bad. 
   
c) Just return as match bytes as we can deal with rigth now back to 
   caller, and let's caller handles it and than call commit. In this scenario
   we never stop journal in the midle of some internal fs operation.
   If reboot comes before commit trunsaction was't closed so it will 
   be droped while journal replay.

Only (c) tend to garantie attomic data operation.

Also fix some issues introduced by 
retries-in-ext3_prepare_write-violate-ordering-requirements:
  i_size may increase after error happend.
  possible data corruption caused commiting non uptodate bh.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index dba6dd2..4c5e9f7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1154,6 +1154,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext3_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1186,7 +1198,7 @@ skip:
block_start = to;
break;
}
-   if (!buffer_mapped(bh))
+   if (!buffer_mapped(bh) || !buffer_uptodate(bh))
/* prepare_write failed on this bh */
break;
if (ext3_should_journal_data(mapping->host)) {
@@ -1204,8 +1216,8 @@ skip:
if (block_start <= from)
goto skip;
 
-   /* commit allocated and zeroed buffers */
-   return mapping->a_ops->commit_write(file, page, from, block_start);
+   /* return number of bytes till last mapped && uptodate block */
+   return block_start - from;
 }
 
 static int ext3_prepare_write(struct file *file, struct page *page,
@@ -1239,7 +1251,8 @@ retry:
 
 failure:
ret2 = ext3_prepare_failure(file, page, from, to);
-   if (ret2 < 0)
+   if (ret2)
+   /* ready to partial write, or fatal error */
return ret2;
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 806eee1..da39f80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1153,6 +1153,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext4_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1185,7 +1197,7 @@ skip:
block_start = to;

[PATCH] ecryptfs lower file handling code issues

2007-02-07 Thread Dmitriy Monakhov

eCryptfs lower file handling code has several issues:
  - Retval from prepare_write()/commit_writ() was't checked to equality 
to AOP_TRUNCATED_PAGE.
  - In some places page was't unmapped and unlocked after error.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
--   
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 06843d2..27ac9ef 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -391,12 +391,14 @@ out:
return rc;
 }
 
-static void ecryptfs_unmap_and_release_lower_page(struct page *lower_page)
+static void ecryptfs_unmap_and_release_lower_page(struct page *lower_page,
+   int page_locked)
 {
kunmap(lower_page);
ecryptfs_printk(KERN_DEBUG, "Unlocking lower page with index = "
"[0x%.16x]\n", lower_page->index);
-   unlock_page(lower_page);
+   if (page_locked)
+   unlock_page(lower_page);
page_cache_release(lower_page);
 }
 
@@ -417,6 +419,7 @@ ecryptfs_write_inode_size_to_header(struct file *lower_file,
char *header_virt;
const struct address_space_operations *lower_a_ops;
u64 file_size;
+   int pg_locked = 1;
 
rc = ecryptfs_grab_and_map_lower_page(&header_page, &header_virt,
  lower_inode, 0);
@@ -427,6 +430,12 @@ ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
}
lower_a_ops = lower_inode->i_mapping->a_ops;
rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
+   if (rc) {
+   if (rc == AOP_TRUNCATED_PAGE)
+   pg_locked = 0;
+   ecryptfs_unmap_and_release_lower_page(header_page, pg_locked);
+   goto out;
+   }
file_size = (u64)i_size_read(inode);
ecryptfs_printk(KERN_DEBUG, "Writing size: [0x%.16x]\n", file_size);
file_size = cpu_to_be64(file_size);
@@ -435,7 +444,9 @@ ecryptfs_write_inode_size_to_header(struct file *lower_file,
if (rc < 0)
ecryptfs_printk(KERN_ERR, "Error commiting header page "
"write\n");
-   ecryptfs_unmap_and_release_lower_page(header_page);
+   if (rc == AOP_TRUNCATED_PAGE)
+   pg_locked = 0;
+   ecryptfs_unmap_and_release_lower_page(header_page, pg_locked);
lower_inode->i_mtime = lower_inode->i_ctime = CURRENT_TIME;
mark_inode_dirty_sync(inode);
 out:
@@ -468,7 +479,10 @@ int ecryptfs_get_lower_page(struct page **lower_page, 
struct inode *lower_inode,
}
 out:
if (rc && (*lower_page)) {
-   ecryptfs_unmap_and_release_lower_page(*lower_page);
+   int pg_locked = 1;
+   if (rc == AOP_TRUNCATED_PAGE)
+   pg_locked = 0;
+   ecryptfs_unmap_and_release_lower_page(*lower_page, pg_locked);
(*lower_page) = NULL;
}
return rc;
@@ -484,16 +498,19 @@ ecryptfs_commit_lower_page(struct page *lower_page, 
struct inode *lower_inode,
   struct file *lower_file, int byte_offset,
   int region_size)
 {
+   int pg_locked = 1;
int rc = 0;
 
rc = lower_inode->i_mapping->a_ops->commit_write(
lower_file, lower_page, byte_offset, region_size);
+   if (rc == AOP_TRUNCATED_PAGE)
+   pg_locked = 0;
if (rc < 0) {
ecryptfs_printk(KERN_ERR,
"Error committing write; rc = [%d]\n", rc);
} else
rc = 0;
-   ecryptfs_unmap_and_release_lower_page(lower_page);
+   ecryptfs_unmap_and_release_lower_page(lower_page, pg_locked);
return rc;
 }
 
@@ -541,6 +558,7 @@ process_new_file(struct ecryptfs_crypt_stat *crypt_stat,
int current_header_page = 0;
int header_pages;
int more_header_data_to_be_written = 1;
+   int pg_locked = 1;
 
lower_inode = ecryptfs_inode_to_lower(inode);
lower_file = ecryptfs_file_to_lower(file);
@@ -563,6 +581,10 @@ process_new_file(struct ecryptfs_crypt_stat *crypt_stat,
rc = lower_a_ops->prepare_write(lower_file, header_page, 0,
PAGE_CACHE_SIZE);
if (rc) {
+   if (rc == AOP_TRUNCATED_PAGE)
+   pg_locked = 0;
+   ecryptfs_unmap_and_release_lower_page(header_page,
+   pg_locked);
ecryptfs_printk(KERN_ERR, "Error preparing to write "
"header page out; rc = [%d]\n", rc);
goto out;
@@ -579,7 +601,7 @@ process_new_file(struct ecryptfs_crypt_stat *crypt_stat,
rc =

nobh_truncate_page question

2007-02-07 Thread Dmitriy Monakhov

Could some one please explain me what's happens with nobh_truncate_page?

int nobh_truncate_page(struct address_space *mapping, loff_t from)
{

 if ((offset & (blocksize - 1)) == 0)
goto out;

ret = -ENOMEM;
page = grab_cache_page(mapping, index);
if (!page)
goto out;

to = (offset + blocksize) & ~(blocksize - 1);
ret = a_ops->prepare_write(NULL, page, offset, to);
< This prepare_write() not balanced by appropriate commit_write()
< Currently this not is fatal because this function used only
< by ext2 and jfs witch  unbalanced prepare/commit operations
< not painful.  
if (ret == 0) {
kaddr = kmap_atomic(page, KM_USER0);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
set_page_dirty(page);
}
unlock_page(page);
page_cache_release(page);
out:
return ret;
}

But we can't just call commit_write after all job done because this result in
i_size increasing till block end. Where are two possible solutions:
a) Call ->commit_write() and then restore i_size to old value.
b) Leave code as it is now, but add warning comment about prepare/commit
   balance violation.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/3] ext2: use perform_write aop

2007-02-08 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> Convert ext2 to use ->perform_write. This uses the main loop out of
> generic_perform_write, but when encountering a short usercopy, it
> zeroes out new uninitialised blocks, and passes in a short-length commit
> to __block_commit_write, which does the right thing (in terms of not
> setting things uptodate).
>
>  fs/buffer.c |  143 
> 
>  fs/ext2/inode.c |7 ++
>  include/linux/buffer_head.h |1 
>  include/linux/pagemap.h |2 
>  4 files changed, 153 insertions(+)
>
> Index: linux-2.6/fs/buffer.c
> ===
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1866,6 +1866,50 @@ next_bh:
>   return err;
>  }
>  
> +void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
> +{
> + unsigned int block_start, block_end;
> + struct buffer_head *head, *bh;
> +
> + BUG_ON(!PageLocked(page));
> + if (!page_has_buffers(page))
> + return;
> +
> + bh = head = page_buffers(page);
> + block_start = 0;
> + do {
> + block_end = block_start + bh->b_size;
> +
> + if (buffer_new(bh)) {
> + if (block_end > from && block_start < to) {
> + if (!PageUptodate(page)) {
> + unsigned start, end;
> + void *kaddr;
> +
> + start = max(from, block_start);
> + end = min(to, block_end);
> +
> + kaddr = kmap_atomic(page, KM_USER0);
> + memset(kaddr+start, 0, block_end-end);
> + flush_dcache_page(page);
> + kunmap_atomic(kaddr, KM_USER0);
> + set_buffer_uptodate(bh);
> + }
> +
> + /*
> +  * XXX: make buffer_new behaviour more
> +  * consistent.
> +  * clear_buffer_new(bh);
> +  */
> + mark_buffer_dirty(bh);
> + }
> + }
> +
> + block_start = block_end;
> + bh = bh->b_this_page;
> + } while (bh != head);
> +}
> +
>  static int __block_commit_write(struct inode *inode, struct page *page,
>   unsigned from, unsigned to)
>  {
> @@ -1900,6 +1944,105 @@ static int __block_commit_write(struct i
>   return 0;
>  }
>  
> +ssize_t block_perform_write(struct file *file, struct iovec_iterator *i,
> + loff_t pos, get_block_t *get_block)
> +{
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + long status = 0;
> + ssize_t written = 0;
> +
> + do {
> + struct page *page;
> + pgoff_t index;  /* Pagecache index for current page */
> + unsigned long offset;   /* Offset into pagecache page */
> + unsigned long bytes;/* Bytes to write to page */
> + size_t copied;  /* Bytes copied from user */
> +
> + offset = (pos & (PAGE_CACHE_SIZE - 1));
> + index = pos >> PAGE_CACHE_SHIFT;
> + bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> + iovec_iterator_count(i));
> +
> + /*
> +  * Bring in the user page that we will copy from _first_.
> +  * Otherwise there's a nasty deadlock on copying from the
> +  * same page as we're writing to, without it being marked
> +  * up-to-date.
> +  *
> +  * Not only is this an optimisation, but it is also required
> +  * to check that the address is actually valid, when atomic
> +  * usercopies are used, below.
> +  */
> + if (unlikely(iovec_iterator_fault_in_readable(i))) {
> + status = -EFAULT;
> + break;
> + }
> +
> + page = __grab_cache_page(mapping, index);
> + if (!page) {
> + status = -ENOMEM;
> + break;
> + }
> +
> + status = __block_prepare_write(inode, page, offset,
> + offset+bytes, get_block);
> + if (unlikely(status)) {
> + ClearPageUptodate(page);
> +
> + page_cache_release(page);
> +
> + /*
> +  * prepare_write() may have instantiated a few blocks
> +  * outside i_size.  Trim these off again. Don't need
> +

[PATCH 0/1][RFC] prepare_write positive return value V(2)

2007-02-12 Thread Dmitriy Monakhov

Changes from ver(1)
 - __page_symlink(): In order to be on a safe side add explicit zeroing content
 before fail (just in case).
 -do_lo_send_aops(): If prepare_write can't handle total size of bytes requested
 from loop_dev it is safer to fail.
 -  pipe_to_file():  Limit the size of the copy to size wich prepare_write ready
 to handle. instead of failing.
 - ecryptfs: was't changed in this version, because where is no even 
 AOP_TRUNCATED_PAGE handling code where.

Almost all read/write operation handles data with chunks(segments or pages)
and result has integral behaviour for folowing scenario: 
for_each_chunk() {
 res = op();
 if(IS_ERROR(res))
   return progress ? progress : res;
 progress += res;
}
prepare_write may has integral behaviour in case of blksize < pgsize,
for example we successfully allocated/read some blocks, but not all of them,
and than some error happend. Currently we eliminate this progress by doing
vmtrunate() after prepare_has failed.
It is good to have ability to signal about this progress. Interprete positive
prepare_write() ret code as bytes num that fs ready to handle at this moment.
 
BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
delayed allocation more correct, and its works well.
I think not everybody will happy about this,  but let's discuss all advantages
and disadvantages of this change.

fixes not dirrectly connected with proposed prepare_write semantic changes:
 __page_symlink : If find_or_create_page has failed on second retry attempt
  function will exit with wrong error code.
 __generic_cont_expand: Add correct AOP_TRUNCATED_PAGE handling.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6b5b642..dc332dc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -224,6 +224,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
sector_t IV;
unsigned size;
int transfer_result;
+   int partial = 0;
 
IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
size = PAGE_CACHE_SIZE - offset;
@@ -239,11 +240,20 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
page_cache_release(page);
continue;
}
-   goto unlock;
+   if (ret > 0 && ret < PAGE_CACHE_SIZE) {
+   /* 
+* ->prepare_write() can't handle total size of bytes
+* requested. In fact this is not likely to happen, 
+* because we request one block only.
+*/
+   partial = 1;
+   size = ret;
+   } else
+   goto unlock;
}
transfer_result = lo_do_transfer(lo, WRITE, page, offset,
bvec->bv_page, bv_offs, size, IV);
-   if (unlikely(transfer_result)) {
+   if (unlikely(transfer_result || partial)) {
char *kaddr;
 
/*
@@ -266,7 +276,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
}
goto unlock;
}
-   if (unlikely(transfer_result))
+   if (unlikely(transfer_result || partial))
goto unlock;
bv_offs += size;
len -= size;
diff --git a/fs/buffer.c b/fs/buffer.c
index 1ad674f..ed06560 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2027,13 +2027,20 @@ static int __generic_cont_expand(struct inode *inode, 
loff_t size,
if (size > inode->i_sb->s_maxbytes)
goto out;
 
+retry:
err = -ENOMEM;
page = grab_cache_page(mapping, index);
if (!page)
goto out;
err = mapping->a_ops->prepare_write(NULL, page, offset, offset);
+   if (err == AOP_TRUNCATED_PAGE) {
+   page_cache_release(page);
+   goto retry;
+   }
if (err) {
/*
+* ->prepare_write() called with from==to and must not 
+* return positive size of bytes in this case.
 * ->prepare_write() may have instantiated a few blocks
 * outside i_size.  Trim these off again.
 */
@@ -2044,6 +2051,10 @@ static int __generic_cont_expand(struct inode *inode, 
loff_t size,
}
 
err = mapping->a_ops->commit_write(NULL, page, offset, offset);
+   if (err == AO

[PATCH 1/1][RFC] EXT34 retry loop issue V(2)

2007-02-12 Thread Dmitriy Monakhov


Patch depends on : "[PATCH 0/1][RFC]  prepare_write positive return value V(2)"

This patch solve ext3/4 retry loop issue.
Issue description:
What we can do if block_prepare_write fail inside ext3_prepare_write ?
a) Stop transaction and do retry if possible, but what happend if 
   reboot comes after journal_force_commit, but before we exhaust
   all retry attempts and generic_file_buffered_write call trancate? 
   We may get allocated blocks outside i_size. Witch is bad.

b) Commit newly allocated blocks. This approach was used after Andrey's patch.
   But if reboot comes, or error happend, user will be surprised that i_size
   increased but blocks are zero filed. This is internal write operation state
   becames visiable to user. Witch is also bad. 
   
c) Just return as match bytes as we can deal with rigth now back to 
   caller, and let's caller handles it and than call commit. In this scenario
   we never stop journal in the midle of some internal fs operation.
   If reboot comes before commit trunsaction was't closed so it will 
   be droped while journal replay.

Only (c) tend to garantie attomic data operation.

Also fix some issues introduced by 
retries-in-ext3_prepare_write-violate-ordering-requirements:
  i_size may increase after error happend.
  possible data corruption caused commiting non uptodate bh.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index dba6dd2..4c5e9f7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1154,6 +1154,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext3_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1186,7 +1198,7 @@ skip:
block_start = to;
break;
}
-   if (!buffer_mapped(bh))
+   if (!buffer_mapped(bh) || !buffer_uptodate(bh))
/* prepare_write failed on this bh */
break;
if (ext3_should_journal_data(mapping->host)) {
@@ -1204,8 +1216,8 @@ skip:
if (block_start <= from)
goto skip;
 
-   /* commit allocated and zeroed buffers */
-   return mapping->a_ops->commit_write(file, page, from, block_start);
+   /* return number of bytes till last mapped && uptodate block */
+   return block_start - from;
 }
 
 static int ext3_prepare_write(struct file *file, struct page *page,
@@ -1239,7 +1251,8 @@ retry:
 
 failure:
ret2 = ext3_prepare_failure(file, page, from, to);
-   if (ret2 < 0)
+   if (ret2)
+   /* ready to partial write, or fatal error */
return ret2;
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 806eee1..da39f80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1153,6 +1153,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext4_prepare_failure(struct file *file, struct page *page,
unsigned from, un

Re: [PATCH] ecryptfs lower file handling code issues

2007-02-12 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> eCryptfs lower file handling code has several issues:
>   - Retval from prepare_write()/commit_writ() was't checked to equality 
> to AOP_TRUNCATED_PAGE.
>   - In some places page was't unmapped and unlocked after error.
it is easy to retproduce:
##Prepare FS
dd if=/dev/zero of=FS.img bs=1M count=64
mkfs.ext3 -F FS.img 
mkdir crypt_root
mkdir crypt_private
mount FS.img crypt_private/ -oloop
mount -t ecryptfs crypt_private/ crypt_root/ -ocipher=aes
## exhaust all available space, and provoke ENOSPC condition.
for ((i=0;i<100;i++));do dd if=/dev/zero of=crypt_root/file$i bs=1M count=1;done
## after  this we got folowing errors on console:
#process_new_file: Error preparing to write header page out; rc = [-28]
#ecryptfs_commit_write: Error processing new file; rc = [-28]
#write_zeros: Error attempting to write zero's to remainder of page at index 
[0x]
#ecryptfs_fill_zeros: write_zeros(file=[f561bde0], index=[0x], 
old_end_pos_in_page=[d], (PAGE_CACHE_SIZE - new_end_pos_in_page=[-1])=[d]) 
returned [0]
#grow_file: Error attempting to fill zeros in file; rc = [-28]

# After prepare_write() failed in process_new_file() page was't unlocked,
# so leter umount will stuck forever in D state. 
umount crypt_root/
umount crypt_private/ ## stuck forever here.
>
> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
> --   
> diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
> index 06843d2..27ac9ef 100644
> --- a/fs/ecryptfs/mmap.c
> +++ b/fs/ecryptfs/mmap.c
> @@ -391,12 +391,14 @@ out:
>   return rc;
>  }
>  
> -static void ecryptfs_unmap_and_release_lower_page(struct page *lower_page)
> +static void ecryptfs_unmap_and_release_lower_page(struct page *lower_page,
> + int page_locked)
>  {
>   kunmap(lower_page);
>   ecryptfs_printk(KERN_DEBUG, "Unlocking lower page with index = "
>   "[0x%.16x]\n", lower_page->index);
> - unlock_page(lower_page);
> + if (page_locked)
> + unlock_page(lower_page);
>   page_cache_release(lower_page);
>  }
>  
> @@ -417,6 +419,7 @@ ecryptfs_write_inode_size_to_header(struct file 
> *lower_file,
>   char *header_virt;
>   const struct address_space_operations *lower_a_ops;
>   u64 file_size;
> + int pg_locked = 1;
>  
>   rc = ecryptfs_grab_and_map_lower_page(&header_page, &header_virt,
> lower_inode, 0);
> @@ -427,6 +430,12 @@ ecryptfs_write_inode_size_to_header(struct file 
> *lower_file,
>   }
>   lower_a_ops = lower_inode->i_mapping->a_ops;
>   rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
> + if (rc) {
> + if (rc == AOP_TRUNCATED_PAGE)
> + pg_locked = 0;
> + ecryptfs_unmap_and_release_lower_page(header_page, pg_locked);
> + goto out;
> + }
>   file_size = (u64)i_size_read(inode);
>   ecryptfs_printk(KERN_DEBUG, "Writing size: [0x%.16x]\n", file_size);
>   file_size = cpu_to_be64(file_size);
> @@ -435,7 +444,9 @@ ecryptfs_write_inode_size_to_header(struct file 
> *lower_file,
>   if (rc < 0)
>   ecryptfs_printk(KERN_ERR, "Error commiting header page "
>   "write\n");
> - ecryptfs_unmap_and_release_lower_page(header_page);
> + if (rc == AOP_TRUNCATED_PAGE)
> + pg_locked = 0;
> + ecryptfs_unmap_and_release_lower_page(header_page, pg_locked);
>   lower_inode->i_mtime = lower_inode->i_ctime = CURRENT_TIME;
>   mark_inode_dirty_sync(inode);
>  out:
> @@ -468,7 +479,10 @@ int ecryptfs_get_lower_page(struct page **lower_page, 
> struct inode *lower_inode,
>   }
>  out:
>   if (rc && (*lower_page)) {
> - ecryptfs_unmap_and_release_lower_page(*lower_page);
> + int pg_locked = 1;
> + if (rc == AOP_TRUNCATED_PAGE)
> + pg_locked = 0;
> + ecryptfs_unmap_and_release_lower_page(*lower_page, pg_locked);
>   (*lower_page) = NULL;
>   }
>   return rc;
> @@ -484,16 +498,19 @@ ecryptfs_commit_lower_page(struct page *lower_page, 
> struct inode *lower_inode,
>  struct file *lower_file, int byte_offset,
>  int region_size)
>  {
> + int pg_locked = 1;
>   int rc = 0;
>  
>   rc = lower_inode->i_mapping->a_ops->commit_write(
>   lower_file, lower_page, byte_offset, region_size);
> + if (rc == AOP_TRUNCATED_PAGE)
> + p

[PATCH]mm incorrect direct io error handling (v4)

2007-02-15 Thread Dmitriy Monakhov

Incorrect direct io error handling (v4)
 Changes from v3:
 - Patch preparead gainst 2.6.20-rc6-mm3. It seems now it's not conflict 
   with Nick's stuff.
 - fix generic segment check function.

LOG:
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path. Initialy i've proposed do it in
__generic_file_aio_write_nolock() with explicit guarantee i_mutex always held,
but not everybody was agree with it. So we may safely call vmtruncate() inside 
generic_file_aio_write(), here i_mutex already locked.
 
TEST_CASE:
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaa"..., 104857600) = -1 ENOSPC (No space left on device)

#stat /mnt/test/BIG_FILE
  File: `/mnt/test/BIG_FILE'
  Size: 0   Blocks: 110896 IO Block: 1024   regular empty file
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<file size is less than biggest block idx

Device: fe07h/65031dInode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2007-01-24 20:03:38.0 +0300
Modify: 2007-01-24 20:03:38.0 +0300
Change: 2007-01-24 20:03:39.0 +0300

#fsck.ext3 -f /dev/VG/test 
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544.  Fix? yes
Pass 2: Checking directory structure

Changes not dirrectly connected with issue:
 - move common segment checks to separate helper function.
Patch pass ltp and fsstres test.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 529eb9e..5f189e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1170,6 +1170,31 @@ success:
return size;
 }
 
+int generic_segment_checks(const struct iovec *iov, unsigned long *nr_segs,
+   size_t *count, unsigned long access_flags)
+{
+   unsigned long   seg;
+   for (seg = 0; seg < *nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   *count += iv->iov_len;
+   if (unlikely((ssize_t)(*count|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flags, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   *nr_segs = seg;
+   *count -= iv->iov_len;  /* This segment is no good */
+   break;
+   }
+   return 0;
+}
+
 /**
  * generic_file_aio_read - generic filesystem read routine
  * @iocb:  kernel I/O control block
@@ -1191,24 +1216,9 @@ generic_file_aio_read(struct kiocb *iocb, const struct 
iovec *iov,
loff_t *ppos = &iocb->ki_pos;
 
count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+   if (retval)
+   return retval;
 
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
@@ -2103,8 +2113,9 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livel

[PATCH]mm __page_symlink retry loop error code fix

2007-02-15 Thread Dmitriy Monakhov


If prepare_write or commit_write return AOP_TRUNCATED_PAGE we jump to "retry"
label and than if find_or_create_page() failed function return incorrect error
code. Initilally i've add this fix as a part of prepare_write_retval patch,
but IMHO it is better to split this to separate fix.
Patch against 2.6.20-rc6-mm3.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/fs/namei.c b/fs/namei.c
index 723db81..6b6fe43 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2692,10 +2692,11 @@ int __page_symlink(struct inode *inode, const char 
*symname, int len,
 {
struct address_space *mapping = inode->i_mapping;
struct page *page;
-   int err = -ENOMEM;
+   int err;
char *kaddr;
 
 retry:
+   err = -ENOMEM;
page = find_or_create_page(mapping, 0, gfp_mask);
if (!page)
goto fail;

[PATCH] ecryptfs lower_file handling issues

2007-02-19 Thread Dmitriy Monakhov


1)Function ecryptfs_do_readpage() calls flush_dcache_page(lower_page),
  but lower_page was't changed here. So remove this line.

2)prepare_write ret val was ignored in ecryptfs_write_inode_size_to_header().
If error happends we can't call commit_write, just do cleanup and fial.
It is issue easy to reproduce with full lower_fs, in this case prepare_write()
will return ENOSPC.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 1e5d2ba..0cebb75 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -238,7 +238,6 @@ int ecryptfs_do_readpage(struct file *file, struct page 
*page,
lower_page_data = kmap_atomic(lower_page, KM_USER1);
memcpy(page_data, lower_page_data, PAGE_CACHE_SIZE);
kunmap_atomic(lower_page_data, KM_USER1);
-   flush_dcache_page(lower_page);
kunmap_atomic(page_data, KM_USER0);
flush_dcache_page(page);
rc = 0;
@@ -454,6 +453,12 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
}
lower_a_ops = lower_inode->i_mapping->a_ops;
rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
+   if (rc) {
+   ecryptfs_printk(KERN_ERR, "Error preparing header page "
+   "write\n");
+   ecryptfs_release_lower_page(header_page);
+   goto out;
+   }
file_size = (u64)i_size_read(inode);
ecryptfs_printk(KERN_DEBUG, "Writing size: [0x%.16x]\n", file_size);
file_size = cpu_to_be64(file_size);

[PATCH] ecryptfs ecryptfs_read_super path_lookup errh fix

2007-02-19 Thread Dmitriy Monakhov

if path_lookup() return non zero code we don't have to worry about 'nd' 
parameter, but ecryptfs_read_super does path_release(&nd) after path_lookup
has failed, and dentry counter becomes negative :) , this result in folowing BUG

ecryptfs_read_super: path_lookup() failed
BUG: Dentry f3f074c4{i=0,n=enc_dir} still in use (-1) [unmount of ext3 dm-5]
<<<<<<<<<<<<<<<<<<<<]Not tainted VLI
EFLAGS: 00210292   (2.6.20-mm2-g0cb8e157-dirty #10)
EIP is at shrink_dcache_for_umount_subtree+0x159/0x1fb
eax: 004b   ebx: f3f074c4   ecx: c0122e78   edx: f7612070
esi:    edi: f3f0753c   ebp: f4269cac   esp: f4269c7c
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process mount.ecryptfs (pid: 4236, ti=f4268000 task=f7612070 task.ti=f4268000)
Stack: c04f9dee f3f074c4  f3f0753c  c04fd617 f5161e30  
    f5161c00 c043b780 f4269d98 f4269cb8 c0183f5f f5161c00 f4269cd0 
   c0176166 f4269cc8 c022cb90 f705fcc0 f5161c00 f4269ce0 c017622c f5161c00 
Call Trace:
 [] show_trace_log_lvl+0x1a/0x2f
 [] show_stack_log_lvl+0x9b/0xaa
 [] show_registers+0x1b8/0x289
 [] die+0x11e/0x233
 [] do_trap+0x8a/0xa3
 [] do_invalid_op+0x97/0xa1
 [] error_code+0x7c/0x84
 [] shrink_dcache_for_umount+0x2d/0x3a
 [] generic_shutdown_super+0x18/0xbe
 [] kill_block_super+0x20/0x32
 [] deactivate_super+0x3f/0x51
 [] mntput_no_expire+0x42/0x6b
 [] path_release+0x20/0x23
 [] ecryptfs_get_sb+0x45a/0x4ad [ecryptfs]
 [] vfs_kern_mount+0x81/0xf1
 [] do_kern_mount+0x30/0x42
 [] do_mount+0x601/0x678
 [] sys_mount+0x6f/0xa9
 [] sysenter_past_esp+0x5d/0x99
 ===
Code: 30 02 00 00 89 44 24 18 8b 45 ec 89 4c 24 14 89 74 24 10 89 7c 24 0c 89 
5c 24 04 89 44 24 08 c7 04 24 ee 9d 4f c0 e8 b7 74 fa ff <0f> 0b eb fe 8b 73 30 
39 de 75 04 31 f6 eb 03 f0 ff 0e 8d 43 48 
EIP: [] shrink_dcache_for_umount_subtree+0x159/0x1fb SS:ESP 
0068:f4269c7c

This is easy to reproduce just try to mount ecryptfs to nonexisting lower path
# mount -tecryptfs private/this_dir_not_exist root -ocipher=aes

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 80044d1..812427e 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -487,7 +487,7 @@ static int ecryptfs_read_super(struct super_block *sb, 
const char *dev_name)
rc = path_lookup(dev_name, LOOKUP_FOLLOW, &nd);
if (rc) {
ecryptfs_printk(KERN_WARNING, "path_lookup() failed\n");
-   goto out_free;
+   goto out;
}
lower_root = nd.dentry;
if (!lower_root->d_inode) {

Re: [PATCH] ecryptfs ecryptfs_read_super path_lookup errh fix

2007-02-19 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> if path_lookup() return non zero code we don't have to worry about 'nd' 
> parameter, but ecryptfs_read_super does path_release(&nd) after path_lookup
> has failed, and dentry counter becomes negative :) , this result in folowing 
> BUG
>
> ecryptfs_read_super: path_lookup() failed
> BUG: Dentry f3f074c4{i=0,n=enc_dir} still in use (-1) [unmount of ext3 dm-5]
> <<<<<<<<<<<<<<<<<<<< [ cut here ]
> kernel BUG at fs/dcache.c:654!
>  [] die+0x11e/0x233
>  [] do_trap+0x8a/0xa3
>  [] do_invalid_op+0x97/0xa1
>  [] error_code+0x7c/0x84
>  [] shrink_dcache_for_umount+0x2d/0x3a
>  [] generic_shutdown_super+0x18/0xbe
>  [] kill_block_super+0x20/0x32
>  [] deactivate_super+0x3f/0x51
>  [] mntput_no_expire+0x42/0x6b
>  [] path_release+0x20/0x23
>  [] ecryptfs_get_sb+0x45a/0x4ad [ecryptfs]
>  [] vfs_kern_mount+0x81/0xf1
>  [] do_kern_mount+0x30/0x42
>  [] do_mount+0x601/0x678
>  [] sys_mount+0x6f/0xa9
>  [] sysenter_past_esp+0x5d/0x99
>  ===
> Code: 30 02 00 00 89 44 24 18 8b 45 ec 89 4c 24 14 89 74 24 10 89 7c 24 0c 89 
> 5c 24 04 89 44 24 08 c7 04 24 ee 9d 4f c0 e8 b7 74 fa ff <0f> 0b eb fe 8b 73 
> 30 39 de 75 04 31 f6 eb 03 f0 ff 0e 8d 43 48 
> EIP: [] shrink_dcache_for_umount_subtree+0x159/0x1fb SS:ESP 
> 0068:f4269c7c
>
> This is easy to reproduce just try to mount ecryptfs to nonexisting lower path
> # mount -tecryptfs private/this_dir_not_exist root -ocipher=aes
This is fun but ecryptfs_read_super() code is more crappy whan it looks at 
first blush.
1) After path_lookup succeed we dot't have any guarantie what it is DIR.
   This must be checked explicitly.
2) path_lookup can't return negative dentry, So inode check is useless.

Following patch is updated version and have to be applied instead of first 
patch 
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 80044d1..fc4a3a2 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -484,18 +484,12 @@ static int ecryptfs_read_super(struct super_block *sb, 
const char *dev_name)
struct vfsmount *lower_mnt;
 
memset(&nd, 0, sizeof(struct nameidata));
-   rc = path_lookup(dev_name, LOOKUP_FOLLOW, &nd);
+   rc = path_lookup(dev_name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
if (rc) {
ecryptfs_printk(KERN_WARNING, "path_lookup() failed\n");
-   goto out_free;
+   goto out;
}
lower_root = nd.dentry;
-   if (!lower_root->d_inode) {
-   ecryptfs_printk(KERN_WARNING,
-   "No directory to interpose on\n");
-   rc = -ENOENT;
-   goto out_free;
-   }
lower_mnt = nd.mnt;
ecryptfs_set_superblock_lower(sb, lower_root->d_sb);
sb->s_maxbytes = lower_root->d_sb->s_maxbytes;

[patch] freevxfs: possible null pointer dereference fix

2007-02-19 Thread Dmitriy Monakhov

sb_read may return NULL, so let's explicitly check it.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/freevxfs/vxfs_bmap.c b/fs/freevxfs/vxfs_bmap.c
index 2d71128..f86fd3c 100644
--- a/fs/freevxfs/vxfs_bmap.c
+++ b/fs/freevxfs/vxfs_bmap.c
@@ -137,7 +137,7 @@ vxfs_bmap_indir(struct inode *ip, long indir, int size, 
long block)
 
bp = sb_bread(ip->i_sb,
indir + (i / VXFS_TYPED_PER_BLOCK(ip->i_sb)));
-   if (!buffer_mapped(bp))
+   if (!bp || !buffer_mapped(bp))
return 0;
 
typ = ((struct vxfs_typed *)bp->b_data) +
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index 098a915..d1f7c5b 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -99,7 +99,7 @@ vxfs_blkiget(struct super_block *sbp, u_long extent, ino_t 
ino)
offset = ((ino % (sbp->s_blocksize / VXFS_ISIZE)) * VXFS_ISIZE);
bp = sb_bread(sbp, block);
 
-   if (buffer_mapped(bp)) {
+   if (bp && buffer_mapped(bp)) {
struct vxfs_inode_info  *vip;
struct vxfs_dinode  *dip;

[patch] reiserfs: possible null pointer dereference during resize

2007-02-19 Thread Dmitriy Monakhov

sb_read may return NULL, let's explicitly check it.
If so free new bitmap blocks array, after this we may safely exit as it done
above during bitmap allocation.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/reiserfs/resize.c b/fs/reiserfs/resize.c
index 3156847..976cc78 100644
--- a/fs/reiserfs/resize.c
+++ b/fs/reiserfs/resize.c
@@ -131,6 +131,10 @@ int reiserfs_resize(struct super_block *s, unsigned long 
block_count_new)
/* don't use read_bitmap_block since it will cache
 * the uninitialized bitmap */
bh = sb_bread(s, i * s->s_blocksize * 8);
+   if (!bh) {
+   vfree(bitmap);
+   return -EIO;
+   }
memset(bh->b_data, 0, sb_blocksize(sb));
reiserfs_test_and_set_le_bit(0, bh->b_data);
reiserfs_cache_bitmap_metadata(s, bh, bitmap + i);

Re: [PATCH][RFC] incorrect direct io error handling (v3)

2007-01-26 Thread Dmitriy Monakhov

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Wed, 24 Jan 2007 22:05:06 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> incorrect direct io error handling (v3)
>> Changes from v2:
>>  - Remove BUG_ON(!mutex_is_locked(..)) for non blkdev.
>>  - vmtruncate() called from generic_file_aio_write().
>>  - depends on patch titled:
>>[PATH][RFC] mm: Move common segments checks to separate function
>
> drat, I skipped that patch due to rejects, and because Nick is working on
> things in the same area.
>
[skip]
>>  if ((written >= 0 || written == -EIOCBQUEUED) &&
>>  ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>> @@ -2365,6 +2366,17 @@ ssize_t generic_file_aio_write(struct ki
>>  &iocb->ki_pos);
>>  mutex_unlock(&inode->i_mutex);
>>  
>> +if (unlikely(ret < 0 && (file->f_flags & O_DIRECT))) {
>> +ssize_t cnt = generic_segment_checks(nr_segs, iov, VERIFY_READ);
>> +loff_t isize = i_size_read(inode);
>> +/*
>> + * generic_file_direct_write() may have instantiated a few 
>> + * blocks outside i_size.  Trim these off again.
>> + */
>> +if (cnt > 0 && (pos + cnt > isize))
>> +vmtruncate(inode, isize);
>> +}
>> +
>
> vmtruncate() really wants i_mutex to be held.  Can't we do that here?
Yepp 110% true, baaahh . it looks like my brain was't clear at the time i
wrote this. We have to do vmtruncate() brfore droping i_mutex , right after
__generic_file_aio_write_nolock() call , i'm sorry to waste your time. 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] :EXT[3,4] jbd layer function called instead of fs specific one

2007-01-28 Thread Dmitriy Monakhov

jbd function called instead of fs specific one.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
--
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index beaf25f..8a824f4 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -947,7 +947,7 @@ out:
 static int ext3_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
 {
-   handle_t *handle = journal_current_handle();
+   handle_t *handle = ext3_journal_current_handle();
int ret = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 
@@ -1717,7 +1717,7 @@ static ssize_t ext3_direct_IO(int rw, st
/*
 * Reacquire the handle: ext3_get_block() can restart the transaction
 */
-   handle = journal_current_handle();
+   handle = ext3_journal_current_handle();
 
 out_stop:
if (handle) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a127cc0..fbff4b9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -946,7 +946,7 @@ out:
 static int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
 {
-   handle_t *handle = journal_current_handle();
+   handle_t *handle = ext4_journal_current_handle();
int ret = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 
@@ -1716,7 +1716,7 @@ static ssize_t ext4_direct_IO(int rw, st
/*
 * Reacquire the handle: ext4_get_block() can restart the transaction
 */
-   handle = journal_current_handle();
+   handle = ext4_journal_current_handle();
 
 out_stop:
if (handle) {

Re: + mm-search_binary_handler-mem-limit-fix.patch added to -mm tree

2007-01-29 Thread Dmitriy Monakhov

> > > The function changes mem limit to USER_DS before possible modprobe, but
> > > never restored it again.
Truly. The road to hell is paved with good intentions.

Martin Schwidefsky <[EMAIL PROTECTED]> writes:
> On Mon, 2007-01-29 at 09:37 -0800, Andrew Morton wrote:
>> hm, thanks for testing - I'll drop it.
>> 
>> I don't really understand what's wrong with it though.  Maybe it's settng
>> USER_DS on kernel threads?
>
> For architectures with a split address space there has to be a call
> set_fs(USER_DS) that switches from KERNEL_DS to USER_DS for the init
> process. So far this has been done in search_binary_handler and
> traditionally the kernel starts with KERNEL_DS to make the early
> copy_from_user calls work.
> So, what is wrong with always setting USER_DS? We are starting a user
> space process after all.
May be add some comment to prevent future attempts to make this place 
more "correct"?
>
> -- 
> blue skies,
>   Martin.
>
> Martin Schwidefsky
> Linux for zSeries Development & Services
> IBM Deutschland Entwicklung GmbH
>
> "Reality continues to ruin my life." - Calvin.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Dmitriy Monakhov

OpenVZ team has discovered error inside generic_file_direct_write()
If generic_file_direct_IO() has fail (ENOSPC condition) it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. 
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path.

Exampe:
open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)

stat mnt2/FILE3
File: `mnt2/FILE3'
Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>>>>>>>>>>>>>>>>>>>>>>^^ file size is less than biggest block idx
Device: 700h/1792d  Inode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

fsck.ext2 -f -n  mnt1/fs_img
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 2048.  Fix? no

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
--
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..bf7cf6c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
mark_inode_dirty(inode);
}
*ppos = end;
+   } else if (written < 0) {
+   loff_t isize = i_size_read(inode);
+   /*
+* generic_file_direct_IO() may have instantiated a few blocks
+* outside i_size.  Trim these off again.
+*/
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
}
 
/*

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Dmitriy Monakhov

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
>> instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
>> error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in 
>> simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>> >>>>>>>>>>>>>>>>>>>>>>^^ file size is less than biggest block idx
>> Device: 700h/1792d  Inode: 14  Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
>> --
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  mark_inode_dirty(inode);
>>  }
>>  *ppos = end;
>> +} else if (written < 0) {
>> +loff_t isize = i_size_read(inode);
>> +/*
>> + * generic_file_direct_IO() may have instantiated a few blocks
>> + * outside i_size.  Trim these off again.
>> + */
>> +if (pos + count > isize)
>> +vmtruncate(inode, isize);
>>  }
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
How could it be ?

from mm/filemap.c:2046 generic_file_direct_write() comment right after 
place where i want to add vmtruncate()
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
 * i_mutex is held, which protects generic_osync_inode() from
 * livelocking.
 */

> And vmtruncate() expects i_mutex to be held.
generic_file_direct_IO must called under i_mutex too
from mm/filemap.c:2388
  /*
   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
   * went wrong during pagecache shootdown.
   */
  static ssize_t
  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

This means XFS generic_file_direct_write() call generic_file_direct_IO() without
i_mutex held too?
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
>> instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
>> error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in 
>> simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>> >>>>>>>>>>>>>>>>>>>>>>^^ file size is less than biggest block idx
>> Device: 700h/1792d  Inode: 14  Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
>> --
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  mark_inode_dirty(inode);
>>  }
>>  *ppos = end;
>> +} else if (written < 0) {
>> +loff_t isize = i_size_read(inode);
>> +/*
>> + * generic_file_direct_IO() may have instantiated a few blocks
>> + * outside i_size.  Trim these off again.
>> + */
>> +if (pos + count > isize)
>> +vmtruncate(inode, isize);
>>  }
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> And vmtruncate() expects i_mutex to be held.
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.
This means we may call generic_file_aio_write_nolock() without i_mutex, right?
but call trace is :
  generic_file_aio_write_nolock() 
  ->generic_file_buffered_write() /* i_mutex not held here */ 
but according to filemaps locking rules: mm/filemap.c:77
 ..
 *  ->i_mutex   (generic_file_buffered_write)
 *->mmap_sem(fault_in_pages_readable->do_page_fault)
 ..
I'm confused a litle bit, where is the truth? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Tue, 12 Dec 2006 15:20:52 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
>> > And vmtruncate() expects i_mutex to be held.
>> >
>> > I guess a suitable solution would be to push this problem back up to the
>> > callers: let them decide whether to run vmtruncate() and if so, to ensure
>> > that i_mutex is held.
>> >
>> > The existence of generic_file_aio_write_nolock() makes that rather messy
>> > though.
>> This means we may call generic_file_aio_write_nolock() without i_mutex, 
>> right?
>> but call trace is :
>>   generic_file_aio_write_nolock() 
>>   ->generic_file_buffered_write() /* i_mutex not held here */ 
>> but according to filemaps locking rules: mm/filemap.c:77
>>  ..
>>  *  ->i_mutex(generic_file_buffered_write)
>>  *->mmap_sem (fault_in_pages_readable->do_page_fault)
>>  ..
>> I'm confused a litle bit, where is the truth? 
>
> xfs_write() calls generic_file_direct_write() without taking i_mutex for
> O_DIRECT writes.
Yes, but my quastion is about __generic_file_aio_write_nolock().
As i understand _nolock sufix means that i_mutex was already locked 
by caller, am i right ?
If yes, than __generic_file_aio_write_nolock() is beter place for vmtrancate() 
acclivity after generic_file_direct_write() has fail.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..723d2ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2282,6 +2282,15 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
+   if (written < 0) {
+   loff_t isize = i_size_read(inode);
+   /*
+* generic_file_direct_write() may have instantiated 
+* a few blocks outside i_size. Trim these off again.
+*/
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
if (written < 0 || written == count)
goto out;
/*

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov

Andrew Morton <[EMAIL PROTECTED]> writes:

> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> >> but according to filemaps locking rules: mm/filemap.c:77
>> >>  ..
>> >>  *  ->i_mutex (generic_file_buffered_write)
>> >>  *->mmap_sem  (fault_in_pages_readable->do_page_fault)
>> >>  ..
>> >> I'm confused a litle bit, where is the truth? 
>> >
>> > xfs_write() calls generic_file_direct_write() without taking i_mutex for
>> > O_DIRECT writes.
>> Yes, but my quastion is about __generic_file_aio_write_nolock().
>> As i understand _nolock sufix means that i_mutex was already locked 
>> by caller, am i right ?
>
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.
>
> I guess we can make that a rule (document it, add
> BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
> really checking that this matches reality for all callers.
I've checked generic_file_aio_write_nolock() callers for non blockdev.
Only ocfs2 call it explicitly, and do it under i_mutex.
So we need to do: 
1) Change wrong comments.
2) Add BUG_ON(!mutex_is_locked(..)) for non blkdev.
3) Invoke vmtruncate only for non blkdev.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..540ef5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2046,8 +2046,8 @@ generic_file_direct_write(struct kiocb *
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
 */
if (written >= 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
@@ -2282,6 +2282,17 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+
if (written < 0 || written == count)
goto out;
/*
@@ -2344,6 +2355,13 @@ ssize_t generic_file_aio_write_nolock(st
ssize_t ret;
 
BUG_ON(iocb->ki_pos != pos);
+   /*
+*  generic_file_buffered_write() may be called inside 
+*  __generic_file_aio_write_nolock() even in case of
+*  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode->i_mode))
+   BUG_ON(!mutex_is_locked(&inode->i_mutex));
 
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
&iocb->ki_pos);
@@ -2386,8 +2404,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

[PATCH] incorrect direct io error handling

2006-12-18 Thread Dmitriy Monakhov

This patch is result of discussion started week ago here:
http://lkml.org/lkml/2006/12/11/66
changes from original patch:
 - Update wrong comments about i_mutex locking.
 - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. 
 - vmtruncate call only for non blockdev
LOG:
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always
held inside generic_file_aio_write_nolock() and we may safely call vmtruncate().
Some fs (XFS at least) may directly call generic_file_direct_write()with 
i_mutex not held. There is no general scenario in this case. This fs have to 
handle generic_file_direct_write() error by its own specific way (place).  

Issue was found during OpenVZ kernel testing.

Exampe:
open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)

stat mnt2/FILE3
File: `mnt2/FILE3'
Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>>>>>>>>>>>>>>>>>>>>>>^^ file size is less than biggest block idx
Device: 700h/1792d  Inode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

fsck.ext2 -f -n  mnt1/fs_img
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 2048.  Fix? no

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..7c571dd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2044,8 +2044,9 @@ generic_file_direct_write(struct kiocb *
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex may not being held (XFS does this), if so some specific 
locking
+* ordering must protect generic_osync_inode() from livelocking.
+* AIO O_DIRECT ops attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2279,6 +2280,17 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+
if (written < 0 || written == count)
goto out;
/*
@@ -2341,6 +2353,13 @@ ssize_t generic_file_aio_write_nolock(st
ssize_t ret;
 
BUG_ON(iocb->ki_pos != pos);
+   /*
+*  generic_file_buffered_write() may be called inside 
+*  __generic_file_aio_write_nolock() even in case of
+*  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode->i_mode))
+   BUG_ON(!mutex_is_locked(&inode->i_mutex));
 
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
&iocb->ki_pos);
@@ -2383,8 +2402,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files. XFS does this.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

Re: [PATCH] incorrect direct io error handling

2006-12-18 Thread Dmitriy Monakhov

David Chinner <[EMAIL PROTECTED]> writes:

> On Mon, Dec 18, 2006 at 04:22:44PM +0300, Dmitriy Monakhov wrote:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 8332c77..7c571dd 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2044,8 +2044,9 @@ generic_file_direct_write(struct kiocb *
>>  /*
>>   * Sync the fs metadata but not the minor inode changes and
>>   * of course not the data as we did direct DMA for the IO.
>> - * i_mutex is held, which protects generic_osync_inode() from
>> - * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
>> + * i_mutex may not being held (XFS does this), if so some specific 
>> locking
>> + * ordering must protect generic_osync_inode() from livelocking.
>> + * AIO O_DIRECT ops attempt to sync metadata here.
>>   */
>>  if ((written >= 0 || written == -EIOCBQUEUED) &&
>>  ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>> @@ -2279,6 +2280,17 @@ __generic_file_aio_write_nolock(struct k
>>  
>>  written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
>>  ppos, count, ocount);
>> +/*
>> + * If host is not S_ISBLK generic_file_direct_write() may 
>> + * have instantiated a few blocks outside i_size  files
>> + * Trim these off again.
>> + */
>> +if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
>> +loff_t isize = i_size_read(inode);
>> +if (pos + count > isize)
>> +vmtruncate(inode, isize);
>> +}
>> +
>>  if (written < 0 || written == count)
>>  goto out;
>
> You comment in the first hunk that i_mutex may not be held here,
> but there's no comment in __generic_file_aio_write_nolock() that the
> i_mutex must be held for !S_ISBLK devices.
Any one may call directly call generic_file_direct_write() with i_mutex not 
held. 
>
>> @@ -2341,6 +2353,13 @@ ssize_t generic_file_aio_write_nolock(st
>>  ssize_t ret;
>>  
>>  BUG_ON(iocb->ki_pos != pos);
>> +/*
>> + *  generic_file_buffered_write() may be called inside 
>> + *  __generic_file_aio_write_nolock() even in case of
>> + *  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
>> + */
>> +if (!S_ISBLK(inode->i_mode))
>> +BUG_ON(!mutex_is_locked(&inode->i_mutex));
>>  
>>  ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>>  &iocb->ki_pos);
>
> I note that you comment here in generic_file_aio_write_nolock(),
> but it's not immediately obvious that this is refering to the
> vmtruncate() call in __generic_file_aio_write_nolock().
This is not about vmtruncate(). __generic_file_aio_write_nolock() may 
call generic_file_buffered_write() even in case of O_DIRECT for !S_ISBLK, and 
generic_file_buffered_write() has documented locking rules (i_mutex held).
IMHO it is important to explicitly document this . And after we realize
that i_mutex always held, vmtruncate() may be safely called.
>
> IOWs, wouldn't it be better to put this comment and check in
> __generic_file_aio_write_nolock() directly above the vmtruncate()
> call that cares about this?
>
>> @@ -2383,8 +2402,8 @@ ssize_t generic_file_aio_write(struct ki
>>  EXPORT_SYMBOL(generic_file_aio_write);
>>  
>>  /*
>> - * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
>> something
>> - * went wrong during pagecache shootdown.
>> + * May be called without i_mutex for writes to S_ISREG files. XFS does this.
>> + * Returns -EIO if something went wrong during pagecache shootdown.
>>   */
>
> Not sure you need to say "XFS does this" - other filesystems may do this
> in the future.
Yes, but where are multiple comments about "reiserfs does this" in fs/buffer.c

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] incorrect direct io error handling

2006-12-18 Thread Dmitriy Monakhov

"Chen, Kenneth W" <[EMAIL PROTECTED]> writes:

> Dmitriy Monakhov wrote on Monday, December 18, 2006 5:23 AM
>> This patch is result of discussion started week ago here:
>> http://lkml.org/lkml/2006/12/11/66
>> changes from original patch:
>>  - Update wrong comments about i_mutex locking.
>>  - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. 
>>  - vmtruncate call only for non blockdev
>> LOG:
>> If generic_file_direct_write() has fail (ENOSPC condition) inside 
>> __generic_file_aio_write_nolock() it may have instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
>> error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. This issue affect 
>> fs regardless the values of blocksize or pagesize.
>> We need truncate any block beyond i_size after write have failed , do in 
>> simular
>> generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always
>> held inside generic_file_aio_write_nolock() and we may safely call 
>> vmtruncate().
>> Some fs (XFS at least) may directly call generic_file_direct_write()with 
>> i_mutex not held. There is no general scenario in this case. This fs have to 
>> handle generic_file_direct_write() error by its own specific way (place).
>>   
>
>
> I'm puzzled that if ext2 is able to instantiate some blocks, then why does it
> return no space error?  Where is the error coming from?
generic_file_aio_write_nolock()
 ->generic_file_direct_write()
   ->generic_file_direct_IO()
 ->ext2_direct_IO(WRITE,...)
   ->blockdev_direct_IO( ,ext2_get_block,...)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Broadcom 4400 resume small fix

2007-01-08 Thread Dmitriy Monakhov

Return value  of 'pci_enable_device' was ignored in b44_resume().
We can't ingore it because it can fail.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 5eb2ec6..63de31b 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -2315,9 +2315,15 @@ static int b44_resume(struct pci_dev *pd
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct b44 *bp = netdev_priv(dev);
+   int err;
 
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, "
+  "aborting.\n");
+   return err;
+   }
pci_set_master(pdev);
 
if (!netif_running(dev))

pci_driver resume() callback return value

2007-01-08 Thread Dmitriy Monakhov

While investigating pci driver model i've found strange thing.
struct pci_driver has resume() callback and it declarated like follows
int  (*resume) (struct pci_dev *dev);/* Device woken up */
Documentation states about it:
from Documentation/pci-error-recovery.txt
STEP 5: Resume Operations
-
The platform will call the resume() callback on all affected device
drivers if all drivers on the segment have returned
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
The goal of this callback is to tell the driver to restart activity,
that everything is back and running.
This callback does not return a result code.

...
But most of pci devices may return correct error code if error happens
during resume(), for example e1000 implement it as follows:
static int e1000_resume(struct pci_dev *pdev)
{
struct net_device *netdev = pci_get_drvdata(pdev);
struct e1000_adapter *adapter = netdev_priv(netdev);
uint32_t err;
.
if ((err = pci_enable_device(pdev))) {
   printk(KERN_ERR "e1000: Cannot enable PCI device from suspend\n");
   return err;
}
.
Even generic pci routines may return nonzero err code and upper(but_type) level
routunes properly handles it:
from drivers/pci/pci-driver.c:321
static int pci_device_resume(struct device * dev)
{
int error;
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;

if (drv && drv->resume)
error = drv->resume(pci_dev);
else
error = pci_default_resume(pci_dev);
return error;
}

Now question. Should we,or should we not return error code from resume callback?
Where a two possible ways:
a) Comment in document section is out of date and we have to properly handle 
   and return error code if something goes wrong.
b) Comment in document section is correct and and dont have to worry about any 
error, and return code  from resume() callback.
As i understand (a) is correct answer.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] fixing errors handling during pci_driver resume stage

2007-01-09 Thread Dmitriy Monakhov

Where are several places where errors ignored during pci_driver resume stage.
In most most cases return value of 'pci_enable_device()' was ignored.
drivers from such subsystems affected:
 - ata
 - fusion
 - ide
 - mmc
 - net
 - parisc
 - parport
 - pci
 - serial

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] fixing errors handling during pci_driver resume stage [net]

2007-01-09 Thread Dmitriy Monakhov

network pci drivers have to return correct error code during resume stage in
case of errors.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/Makefile b/Makefile
diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index 80bdcf8..ed06e48 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -823,11 +823,18 @@ static int vortex_resume(struct pci_dev
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct vortex_private *vp = netdev_priv(dev);
+   int err;
 
if (dev && vp) {
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, "
+   " aborting.\n");
+   return err;
+   }
+
pci_set_master(pdev);
if (request_irq(dev->irq, vp->full_bus_master_rx ?
&boomerang_interrupt : &vortex_interrupt, 
IRQF_SHARED, dev->name, dev)) {
diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 5eb2ec6..a88b346 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -2315,16 +2315,25 @@ static int b44_resume(struct pci_dev *pd
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct b44 *bp = netdev_priv(dev);
+   int err;
 
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
+
pci_set_master(pdev);
 
if (!netif_running(dev))
return 0;
 
-   if (request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, dev))
+   if (request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, dev)) {
printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name);
+   pci_disable_device(pdev);
+   return -EBUSY;
+   }
 
spin_lock_irq(&bp->lock);
 
diff --git a/drivers/net/eepro100.c b/drivers/net/eepro100.c
index e28bb1e..698f974 100644
--- a/drivers/net/eepro100.c
+++ b/drivers/net/eepro100.c
@@ -2292,10 +2292,16 @@ static int eepro100_resume(struct pci_de
struct net_device *dev = pci_get_drvdata (pdev);
struct speedo_private *sp = netdev_priv(dev);
void __iomem *ioaddr = sp->regs;
+   int err;
 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
+
pci_set_master(pdev);
 
if (!netif_running(dev))
diff --git a/drivers/net/natsemi.c b/drivers/net/natsemi.c
index ffa0afd..57a8f36 100644
--- a/drivers/net/natsemi.c
+++ b/drivers/net/natsemi.c
@@ -3195,13 +3195,20 @@ static int natsemi_resume (struct pci_de
 {
struct net_device *dev = pci_get_drvdata (pdev);
struct netdev_private *np = netdev_priv(dev);
+   int err;
 
rtnl_lock();
if (netif_device_present(dev))
goto out;
if (netif_running(dev)) {
BUG_ON(!np->hands_off);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, "
+   "aborting.\n");
+   return err;
+   }
+
/*  pci_power_on(pdev); */
 
natsemi_reset(dev);
diff --git a/drivers/net/ne2k-pci.c b/drivers/net/ne2k-pci.c
index 589785d..e8aea03 100644
--- a/drivers/net/ne2k-pci.c
+++ b/drivers/net/ne2k-pci.c
@@ -670,10 +670,15 @@ static int ne2k_pci_suspend (struct pci_
 static int ne2k_pci_resume (struct pci_dev *pdev)
 {
struct net_device *dev = pci_get_drvdata (pdev);
+   int err;
 
pci_set_power_state(pdev, 0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
NS8390_init(dev, 1);
netif_device_attach(dev);
 
diff --git a/drivers/net/sk98lin/skge.c b/drivers/net/sk98lin/skge.c
index 92d11b9..c9de11f 100644
--- a/drivers/net/sk98lin/skge.c
+++ b/drivers/net/sk98lin/skge.c
@@ -5125,7 +5125,12 @@ static int skge_resume(struct pci_dev *p
 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+

[PATCH 2/5] fixing errors handling during pci_driver resume stage [ata]

2007-01-09 Thread Dmitriy Monakhov

ata pci drivers have to return correct error code during resume stage in
case of errors.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index b517d24..0656334 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1372,7 +1372,9 @@ static int ahci_pci_device_resume(struct
void __iomem *mmio = host->mmio_base;
int rc;
 
-   ata_pci_device_do_resume(pdev);
+   rc = ata_pci_device_do_resume(pdev);
+   if (rc)
+   return rc;
 
if (pdev->dev.power.power_state.event == PM_EVENT_SUSPEND) {
rc = ahci_reset_controller(mmio, pdev);
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 0d51d13..3d6729d 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6221,12 +6221,19 @@ void ata_pci_device_do_suspend(struct pc
}
 }
 
-void ata_pci_device_do_resume(struct pci_dev *pdev)
+int ata_pci_device_do_resume(struct pci_dev *pdev)
 {
+   int ret = 0; 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+   if (ret) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return ret;
+   }
+   
pci_set_master(pdev);
+   return ret;
 }
 
 int ata_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg)
@@ -6246,8 +6253,10 @@ int ata_pci_device_suspend(struct pci_de
 int ata_pci_device_resume(struct pci_dev *pdev)
 {
struct ata_host *host = dev_get_drvdata(&pdev->dev);
+   int err;
 
-   ata_pci_device_do_resume(pdev);
+   if ((err = ata_pci_device_do_resume(pdev)))
+   return err;
ata_host_resume(host);
return 0;
 }
diff --git a/drivers/ata/sata_sil.c b/drivers/ata/sata_sil.c
index 7808d03..8451143 100644
--- a/drivers/ata/sata_sil.c
+++ b/drivers/ata/sata_sil.c
@@ -710,8 +710,12 @@ err_out:
 static int sil_pci_device_resume(struct pci_dev *pdev)
 {
struct ata_host *host = dev_get_drvdata(&pdev->dev);
+   int err;
+
+   err = ata_pci_device_do_resume(pdev);
+   if (err)
+   return err;
 
-   ata_pci_device_do_resume(pdev);
sil_init_controller(pdev, host->n_ports, host->ports[0]->flags,
host->mmio_base);
ata_host_resume(host);
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 5aa288d..be6971a 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -1200,8 +1200,11 @@ static int sil24_pci_device_resume(struc
 {
struct ata_host *host = dev_get_drvdata(&pdev->dev);
struct sil24_host_priv *hpriv = host->private_data;
+   int err;
 
-   ata_pci_device_do_resume(pdev);
+   err = ata_pci_device_do_resume(pdev);
+   if (err)
+   return err;
 
if (pdev->dev.power.power_state.event == PM_EVENT_SUSPEND)
writel(HOST_CTRL_GLOBAL_RST, hpriv->host_base + HOST_CTRL);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index ab27548..283ca72 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -719,7 +719,7 @@ extern int ata_pci_init_one (struct pci_
 unsigned int n_ports);
 extern void ata_pci_remove_one (struct pci_dev *pdev);
 extern void ata_pci_device_do_suspend(struct pci_dev *pdev, pm_message_t mesg);
-extern void ata_pci_device_do_resume(struct pci_dev *pdev);
+extern int ata_pci_device_do_resume(struct pci_dev *pdev);
 extern int ata_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg);
 extern int ata_pci_device_resume(struct pci_dev *pdev);
 extern int ata_pci_clear_simplex(struct pci_dev *pdev);

[PATCH 3/5] fixing errors handling during pci_driver resume stage [fusion]

2007-01-09 Thread Dmitriy Monakhov

fusion pci drivers have to return correct error code during resume stage in
case of errors.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 6e068cf..51a3621 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1524,7 +1524,7 @@ mpt_resume(struct pci_dev *pdev)
 {
MPT_ADAPTER *ioc = pci_get_drvdata(pdev);
u32 device_state = pdev->current_state;
-   int recovery_state;
+   int recovery_state, err;
 
printk(MYIOC_s_INFO_FMT
"pci-resume: pdev=0x%p, slot=%s, Previous operating state [D%d]\n",
@@ -1532,7 +1532,11 @@ mpt_resume(struct pci_dev *pdev)
 
pci_set_power_state(pdev, 0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   err = pci_enable_device(pdev);
+   if (err) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
 
/* enable interrupts */
CHIPREG_WRITE32(&ioc->chip->IntMask, MPI_HIM_DIM);
diff --git a/drivers/message/fusion/mptscsih.c 
b/drivers/message/fusion/mptscsih.c
index 2c72c36..5324866 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1120,8 +1120,10 @@ mptscsih_resume(struct pci_dev *pdev)
MPT_ADAPTER *ioc = pci_get_drvdata(pdev);
struct Scsi_Host*host = ioc->sh;
MPT_SCSI_HOST   *hd;
+   int err;
 
-   mpt_resume(pdev);
+   if ((err = mpt_resume(pdev)))
+   return err;
 
if(!host)
return 0;

[PATCH 4/5] fixing errors handling during pci_driver resume stage [misc]

2007-01-09 Thread Dmitriy Monakhov

pci drivers have to return correct error code during resume stage in
case of errors.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/ide/pci/sc1200.c b/drivers/ide/pci/sc1200.c
index ff80937..a426905 100644
--- a/drivers/ide/pci/sc1200.c
+++ b/drivers/ide/pci/sc1200.c
@@ -394,10 +394,16 @@ static int sc1200_suspend (struct pci_de
 static int sc1200_resume (struct pci_dev *dev)
 {
ide_hwif_t  *hwif = NULL;
+   int err;
 
pci_set_power_state(dev, PCI_D0);   // bring chip back from sleep 
state
dev->current_state = PM_EVENT_ON;
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   dev_err(&dev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
+
//
// loop over all interfaces that are part of this pci device:
//
diff --git a/drivers/misc/tifm_7xx1.c b/drivers/misc/tifm_7xx1.c
index 2ab7add..d659ad9 100644
--- a/drivers/misc/tifm_7xx1.c
+++ b/drivers/misc/tifm_7xx1.c
@@ -274,10 +274,15 @@ static int tifm_7xx1_suspend(struct pci_
 static int tifm_7xx1_resume(struct pci_dev *dev)
 {
struct tifm_adapter *fm = pci_get_drvdata(dev);
-   unsigned long flags;
+   unsigned long flags, err;
 
pci_restore_state(dev);
-pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   dev_err(&dev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
+
 pci_set_power_state(dev, PCI_D0);
 pci_set_master(dev);
 
diff --git a/drivers/mmc/sdhci.c b/drivers/mmc/sdhci.c
index c2d13d7..736f74c 100644
--- a/drivers/mmc/sdhci.c
+++ b/drivers/mmc/sdhci.c
@@ -1109,7 +1109,11 @@ static int sdhci_resume (struct pci_dev
 
pci_set_power_state(pdev, PCI_D0);
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+   if (ret) {
+   dev_err(&pdev->dev, "Cannot enable PCI device, aborting.\n");
+   return ret;
+   }
 
for (i = 0;i < chip->num_slots;i++) {
if (!chip->hosts[i])
diff --git a/drivers/parisc/superio.c b/drivers/parisc/superio.c
index 1fd97f7..4428ffa 100644
--- a/drivers/parisc/superio.c
+++ b/drivers/parisc/superio.c
@@ -199,7 +199,8 @@ superio_init(struct pci_dev *pcidev)
pci_write_config_word (pdev, PCI_COMMAND, word);
 
pci_set_master (pdev);
-   pci_enable_device(pdev);
+   if (pci_enable_device(pdev))
+   return;
 
/*
 * Next project is programming the onboard interrupt controllers.
@@ -275,6 +276,7 @@ superio_init(struct pci_dev *pcidev)
SUPERIO, (void *)sio)) {
 
printk(KERN_ERR PFX "could not get irq\n");
+   pci_disable_device(pdev);
BUG();
return;
}

[PATCH 5/5] fixing errors handling during pci_driver resume stage [serial]

2007-01-09 Thread Dmitriy Monakhov

serial pci drivers have to return correct error code during resume stage in
case of errors.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/parport/parport_serial.c b/drivers/parport/parport_serial.c
index 78c0a26..1e14906 100644
--- a/drivers/parport/parport_serial.c
+++ b/drivers/parport/parport_serial.c
@@ -392,6 +392,7 @@ static int parport_serial_pci_suspend(st
 static int parport_serial_pci_resume(struct pci_dev *dev)
 {
struct parport_serial_private *priv = pci_get_drvdata(dev);
+   int err;
 
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
@@ -399,7 +400,11 @@ static int parport_serial_pci_resume(str
/*
 * The device may have been disabled.  Re-enable it.
 */
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   dev_err(&dev->dev, "Cannot enable PCI device, aborting.\n");
+   return err;
+   }
 
if (priv->serial)
pciserial_resume_ports(priv->serial);
diff --git a/drivers/serial/8250_pci.c b/drivers/serial/8250_pci.c
index 52e2e64..e26e4a6 100644
--- a/drivers/serial/8250_pci.c
+++ b/drivers/serial/8250_pci.c
@@ -1805,6 +1805,7 @@ static int pciserial_suspend_one(struct
 static int pciserial_resume_one(struct pci_dev *dev)
 {
struct serial_private *priv = pci_get_drvdata(dev);
+   int err;
 
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
@@ -1813,7 +1814,12 @@ static int pciserial_resume_one(struct p
/*
 * The device may have been disabled.  Re-enable it.
 */
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   dev_err(&dev->dev, "Cannot enable PCI device, "
+   "aborting.\n");
+   return err;
+   }
 
pciserial_resume_ports(priv);
}
diff --git a/drivers/serial/serial_txx9.c b/drivers/serial/serial_txx9.c
index 7186a82..583cdc8 100644
--- a/drivers/serial/serial_txx9.c
+++ b/drivers/serial/serial_txx9.c
@@ -1132,12 +1132,19 @@ static int pciserial_txx9_suspend_one(st
 static int pciserial_txx9_resume_one(struct pci_dev *dev)
 {
int line = (int)(long)pci_get_drvdata(dev);
+   int err;
 
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
 
if (line) {
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   dev_err(&dev->dev, "Cannot enable PCI device, "
+   "aborting.\n");
+   return err;
+   }
+
serial_txx9_resume_port(line);
}
return 0;

Re: [PATCH 5/5] fixing errors handling during pci_driver resume stage [serial]

2007-01-09 Thread Dmitriy Monakhov

Russell King <[EMAIL PROTECTED]> writes:

> On Tue, Jan 09, 2007 at 12:01:58PM +0300, Dmitriy Monakhov wrote:
>> serial pci drivers have to return correct error code during resume stage in
>> case of errors.
>
> Sigh.  *hate* *hate* *hate*.
>
>> diff --git a/drivers/serial/8250_pci.c b/drivers/serial/8250_pci.c
>> index 52e2e64..e26e4a6 100644
>> --- a/drivers/serial/8250_pci.c
>> +++ b/drivers/serial/8250_pci.c
>> @@ -1805,6 +1805,7 @@ static int pciserial_suspend_one(struct
>>  static int pciserial_resume_one(struct pci_dev *dev)
>>  {
>>  struct serial_private *priv = pci_get_drvdata(dev);
>> +int err;
>>  
>>  pci_set_power_state(dev, PCI_D0);
>>  pci_restore_state(dev);
>> @@ -1813,7 +1814,12 @@ static int pciserial_resume_one(struct p
>>  /*
>>   * The device may have been disabled.  Re-enable it.
>>   */
>> -pci_enable_device(dev);
>> +err = pci_enable_device(dev);
>> +if (err) {
>> +dev_err(&dev->dev, "Cannot enable PCI device, "
>> +"aborting.\n");
>> +return err;
>> +}
>>  
>>  pciserial_resume_ports(priv);
>>  }
>
> So if pci_enable_device() fails, what do we do with the still suspended
> serial port?  Does it clean up that state?  Probably not.
>
> Look, merely going around bunging this stupid "oh lets propagate the
> error" crap into the kernel doesn't actually fix _anything_.  In fact
> it potentially _hides_ the warnings produced by __must_check which
> give a hint that _something_ needs to be done to _properly_ fix the
> problem.
>
> And by "properly", I mean not just merely propagating the error.
>
> In this particular case, the above may result in resources not being
> freed.
Yep 100% true. But the question is _HOW_? We want shutdown not enabled device.
Is it safe just call pciserial_remove_ports() for this device? 
>
> -- 
> Russell King
>  Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
>  maintainer of:
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] incorrect direct io error handling

2007-01-10 Thread Dmitriy Monakhov


Sorry for long delay (russian holidays are very hard time :) )

David Chinner <[EMAIL PROTECTED]> writes:
> On Tue, Dec 19, 2006 at 09:07:12AM +0300, Dmitriy Monakhov wrote:
>> David Chinner <[EMAIL PROTECTED]> writes:
>> > On Mon, Dec 18, 2006 at 04:22:44PM +0300, Dmitriy Monakhov wrote:
>> >> diff --git a/mm/filemap.c b/mm/filemap.c
>> >> index 8332c77..7c571dd 100644
>> >> --- a/mm/filemap.c
>> >> +++ b/mm/filemap.c
>
> 
>
>> > You comment in the first hunk that i_mutex may not be held here,
>> > but there's no comment in __generic_file_aio_write_nolock() that the
>> > i_mutex must be held for !S_ISBLK devices.
>> Any one may call directly call generic_file_direct_write() with i_mutex not 
>> held. 
>
> Only block devices based on the implementation (i.e. buffered I/O is
> done here).  but one can't call vmtruncate without the i_mutex held,
> so if a filesystem is calling generic_file_direct_write() it won't
> be able to use __generic_file_aio_write_nolock() without the i_mutex
> held (because it can right now if it doesn't need the buffered I/O
> fallback path), then 
>
>> >
>> >> @@ -2341,6 +2353,13 @@ ssize_t generic_file_aio_write_nolock(st
>> >>   ssize_t ret;
>> >>  
>> >>   BUG_ON(iocb->ki_pos != pos);
>> >> + /*
>> >> +  *  generic_file_buffered_write() may be called inside 
>> >> +  *  __generic_file_aio_write_nolock() even in case of
>> >> +  *  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
>> >> +  */
>> >> + if (!S_ISBLK(inode->i_mode))
>> >> + BUG_ON(!mutex_is_locked(&inode->i_mutex));
>> >>  
>> >>   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>> >>   &iocb->ki_pos);
>> >
>> > I note that you comment here in generic_file_aio_write_nolock(),
>> > but it's not immediately obvious that this is refering to the
>> > vmtruncate() call in __generic_file_aio_write_nolock().
>> This is not about vmtruncate(). __generic_file_aio_write_nolock() may 
>> call generic_file_buffered_write() even in case of O_DIRECT for !S_ISBLK, 
>> and 
>
> No, the need for i_mutex is currently dependent on doing direct I/O
> and the return value from generic_file_buffered_write().
> A filesystem that doesn't fall back to buffered I/O (e.g. XFS) can currently
> use generic_file_aio_write_nolock() without needing to hold i_mutex.
> use generic_file_aio_write_nolock() without needing to hold i_mutex.
But it doesn't use it. XFS implement it's own write method with it's own locking
rules and explicitly call generic_file_direct_write() in case of O_DIRECT.
BTW XFS correctly handling ENOSPC in case of O_DIRECT (fs corruption not happend
after error occur).

>
> Your change prevents that by introducing a vmtruncate() before the
> generic_file_buffered_write() return value check, which means that a
> filesystem now _must_ hold the i_mutex when calling
> generic_file_aio_write_nolock() even when it doesn't do buffered I/O
> through this path.
Yes it's so. But it is just explicitly document the fact that every fs call 
generic_file_aio_write_nolock() with i_mutex held (where is no any fs that
invoke it without i_mutex). As i understand Andrew Morton think so too: 
http://lkml.org/lkml/2006/12/12/67

  I guess we can make that a rule (document it, add
  BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
  really checking that this matches reality for all callers.


>
>> generic_file_buffered_write() has documented locking rules (i_mutex held).
>> IMHO it is important to explicitly document this . And after we realize
>> that i_mutex always held, vmtruncate() may be safely called.
>
> I don't think changing the locking semantics of
> generic_file_aio_write_nolock() to require a lock for all
> filesystem-based users is a good way to fix a filesystem specific
> direct I/O problem which can be easily fixed in filesystem specific
> code - i.e. call vmtruncate() in ext3_file_write() on failure
Where are more than 10 filesystems where we have to fix it then.
And fix is almost the same for all fs, so we have to do many copy/paste work
IMHO fix it inside generic_file_aio_write_nolock is realy straightforward way.
What do you think?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Broadcom 4400 resume small fix (v2)

2007-01-10 Thread Dmitriy Monakhov

Changes from v1:
- Fix according to Francois Romieu comments.

LOG:
Some issues in b44_resume().
- Return value of pci_enable_device() was ignored.
- If request_irq() has failed we have to just disable device and exit.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 5eb2ec6..42c57bf 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -2315,16 +2315,27 @@ static int b44_resume(struct pci_dev *pd
 {
struct net_device *dev = pci_get_drvdata(pdev);
struct b44 *bp = netdev_priv(dev);
+   int rc = 0;
 
pci_restore_state(pdev);
-   pci_enable_device(pdev);
+   rc = pci_enable_device(pdev);
+   if (rc) {
+   printk(KERN_ERR PFX "%s: pci_enable_device failed\n",
+   dev->name);
+   return rc;
+   }
+
pci_set_master(pdev);
 
if (!netif_running(dev))
return 0;
 
-   if (request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, dev))
+   rc = request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, dev);
+   if (rc) {
printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name);
+   pci_disable_device(pdev);
+   return rc;
+   }
 
spin_lock_irq(&bp->lock);

Re: [PATCH 1/5] fixing errors handling during pci_driver resume stage [net]

2007-01-10 Thread Dmitriy Monakhov

Stephen Hemminger <[EMAIL PROTECTED]> writes:

> On Tue, 09 Jan 2007 12:01:14 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> network pci drivers have to return correct error code during resume stage in
>> case of errors.
>> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
>> -
>
> Please don't introduce one dev_err() call into a device driver if all the 
> other
> error printout's use a different interface.
Ok. It was wrong decision to fix all affected drivers in one patchset.
Especially without respect to concrete driver specific.
I'll try to rewrite it in more slowly but (i hope) more correct way
>
> -- 
> Stephen Hemminger <[EMAIL PROTECTED]>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Dmitriy Monakhov

On 14:19 Втр 29 Май , [EMAIL PROTECTED] wrote:
> 
> The patch titled
>  fs: introduce write_begin, write_end, and perform_write aops
> has been added to the -mm tree.  Its filename is
>  fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> --
> Subject: fs: introduce write_begin, write_end, and perform_write aops
> From: Nick Piggin <[EMAIL PROTECTED]>
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
> 
> [EMAIL PROTECTED]: API design contributions, code review and fixes]
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> Signed-off-by: Mark Fasheh <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
And i have some fixes and questions. My be i've missed somthing and it was 
already disscussed, but i cant find in LKML.

1) loop dev:
loop.c code itself is not perfect. In fact before Nick's patch
partial write was't possible. Assumption what write chunks are
always page aligned is realy weird ( see "index++" line).
Fixed by "new aop loop fix" patch

2)block_write_begin:
After we enter to block_write_begin with *pagep == NULL and
some page was grabed we remember this page in *pagep
And if __block_prepare_write() we have to clear *pagep , as 
it was before. Because this may confuse caller.
for example caller may have folowing code:
ret = block_write_begin(..., pagep,...)
if (ret && *pagep != NULL) {
unlock_page(*pagep);
page_cache_release(*pagep);
}
Fixed my "new aop block_write_begin fix" patch

3) __page_symlink:
Nick's patch add folowing code:
+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
+ AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
symlink now consume whole page. I have only one question "WHY???".
I don't see any advantages, but where are huge list of
dissdvantages:
a) fs with blksize == 1k and pagesize == 16k after this patch
   waste more than 10x times disk space for nothing.
b) What happends if we want use fs with blksize == 4k on i386
   after it was used by ia64 ??? (before this patch it was
   possible).

I dont prepare patch for this because i dont understand issue
witch Nick aimed to fix.

4) iov_iter_fault_in_readable:
Function prerform check for signgle region, with out respect to
segment nature of iovec, For example writev no longer works :) :
writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
this hidden bug, and it was invisiable because _fault_in_readable
return value was ignored before. Lets iov_iter_fault_in_readable
perform checks for all segments.
Fixed by :"iov_iter_fault_in_readable fix"

5) ext3_write_end:
Before  write_begin/write_end patch set we have folowing locking
order:
stop_journal(handle);
unlock_page(page);
But now order is oposite:
unlock_page(page);
stop_journal(handle);
Can we got any race condition now? I'm not sure is it actual problem,
may be somebody cant describe this.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] new aop loop fix

2007-06-13 Thread Dmitriy Monakhov

loop.c code itself is not perfect. In fact before Nick's patch
partial write was't possible. Assumption what write chunks are
always page aligned is realy weird ( see "index++" line).

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4bab9b1..8726da5 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -215,7 +215,6 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
int len, ret;
 
mutex_lock(&mapping->host->i_mutex);
-   index = pos >> PAGE_CACHE_SHIFT;
offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
bv_offs = bvec->bv_offset;
len = bvec->bv_len;
@@ -226,6 +225,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
struct page *page;
void *fsdata;
 
+   index = pos >> PAGE_CACHE_SHIFT;
IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
size = PAGE_CACHE_SIZE - offset;
if (size > len)
@@ -255,7 +255,6 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
bv_offs += copied;
len -= copied;
offset = 0;
-   index++;
pos += copied;
}
ret = 0;

> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] new aop block_write_begin fix

2007-06-13 Thread Dmitriy Monakhov

After we enter to block_write_begin with *pagep == NULL and
some page was grabed we remember this page in *pagep
And if __block_prepare_write() we have to clear *pagep , as 
it was before. Because this may confuse caller.
for example caller may have folowing code:
ret = block_write_begin(..., pagep,...)
if (ret && *pagep != NULL) {
unlock_page(*pagep);
page_cache_release(*pagep);
}
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/buffer.c b/fs/buffer.c
index 07cd457..df933ba 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1982,6 +1982,7 @@ int block_write_begin(struct file *file, struct 
address_space *mapping,
if (ownpage) {
unlock_page(page);
page_cache_release(page);
+   *pagep = NULL;
 
/*
 * prepare_write() may have instantiated a few blocks 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

iov_iter_fault_in_readable fix

2007-06-13 Thread Dmitriy Monakhov

Function prerform check for signgle region, with out respect to
segment nature of iovec, For example writev no longer works :)

/* TESTCASE BEGIN */
#include 
#include 
#include 
#include 
#include 
#include 
#define SIZE  (4096 * 2)
int main(int argc, char* argv[])
{   
char* ptr[4];
struct iovec iov[2];
int fd, ret;
ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[1] = mmap(NULL, SIZE, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

iov[0].iov_base = ptr[0] + (SIZE -1);
iov[0].iov_len = 1;
memset(ptr[0], 1, SIZE);

iov[1].iov_base = ptr[2];
iov[1].iov_len = SIZE;
memset(ptr[2], 2, SIZE);

fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
return 0;
}   
/* TESTCASE END*/
We will get folowing result:
writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad 
address)

this is hidden bug, and it was invisiable because _fault_in_readable
return value was ignored before. Lets iov_iter_fault_in_readable
perform checks for all segments.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fef19fc..7e025ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8d59ed9..8600c3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
 {
-   char __user *buf = i->iov->iov_base + i->iov_offset;
-   return fault_in_pages_readable(buf, bytes);
+   size_t len = *bytes;
+   int ret;
+   if (likely(i->nr_segs == 1)) {
+   ret = fault_in_pages_readable(i->iov->iov_base, len);
+   if (ret)
+   *bytes = 0;
+   } else {
+   const struct iovec *iov = i->iov;
+   size_t base = i->iov_offset;
+   *bytes = 0;
+   while (len) {
+   int copy = min(len, iov->iov_len - base);
+   if ((ret = fault_in_pages_readable(iov->iov_base + 
base, copy)))
+   break;
+   *bytes += copy;
+   len -= copy;
+   base += copy;
+   if (iov->iov_len == base) {
+   iov++;
+   base = 0;
+   }
+   }
+   }
+   return ret; 
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file 
*file,
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+   if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
status = -EFAULT;
break;
}
@@ -2284,7 +2306,7 @@ again:
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+   if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
status = -EFAULT;
break;
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] new aop loop fix

2007-06-13 Thread Dmitriy Monakhov

On 14:36 Срд 13 Июн , Hugh Dickins wrote:
> On Wed, 13 Jun 2007, Dmitriy Monakhov wrote:
> 
> > loop.c code itself is not perfect. In fact before Nick's patch
> > partial write was't possible. Assumption what write chunks are
> > always page aligned is realy weird ( see "index++" line).
> > 
> > Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
> 
> I'm interested, because I'm trying to chase down an -mm bug which
> occasionally leaves me with 1k of zeroes where it shouldn't (in a
> 1k bsize ext2 looped over tmpfs).  The length of time for this to
> happen varies a lot, so bisection has been misleading: maybe the
> problem lies in Nick's patches, maybe it does not.
> 
> But I don't understand your fix below at all.  _If_ you need to
> change the handling of index, then you need to change the handling
> of offset too, don't you?
> 
> But what's wrong with how inded was handled anyway?  Yes, it might
> be being incremented at the bottom of the loop when we haven't
> reached the end of this page, but in that case we're not going
> round the loop again anyway: len will now be 0.  So no problem.
> 
> One of us is missing something: please enlighten me - thanks.
Yepp. You absolutely right, wrong patch was attached :(
Btw: Nick's patches broke  LO_CRYPT_XOR mode, but it is ok because
this feature was absolete and not used by anyone, am i right here?
> 
> Hugh
> 
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 4bab9b1..8726da5 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -215,7 +215,6 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > int len, ret;
> >  
> > mutex_lock(&mapping->host->i_mutex);
> > -   index = pos >> PAGE_CACHE_SHIFT;
> > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> > bv_offs = bvec->bv_offset;
> > len = bvec->bv_len;
> > @@ -226,6 +225,7 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > struct page *page;
> > void *fsdata;
> >  
> > +   index = pos >> PAGE_CACHE_SHIFT;
> > IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
> > size = PAGE_CACHE_SIZE - offset;
> > if (size > len)
> > @@ -255,7 +255,6 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > bv_offs += copied;
> > len -= copied;
> > offset = 0;
> > -   index++;
> > pos += copied;
> > }
> > ret = 0;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Dmitriy Monakhov

On 13:43 Срд 13 Июн , Nick Piggin wrote:
> On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> > On 14:19 ?? 29 ?? , [EMAIL PROTECTED] wrote:
> > > 
> > > The patch titled
> > >  fs: introduce write_begin, write_end, and perform_write aops
> > > has been added to the -mm tree.  Its filename is
> > >  fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > > 
> > > *** Remember to use Documentation/SubmitChecklist when testing your code 
> > > ***
> > > 
> > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to 
> > > find
> > > out what to do about this
> > > 
> > > --
> > > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > > From: Nick Piggin <[EMAIL PROTECTED]>
> > > 
> > > These are intended to replace prepare_writ eand commit_write with more
> > > flexible alternatives that are also able to avoid the buffered write
> > > deadlock problems efficiently (which prepare_write is unable to do).
> > > 
> > > [EMAIL PROTECTED]: API design contributions, code review and fixes]
> > > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> > > Signed-off-by: Mark Fasheh <[EMAIL PROTECTED]>
> > > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> > I've finaly find time to review Nick's "write_begin/write_end aop" patch 
> > set.
> > And i have some fixes and questions. My be i've missed somthing and it was 
> > already disscussed, but i cant find in LKML.
> 
> Thanks, that's really helpful.
> 
>  
> > 1) loop dev:
> > loop.c code itself is not perfect. In fact before Nick's patch
> > partial write was't possible. Assumption what write chunks are
> > always page aligned is realy weird ( see "index++" line).
> > Fixed by "new aop loop fix" patch
> 
> I think you're right, fix looks good.
> 
>  
> > 2)block_write_begin:
> > After we enter to block_write_begin with *pagep == NULL and
> > some page was grabed we remember this page in *pagep
> > And if __block_prepare_write() we have to clear *pagep , as 
> > it was before. Because this may confuse caller.
> > for example caller may have folowing code:
> > ret = block_write_begin(..., pagep,...)
> > if (ret && *pagep != NULL) {
> > unlock_page(*pagep);
> > page_cache_release(*pagep);
> > }
> > Fixed my "new aop block_write_begin fix" patch
> 
> I don't think the caller can rely on that if it returns failure.
> But that is more defensive I guess. Maybe setting it to 1 or
> so would catch abusers.
> 
>  
> > 3) __page_symlink:
> > Nick's patch add folowing code:
> > + err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> > + AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> > symlink now consume whole page. I have only one question "WHY???".
> > I don't see any advantages, but where are huge list of
> > dissdvantages:
> > a) fs with blksize == 1k and pagesize == 16k after this patch
> >waste more than 10x times disk space for nothing.
> > b) What happends if we want use fs with blksize == 4k on i386
> >after it was used by ia64 ??? (before this patch it was
> >possible).
One more visiable effect caused by wrong symlink size:
fsstress cause folowing error:

EXT3-fs unexpected failure: !buffer_revoked(bh); 
inconsistent data on disk
ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke
ext3_abort called.

EXT3-fs error (device dm-4): ext3_forget: error -5 when attempting
revoke
Remounting filesystem read-only 
Aborting journal on device dm-4.
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_free_blocks_sb: Journal has aborted

journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_truncate: IO failure
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-4) in ext3_rese

[PATCH] ext4:fix unexpected error from ext4_reserve_global

2007-06-14 Thread Dmitriy Monakhov

I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file

open("/dev/zero", O_RDONLY) = 0
close(1)= 0
open("/mnt/test/file", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = -1 ENOENT (No such fil
e or directory)

This strange error returned from ext4_reserve_global().
It's just typo because:
a) In fact this is 100% ENOSPC situation
b) simular function ext4_reserve_local() returns -ENOSPC

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 fs/ext4/balloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4d7bfd2..43ae8f8 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int 
blocks)
 {
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_reservation_slot *rs;
-   int i, rc = -ENOENT;
+   int i, rc = -ENOSPC;
__u64 free = 0;
 
rs = sbi->s_reservation_slots;
-- 
1.5.2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] deny partial write for loop dev fd

2007-06-16 Thread Dmitriy Monakhov

Partial write can be easily supported by LO_CRYPT_NONE mode, but
it is not easy in LO_CRYPT_CRYPTOAPI case, because of its block nature.
I don't know who still used cryptoapi, but theoretically it is possible.
So let's leave things as they are. Loop device doesn't support partial
write before Nick's "write_begin/write_end" patch set, and let's it 
behave the same way after.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 drivers/block/loop.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4bab9b1..de122f3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -244,10 +244,8 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
 
ret = pagecache_write_end(file, mapping, pos, size, copied,
page, fsdata);
-   if (ret < 0)
+   if (ret < 0 || ret != copied)
goto fail;
-   if (ret < copied)
-   copied = ret;
 
if (unlikely(transfer_result))
goto fail;
-- 
1.5.2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: iov_iter_fault_in_readable fix

2007-06-16 Thread Dmitriy Monakhov

On 18:31 Чтв 14 Июн , Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> > Function prerform check for signgle region, with out respect to
> > segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)
I've prepared testcase (testcases/kernel/syscalls/writev/writev06.c) 
and sent it to ltp mailing list.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][Patch 2/6] integrity: fs hook placement

2007-03-08 Thread Dmitriy Monakhov

Mimi Zohar <[EMAIL PROTECTED]> writes:

> This patch places calls to the new integrity hooks in the appropriate
> places in the fs directory. It is not meant in any way to be viewed
> as a complete set, but used as a basis for an initial discussion.
>
> Index: linux-2.6.21-rc3-mm2/fs/ext3/xattr_security.c
> ===
> --- linux-2.6.21-rc3-mm2.orig/fs/ext3/xattr_security.c
> +++ linux-2.6.21-rc3-mm2/fs/ext3/xattr_security.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "xattr.h"
>  
>  static size_t
> @@ -58,12 +59,19 @@ ext3_init_security(handle_t *handle, str
>  
>   err = security_inode_init_security(inode, dir, &name, &value, &len);
>   if (err) {
> + /* Even if creation of the security xattr fails, must
> +  * indicate this is a new inode. */
> + integrity_inode_init_integrity(inode, dir, NULL, NULL, NULL);
>   if (err == -EOPNOTSUPP)
>   return 0;
>   return err;
>   }
 block begin
>   err = ext3_xattr_set_handle(handle, inode, EXT3_XATTR_INDEX_SECURITY,
>   name, value, len, 0);
> +
> + integrity_inode_init_integrity(inode, dir, &name, &value, &len);
> + err = ext3_xattr_set_handle(handle, inode, EXT3_XATTR_INDEX_SECURITY,
> + name, value, len, 0);
 block end
May be i've missed something, but i don't get last block.
Why you call ext3_xattr_set_handle() twise?, 
or you just mistyped and  it has to look like this:
<<<   kfree(name);
>   kfree(value);
>   return err;
> Index: linux-2.6.21-rc3-mm2/include/linux/fs.h
> ===
> --- linux-2.6.21-rc3-mm2.orig/include/linux/fs.h
> +++ linux-2.6.21-rc3-mm2/include/linux/fs.h
> @@ -591,6 +591,9 @@ struct inode {
>  #ifdef CONFIG_SECURITY
>   void*i_security;
>  #endif
> +#ifdef CONFIG_INTEGRITY
> + void*i_integrity;
> +#endif
>   void*i_private; /* fs or device private pointer */
>  };
>  
> Index: linux-2.6.21-rc3-mm2/fs/dcache.c
> ===
> --- linux-2.6.21-rc3-mm2.orig/fs/dcache.c
> +++ linux-2.6.21-rc3-mm2/fs/dcache.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -986,6 +987,7 @@ void d_instantiate(struct dentry *entry,
>   entry->d_inode = inode;
>   fsnotify_d_instantiate(entry, inode);
>   spin_unlock(&dcache_lock);
> + integrity_d_instantiate(entry, inode);
>   security_d_instantiate(entry, inode);
>  }
>  
> @@ -1050,6 +1052,7 @@ struct dentry *d_instantiate_unique(stru
>   spin_unlock(&dcache_lock);
>  
>   if (!result) {
> + integrity_d_instantiate(entry, inode);
>   security_d_instantiate(entry, inode);
>   return NULL;
>   }
> @@ -1187,6 +1190,7 @@ struct dentry *d_splice_alias(struct ino
>   BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
>   fsnotify_d_instantiate(new, inode);
>   spin_unlock(&dcache_lock);
> + integrity_d_instantiate(new, inode);
>   security_d_instantiate(new, inode);
>   d_rehash(dentry);
>   d_move(new, dentry);
> @@ -1197,6 +1201,7 @@ struct dentry *d_splice_alias(struct ino
>   dentry->d_inode = inode;
>   fsnotify_d_instantiate(dentry, inode);
>   spin_unlock(&dcache_lock);
> + integrity_d_instantiate(dentry, inode);
>   security_d_instantiate(dentry, inode);
>   d_rehash(dentry);
>   }
> @@ -1748,6 +1753,7 @@ found:
>   spin_unlock(&dcache_lock);
>  out_nolock:
>   if (actual == dentry) {
> + integrity_d_instantiate(dentry, inode);
>   security_d_instantiate(dentry, inode);
>   return NULL;
>   }
> Index: linux-2.6.21-rc3-mm2/fs/file_table.c
> ===
> --- linux-2.6.21-rc3-mm2.orig/fs/file_table.c
> +++ linux-2.6.21-rc3-mm2/fs/file_table.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -169,6 +170,7 @@ void fastcall __fput(struct file *file)
>   if (file->f_op && file->f_op->release)
>   file->f_op->release(inode, file);
>   security_file_free(file);
> + integrity_file_free(file);
>   if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
>   cdev_put(inode->i_cdev);
>   fops_put(file->f_op);
> @@ -240,6 +242,7 @@ void put_filp(struct file *file)
>  {
>   if (atomic_dec_and_test(&file->f_count)) {
>   security

[patch] kobject: kobject_shadow_add cleanup

2007-03-10 Thread Dmitriy Monakhov

 - correct function name in comments
 - parrent assignment does metter only inside "if" block, 
   so move it inside this block.
Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/lib/kobject.c b/lib/kobject.c
index b94f208..e4b477d 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -201,7 +201,7 @@ static void unlink(struct kobject * kobj)
 }
 
 /**
- * kobject_add - add an object to the hierarchy.
+ * kobject_shadow_add - add an object to the hierarchy.
  * @kobj:  object.
  * @shadow_parent: sysfs directory to add to.
  */
@@ -234,8 +234,8 @@ int kobject_shadow_add(struct kobject * kobj, struct dentry 
*shadow_parent)
 
list_add_tail(&kobj->entry,&kobj->kset->list);
spin_unlock(&kobj->kset->list_lock);
+   kobj->parent = parent;
}
-   kobj->parent = parent;
 
error = create_dir(kobj, shadow_parent);
if (error) {

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] Driver core: fix error by cleanup up symlinks properly

2007-03-10 Thread Dmitriy Monakhov

Greg Kroah-Hartman <[EMAIL PROTECTED]> writes:

> From: James Simmons <[EMAIL PROTECTED]>
>
> When a device fails to register the class symlinks where not cleaned up.
> This left a symlink in the /sys/class/"device"/ directory that pointed
> to no where. This caused the sysfs_follow_link Oops I reported earlier.
> This patch cleanups up the symlink. Please apply. Thank you.
>
> Signed-Off: James Simmons <[EMAIL PROTECTED]>
> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
> ---
>  drivers/base/core.c |   31 ++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index d04fd33..cf2a398 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -637,12 +637,41 @@ int device_add(struct device *dev)
>BUS_NOTIFY_DEL_DEVICE, dev);
>   device_remove_groups(dev);
>   GroupError:
> - device_remove_attrs(dev);
> + device_remove_attrs(dev);
>   AttrsError:
>   if (dev->devt_attr) {
>   device_remove_file(dev, dev->devt_attr);
>   kfree(dev->devt_attr);
>   }
> +
> + if (dev->class) {
> + sysfs_remove_link(&dev->kobj, "subsystem");
> + /* If this is not a "fake" compatible device, remove the
> +  * symlink from the class to the device. */
> + if (dev->kobj.parent != &dev->class->subsys.kset.kobj)
> + sysfs_remove_link(&dev->class->subsys.kset.kobj,
> +   dev->bus_id);
> +#ifdef CONFIG_SYSFS_DEPRECATED
> + if (parent) {
> + char *class_name = make_class_name(dev->class->name,
> +&dev->kobj);
> + if (class_name)
> + sysfs_remove_link(&dev->parent->kobj,
> +   class_name);
> + kfree(class_name);
> + sysfs_remove_link(&dev->kobj, "device");
> + }
> +#endif
> +
< block begin
> + down(&dev->class->sem);
> + /* notify any interfaces that the device is now gone */
> + list_for_each_entry(class_intf, &dev->class->interfaces, node)
> + if (class_intf->remove_dev)
> + class_intf->remove_dev(dev, class_intf);
> + /* remove the device from the class list */
> + list_del_init(&dev->node);
> + up(&dev->class->sem);
<< block end 
May be i've missed something, but i'm confuesd a litle bit.
For example if error happens while device_pm_add() we jump to label "PMError"
and code from block above will be executed (device will be remove from list),
but this device wasn't added to this list yet!
> + }
>   ueventattrError:
>   device_remove_file(dev, &dev->uevent_attr);
>   attrError:
> -- 
> 1.5.0.1
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] driver core: fix device_add error path

2007-03-11 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> Greg Kroah-Hartman <[EMAIL PROTECTED]> writes:
>
>> From: James Simmons <[EMAIL PROTECTED]>
>>
>> When a device fails to register the class symlinks where not cleaned up.
>> This left a symlink in the /sys/class/"device"/ directory that pointed
>> to no where. This caused the sysfs_follow_link Oops I reported earlier.
>> This patch cleanups up the symlink. Please apply. Thank you.
>>
>> Signed-Off: James Simmons <[EMAIL PROTECTED]>
>> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
>> ---
>>  drivers/base/core.c |   31 ++-
>>  1 files changed, 30 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>> index d04fd33..cf2a398 100644
>> --- a/drivers/base/core.c
>> +++ b/drivers/base/core.c
>> @@ -637,12 +637,41 @@ int device_add(struct device *dev)
>>   BUS_NOTIFY_DEL_DEVICE, dev);
>>  device_remove_groups(dev);
>>   GroupError:
>> -device_remove_attrs(dev);
>> +device_remove_attrs(dev);
>>   AttrsError:
>>  if (dev->devt_attr) {
>>  device_remove_file(dev, dev->devt_attr);
>>  kfree(dev->devt_attr);
>>  }
>> +
>> +if (dev->class) {
>> +sysfs_remove_link(&dev->kobj, "subsystem");
>> +/* If this is not a "fake" compatible device, remove the
>> + * symlink from the class to the device. */
>> +if (dev->kobj.parent != &dev->class->subsys.kset.kobj)
>> +sysfs_remove_link(&dev->class->subsys.kset.kobj,
>> +  dev->bus_id);
>> +#ifdef CONFIG_SYSFS_DEPRECATED
>> +if (parent) {
>> +char *class_name = make_class_name(dev->class->name,
>> +   &dev->kobj);
>> +if (class_name)
>> +sysfs_remove_link(&dev->parent->kobj,
>> +  class_name);
>> +kfree(class_name);
>> +sysfs_remove_link(&dev->kobj, "device");
>> +}
>> +#endif
>> +
> <<<<< block begin
>> +down(&dev->class->sem);
>> +/* notify any interfaces that the device is now gone */
>> +list_for_each_entry(class_intf, &dev->class->interfaces, node)
>> +if (class_intf->remove_dev)
>> +class_intf->remove_dev(dev, class_intf);
>> +/* remove the device from the class list */
>> +list_del_init(&dev->node);
>> +up(&dev->class->sem);
> <<<<<< block end 
> May be i've missed something, but i'm confuesd a litle bit.
> For example if error happens while device_pm_add() we jump to label "PMError"
> and code from block above will be executed (device will be remove from list),
> but this device wasn't added to this list yet!
I've check it one more time, code it really broken!, and i think i understand 
how
this can happen 
it look like full code chunck was copy-pasted from device_del(), but in case of 
device_add() error path, device was't added to dev->class->devices list yet.
Folowing patch fix this copy-paste error:

 [PATCH] driver core: fix device_add error path

 - At the moment we jump here device was't added to
   dev->class->devices list yet.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/base/core.c |9 -
 1 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 142c222..7d2459b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -684,15 +684,6 @@ int device_add(struct device *dev)
 #endif
sysfs_remove_link(&dev->kobj, "device");
}
-
-   down(&dev->class->sem);
-   /* notify any interfaces that the device is now gone */
-   list_for_each_entry(class_intf, &dev->class->interfaces, node)
-   if (class_intf->remove_dev)
-   class_intf->remove_dev(dev, class_intf);
-   /* remove the device from the class list */
-   list_del_init(&dev->node);
-   up(&dev->class->sem);
}
  ueventattrError:
device_remove_file(dev, &dev->uevent_attr);
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] driver core: handles kobject_uevent failure while device_add

2007-03-11 Thread Dmitriy Monakhov

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/base/core.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 142c222..da73012 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -630,8 +630,11 @@ int device_add(struct device *dev)
goto PMError;
if ((error = bus_add_device(dev)))
goto BusError;
-   if (!dev->uevent_suppress)
-   kobject_uevent(&dev->kobj, KOBJ_ADD);
+   if (!dev->uevent_suppress) {
+   error = kobject_uevent(&dev->kobj, KOBJ_ADD);
+   if (error)
+   goto UeventError;
+   }
bus_attach_device(dev);
if (parent)
klist_add_tail(&dev->knode_parent, &parent->klist_children);
@@ -651,6 +654,9 @@ int device_add(struct device *dev)
kfree(class_name);
put_device(dev);
return error;
+
+ UeventError:
+   bus_remove_device(dev);
  BusError:
device_pm_remove(dev);
  PMError:
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] driver core: handle sysfs_op failure while device_add

2007-03-11 Thread Dmitriy Monakhov

depends on: "[PATCH] driver core: fix device_add error path"

 - rearrange error path sequence, in order to make it more correct.
   In fact if initial sequance was doA(); doB(); doC();
   we should undo it with folowing sequance undoC(); undoB(); undoA();
 - handle sysfs_ops failure

This patch kills "ignoring return value of 'sysfs_create_link'"
warning message.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/base/core.c |   47 +++
 1 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 245d703..bcfcfad 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -602,22 +602,34 @@ int device_add(struct device *dev)
}
 
if (dev->class) {
-   sysfs_create_link(&dev->kobj, &dev->class->subsys.kset.kobj,
- "subsystem");
+   error = sysfs_create_link(&dev->kobj,
+   &dev->class->subsys.kset.kobj, "subsystem");
+   if (error)
+   goto ueventattrError;
/* If this is not a "fake" compatible device, then create the
 * symlink from the class to the device. */
-   if (dev->kobj.parent != &dev->class->subsys.kset.kobj)
-   sysfs_create_link(&dev->class->subsys.kset.kobj,
- &dev->kobj, dev->bus_id);
+   if (dev->kobj.parent != &dev->class->subsys.kset.kobj) {
+   error = sysfs_create_link(
+   &dev->class->subsys.kset.kobj,
+   &dev->kobj, dev->bus_id);
+   if (error)
+   goto busLinkErr;
+   }
if (parent) {
-   sysfs_create_link(&dev->kobj, &dev->parent->kobj,
-   "device");
+   error = sysfs_create_link(&dev->kobj,
+   &dev->parent->kobj, "device");
+   if (error)
+   goto deviceLinkErr;
 #ifdef CONFIG_SYSFS_DEPRECATED
class_name = make_class_name(dev->class->name,
&dev->kobj);
-   if (class_name)
-   sysfs_create_link(&dev->parent->kobj,
- &dev->kobj, class_name);
+   if (class_name) {
+   error = sysfs_create_link(
+   &dev->parent->kobj,
+   &dev->kobj, class_name);
+   if (error)
+   goto classLinkErr;
+   }
 #endif
}
}
@@ -673,12 +685,6 @@ int device_add(struct device *dev)
}
 
if (dev->class) {
-   sysfs_remove_link(&dev->kobj, "subsystem");
-   /* If this is not a "fake" compatible device, remove the
-* symlink from the class to the device. */
-   if (dev->kobj.parent != &dev->class->subsys.kset.kobj)
-   sysfs_remove_link(&dev->class->subsys.kset.kobj,
- dev->bus_id);
if (parent) {
 #ifdef CONFIG_SYSFS_DEPRECATED
char *class_name = make_class_name(dev->class->name,
@@ -688,8 +694,17 @@ int device_add(struct device *dev)
  class_name);
kfree(class_name);
 #endif
+classLinkErr:
sysfs_remove_link(&dev->kobj, "device");
}
+deviceLinkErr:
+   /* If this is not a "fake" compatible device, remove the
+* symlink from the class to the device. */
+   if (dev->kobj.parent != &dev->class->subsys.kset.kobj)
+   sysfs_remove_link(&dev->class->subsys.kset.kobj,
+ dev->bus_id);
+busLinkErr:
+   sysfs_remove_link(&dev->kobj, "subsystem");
}
  ueventattrError:
device_remove_file(dev, &dev->uevent_attr);
-- 
1.5.0.1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext3: dirindex error pointer issues (b)

2007-03-11 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

>  - ext3_dx_find_entry() exit with out setting proper error pointer
>  - do_split() exit with out setting proper error pointer
>it is realy painful because many callers contain folowing code:
>de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
>if (!(de))
> return retval;
><<< WOW retval wasn't changed by do_split(), so caller failed
><<< but return SUCCESS :)
>  - Rearrange do_split() error path. Current error path is realy ugly, all
>this up and down jump stuff doesn't make code easy to understand.
Ohh my first patch change error message semantics in do_split(). Initially when 
ext3_append() failed we just exit without printing error. In fact ext3_append()
may fail, it is legal and it's happens qite often (ENOSPC for example). This 
cause annoying fake error message. So restore this semantic as it was before
patch.
Andrew please apply incremental patch what fix it:

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 1a52586..7edb617 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1197,8 +1197,8 @@ journal_error:
brelse(*bh);
brelse(bh2);
*bh = NULL;
-errout:
ext3_std_error(dir->i_sb, err);
+errout:
*error = err;
return NULL;
 }
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f0a6c26..02a75db 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1195,8 +1195,8 @@ journal_error:
brelse(*bh);
brelse(bh2);
*bh = NULL;
-errout:
ext4_std_error(dir->i_sb, err);
+errout:
*error = err;
return NULL;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] mm: move common segment checks to separate helper function (v6)

2007-03-12 Thread Dmitriy Monakhov

I realy don't want to be annoying by sending this patcheset over and over
again. If anyone think this patch is realy cappy, please comment what 
exectly is bad. Thank you.

Changes:
  - patch was split in two patches.
  - comments added. I think now it is clearly describe things.
  - make generic_segment_checks() inline
  - patch prepared against 2.6.20-mm3

How is tested:
 - LTP test, and other readv/writev op tests.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 mm/filemap.c |   76 +
 1 files changed, 39 insertions(+), 37 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8e1849a..0aadf5f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1159,6 +1159,39 @@ success:
return size;
 }
 
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Adjust number of segments and amount of bytes to write.
+ * Returns appropriate error code that caller should return or
+ * zero in case that write should be allowed.
+ */
+inline int generic_segment_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_flags)
+{
+   unsigned long   seg;
+   for (seg = 0; seg < *nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   *count += iv->iov_len;
+   if (unlikely((ssize_t)(*count|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flags, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   *nr_segs = seg;
+   *count -= iv->iov_len;  /* This segment is no good */
+   break;
+   }
+   return 0;
+}
+
 /**
  * generic_file_aio_read - generic filesystem read routine
  * @iocb:  kernel I/O control block
@@ -1180,24 +1213,9 @@ generic_file_aio_read(struct kiocb *iocb, const struct 
iovec *iov,
loff_t *ppos = &iocb->ki_pos;
 
count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+   if (retval)
+   return retval;
 
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
@@ -2094,30 +2112,14 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
const struct iovec *iov,
size_t ocount;  /* original count */
size_t count;   /* after file limit checks */
struct inode*inode = mapping->host;
-   unsigned long   seg;
loff_t  pos;
ssize_t written;
ssize_t err;
 
ocount = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   ocount -= iv->iov_len;  /* This segment is no good */
-   break;
-   }
+   err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
+   if (err)
+   return err;
 
count = ocount;
pos = *ppos;
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] mm: incorrect direct io error handling (v6)

2007-03-12 Thread Dmitriy Monakhov

I realy don't want to be annoying by sending this patcheset over and over
again, i just want the issue to be solved. If anyone think this solution
is realy cappy, please comment what exectly is bad. Thank you.

Changes:
  - patch was split in two patches.
  - comments added. I think now it is clearly describe things.
  - patch prepared against 2.6.20-mm3

How this patch tested:
  - fsstress test.
  - manual direct_io tests.

LOG:
 - Trim off blocks after generic_file_direct_write() has failed.
 - Update out of date comments about direct_io locking rules.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 mm/filemap.c |   32 
 1 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0aadf5f..8959ae3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1925,8 +1925,9 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
+* AIO O_DIRECT ops attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2240,6 +2241,29 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
mutex_lock(&inode->i_mutex);
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
&iocb->ki_pos);
+   /* 
+* If __generic_file_aio_write_nolock has failed.
+* This may happen because of:
+* 1) Bad segment found (failed before actual write attempt)
+* 2) Segments are good, but actual write operation failed
+*and may have instantiated a few blocks outside i_size.
+*   a) in case of buffered write these blocks was already
+*  trimmed by generic_file_buffered_write()
+*   b) in case of O_DIRECT these blocks weren't trimmed yet.
+*
+* In case of (2b) these blocks have to be trimmed off again.
+*/
+   if (unlikely( ret < 0 && file->f_flags & O_DIRECT)) {
+   unsigned long nr_segs_avail = nr_segs;
+   size_t count = 0;
+   if (!generic_segment_checks(iov, &nr_segs_avail, &count,
+   VERIFY_READ)) {
+   /*It is (2b) case, because segments are good*/
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+   }
mutex_unlock(&inode->i_mutex);
 
if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2254,8 +2278,8 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] mm: incorrect direct io error handling (v6)

2007-03-12 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> On Mon, Mar 12, 2007 at 10:58:10AM +0300, Dmitriy Monakhov wrote:
>> I realy don't want to be annoying by sending this patcheset over and over
>> again, i just want the issue to be solved. If anyone think this solution
>> is realy cappy, please comment what exectly is bad. Thank you.
>
> If you don't get feedback, then you have to keep posting. If you still
> don't get feedback, try cc'ing a few more lists (eg. linux-fsdevel).
>
>> Changes:
>>   - patch was split in two patches.
>>   - comments added. I think now it is clearly describe things.
>>   - patch prepared against 2.6.20-mm3
>> 
>> How this patch tested:
>>   - fsstress test.
>>   - manual direct_io tests.
>> 
>> LOG:
>>  - Trim off blocks after generic_file_direct_write() has failed.
>>  - Update out of date comments about direct_io locking rules.
>
> It can be nice to expand on what the problem was, and how you fixed it...
> but I guess you do quite a good job in the C comments.
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path.
TEST_CASE:
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaa"..., 104857600) = -1 ENOSPC (No space left on device)

#stat /mnt/test/BIG_FILE
  File: `/mnt/test/BIG_FILE'
  Size: 0   Blocks: 110896 IO Block: 1024   regular empty file
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<file size is less than biggest block idx

Device: fe07h/65031dInode: 14  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2007-01-24 20:03:38.0 +0300
Modify: 2007-01-24 20:03:38.0 +0300
Change: 2007-01-24 20:03:39.0 +0300

#fsck.ext3 -f /dev/VG/test 
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544.  Fix? yes
Pass 2: Checking directory structure

>
>> 
>> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
>> ---
>>  mm/filemap.c |   32 
>>  1 files changed, 28 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 0aadf5f..8959ae3 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1925,8 +1925,9 @@ generic_file_direct_write(struct kiocb *iocb, const 
>> struct iovec *iov,
>>  /*
>>   * Sync the fs metadata but not the minor inode changes and
>>   * of course not the data as we did direct DMA for the IO.
>> - * i_mutex is held, which protects generic_osync_inode() from
>> - * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
>> + * i_mutex may not being held, if so some specific locking
>> + * ordering must protect generic_osync_inode() from livelocking.
>> + * AIO O_DIRECT ops attempt to sync metadata here.
>>   */
>
> This wasn't exactly clear to me. Did you mean:
>
>  "may be held, which protects generic_osync_inode() from livelocking. If it
>   is not held, then the filesystem must prevent this livelock"?
Yep..  my english is not realy good :(
>
>>  if ((written >= 0 || written == -EIOCBQUEUED) &&
>>  ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>> @@ -2240,6 +2241,29 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, 
>> const struct iovec *iov,
>>  mutex_lock(&inode->i_mutex);
>>  ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>>  &iocb->ki_pos);
>> +/* 
>> + * If __generic_file_aio_write_nolock has failed.
>> + * This may happen because of:
>> + * 1) Bad segment found (failed before actual write attempt)
>> + * 2) Segments are good, but actual write operation failed
>> + *and may have instantiated a few blocks outside i_size.
>> + *   a) in case of buffered write these blocks was already
>> + *  trimmed by generic_file_buffered_write()
>>

Re: [PATCH 2/2] mm: incorrect direct io error handling (v6)

2007-03-12 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> On Mon, Mar 12, 2007 at 11:55:30AM +0300, Dmitriy Monakhov wrote:
>> Nick Piggin <[EMAIL PROTECTED]> writes:
>> 
>> > On Mon, Mar 12, 2007 at 10:58:10AM +0300, Dmitriy Monakhov wrote:
>
>> >> @@ -2240,6 +2241,29 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, 
>> >> const struct iovec *iov,
>> >>   mutex_lock(&inode->i_mutex);
>> >>   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>> >>   &iocb->ki_pos);
>> >> + /* 
>> >> +  * If __generic_file_aio_write_nolock has failed.
>> >> +  * This may happen because of:
>> >> +  * 1) Bad segment found (failed before actual write attempt)
>> >> +  * 2) Segments are good, but actual write operation failed
>> >> +  *and may have instantiated a few blocks outside i_size.
>> >> +  *   a) in case of buffered write these blocks was already
>> >> +  *  trimmed by generic_file_buffered_write()
>> >> +  *   b) in case of O_DIRECT these blocks weren't trimmed yet.
>> >> +  *
>> >> +  * In case of (2b) these blocks have to be trimmed off again.
>> >> +  */
>> >> + if (unlikely( ret < 0 && file->f_flags & O_DIRECT)) {
>> >> + unsigned long nr_segs_avail = nr_segs;
>> >> + size_t count = 0;
>> >> + if (!generic_segment_checks(iov, &nr_segs_avail, &count,
>> >> + VERIFY_READ)) {
>> >> + /*It is (2b) case, because segments are good*/
>> >> + loff_t isize = i_size_read(inode);
>> >> + if (pos + count > isize)
>> >> + vmtruncate(inode, isize);
>> >> + }
>> >> + }
>> >
>> > OK, but wouldn't this be better to be done in the actual direct IO
>> > functions themselves? Thus you could be sure that you have the 2b case,
>> > and the code would be less fragile to something changing?
>> Ohh, We can't just call vmtruncate() after generic_file_direct_write()
>> failure while __generic_file_aio_write_nolock() becase where is no guarantee
>> what i_mutex held. In fact all existing fs always invoke 
>> __generic_file_aio_write_nolock() with i_mutex held in case of S_ISREG files,
>> but this was't explicitly demanded and documented. I've proposed to do it in
>> previous versions of this patch, because it this just document current state
>> of affairs, but David Chinner wasn't agree with it.
>
> It seemed like it was documented in the comments that you altered in this
> patch...
>
> How would such a filesystem that did not hold i_mutex propose to fix the
> problem?
>
> The burden should be on those filesystems that might not want to hold
> i_mutex here, to solve the problem nicely, rather than generic code to take
> this ugly code.
Ok then what do you think about this version http://lkml.org/lkml/2006/12/18/103
witch was posted almost  month ago :)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: move common segment checks to separate helper function (v6)

2007-03-12 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> On Mon, Mar 12, 2007 at 10:57:53AM +0300, Dmitriy Monakhov wrote:
>> I realy don't want to be annoying by sending this patcheset over and over
>> again. If anyone think this patch is realy cappy, please comment what 
>> exectly is bad. Thank you.
>
> Doesn't seem like a bad idea.
>
>> 
>> Changes:
>>   - patch was split in two patches.

>> +/*
>> + * Performs necessary checks before doing a write
>> + *
>> + * Adjust number of segments and amount of bytes to write.
>> + * Returns appropriate error code that caller should return or
>> + * zero in case that write should be allowed.
>> + */
>> +inline int generic_segment_checks(const struct iovec *iov,
>> +unsigned long *nr_segs, size_t *count,
>> +unsigned long access_flags)
>
> Make it static and not inline, and the compiler will work it out.
Wow i've just carefully checked and found more functions with duplicating code:
fs/xfs/linux-2.6/xfs_lrw.c:655 xfs_write()
fs/ntfs/file.c:2339 ntfs_file_aio_write_nolock()
So i think nobody will object against exporting generic_segment_checks()
and removing doplicating code.
>
> This function name doesn't really imply that it returns you the
> nr_segs and count, but that's not a big deal I guess.
>
> You also don't say that nr_segs should be initialised to the amount
> you which to write, while count must be initialised to zero.
>
>> +{
>> +unsigned long   seg;
>> +for (seg = 0; seg < *nr_segs; seg++) {
>> +const struct iovec *iv = &iov[seg];
>> +
>> +/*
>> + * If any segment has a negative length, or the cumulative
>> + * length ever wraps negative then return -EINVAL.
>> + */
>> +*count += iv->iov_len;
>> +if (unlikely((ssize_t)(*count|iv->iov_len) < 0))
>> +return -EINVAL;
>> +if (access_ok(access_flags, iv->iov_base, iv->iov_len))
>> +continue;
>
> Why now insert the above test, and put the below statements inside the
> branch? OTOH, that makes it less obviously c&p from the others. Maybe
> a subsequent patch.
>
>> +if (seg == 0)
>> +return -EFAULT;
>> +*nr_segs = seg;
>> +*count -= iv->iov_len;  /* This segment is no good */
>> +break;
>> +}
>
>
> You could assign to *count here, once, and remove the requirement
> that the caller initialised it to zero?
>
>> +return 0;
>> +}
>> +
>>  /**
>>   * generic_file_aio_read - generic filesystem read routine
>>   * @iocb:   kernel I/O control block
>> @@ -1180,24 +1213,9 @@ generic_file_aio_read(struct kiocb *iocb, const 
>> struct iovec *iov,
>>  loff_t *ppos = &iocb->ki_pos;
>>  
>>  count = 0;
>> -for (seg = 0; seg < nr_segs; seg++) {
>> -const struct iovec *iv = &iov[seg];
>> -
>> -/*
>> - * If any segment has a negative length, or the cumulative
>> - * length ever wraps negative then return -EINVAL.
>> - */
>> -count += iv->iov_len;
>> -if (unlikely((ssize_t)(count|iv->iov_len) < 0))
>> -return -EINVAL;
>> -if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
>> -continue;
>> -if (seg == 0)
>> -return -EFAULT;
>> -nr_segs = seg;
>> -count -= iv->iov_len;   /* This segment is no good */
>> -break;
>> -}
>> +retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
>> +if (retval)
>> +return retval;
>>  
>>  /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
>>  if (filp->f_flags & O_DIRECT) {
>> @@ -2094,30 +2112,14 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
>> const struct iovec *iov,
>>  size_t ocount;  /* original count */
>>  size_t count;   /* after file limit checks */
>>  struct inode*inode = mapping->host;
>> -unsigned long   seg;
>>  loff_t  pos;
>>  ssize_t written;
>>  ssize_t err;
>>  
>>  ocount = 0;
>> -for (seg = 0; seg < nr_segs; seg++) {
>> -const struct iovec *iv = &iov[seg];
>> -
>&

[PATCH 2/2] incorrect direct io error handling (v7)

2007-03-12 Thread Dmitriy Monakhov

Changes against v6:
 - Handle direct_io failure inside generic_file_direct_write() as it was
   recommend by Andrew (during discussion v1), and by Nick (during 
   discussion v6).
 - change comments, make it more clear.
 - one more time check what __generic_file_aio_write_nolock() always called
   under i_mutex for non blkdev files.
Tested with: fsstress, manual direct_io tests

Log:
If generic_file_direct_write() has fail (ENOSPC condition) inside 
__generic_file_aio_write_nolock() it may have instantiated
a few blocks outside i_size. And fsck will complain about wrong i_size
(ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
error),
after fsck will fix error i_size will be increased to the biggest block,
but this blocks contain gurbage from previous write attempt, this is not 
information leak, but its silence file data corruption. This issue affect 
fs regardless the values of blocksize or pagesize, and off corse only for non 
blkdev files.
We need truncate any block beyond i_size after write have failed , do in simular
generic_file_buffered_write() error path. We may safely call vmtruncate() here
because i_mutex always held for non blkdev files.
 
TEST_CASE:
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaa"..., 104857600) = -1 ENOSPC (No space left on device)

#stat /mnt/test/BIG_FILE
  File: `/mnt/test/BIG_FILE'
  Size: 0   Blocks: 110896 IO Block: 1024   regular empty file
<<
---
 mm/filemap.c |   28 
 1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8bd1ea4..95d49fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1932,8 +1932,10 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex is held in case of DIO_LOCKING, which protects 
+* generic_osync_inode() from livelocking. If it is not held, then 
+* the filesystem must prevent this livelock. AIO O_DIRECT ops 
+* attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2155,8 +2157,26 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
const struct iovec *iov,
loff_t endbyte;
ssize_t written_buffered;
 
+   /*
+* In case of non blockdev we may fail to buffered I/O.
+* So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode->i_mode))
+   BUG_ON(!mutex_is_locked(&inode->i_mutex));
+
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+
if (written < 0 || written == count)
goto out;
/*
@@ -2261,8 +2281,8 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * Called under i_mutex for writes to S_ISREG files in case of DIO_LOCKING.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.t

[PATCH 1/2] mm: move common segment checks to separate helper function (v7)

2007-03-12 Thread Dmitriy Monakhov

Changes against v6
 - remove duplicated code from xfs,ntfs
 - export generic_segment_checks, because it used by xfs,nfs now.
 - change arguments initialization pocily according to Nick's comments.

Tested with: ltp readv/writev tests

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ntfs/file.c |   21 ++-
 fs/xfs/linux-2.6/xfs_lrw.c |   22 ++--
 include/linux/fs.h |3 ++
 mm/filemap.c   |   83 ---
 4 files changed, 55 insertions(+), 74 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index dbbac55..621de36 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2129,28 +2129,13 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb 
*iocb,
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
loff_t pos;
-   unsigned long seg;
size_t count;   /* after file limit checks */
ssize_t written, err;
 
count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (!seg)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+   if (err)
+   return err;
pos = *ppos;
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index ff8d64e..558076d 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -639,7 +639,6 @@ xfs_write(
xfs_fsize_t isize, new_size;
xfs_iocore_t*io;
bhv_vnode_t *vp;
-   unsigned long   seg;
int iolock;
int eventsent = 0;
bhv_vrwlock_t   locktype;
@@ -652,24 +651,9 @@ xfs_write(
vp = BHV_TO_VNODE(bdp);
xip = XFS_BHVTOI(bdp);
 
-   for (seg = 0; seg < segs; seg++) {
-   const struct iovec *iv = &iovp[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   segs = seg;
-   ocount -= iv->iov_len;  /* This segment is no good */
-   break;
-   }
+   error = generic_segment_checks(iovp, &segs, &ocount, VERIFY_READ);
+   if (error)
+   return error;
 
count = ocount;
pos = *offset;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a3d22e..3b99450 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1778,6 +1778,9 @@ extern ssize_t generic_file_sendfile(struct file *, 
loff_t *, size_t, read_actor
 extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, 
read_actor_t);
+extern int generic_segment_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_flags);
 
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8e1849a..8bd1ea4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1159,6 +1159,46 @@ success:
return size;
 }
 
+/*
+ * Performs necessary checks before doing a write
+ * @iov:   io vector request
+ * @nr_segs:   number of segments in the iovec
+ * @count: number of bytes to write
+ * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE
+ *
+ * Adjust number of segments and amount of bytes to write (nr_segs should be
+ * properly initialized first). Returns appropriate error code that caller 
+ * should return or zero in case that write should be allowed.
+ */
+int generic_segment_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_fl

[PATCH] splice: partial write handling fix

2007-03-14 Thread Dmitriy Monakhov

currently if partial write has happened while ->commit_write() then page
wasn't marked as accessed and rebalanced. 

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/fs/splice.c b/fs/splice.c
index 2fca6eb..bb1bf62 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -682,18 +682,25 @@ find_page:
}
 
ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
-   if (!ret) {
+   if (ret) {
+   if(ret == AOP_TRUNCATED_PAGE) {
+   page_cache_release(page);
+   goto find_page;
+   }
+   if (ret < 0)
+   goto out;
/*
-* Return the number of bytes written and mark page as
-* accessed, we are now done!
+* Partial write has happened, so 'ret' already initialized by
+* number of bytes written, Where is nothing we have to do here.
 */
+   } else
ret = this_len;
-   mark_page_accessed(page);
-   balance_dirty_pages_ratelimited(mapping);
-   } else if (ret == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
-   goto find_page;
-   }
+   /*
+* Return the number of bytes written and mark page as
+* accessed, we are now done!
+*/
+   mark_page_accessed(page);
+   balance_dirty_pages_ratelimited(mapping);
 out:
page_cache_release(page);
unlock_page(page);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> Introduce write_begin, write_end, and perform_write aops.
>
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
>
>  Documentation/filesystems/Locking |9 +
>  Documentation/filesystems/vfs.txt |   38 ++
>  drivers/block/loop.c  |   69 ---
>  fs/buffer.c   |  144 
>  fs/libfs.c|   40 ++
>  fs/namei.c|   47 ++-
>  fs/splice.c   |  106 +
>  include/linux/buffer_head.h   |7 +
>  include/linux/fs.h|   29 
>  include/linux/pagemap.h   |2 
>  mm/filemap.c  |  228 
> ++
>  11 files changed, 494 insertions(+), 225 deletions(-)
>
> Index: linux-2.6/include/linux/fs.h
> ===
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -449,6 +449,17 @@ struct address_space_operations {
>*/
>   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
>   int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
> +
> + int (*write_begin)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> + int (*write_end)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
> +
> + int (*perform_write)(struct file *, struct address_space *,
> + struct iov_iter *, loff_t);
> +
>   /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
>   sector_t (*bmap)(struct address_space *, sector_t);
>   void (*invalidatepage) (struct page *, unsigned long);
> @@ -463,6 +474,18 @@ struct address_space_operations {
>   int (*launder_page) (struct page *);
>  };
>  
> +/*
> + * pagecache_write_begin/pagecache_write_end must be used by general code
> + * to write into the pagecache.
> + */
> +int pagecache_write_begin(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> +
> +int pagecache_write_end(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
> +
>  struct backing_dev_info;
>  struct address_space {
>   struct inode*host;  /* owner: inode, block_device */
> @@ -1902,6 +1925,12 @@ extern int simple_prepare_write(struct f
>   unsigned offset, unsigned to);
>  extern int simple_commit_write(struct file *file, struct page *page,
>   unsigned offset, unsigned to);
> +extern int simple_write_begin(struct file *file, struct address_space 
> *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> +extern int simple_write_end(struct file *file, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
>  
>  extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct 
> nameidata *);
>  extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
> *);
> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -2049,6 +2049,88 @@ inline int generic_write_checks(struct f
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int pagecache_write_begin(struct file *file, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata)
> +{
> + const struct address_space_operations *aops = mapping->a_ops;
> +
> + if (aops->write_begin)
> + return aops->write_begin(file, mapping, pos, len, intr, pagep, 
> fsdata);
> + else {
> + int ret;
> + pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> + unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
> + struct inode *inode = mapping->host;
> + struct page *page;
> +again:
> + page = __grab_cache_page(mapping, index);
> + *pagep = page;
> + if (!page)
> + return -ENOMEM;
> +
> + if (intr && !PageUptodate(page)) {
> +

Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-15 Thread Dmitriy Monakhov

Nick Piggin <[EMAIL PROTECTED]> writes:

> Index: linux-2.6/fs/splice.c
> ===
> --- linux-2.6.orig/fs/splice.c
> +++ linux-2.6/fs/splice.c
> @@ -559,7 +559,7 @@ static int pipe_to_file(struct pipe_inod
>   struct address_space *mapping = file->f_mapping;
>   unsigned int offset, this_len;
>   struct page *page;
> - pgoff_t index;
> + void *fsdata;
>   int ret;
>  
>   /*
> @@ -569,13 +569,13 @@ static int pipe_to_file(struct pipe_inod
>   if (unlikely(ret))
>   return ret;
>  
> - index = sd->pos >> PAGE_CACHE_SHIFT;
>   offset = sd->pos & ~PAGE_CACHE_MASK;
>  
>   this_len = sd->len;
>   if (this_len + offset > PAGE_CACHE_SIZE)
>   this_len = PAGE_CACHE_SIZE - offset;
>  
> +#if 0
>   /*
>* Reuse buf page, if SPLICE_F_MOVE is set and we are doing a full
>* page.
> @@ -587,86 +587,11 @@ static int pipe_to_file(struct pipe_inod
>* locked on successful return.
>*/
>   if (buf->ops->steal(pipe, buf))
> - goto find_page;
> +#endif
One more note. It's looks like you just disabled all fancy zero copy logic.
Off corse this is just rfc patchset.
But i think where is fundamental problem with it:
 Previous logic was following:
  1)splice code responsible for: stealing(if possible) and loking the page
  2)prepare_write() code responsible for: do fs speciffic stuff

 But with new write_begin() logic  all steps (grubbing, locking, preparing)
 happened internaly inside write_begin() witch doesn't even know about what
 kind of data will be copied between write_begin/write_end.
 So fancy zero copy logic is impossible :(
I think this can be solved somehow, but i dont know yet, how can this be done
without implementing it inside begin_write().

>  
> - page = buf->page;
> - if (add_to_page_cache(page, mapping, index, GFP_KERNEL)) {
> - unlock_page(page);
> - goto find_page;
> - }
> -
> - page_cache_get(page);
> -
> - if (!(buf->flags & PIPE_BUF_FLAG_LRU))
> - lru_cache_add(page);
> - } else {
> -find_page:
> - page = find_lock_page(mapping, index);
> - if (!page) {
> - ret = -ENOMEM;
> - page = page_cache_alloc_cold(mapping);
> - if (unlikely(!page))
> - goto out_ret;
> -
> - /*
> -  * This will also lock the page
> -  */
> - ret = add_to_page_cache_lru(page, mapping, index,
> - GFP_KERNEL);
> - if (unlikely(ret))
> - goto out;
> - }
> -
> - /*
> -  * We get here with the page locked. If the page is also
> -  * uptodate, we don't need to do more. If it isn't, we
> -  * may need to bring it in if we are not going to overwrite
> -  * the full page.
> -  */
> - if (!PageUptodate(page)) {
> - if (this_len < PAGE_CACHE_SIZE) {
> - ret = mapping->a_ops->readpage(file, page);
> - if (unlikely(ret))
> - goto out;
> -
> - lock_page(page);
> -
> - if (!PageUptodate(page)) {
> - /*
> -  * Page got invalidated, repeat.
> -  */
> - if (!page->mapping) {
> - unlock_page(page);
> - page_cache_release(page);
> - goto find_page;
> - }
> - ret = -EIO;
> - goto out;
> - }
> - } else
> - SetPageUptodate(page);
> - }
> - }
> -
> - ret = mapping->a_ops->prepare_write(file, page, offset, 
> offset+this_len);
> - if (unlikely(ret)) {
> - loff_t isize = i_size_read(mapping->host);
> -
> - if (ret != AOP_TRUNCATED_PAGE)
> - unlock_page(page);
> - page_cache_release(page);
> - if (ret == AOP_TRUNCATED_PAGE)
> - goto find_page;
> -
> - /*
> -  * prepare_write() may have instantiated a few blocks
> -  * outside i_size.  Trim these off again.
> -  */
> - if (sd->pos + this_len > isize)
> - vmtruncate(mapping->host, isize);
> -
> -

Re: [PATCH] splice: partial write handling fix

2007-03-15 Thread Dmitriy Monakhov

Jens Axboe <[EMAIL PROTECTED]> writes:

> On Wed, Mar 14 2007, Dmitriy Monakhov wrote:
>> currently if partial write has happened while ->commit_write() then page
>> wasn't marked as accessed and rebalanced. 
>
> The ->commit_write() return values aren't very well designed imho. Is
> your fix correct getting the pipe_to_file() return value correct now?
I think yes, after my patch this path look like follows:

ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
if (ret) {
if(ret == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
goto find_page;
}
if (ret < 0)
goto out;
/*
 * Partial write has happened, so 'ret' already initialized by
 * number of bytes written, Where is nothing we have to do here.
 */
} else
ret = this_len;
/*
 * Return the number of bytes written and mark page as
 * accessed, we are now done!
 */
mark_page_accessed(page);
balance_dirty_pages_ratelimited(mapping);
out:
page_cache_release(page);
unlock_page(page);
out_ret:
return ret;
}

But off corse this patch does't fix issues spotted by Nick. 
>
> -- 
> Jens Axboe
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] fs: incorrect direct io error handling v8

2007-03-18 Thread Dmitriy Monakhov


If generic_file_direct_write() has fail (ENOSPC condition) inside
__generic_file_aio_write_nolock() it may have instantiated a few blocks outside
i_size in case of non blockdev files. At least ext2, ext3 and reiserfs interpret
i_size and biggest block difference as error. Later fsck will complain
about wrong i_size. After fsck will fix error i_size will be increased
to the biggest block. This is bad because this blocks contain gurbage
from previous write attempt. And result in silence file data corruption.
This issue affect fs regardless the values of blocksize or pagesize.
We need truncate any block beyond i_size after generic_file_direct_write()
has failed. In fact all existing fs witch use __generic_file_aio_write_nolock()
always call it under i_mutex for non blockdev files. This patch add 
correcspnding BUG_ON() in order to explicitly demand/document it.

Also fix out of date direct_io locking comments.

TEST_CASE:
TESTCASE_BEGIN
   $touch /mnt/test/BIG_FILE
   ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
   open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
   write(3, ""..., 104857600) = -1 ENOSPC (No space left on device)
   ## size and block sould't be changed because write op failed.
   $stat /mnt/test/BIG_FILE
 File: `/mnt/test/BIG_FILE'
 Size: 0 Blocks: 110896 IO Block: 1024   regular empty file
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<^^^file size is less than biggest block idx
   
   Device: fe07h/65031dInode: 14  Links: 1
   Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
   Access: 2007-01-24 20:03:38.0 +0300
   Modify: 2007-01-24 20:03:38.0 +0300
   Change: 2007-01-24 20:03:39.0 +0300
   
   #fsck.ext3 -f /dev/VG/test 
   e2fsck 1.39 (29-May-2006)
   Pass 1: Checking inodes, blocks, and sizes
   Inode 14, i_size is 0, should be 56556544.  Fix? yes
   Pass 2: Checking directory structure
   
#TESTCASE_END

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 mm/filemap.c |   24 
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bbef42f..a08900e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1892,8 +1892,10 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
+* i_mutex is held in case of DIO_LOCKING, which protects
+* generic_osync_inode() from livelocking. If it is not held, then
+* the filesystem must prevent this livelock. AIO O_DIRECT ops
+* attempt to sync metadata here.
 */
if ((written >= 0 || written == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
@@ -2083,6 +2085,9 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, const 
struct iovec *iov,
ssize_t written;
ssize_t err;
 
+   /* Always called under i_mutex for writes to non blockdev files */
+   BUG_ON(!S_ISBLK(inode->i_mode) &&
+   !mutex_is_locked(&inode->i_mutex));
ocount = 0;
err = generic_iovec_checks(iov, &nr_segs, &ocount, VERIFY_READ);
if (err)
@@ -2117,6 +2122,17 @@ __generic_file_aio_write_nolock(struct kiocb *iocb, 
const struct iovec *iov,
 
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not blockdev generic_file_direct_write() may
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+
if (written < 0 || written == count)
goto out;
/*
@@ -2221,8 +2237,8 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * Called under i_mutex for writes to S_ISREG files in case of DIO_LOCKING.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kio

[PATCH 1/2] fs: remove duplicated iovec checking code v8

2007-03-18 Thread Dmitriy Monakhov


Where are several places where the same code used for iovec checks.
This patch just move this code to separate helper function, and replace
duplicated code with it. IMHO it is better because these are checks that
we want for all filesystems/drivers that use vectored I/O.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 fs/ntfs/file.c |   21 +++--
 fs/read_write.c|   40 
 fs/xfs/linux-2.6/xfs_lrw.c |   22 +++---
 include/linux/fs.h |3 +++
 mm/filemap.c   |   43 ++-
 5 files changed, 55 insertions(+), 74 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index dbbac55..2c672a4 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2129,28 +2129,13 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb 
*iocb,
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
loff_t pos;
-   unsigned long seg;
size_t count;   /* after file limit checks */
ssize_t written, err;
 
count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = &iov[seg];
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (!seg)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
+   err = generic_iovec_checks(iov, &nr_segs, &count, VERIFY_READ);
+   if (err)
+   return err;
pos = *ppos;
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
diff --git a/fs/read_write.c b/fs/read_write.c
index 4d03008..22ec324 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -217,6 +217,46 @@ Einval:
return -EINVAL;
 }
 
+/*
+ * Performs necessary iovec checks before doing a write
+ * @iov:   io vector request
+ * @nr_segs:   number of segments in the iovec
+ * @count: number of bytes to write
+ * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE
+ *
+ * Adjust number of segments and amount of bytes to write (nr_segs should be
+ * properly initialized first). Returns appropriate error code that caller
+ * should return or zero in case that write should be allowed.
+ */
+int generic_iovec_checks(const struct iovec *iov,
+   unsigned long *nr_segs, size_t *count,
+   unsigned long access_flags)
+{
+   unsigned long   seg;
+   size_t cnt = 0;
+   for (seg = 0; seg < *nr_segs; seg++) {
+   const struct iovec *iv = &iov[seg];
+
+   /*
+* If any segment has a negative length, or the cumulative
+* length ever wraps negative then return -EINVAL.
+*/
+   cnt += iv->iov_len;
+   if (unlikely((ssize_t)(cnt|iv->iov_len) < 0))
+   return -EINVAL;
+   if (access_ok(access_flags, iv->iov_base, iv->iov_len))
+   continue;
+   if (seg == 0)
+   return -EFAULT;
+   *nr_segs = seg;
+   cnt -= iv->iov_len; /* This segment is no good */
+   break;
+   }
+   *count = cnt;
+   return 0;
+}
+EXPORT_SYMBOL(generic_iovec_checks);
+
 static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
 {
set_current_state(TASK_UNINTERRUPTIBLE);
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index ff8d64e..9a11b00 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -639,7 +639,6 @@ xfs_write(
xfs_fsize_t isize, new_size;
xfs_iocore_t*io;
bhv_vnode_t *vp;
-   unsigned long   seg;
int iolock;
int eventsent = 0;
bhv_vrwlock_t   locktype;
@@ -652,24 +651,9 @@ xfs_write(
vp = BHV_TO_VNODE(bdp);
xip = XFS_BHVTOI(bdp);
 
-   for (seg = 0; seg < segs; seg++) {
-   const struct iovec *iv = &iovp[seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) <

Re: [PATCH][5/5][resend] floppy.c: Fix device_create_file() warning

2007-03-20 Thread Dmitriy Monakhov

"Jesper Juhl" <[EMAIL PROTECTED]> writes:

> On 20/03/07, Mikael Pettersson <[EMAIL PROTECTED]> wrote:
>> On Mon, 19 Mar 2007 18:42:22 +0100, Jesper Juhl wrote:
>> > --- a/drivers/block/floppy.c
>> > +++ b/drivers/block/floppy.c
>> > @@ -4302,7 +4302,12 @@ static int __init floppy_init(void)
>> >   if (err)
>> >   goto out_flush_work;
>> >
>> > - device_create_file(&floppy_device[drive].dev,&dev_attr_cmos);
>> > + err = device_create_file(&floppy_device[drive].dev, 
>> > &dev_attr_cmos);
>> > + if (err)
>> > + goto out_flush_work;
>> > +
>> >   /* to be cleaned up... */
>> >   disks[drive]->private_data = (void *)(long)drive;
>> >   disks[drive]->queue = floppy_queue;
>>
>> The floppy driver's sysfs file just provides some auxiliary
>> information to user-space, none of which matters for most of
>> its users. It is IMO totally inappropriate to fail floppy
>> driver init in this case.
>>
>
> Which is why my original patch just output a warning to let the user
> know that creating the file had failed.
Ohh please... First of all you have to make you patch correct, and only
after this think about tiny shiny error massages.
Lets look at original code:
  err = platform_device_register(&floppy_device[drive]);
  if (err)
goto out_flush_work;

  device_create_file(&floppy_device[drive].dev,&dev_attr_cmos);
< You propose  "goto out_flush_work" here if device_create_file() failed
< ,as it was done if platform_device_register() failed. But at the 
< moment we call device_create_file() platform_device_register()
< succeed, this means we have to unregister it first and only after
< this jumpto out_flush_work.
>
> -- 
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] floppy: handle device_create_file() failure while init

2007-03-02 Thread Dmitriy Monakhov

This patch kills the "ignoring return value of 'device_create_file'" 
warning message.
Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 drivers/block/floppy.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 567c630..dfdabc3 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4333,7 +4333,10 @@ static int __init floppy_init(void)
if (err)
goto out_flush_work;
 
-   device_create_file(&floppy_device[drive].dev,&dev_attr_cmos);
+   err = 
device_create_file(&floppy_device[drive].dev,&dev_attr_cmos);
+   if (err)
+   goto out_unreg_platform_dev;
+
/* to be cleaned up... */
disks[drive]->private_data = (void *)(long)drive;
disks[drive]->queue = floppy_queue;
@@ -4344,6 +4347,8 @@ static int __init floppy_init(void)
 
return 0;
 
+out_unreg_platform_dev:
+   platform_device_unregister(&floppy_device[drive]);
 out_flush_work:
flush_scheduled_work();
if (usage_count)
-- 
1.4.4.2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] kobject: new_device->kref wasn't putted in kobject_move()

2007-03-03 Thread Dmitriy Monakhov

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/lib/kobject.c b/lib/kobject.c
index b94f208..b11f7b2 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -432,6 +432,7 @@ int kobject_move(struct kobject *kobj, struct kobject 
*new_parent)
kobject_put(old_parent);
kobject_uevent_env(kobj, KOBJ_MOVE, envp);
 out:
+   kobject_put(new_parent);
kobject_put(kobj);
kfree(devpath_string);
kfree(devpath);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] kobject: new_device->kref wasn't putted in kobject_move()

2007-03-03 Thread Dmitriy Monakhov

Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
>
> diff --git a/lib/kobject.c b/lib/kobject.c
> index b94f208..b11f7b2 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -432,6 +432,7 @@ int kobject_move(struct kobject *kobj, struct kobject 
> *new_parent)
>   kobject_put(old_parent);
>   kobject_uevent_env(kobj, KOBJ_MOVE, envp);
>  out:
> + kobject_put(new_parent);
>   kobject_put(kobj);
>   kfree(devpath_string);
>   kfree(devpath);
OOps i'm realy sorry, by occasion patch was incomplete :(
The updated patch version following:

  [PATCH] kobject: new_device->kref wasn't putted after error in kobject_move()
  If error happen we jump to "out" label, in this case new_device not yet 
  became the parent but it wasn't putted.
 
  Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/lib/kobject.c b/lib/kobject.c
index b94f208..a6c9a06 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -429,9 +429,11 @@ int kobject_move(struct kobject *kobj, struct kobject 
*new_parent)
goto out;
old_parent = kobj->parent;
kobj->parent = new_parent;
+   new_parent = NULL;
kobject_put(old_parent);
kobject_uevent_env(kobj, KOBJ_MOVE, envp);
 out:
+   kobject_put(new_parent);
kobject_put(kobj);
kfree(devpath_string);
kfree(devpath);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] udf: possible null pointer dereference while load_partition

2007-03-04 Thread Dmitriy Monakhov


sb_read may return NULL, let's explicitly check it.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/fs/udf/super.c b/fs/udf/super.c
index f4b3265..951b85d 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -1407,6 +1407,8 @@ udf_load_partition(struct super_block *sb, kernel_lb_addr 
*fileset)
 
pos = udf_block_map(UDF_SB_VAT(sb), 0);
bh = sb_bread(sb, pos);
+   if (!bh)
+   return 1;
UDF_SB_TYPEVIRT(sb,i).s_start_offset =
le16_to_cpu(((struct 
virtualAllocationTable20 *)bh->b_data + 
udf_ext0_offset(UDF_SB_VAT(sb)))->lengthHeader) +

udf_ext0_offset(UDF_SB_VAT(sb));

[PATCH] ext3: dirindex error pointer issues

2007-03-04 Thread Dmitriy Monakhov

 - ext3_dx_find_entry() exit with out setting proper error pointer
 - do_split() exit with out setting proper error pointer
   it is realy painful because many callers contain folowing code:
   de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
   if (!(de))
return retval;
   <<< WOW retval wasn't changed by do_split(), so caller failed
   <<< but return SUCCESS :)
 - Rearrange do_split() error path. Current error path is realy ugly, all
   this up and down jump stuff doesn't make code easy to understand.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ext3/namei.c |   26 +++---
 fs/ext4/namei.c |   26 +++---
 2 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 49159f1..1a52586 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -969,6 +969,7 @@ static struct buffer_head * ext3_dx_find_entry(struct 
dentry *dentry,
  (blockb_data, *data2;
unsigned split;
struct ext3_dir_entry_2 *de = NULL, *de2;
-   int err;
+   int err = 0;
 
-   bh2 = ext3_append (handle, dir, &newblock, error);
+   bh2 = ext3_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
*bh = NULL;
@@ -1145,14 +1146,9 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
 
BUFFER_TRACE(*bh, "get_write_access");
err = ext3_journal_get_write_access(handle, *bh);
-   if (err) {
-   journal_error:
-   brelse(*bh);
-   brelse(bh2);
-   *bh = NULL;
-   ext3_std_error(dir->i_sb, err);
-   goto errout;
-   }
+   if (err)
+   goto journal_error;
+
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext3_journal_get_write_access(handle, frame->bh);
if (err)
@@ -1195,8 +1191,16 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
goto journal_error;
brelse (bh2);
dxtrace(dx_show_index ("frame", frame->entries));
-errout:
return de;
+
+journal_error:
+   brelse(*bh);
+   brelse(bh2);
+   *bh = NULL;
+errout:
+   ext3_std_error(dir->i_sb, err);
+   *error = err;
+   return NULL;
 }
 #endif
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index e7e1d79..682a3d7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -967,6 +967,7 @@ static struct buffer_head * ext4_dx_find_entry(struct 
dentry *dentry,
  (blockb_data, *data2;
unsigned split;
struct ext4_dir_entry_2 *de = NULL, *de2;
-   int err;
+   int err = 0;
 
-   bh2 = ext4_append (handle, dir, &newblock, error);
+   bh2 = ext4_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
*bh = NULL;
@@ -1143,14 +1144,9 @@ static struct ext4_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
 
BUFFER_TRACE(*bh, "get_write_access");
err = ext4_journal_get_write_access(handle, *bh);
-   if (err) {
-   journal_error:
-   brelse(*bh);
-   brelse(bh2);
-   *bh = NULL;
-   ext4_std_error(dir->i_sb, err);
-   goto errout;
-   }
+   if (err)
+   goto journal_error;
+
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext4_journal_get_write_access(handle, frame->bh);
if (err)
@@ -1193,8 +1189,16 @@ static struct ext4_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
goto journal_error;
brelse (bh2);
dxtrace(dx_show_index ("frame", frame->entries));
-errout:
return de;
+
+journal_error:
+   brelse(*bh);
+   brelse(bh2);
+   *bh = NULL;
+errout:
+   ext4_std_error(dir->i_sb, err);
+   *error = err;
+   return NULL;
 }
 #endif
 
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext3: dirindex error pointer issues

2007-03-04 Thread Dmitriy Monakhov

Andreas Dilger <[EMAIL PROTECTED]> writes:

> On Mar 04, 2007  17:18 +0300, Dmitriy Monakhov wrote:
>>  - ext3_dx_find_entry() exit with out setting proper error pointer
>>  - do_split() exit with out setting proper error pointer
>>it is realy painful because many callers contain folowing code:
>>de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
>>if (!(de))
>> return retval;
>><<< WOW retval wasn't changed by do_split(), so caller failed
>><<< but return SUCCESS :)
>>  - Rearrange do_split() error path. Current error path is realy ugly, all
>>this up and down jump stuff doesn't make code easy to understand.
>> 
>> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
>> ---
>>  fs/ext3/namei.c |   26 +++---
>>  fs/ext4/namei.c |   26 +++---
>>  2 files changed, 30 insertions(+), 22 deletions(-)
>> 
>> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
>> index 49159f1..1a52586 100644
>> --- a/fs/ext3/namei.c
>> +++ b/fs/ext3/namei.c
>> @@ -969,6 +969,7 @@ static struct buffer_head * ext3_dx_find_entry(struct 
>> dentry *dentry,
>>(block<>+((char *)de - bh->b_data))) {
>>  brelse (bh);
>> +*err = ERR_BAD_DX_DIR;
>>  goto errout;
>>  }
>>  *res_dir = de;
>
> I have no objection to this change (by principle of least surprise) but
> I don't know if it fixes a real problem.  The one caller of this function
> treats ERR_BAD_DX_DIR the same as bh == NULL.
  Execly  ERR_BAD_DX_DIR treats  the same as bh == NULL and let's look at
  callers code: 
  struct buffer_head * ext3_find_entry(..)
  {
  ...
  bh = ext3_dx_find_entry(dentry, res_dir, &err);
/*
 * On success, or if the error was file not found,
 * return.  Otherwise, fall back to doing a search the
 * old fashioned way.
 */
if (bh || (err != ERR_BAD_DX_DIR))
  <<<<< after ext3_dx_find_entry() has failed , but don't set err pointer 
  <<<<< we get  (bh == NULL, err == 0) , and then just return NULL.
return bh;
 ...
 }

>
>> @@ -1134,9 +1135,9 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
>> *handle, struct inode *dir,
>>  char *data1 = (*bh)->b_data, *data2;
>>  unsigned split;
>>  struct ext3_dir_entry_2 *de = NULL, *de2;
>> -int err;
>> +int err = 0;
>>  
>> -bh2 = ext3_append (handle, dir, &newblock, error);
>> +bh2 = ext3_append (handle, dir, &newblock, &err);
>
> Why don't we just remove "err" entirely and use *error to avoid any risk
> of setting err and not returning it in error?  Also reduces stack usage
> that tiny little bit.
I've chosen this approuch becouse of:
1) this approuch was used in block allocation code.
2) this approuch prevent  folowing errors:
  *error = do_some_function(.)
   /* some code*/
  if(error) 
 we do this checking as we do it thousands times before, but forget
 what error it pointer here. And compiler can't warn us here because 
 it is absolutely legal operation. At least it is better to rename 
 error to errorp.

Btw: I've thought about adding assertations to error path witch may check
what errp pointer was properly initialized after error happends.
folowing code is draft only:
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -63,6 +63,7 @@ static struct buffer_head *ext3_append(handle_t *handle,
EXT3_I(inode)->i_disksize = inode->i_size;
ext3_journal_get_write_access(handle,bh);
}
+   assert(bh || *err);
return bh;
 }
 
@@ -433,6 +434,7 @@ fail2:
frame--;
}
 fail:
+   assert(*err != 0);
return NULL;
 }

>
>
> In ext3_dx_add_entry() it appears like we should "goto journal_error"
> to report an error after the failed call to do_split(), instead of just
> "goto cleanup" so that we report an error in the filesystem.  Not 100%
> sure of this.
do_split() already invoked ext3_std_error() on it's "journal_error" path.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-2.6.21-rc2 compile warnings

2007-03-05 Thread Dmitriy Monakhov

"young dave" <[EMAIL PROTECTED]> writes:

> Hi,
> As I compile the linux-2.6.21-rc2  (gcc version 3.4.6), there's some
> warning  messages:
you should probably check -mm tree may be some issues already fixed where.
>
> drivers/video/Kconfig:1622:warning: 'select' used by config symbol
> 'FB_PS3' refer to undefined symbol 'PS3_PS3AV'
>
> fs/partitions/check.c: In function `add_partition':
> fs/partitions/check.c:388: warning: ignoring return value of
> `kobject_add', declared with attribute warn_unused_result
> fs/partitions/check.c:391: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> fs/partitions/check.c:399: warning: ignoring return value of
> `sysfs_create_file', declared with attribute warn_unused_result
>
> drivers/ata/libata-sff.c: In function `ata_pci_init_one':
> drivers/ata/libata-sff.c:820: warning: ignoring return value of
> `pci_request_region', declared with attribute warn_unused_result
> drivers/ata/libata-sff.c:822: warning: ignoring return value of
> `pci_request_region', declared with attribute warn_unused_result
> drivers/ata/libata-sff.c:824: warning: ignoring return value of
> `pci_request_region', declared with attribute warn_unused_result
>
> drivers/base/core.c: In function `device_add':
> drivers/base/core.c:580: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/base/core.c:585: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/base/core.c:589: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/base/core.c:594: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/base/core.c: In function `device_rename':
> drivers/base/core.c:1039: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/base/core.c:1049: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
>
> drivers/block/floppy.c: In function `floppy_init':
> drivers/block/floppy.c:4337: warning: ignoring return value of
I've already sent patch, but i wasn't added to -mm yet ( at least
it not rejecdted yet :) ) 
> `device_create_file', declared with attribute warn_unused_result
> drivers/cpufreq/cpufreq.c: In function `cpufreq_add_dev':
> drivers/cpufreq/cpufreq.c:826: warning: ignoring return value of
> `sysfs_create_file', declared with attribute warn_unused_result
> drivers/cpufreq/cpufreq.c:830: warning: ignoring return value of
> `sysfs_create_file', declared with attribute warn_unused_result
> drivers/cpufreq/cpufreq.c:832: warning: ignoring return value of
> `sysfs_create_file', declared with attribute warn_unused_result
>
> drivers/ide/pci/cs5530.c: In function `init_chipset_cs5530':
> drivers/ide/pci/cs5530.c:244: warning: ignoring return value of
> `pci_set_mwi', declared with attribute warn_unused_result
>
> drivers/ide/pci/sc1200.c: In function `sc1200_resume':
> drivers/ide/pci/sc1200.c:397: warning: ignoring return value of
> `pci_enable_device', declared with attribute warn_unused_result
>
> drivers/ide/setup-pci.c: In function `ide_scan_pcibus':
> drivers/ide/setup-pci.c:866: warning: ignoring return value of
> `__pci_register_driver', declared with attribute warn_unused_result
>
> drivers/input/mousedev.c: In function `mousedev_connect':
> drivers/input/mousedev.c:664: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
>
> drivers/input/joydev.c: In function `joydev_connect':
> drivers/input/joydev.c:542: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
>
>
> drivers/input/evdev.c: In function `evdev_connect':
> drivers/input/evdev.c:653: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/message/i2o/device.c: In function `i2o_device_add':
> drivers/message/i2o/device.c:247: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/message/i2o/device.c:254: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/message/i2o/device.c:260: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
> drivers/message/i2o/device.c:267: warning: ignoring return value of
> `sysfs_create_link', declared with attribute warn_unused_result
>  CC  drivers/message/i2o/debug.
>
> drivers/net/sk98lin/skge.c: In function `skge_resume':
> drivers/net/sk98lin/skge.c:5128: warning: ignoring return value of
> `pci_enable_device', declared with attribute warn_unused_result
> drivers/net/tulip/xircom_tulip_cb.c: In function `outl_CSR6':
> drivers/net/tulip/xircom_tulip_cb.c:348: warning: `save_flags' is
> deprecated (declared at include/linux/inte

97 matches

Mail list logo