Re: fallocate() man page
Hi Michael, On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote: > Amit, > > I've taken the page that you sent and made various minor formatting and > wording fixes. I've also added various FIXMEs to the page. Some of these > ("FIXME .") are things that I need to check up later. Some others are > questions for which I need input from you, David, or someone else with the > relevant info (I've marked these "FIXME Amit:"). Could you please review, > and send a new draft of the page back to me. Thanks for going through the manpage and improving it! My comments are below in between ... tags. Thanks! -- Regards, Amit Arora .\" FIXME Amit: I need author and license information for this page. .\" .\"David Chinner is the original author, hence he can help with this. .\" .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .\" FIXME . eventually this #include will probably be something .\" different when support is added in glibc. .B #include .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \ ", loff_t " len "); .\" FIXME . check later what feature text macros are required in .\" glibc .SH DESCRIPTION .BR fallocate () allows the caller to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode argument determines the operation to be performed on the given range. Currently only one flag is supported for .IR mode : .TP .B FALLOC_FL_KEEP_SIZE allocates and initializes to zero the disk space within the given range. .\" FIXME Amit: The next two sentences seem to contradict .\" each other somewhat. On the one hand, later writes .\" are guaranteed not to fail for lack of space; on the other .\" hand, the file size id not changed even if it is currently .\" smaller than offset+len bytes. .\" Could you explain this a little further. (E.g., how does .\" the kernel guarantee space without changing the size .\" of the file?) .\" .\" Well, this is a feature where you can allocate/reserve space for .\" a file without changing the file size. This is done by allocating blocks .\" to the file, but still not changing the size. As mentioned below, this .\" helps applications that use append mode a lot. These can open .\" a file in append mode and start writing to "preallocated" space. .\" So, if someone does a stat on a file after fallocate() with this mode (where .\" file size is not changed), he/she will see that the st_blocks .\" increased, but st_size did not change. .\" After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. Even if the size of the file is less than .IR offset + len , the file size is not changed. This allows allocation of zeroed blocks beyond the end of file and is useful for optimizing append workloads. .\" FIXME Amit: Which other flags are likely to appear .\" for mode, and in which kernel version are they likely? .\" .\"There were few more flags which were discussed, but none of .\" them have been finalized upon. Here are these flags: .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, FA_FL_NO_CTIME .\" All of the above flags were debated upon and we can not say if any/which one .\" of these flags will make it to the later kernels. .\" .PP If .B FALLOC_FL_KEEP_SIZE flag is not specified in .IR mode , the default behavior is almost same as when this flag is specified. The only difference is that on success, the file size will be changed if the .IR offset + len is greater than the file size. This default behavior closely resembles the behavior of the .BR posix_fallocate (3) library function, and is intended as a method of optimally implementing that function. .\" FIXME Amit: is it worth adding a few words to the following .\" sentence to say why fallocate() may allocate a larger range .\" than specified? .\" .\" The preallocation is done in block size chunks. Thus, if the last .\" few bytes in the range falls in a new block, this entire block gets .\" allocated to the file. Hence we may have slightly larger range allocated. .\" I have tried to add one line to explain this below. Please see if it .\" makes sense and is understandable. Thanks! .\" .PP .BR fallocate () may allocate a larger range than that was specified. .\" .\" This is because allocation is done in block size chunks and hence .\" the allocation will automatically get block aligned. .\" .SH RETURN VALUE .BR fallocate () returns zero on success, or an error number on failure. Note that .\" FIXME . the library wrapper function will do the right .\" thing, returning -1 on error and setting errno. .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal
Re: fallocate() man page - darft 2
On Fri, Aug 03, 2007 at 01:59:53PM +0200, Michael Kerrisk wrote: > > > > There is a typo above. We have "file system" repeated twice in above > > sentence. Second one should be "file". > > > > Thanks for catching that. > > Okay -- it seems that this page is pretty much ready for publication, > right? I'll hold off for a bit, until nearer the end of the 2.6.23 cycle. I agree. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fallocate() man page
Hi Michael, On Mon, Jul 30, 2007 at 09:43:08PM +0200, Michael Kerrisk wrote: > Hello Amit. > > > On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote: > >> Amit, > >> > >> I've taken the page that you sent and made various minor formatting and > >> wording fixes. I've also added various FIXMEs to the page. Some of these > >> ("FIXME .") are things that I need to check up later. Some others are > >> questions for which I need input from you, David, or someone else with the > >> relevant info (I've marked these "FIXME Amit:"). Could you please review, > >> and send a new draft of the page back to me. > > > > Thanks for going through the manpage and improving it! > > > > My comments are below in between ... tags. > > > > Thanks! > [...] > > > The > > .I mode > > argument determines the operation to be performed on the given range. > > Currently only one flag is supported for > > .IR mode : > > .TP > > .B FALLOC_FL_KEEP_SIZE > > allocates and initializes to zero the disk space within the given range. > > .\" FIXME Amit: The next two sentences seem to contradict > > .\" each other somewhat. On the one hand, later writes > > .\" are guaranteed not to fail for lack of space; on the other > > .\" hand, the file size id not changed even if it is currently > > .\" smaller than offset+len bytes. > > .\" Could you explain this a little further. (E.g., how does > > .\" the kernel guarantee space without changing the size > > .\" of the file?) > > .\" > > .\" Well, this is a feature where you can allocate/reserve space for > > .\" a file without changing the file size. This is done by allocating blocks > > .\" to the file, but still not changing the size. As mentioned below, this > > .\" helps applications that use append mode a lot. These can open > > .\" a file in append mode and start writing to "preallocated" space. > > .\" So, if someone does a stat on a file after fallocate() with this mode > > (where > > .\" file size is not changed), he/she will see that the st_blocks > > .\" increased, but st_size did not change. > > .\" > > Okay -- I tried rewording the text here a little to make this clearer. Can > you review the new version to see that it's okay. > > [...] Ok. Will review the draft version soon and will get back to you. > > .\" FIXME Amit: Which other flags are likely to appear > > .\" for mode, and in which kernel version are they likely? > > .\" > > .\"There were few more flags which were discussed, but none of > > .\" them have been finalized upon. Here are these flags: > > .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, > > FA_FL_NO_CTIME > > .\" All of the above flags were debated upon and we can not say if > > any/which one > > .\" of these flags will make it to the later kernels. > > .\" > > Thanks for the info. > > [...] > > > .\" FIXME Amit: is it worth adding a few words to the following > > .\" sentence to say why fallocate() may allocate a larger range > > .\" than specified? > > .\" > > .\" The preallocation is done in block size chunks. Thus, if the last > > .\" few bytes in the range falls in a new block, this entire block gets > > .\" allocated to the file. Hence we may have slightly larger range > > allocated. > > .\" I have tried to add one line to explain this below. Please see if it > > .\" makes sense and is understandable. Thanks! > > .\" > > Thanks. > > > .PP > > .BR fallocate () > > may allocate a larger range than that was specified. > > .\" > > .\" This is because allocation is done in block size chunks and hence > > .\" the allocation will automatically get block aligned. > > .\" > > I made the sentence: > > Because allocation is done in block size chunks, fallocate() > may allocate a larger range than that which was specified. > > okay? > > [...] Ok. > > .TP > > .B ENODEV > > .I fd > > does not refer to a regular file or a directory. > > .TP > > .B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > .IR fd . > > .TP > > .B ESPIPE > > .I fd > > refers to a pipe of file descriptor. > > .\" FIXME Amit: ENODEV says "fd is not a file or a directory"; > > .\" ESPIPE says (I had to fix the text a little) "refers to a pipe". > > .\" This doesn't make sense: if fd is a pipe, then either one > > .\" of these errors could occur. Which is it supposed to be? > > .\" > > .\"This is inline with posix_fallocate manpage. If it is a pipe, > > .\" user will get ESPIPE. > > .\" > > Okay -- thanks. I reworded the text for the ESNODEV error to make this > clearer. (Please check the wording in the next draft.) Sure. > By the way in fs/open.c I see the comment: > > /* > * Let individual file system decide if it supports preallocation > * for directories or not. > */ > if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) > goto out_fput; > > But that comment doesn't seem to accord with
Re: fallocate() man page - darft 2
Hi Michael, On Mon, Jul 30, 2007 at 09:44:10PM +0200, Michael Kerrisk wrote: > Amit, David, > > I've edited the previous version of the page, adding David's license, and > integrating Amit's comments. I've also added a few new FIXMES. ("FIXME > Amit" again.) Ok, Thanks! > Could you please review the changes, and the FIXMEs. Please find my comments below.. > Cheers, > > Michael -- Regards, Amit Arora > .\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved > .\" Written by Dave Chinner <[EMAIL PROTECTED]> > .\" May be distributed as per GNU General Public License version 2. > .\" > .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" > .SH NAME > fallocate \- manipulate file space > .SH SYNOPSIS > .nf > .\" FIXME . eventually this #include will probably be something > .\" different when support is added in glibc. > .B #include > .PP > .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \ > ", loff_t " len "); > .\" FIXME . check later what feature text macros are required in > .\" glibc > .SH DESCRIPTION > .BR fallocate () > allows the caller to directly manipulate the allocated disk space > for the file referred to by > .I fd > for the byte range starting at > .I offset > and continuing for > .I len > bytes. > .\" FIXME Amit: in other words the affected byte range > .\" is the bytes from (offset) to (offset + len - 1), right? Yes, you are right. > The > .I mode > argument determines the operation to be performed on the given range. > Currently only one flag is supported for > .IR mode : > .TP > .B FALLOC_FL_KEEP_SIZE > This flag allocates and initializes to zero the disk space > within the range specified by > .I offset > and > .IR len . > After a successful call, subsequent writes into this range > are guaranteed not to fail because of lack of disk space. > Preallocating zeroed blocks beyond the end of the file > is useful for optimizing append workloads. > Preallocating blocks does not change > the file size (as reported by > .BR stat (2)) > even if it is less than > .\" FIXME Amit: "offset + len" is written here. But should it be > .\" "offset + len - 1" ? Good point. This text was directly taken from the man page of posix_fallocate and is also there on the posix specifications at: http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html The current posix_fallocate() implementation and also the fallocate() implementation in ext4 are based on above documentation, wherein EOF is compared with "offset + len" and not with "offset + len - 1". I am not sure if this is right or wrong. But, this is as per posix specifications. ;) > .IR offset + len . > .\" > .\" Note from Amit Arora: > .\" There were few more flags which were discussed, but none of > .\" them have been finalized upon. Here are these flags: > .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, > .\" FA_FL_NO_CTIME > .\" All of the above flags were debated upon and we can not say > .\" if any/which one of these flags will make it to the later kernels. > .PP > If > .B FALLOC_FL_KEEP_SIZE > flag is not specified in > .IR mode , > the default behavior is almost same as when this flag is specified. > The only difference is that on success, > the file size will be changed if > .\" FIXME Amit: "offset + len" is written here. But should it be > .\" "offset + len - 1" ? Please see my previous comment. > .IR offset + len > is greater than the file size. > This default behavior closely resembles the behavior of the > .BR posix_fallocate (3) > library function, > and is intended as a method of optimally implementing that function. > .PP > Because allocation is done in block size chunks, > .BR fallocate () > may allocate a larger range than that which was specified. > .SH RETURN VALUE > .BR fallocate () > returns zero on success, or an error number on failure. > Note that > .\" FIXME . the library wrapper function will do the right > .\" thing, returning -1 on error and setting errno. > .I errno > is not set. > .SH ERRORS > .TP > .B EBADF > .I fd > is not a valid file descriptor, or is not opened for writing. > .TP > .B EFBIG > .IR offset + len > exceeds the maximum file size. > .TP > .B EINVAL > .I offset > was less than 0, or > .I len > was less than or equal to 0. > .TP > .B ENODEV > .I fd > does not refer to a regular file or a directory. > (If > .I fd > is a pipe or FIFO, a different error results.) > .TP > .B ENOSPC > There is not enough space left on the device containing the file > referred to by > .IR fd . > .TP > .B ESPIPE > .I fd > refers to a pipe or FIFO. > .TP > .B ENOSYS > The file system containing the file system referred to by There is a typo above. We have "file system" repeated twice in above sentence. Second one should be "file". > .I fd > does not support this operation. > .TP > .B EINTR > A signal was caught during execution. > .TP > .B EIO > An I/O error occurred while reading from or writing to a file system. > .TP > .B EOPNOTSUPP >
[PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc
This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: - Changes from Take3 to Take4: 1) Do not update c/mtime. Let each filesystem update ctime (update of mtime will not be required for allocation since we touch only metadata/inode and not blocks), if required. Changes from Take2 to Take3: 1) Patches now based on 2.6.22-rc1 kernel. Changes from Take1(initial post on 26th April, 2007) to Take2: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S === --- linux-2.6.22-rc4.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_fallocate ia32_syscall_end: Index: linux-2.6.22-rc4/fs/open.c === --- linux-2.6.22-rc4.orig/fs/open.c +++ linux-2.6.22-rc4/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by the + * individual file systems will update the file size and/or ctime/mtime + * depending on the mode and also on the success of the operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + re
[PATCH 2/7][TAKE5] fallocate() on s390(x)
This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15)/* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. - From: Martin Schwidefsky <[EMAIL PROTECTED]> This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S === --- linux-2.6.22-rc4.orig/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:16:01.0 -0700 +++ linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:27:29.0 -0700 @@ -1683,6 +1683,16 @@ llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr%r2,%r2 # int + lgfr%r3,%r3 # int + sllg%r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg%r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate + .globl compat_sys_utimensat_wrapper compat_sys_utimensat_wrapper: llgfr %r2,%r2 # unsigned int Index: linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c === --- linux-2.6.22-rc4.orig/arch/s390/kernel/sys_s390.c 2007-06-11 16:16:01.0 -0700 +++ linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c2007-06-11 16:27:29.0 -0700 @@ -265,3 +265,32 @@ return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc4/arch/s390/kernel/syscalls.S === --- linux-2.6.22-rc4.orig/arch/s390/kernel/syscalls.S 2007-06-11 16:16:01.0 -0700 +++ linux-2.6.22-rc4/arch/s390/kernel/syscalls.S2007-06-11 16:27:29.0 -0700 @@ -322,6 +322,7 @@ SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) NI_SYSCALL /* 314 sys_fallocate */ SYSCALL(sys_utimensat,sys_utimensat,compat_sys_utimensat_wrapper) /* 315 */ SYSCALL(sys_signalfd,sys_signalfd,compat_sys_signalfd_wrapper) Index: linux-2.6.22-rc4/include/asm-s390/unistd.h === --- linux-2.6.22-rc4.orig/include/asm-s390/unistd.h 2007-06-11 16:16:01.0 -0700 +++ linux-2.6.22-rc4/include/asm-s390/unistd.h 2007-06-11 16:27:29.0 -0700 @@ -256,7 +256,8 @@ #define __NR_signalfd 316 #define __NR_timerfd 317 #define __NR_eventfd 318 -#define NR_syscalls 319 +#define __NR_fallocate 319 +#define NR_syscalls 320 /* * There are some system calls that are not present on 64 bit, some - To unsubscribe from this list: send the line "unsubscribe linux-k
[PATCH 3/7][TAKE5] fallocate() on ia64
fallocate() on ia64 ia64 fallocate syscall support. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/arch/ia64/kernel/entry.S === --- linux-2.6.22-rc4.orig/arch/ia64/kernel/entry.S 2007-06-11 17:22:15.0 -0700 +++ linux-2.6.22-rc4/arch/ia64/kernel/entry.S 2007-06-11 17:30:37.0 -0700 @@ -1588,5 +1588,6 @@ data8 sys_signalfd data8 sys_timerfd data8 sys_eventfd + data8 sys_fallocate // 1310 .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc4/include/asm-ia64/unistd.h === --- linux-2.6.22-rc4.orig/include/asm-ia64/unistd.h 2007-06-11 17:22:15.0 -0700 +++ linux-2.6.22-rc4/include/asm-ia64/unistd.h 2007-06-11 17:30:37.0 -0700 @@ -299,11 +299,12 @@ #define __NR_signalfd 1307 #define __NR_timerfd 1308 #define __NR_eventfd 1309 +#define __NR_fallocate 1310 #ifdef __KERNEL__ -#define NR_syscalls286 /* length of syscall table */ +#define NR_syscalls287 /* length of syscall table */ /* * The following defines stop scripts/checksyscalls.sh from complaining about - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7][TAKE5] support new modes in fallocate
Implement new flags and values for mode argument. This patch implements the new flags and values for the "mode" argument of the fallocate system call. It is based on the discussion between Andreas Dilger and David Chinner on the man page proposed (by the later) on fallocate. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/include/linux/fs.h === --- linux-2.6.22-rc4.orig/include/linux/fs.h +++ linux-2.6.22-rc4/include/linux/fs.h @@ -267,15 +267,16 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WAIT_AFTER 4 /* - * sys_fallocate modes - * Currently sys_fallocate supports two modes: - * FA_ALLOCATE : This is the preallocate mode, using which an application/user - * may request (pre)allocation of blocks. - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free - * the preallocated blocks. + * sys_fallocate mode flags and values */ -#define FA_ALLOCATE0x1 -#define FA_DEALLOCATE 0x2 +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ + +#define FA_ALLOCATE0 +#define FA_DEALLOCATE FA_FL_DEALLOC +#define FA_RESV_SPACE FA_FL_KEEP_SIZE +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) #ifdef __KERNEL__ Index: linux-2.6.22-rc4/fs/open.c === --- linux-2.6.22-rc4.orig/fs/open.c +++ linux-2.6.22-rc4/fs/open.c @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned * sys_fallocate - preallocate blocks or free preallocated blocks * @fd: the file descriptor * @mode: mode specifies if fallocate should preallocate blocks OR free - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and - * FA_DEALLOCATE modes are supported. + * (unallocate) preallocated blocks. * @offset: The offset within file, from where (un)allocation is being * requested. It should not have a negative value. * @len: The amount (in bytes) of space to be (un)allocated, from the offset. * * This system call, depending on the mode, preallocates or unallocates blocks * for a file. The range of blocks depends on the value of offset and len - * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * arguments provided by the user/application. For FA_ALLOCATE and + * FA_RESV_SPACE modes, if the sys_fallocate() * system call succeeds, subsequent writes to the file in the given range * (specified by offset & len) should not fail - even if the file system * later becomes full. Hence the preallocation done is persistent (valid - * even after reopen of the file and remount/reboot). + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode + * is passed, the file size will not be changed even if the preallocation + * is beyond EOF. * * It is expected that the ->fallocate() inode operation implemented by the * individual file systems will update the file size and/or ctime/mtime - * depending on the mode and also on the success of the operation. + * depending on the mode (change is visible to user or not - say file size) + * and obviously, on the success of the operation. * * Note: Incase the file system does not support preallocation, * posix_fallocate() should fall back to the library implementation (i.e. @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in /* Return error if mode is not supported */ ret = -EOPNOTSUPP; - if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE) + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE || + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE)) goto out; ret = -EBADF; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7][TAKE5] ext4: support new modes
Support new values of mode in ext4. This patch supports new mode values/flags in ext4. With this patch ext4 will be able to support FA_ALLOCATE and FA_RESV_SPACE modes. Supporting FA_DEALLOCATE and FA_UNRESV_SPACE fallocate modes in ext4 is a work for future. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/fs/ext4/extents.c === --- linux-2.6.22-rc4.orig/fs/ext4/extents.c +++ linux-2.6.22-rc4/fs/ext4/extents.c @@ -2477,7 +2477,8 @@ int ext4_ext_writepage_trans_blocks(stru /* * preallocate space for a file. This implements ext4's fallocate inode * operation, which gets called from sys_fallocate system call. - * Currently only FA_ALLOCATE mode is supported on extent based files. + * Currently only FA_ALLOCATE and FA_RESV_SPACE modes are supported on + * extent based files. * We may have more modes supported in future - like FA_DEALLOCATE, which * tells fallocate to unallocate previously (pre)allocated blocks. * For block-mapped files, posix_fallocate should fall back to the method @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode, * currently supporting (pre)allocate mode for extent-based * files _only_ */ - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) || + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE)) return -EOPNOTSUPP; /* preallocation to directories is currently not supported */ @@ -2572,9 +2574,10 @@ retry: /* * Time to update the file size. -* Update only when preallocation was requested beyond the file size. +* Update only when preallocation was requested beyond the file size +* and when FA_FL_KEEP_SIZE mode is not specified! */ - if ((offset + len) > i_size_read(inode)) { + if (!(mode & FA_FL_KEEP_SIZE) && (offset + len) > i_size_read(inode)) { if (ret > 0) { /* * if no error, we assume preallocation succeeded - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/7][TAKE5] ext4: write support for preallocated blocks
This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: - Changes from Take3 to Take4: - no change - Changes from Take2 to Take3: 1) Patch now rebased to 2.6.22-rc1 kernel. Changes from Take1 to Take2: 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/fs/ext4/extents.c === --- linux-2.6.22-rc4.orig/fs/ext4/extents.c +++ linux-2.6.22-rc4/fs/ext4/extents.c @@ -1167,6 +1167,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1354,25 +1401,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2035,15 +2064,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max
[PATCH 0/6][TAKE5] fallocate system call
N O T E: --- 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part of ext4 patch queue git tree hosted by Ted. 2) The above new patches (4/7 and 7/7) are based on the dicussion between Andreas Dilger and David Chinner on the mode argument, when later posted a man page on fallocate. 3) All of these patches are based on 2.6.22-rc4 kernel and apply to 2.6.22-rc5 too (with some successfull hunks, though - since the ext4 patch queue git tree has some other patches as well before fallocate patches in the patch series). Changelog: - Changes from Take4 to Take5: 1) New Patch 4/7 implements new flags and values for mode argument of fallocate system call. 2) New Patch 7/7 implements 2 (out of 4) modes in ext4. Implementation of rest of the (two) modes is yet to be done. 3) Updated the interface description below to mention new modes being supported. 4) Removed "extent overlap check" bugfix (patch 4/6 in TAKE4, since it is now part of mainline. 5) Corrected format of couple of multi-line comments, which got missed in earlier take. Changes from Take2 to Take3: 1) Return type is now described in the interface description above. 2) Patches rebased to 2.6.22-rc1 kernel. ** Each post will have an individual changelog for a particular patch. Description: --- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: - The system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports four modes - FA_ALLOCATE, FA_DEALLOCATE, FA_RESV_SPACE and FA_UNRESV_SPACE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime in the inode of the corresponding file, marking a successfull allocation. FA_FA_RESV_SPACE: This mode is quite same as FA_ALLOCATE. The only difference being that the file size will not be changed. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. This is reverse of FA_ALLOCATE mode. FA_UNRESV_SPACE: This mode is quite same as FA_DEALLOCATE. The difference being that the file size is not changed and the data is also deleted. * New modes might get added in future. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: --- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: - mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: http://lkml.org/lkml/2007/5/8/583 Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: http://lkml.org/lkml/2007/3/7/161 http://lkml.org/lkml/2007/3/18/198 ToDos: - 1> Implementation on other architectures (other than i386, x86_64, ia64, ppc64 and s390(x
[PATCH 5/7][TAKE5] ext4: fallocate support in ext4
This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Changelog: - Changes from Take3 to Take4: 1) Changed ext4_fllocate() declaration and definition to return a "long" and not an "int", to match with ->fallocate() inode op. 2) Update ctime if new blocks get allocated. Changes from Take2 to Take3: 1) Patch rebased to 2.6.22-rc1 kernel version. 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);". Changes from Take1 to Take2: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22-rc4/fs/ext4/extents.c === --- linux-2.6.22-rc4.orig/fs/ext4/extents.c +++ linux-2.6.22-rc4/fs/ext4/extents.c @@ -316,7 +316,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -339,7 +339,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -455,7 +455,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -713,7 +713,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1133,7 +1133,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1142,14 +1154,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1171,7 +1183,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode);
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. If it is decided that these flags are also needed, I will update this patch. Thanks! On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote: > Implement new flags and values for mode argument. > > This patch implements the new flags and values for the "mode" argument > of the fallocate system call. It is based on the discussion between > Andreas Dilger and David Chinner on the man page proposed (by the later) > on fallocate. > > Signed-off-by: Amit Arora <[EMAIL PROTECTED]> > > Index: linux-2.6.22-rc4/include/linux/fs.h > === > --- linux-2.6.22-rc4.orig/include/linux/fs.h > +++ linux-2.6.22-rc4/include/linux/fs.h > @@ -267,15 +267,16 @@ extern int dir_notify_enable; > #define SYNC_FILE_RANGE_WAIT_AFTER 4 > > /* > - * sys_fallocate modes > - * Currently sys_fallocate supports two modes: > - * FA_ALLOCATE : This is the preallocate mode, using which an > application/user > - * may request (pre)allocation of blocks. > - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free > - * the preallocated blocks. > + * sys_fallocate mode flags and values > */ > -#define FA_ALLOCATE 0x1 > -#define FA_DEALLOCATE0x2 > +#define FA_FL_DEALLOC0x01 /* default is allocate */ > +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */ > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC > */ > + > +#define FA_ALLOCATE 0 > +#define FA_DEALLOCATEFA_FL_DEALLOC > +#define FA_RESV_SPACEFA_FL_KEEP_SIZE > +#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | > FA_FL_DEL_DATA) > > #ifdef __KERNEL__ > > Index: linux-2.6.22-rc4/fs/open.c > === > --- linux-2.6.22-rc4.orig/fs/open.c > +++ linux-2.6.22-rc4/fs/open.c > @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned > * sys_fallocate - preallocate blocks or free preallocated blocks > * @fd: the file descriptor > * @mode: mode specifies if fallocate should preallocate blocks OR free > - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and > - * FA_DEALLOCATE modes are supported. > + * (unallocate) preallocated blocks. > * @offset: The offset within file, from where (un)allocation is being > * requested. It should not have a negative value. > * @len: The amount (in bytes) of space to be (un)allocated, from the offset. > * > * This system call, depending on the mode, preallocates or unallocates > blocks > * for a file. The range of blocks depends on the value of offset and len > - * arguments provided by the user/application. For FA_ALLOCATE mode, if this > + * arguments provided by the user/application. For FA_ALLOCATE and > + * FA_RESV_SPACE modes, if the sys_fallocate() > * system call succeeds, subsequent writes to the file in the given range > * (specified by offset & len) should not fail - even if the file system > * later becomes full. Hence the preallocation done is persistent (valid > - * even after reopen of the file and remount/reboot). > + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode > + * is passed, the file size will not be changed even if the preallocation > + * is beyond EOF. > * > * It is expected that the ->fallocate() inode operation implemented by the > * individual file systems will update the file size and/or ctime/mtime > - * depending on the mode and also on the success of the operation. > + * depending on the mode (change is visible to user or not - say file size) > + * and obviously, on the success of the operation. > * > * Note: Incase the file system does not support preallocation, > * posix_fallocate() should fall back to the library implementation (i.e. > @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in > > /* Return error if mode is not supported */ > ret = -EOPNOTSUPP; > - if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE) > + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE || > + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE)) > goto out; > > ret = -EBADF; > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote: > > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as > > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. > > If it is decided that these flags are also needed, I will update this > > patch. Thanks! > > Can you clarify - what is the current behaviour when ENOSPC (or some other > error) is hit? Does it keep the current fallocate() or does it free it? Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and posix_fallocate, which also do not free the partially allocated space. > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we > don't want to expose uninitialized disk blocks to userspace. I'm not > sure if this makes sense at all. I don't think we need to make it default - atleast for filesystems which have a mechanism to distinguish preallocated blocks from "regular" ones. In ext4, for example, we will have a way to mark uninitialized extents. All the preallocated blocks will be part of these uninitialized extents. And any read on these extents will treat them as a hole, returning zeroes to user land. Thus any existing data on uninitialized blocks will not be exposed to the userspace. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: > On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote: > > +#define FA_FL_DEALLOC 0x01 /* default is allocate */ > > +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ > > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC > > */ > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate. > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on > each extent. For some workloads this would be much faster than truncate > and reallocate of all the blocks in a file. In ext4, we already mark each extent having preallocated blocks as uninitialized. This is done as part of following code (which is part of patch 5/7) in ext4_ext_get_blocks() : @@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) { /* free data blocks we just allocated */ > In that light, please change the comment to /* default is keep existing data > */ > so that it doesn't imply this is only for DEALLOC. Ok. Will update the comment. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7][TAKE5] ext4: support new modes
On Mon, Jun 25, 2007 at 03:56:25PM -0600, Andreas Dilger wrote: > On Jun 25, 2007 19:20 +0530, Amit K. Arora wrote: > > @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode, > > * currently supporting (pre)allocate mode for extent-based > > * files _only_ > > */ > > - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) || > > + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE)) > > return -EOPNOTSUPP; > > This should probably just check for the individual flags it can support > (e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA). Hmm.. I am thinking of a scenario when the file system supports some individual flags, but does not support a particular combination of them. Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported for some reason). This means that although we support FA_FL_DEALLOC, FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the combination of all these flags (which is nothing but FA_UNRESV_SPACE). > I also thought another proposed flag was to determine whether mtime (and > maybe ctime) is changed when doing prealloc/dealloc space? Default should > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else > should decide if we want to allow changing the file w/o changing ctime, if > that is required even though the file is not visibly changing. Maybe the > ctime update should be implicit if the size or mtime are changing? Is it really required ? I mean, why should we allow users not to update ctime/mtime even if the file metadata/data gets updated ? It sounds a bit "unnatural" to me. Is there any application scenario in your mind, when you suggest of giving this flexibility to userspace ? I think, modifying ctime/mtime should be dependent on the other flags. E.g., if we do not zero out data blocks on allocation/deallocation, update only ctime. Otherwise, update ctime and mtime both. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: > On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > > All I'm really interested in right now is that the fallocate > > > _interface_ can be used as a *complete replacement* for the > > > pre-existing XFS-specific ioctls that are already used by > > > applications. What ext4 can or can't do right now is irrelevant to > > > this discussion - the interface definition needs to take priority > > > over implementation > > > > Would you like to write up an interface definition description (likely > > man page) and post it for review, possibly with a mention of apps using > > it today ? > > Yeah, I started doing that yesterday as i figured it was the only way > to cut the discussion short > > > One reason for introducing the mode parameter was to allow the interface to > > evolve incrementally as more options / semantic questions are proposed, so > > that we don't have to make all the decisions right now. > > So it would be good to start with a *minimal* definition, even just one > > mode. > > The rest could follow as subsequent patches, each being reviewed and debated > > separately. Otherwise this discussion can drag on for a long time. > > Minimal definition to replace what applicaitons use on XFS and to > support poasix_fallocate are the thre that have been mentioned so > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them > all in a man page... Hi Dave, Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). The idea I had was to push the patch with bare minimum functionality (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other new mode(s) based on the man page you planned to provide. Thanks! -- Regards, Amit Arora > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: -mm merge plans for 2.6.23 -- sys_fallocate
On Tue, Jul 10, 2007 at 08:05:31PM +0200, Heiko Carstens wrote: > > Alternatively I can push them directly to Linus along with other ext4 > > patches. We can drop the s390 patch if Martin or Heiko wants to wire > > it up themselves. > > Yes, please drop the s390 patch. In general it seems to be better if only > one architecture gets a syscall wired up initially and let other arches > follow later. > > Just wondering if the x86_64 compat syscall gets ever fixed? I think > I mentioned already three or four times to Amit that it is broken. > Or is it that nobody cares? Dunno.. Last time it was brought up was when TAKE5 of the patchset was posted and I had planned to fix this in the TAKE6 - which didn't happen since there was no final descision on the mode flags. Anyhow, the x86_64 compat syscall has already been fixed in the ext4 patch queue. I will repost all the patches rebased on 2.6.22 (as they are in the ext4 patch queue), since these have already been dropped from -mm. > In addition there used to be a somewhat inofficial rule that new syscalls > have to come with a test program, so people can easily test if they wired > up the syscall correctly. Ok. Will work on a small testcase and post it soon. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7][TAKE6] fallocate system call
This is the latest fallocate patchset and is rebased to 2.6.22. Following are the changes from TAKE5: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html 5) Renamed mode flags and values from "FA_" to "FALLOC_" 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: - 1> Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4> A testcase to test the system call. Will post it soon. Following patches follow: Patch 1/7 : manpage for fallocate Patch 2/7 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/7 : support new modes in fallocate Patch 4/7 : ext4: fallocate support in ext4 Patch 5/7 : ext4: write support for preallocated blocks Patch 6/7 : ext4: support new modes in ext4 Patch 7/7 : ext4: change for better extent-to-group alignment -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] manpage for fallocate
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include .PP .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len); .Op .SH DESCRIPTION The .BR fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .IR offset and continuing for .IR len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are four modes: .TP .B FALLOC_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than .IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. .B FALLOC_ALLOCATE closely resembles .B posix_fallocate(3) and is intended as a method of optimally implementing this function. .B FALLOC_ALLOCATE may allocate a larger range that was specified. .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. .TP .B FALLOC_DEALLOCATE removes any preallocated space within the given range. The file size may change if deallocation is towards the end of the file. .TP .B FALLOC_UNRESV_SPACE removes the underlying disk space within the given range. The disk space shall be removed regardless of it's contents so both allocated space from .B FALLOC_ALLOCATE and .B FALLOC_RESV_SPACE as well as from .B write(3) will be removed. .B FALLOC_UNRESV_SPACE shall never remove disk blocks outside the range specified. .B FALLOC_UNRESV_SPACE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, .B ftuncate64(3) or .B FALLOC_DEALLOCATE should be used. .SH "RETURN VALUE" .BR fallocate() returns zero on success, or an error number on failure. Note that .IR errno is not set. .SH "ERRORS" .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .I offset+len exceeds the maximum file size. .TP .B EINVAL .I offset or .I len was less than 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd. .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .B ENOSYS The filesystem underlying the file descriptor does not support this operation. .SH AVAILABILITY The .BR fallocate () system call is available since 2.6.XX .SH "SEE ALSO" .BR syscall (2), .BR posix_fadvise (3) .BR ftruncate (3) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
From: Amit Arora <[EMAIL PROTECTED]> sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/arch/i386/kernel/syscall_table.S === --- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys32_fallocate ia32_syscall_end: Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by the + * individual file systems will update the file size and/or ctime/mtime + * depending on the mode and also on the success of the operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EIN
[PATCH 3/7] support new modes in fallocate
From: Amit Arora <[EMAIL PROTECTED]> Implement new flags and values for mode argument. This patch implements the new flags and values for the "mode" argument of the fallocate system call. It is based on the discussion between Andreas Dilger and David Chinner on the man page proposed (by the later) on fallocate. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/include/linux/fs.h === --- linux-2.6.22.orig/include/linux/fs.h +++ linux-2.6.22/include/linux/fs.h @@ -267,15 +267,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WAIT_AFTER 4 /* - * sys_fallocate modes - * Currently sys_fallocate supports two modes: - * FA_ALLOCATE : This is the preallocate mode, using which an application/user - * may request (pre)allocation of blocks. - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free - * the preallocated blocks. + * sys_fallocate mode flags and values */ -#define FA_ALLOCATE0x1 -#define FA_DEALLOCATE 0x2 +#define FALLOC_FL_DEALLOC 0x01 /* default is allocate */ +#define FALLOC_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ +#define FALLOC_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ + +#define FALLOC_ALLOCATE0 +#define FALLOC_DEALLOCATE FALLOC_FL_DEALLOC +#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE +#define FALLOC_UNRESV_SPACE(FALLOC_FL_DEALLOC | FALLOC_FL_KEEP_SIZE | \ +FALLOC_FL_DEL_DATA) #ifdef __KERNEL__ Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned * sys_fallocate - preallocate blocks or free preallocated blocks * @fd: the file descriptor * @mode: mode specifies if fallocate should preallocate blocks OR free - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and - * FA_DEALLOCATE modes are supported. + * (unallocate) preallocated blocks. * @offset: The offset within file, from where (un)allocation is being * requested. It should not have a negative value. * @len: The amount (in bytes) of space to be (un)allocated, from the offset. * * This system call, depending on the mode, preallocates or unallocates blocks * for a file. The range of blocks depends on the value of offset and len - * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * arguments provided by the user/application. For FALLOC_ALLOCATE and + * FALLOC_RESV_SPACE modes, if the sys_fallocate() * system call succeeds, subsequent writes to the file in the given range * (specified by offset & len) should not fail - even if the file system * later becomes full. Hence the preallocation done is persistent (valid - * even after reopen of the file and remount/reboot). + * even after reopen of the file and remount/reboot). If FALLOC_RESV_SPACE mode + * is passed, the file size will not be changed even if the preallocation + * is beyond EOF. * * It is expected that the ->fallocate() inode operation implemented by the * individual file systems will update the file size and/or ctime/mtime - * depending on the mode and also on the success of the operation. + * depending on the mode (change is visible to user or not - say file size) + * and obviously, on the success of the operation. * * Note: Incase the file system does not support preallocation, * posix_fallocate() should fall back to the library implementation (i.e. @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in /* Return error if mode is not supported */ ret = -EOPNOTSUPP; - if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE) + if (!(mode == FALLOC_ALLOCATE || mode == FALLOC_DEALLOCATE || + mode == FALLOC_RESV_SPACE || mode == FALLOC_UNRESV_SPACE)) goto out; ret = -EBADF; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] ext4: fallocate support in ext4
From: Amit Arora <[EMAIL PROTECTED]> fallocate support in ext4 This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:33.0 -0700 +++ linux-2.6.22/fs/ext4/extents.c 2007-07-09 15:24:39.0 -0700 @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
[PATCH 5/7] ext4: write support for preallocated blocks
From: Amit Arora <[EMAIL PROTECTED]> write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:39.0 -0700 +++ linux-2.6.22/fs/ext4/extents.c 2007-07-09 15:24:48.0 -0700 @@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1374,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err
[PATCH 6/7] ext4: support new modes in ext4
From: Amit Arora <[EMAIL PROTECTED]> Support new values of mode in ext4. This patch supports new mode values/flags in ext4. With this patch ext4 will be able to support FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes. Supporting FALLOC_DEALLOCATE and FALLOC_UNRESV_SPACE fallocate modes in ext4 is a work for future. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -2453,8 +2453,9 @@ int ext4_ext_writepage_trans_blocks(stru /* * preallocate space for a file. This implements ext4's fallocate inode * operation, which gets called from sys_fallocate system call. - * Currently only FA_ALLOCATE mode is supported on extent based files. - * We may have more modes supported in future - like FA_DEALLOCATE, which + * Currently only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are supported on + * extent based files. + * We may have more modes supported in future - like FALLOC_DEALLOCATE, which * tells fallocate to unallocate previously (pre)allocated blocks. * For block-mapped files, posix_fallocate should fall back to the method * of writing zeroes to the required new blocks (the same behavior which is @@ -2475,7 +2476,8 @@ long ext4_fallocate(struct inode *inode, * currently supporting (pre)allocate mode for extent-based * files _only_ */ - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) || + !(mode == FALLOC_ALLOCATE || mode == FALLOC_RESV_SPACE)) return -EOPNOTSUPP; /* preallocation to directories is currently not supported */ @@ -2548,9 +2550,11 @@ retry: /* * Time to update the file size. -* Update only when preallocation was requested beyond the file size. +* Update only when preallocation was requested beyond the file size +* and when FALLOC_FL_KEEP_SIZE mode is not specified! */ - if ((offset + len) > i_size_read(inode)) { + if (!(mode & FALLOC_FL_KEEP_SIZE) && + (offset + len) > i_size_read(inode)) { if (ret > 0) { /* * if no error, we assume preallocation succeeded - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] ext4: change for better extent-to-group alignment
From: Amit Arora <[EMAIL PROTECTED]> Change on-disk format for extent to represent uninitialized/initialized extents This change was suggested by Andreas Dilger as part of the following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1106,7 +1106,7 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - unsigned short ext1_ee_len, ext2_ee_len; + unsigned short ext1_ee_len, ext2_ee_len, max_len; /* * Make sure that either both extents are uninitialized, or @@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) return 0; + if (ext4_ext_is_uninitialized(ex1)) + max_len = EXT_UNINIT_MAX_LEN; + else + max_len = EXT_INIT_MAX_LEN; + ext1_ee_len = ext4_ext_get_actual_len(ex1); ext2_ee_len = ext4_ext_get_actual_len(ex2); @@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > max_len) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) @@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); - if (uninitialized) + /* +* Do not mark uninitialized if all the blocks in the +* extent have been removed. +*/ + if (uninitialized && num) ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); @@ -2307,6 +2316,18 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); + /* +* See if request is beyond maximum number of blocks we can have in +* a single extent. For an initialized extent this limit is +* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is +* EXT_UNINIT_MAX_LEN. +*/ + if (max_blocks > EXT_INIT_MAX_LEN && create != EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_INIT_MAX_LEN; + else if (max_blocks > EXT_UNINIT_MAX_LEN && +create == EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_UNINIT_MAX_LEN; + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ newex.ee_block = cpu_to_le32(iblock); newex.ee_len = cpu_to_le16(max_blocks); Index: linux-2.6.22/include/linux/ext4_fs_extents.h === --- linux-2.6.22.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22/include/linux/ext4_fs_extents.h @@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru #define EXT_MAX_BLOCK 0x -#define EXT_MAX_LEN((1UL << 15) - 1) +/* + * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an + * initialized extent. This is 2^15 and not (2^16 - 1), since we use the + * MSB of ee_len field in the extent datastructure to signify if this + * particular extent is an initialized extent or an uninitialized (i.e. + * preallocated). + * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an + * uninitialized extent. + * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an + * uninitialized one. In other words, if MSB of ee_len is set, it is an + * uninitialized extent with only one special scenario when ee_len = 0x8000. + * In this case we can not have an uninitialized extent of zero length and + * thus we make it as a special case of initialized extent with 0x8000 length. + * This way we get better extent-to-group alignment for initialized extents. + * Hence, the maximum number of blocks we can have in an *initialized* + * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767). + */ +#define EXT_INIT_MAX_LEN (1UL << 15) +#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1) #define EXT_F
Re: -mm merge plans for 2.6.23 -- sys_fallocate
On Tue, Jul 10, 2007 at 11:20:47AM -0700, Mark Fasheh wrote: > On Tue, Jul 10, 2007 at 11:45:03AM -0400, Theodore Tso wrote: > > On Tue, Jul 10, 2007 at 02:22:13AM -0700, Andrew Morton wrote: > > > On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <[EMAIL PROTECTED]> > > > wrote: > > > > We reserved a different syscall number than the one that is used right > > > > now > > > > in the patch. Please drop this patch... Martin or I will wire up the > > > > syscall > > > > as soon as the x86 variant is merged. Everything else just causes > > > > trouble and > > > > confusion. > > > > > > OK, I dropped all the fallocate patches. > > > > Andrew, I want to clarify who is going to push the fallocate patches. > > I can either push them to Linus as part of the ext4 patch set, or we > > can wait for you to push them. I thought since you had them in -mm > > and we were going to wait you to push them (and presume that this was > > going to happen soon). > > > > Alternatively I can push them directly to Linus along with other ext4 > > patches. We can drop the s390 patch if Martin or Heiko wants to wire > > it up themselves. > > > > As far as I know there hasn't been any real contention on the actual > > syscall patches, other than the numbering issues, so it seems that > > pushing them to Linus sooner rather than later is the right thing to > > do. > > Where is the latest and greatest version of those patches? Is it still the > patch set distributed in 2.6.22-rc6-mm1? I'd mostly like to see the final > set of flags we're planning on supporting. But yeah, I second the "sooner > rather than later" :) I have posted the latest fallocate patches as part of TAKE6. These patches are exactly same as how they currently look in the ext4 patch queue being maintained by Ted. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote: > On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c > > +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c > > @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, > > return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, > > len, advice); > > } > > + > > +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, > > + unsigned offset_hi, unsigned len_lo, > > + unsigned len_hi) > > Please call this compat_sys_fallocate in line with the powerpc version - > it gives us a hint that maybe we should think about how to consolidate > them. I know other stuff in that file is called sys32_ ... but it is time > for a change :-) I think this can be handled as a separate patch once this patchset is in mainline. Since, anyhow we will need to do this for other sys32_ calls which are already there... -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] manpage for fallocate
On Wed, Jul 11, 2007 at 12:37:01AM +0300, Heikki Orsila wrote: > On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote: > > .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len); > > Correction: "int syscall(int fd, int mode, ...)", Here, we have syscall() with first argument being the system call number - so what you suggested is not correct. But, yes, the synopsis should change at some time. Maybe to something like: #include long fallocate(int fd, int mode, loff_t offset, loff_t len); > > .TP > > .B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > .IR fd. > > .TP > > .B ESPIPE > > .I fd > > refers to a pipe of file descriptor. > > .B ENOSYS > > The filesystem underlying the file descriptor does not support this > > operation. > > EINTR? Will add following errors: EINTR A signal was caught during execution EIO An I/O error occurred while reading from or writing to a file system. EOPNOTSUPPThe mode is not supported on the file descriptor. and will update following : EINVALoffset was less than 0, or len was less than or equal to 0. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: > On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote: > > On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: > > > Well, if you see the modes proposed using above flags : > > > > > > #define FA_ALLOCATE 0 > > > #define FA_DEALLOCATE FA_FL_DEALLOC > > > #define FA_RESV_SPACE FA_FL_KEEP_SIZE > > > #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | > > > FA_FL_DEL_DATA) > > > > > > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes > > > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this > > > flag. Hence prealloction will never delete data. > > > This mode is required only for FA_UNRESV_SPACE, which is a deallocation > > > mode, to support any existing XFS aware applications/usage-scenarios. > > > > Sorry, but this doesn't make any sense. There is no need to put every > > feature in the XFS ioctls in the syscalls. The XFS ioctls will need to > > be supported forever anyway - as I suggested before they really should > > be moved to generic code. > > > > What needs to be supported is what makes sense as an interface. > > A punch a hole interface does make sense, but trying to hack this into > > a preallocation system call is just madness. We're not IRIX or windows > > that fit things into random subcall just because there was some space > > left to squeeze them in. > > > > > > > > FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on > > > > > > size, data change) */ > > > > > > FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on > > > > > > size, data change) */ > > > > > > > > NACK to these aswell. If i_size changes c/mtime need updates, if the > > > > size > > > > doesn't chamge they don't. No need to add more flags for this. > > > > > > This requirement was from the point of view of HSM applications. Hope > > > you saw Andreas previous post and are keeping that in mind. > > > > HSMs needs this basically for every system call, which screams for an > > open flag like O_INVISIBLE anyway. Adding this in a generic way is > > a good idea, but hacking bits and pieces that won't fit into the global > > design is completely wrong. > > Why don't we just merge the interface for preallocation (essentially > enough to satisfy posix_fallocate() and the simple XFS requirement for > space reservation without changing file size), which there is clear agreement > on (I hope :)). After all, this was all that we set out to do when we > started. As you suggest, let us just have two modes for the time being: #define FALLOC_ALLOCATE 0x1 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it will result in file size not being changed even if the preallocation is beyond EOF. > And leave all the dealloc/punch/hsm type features for separate future patches/ > debates, those really shouldn't hold up the basic fallocate interface. I agree. > I agree with Christoph that we are just diverging too much in trying to > club those decisions here. > > Dave, Andreas, Ted ? > > Regards > Suparna -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
On Thu, Jul 12, 2007 at 08:56:30AM -0400, David Patrick Quigley wrote: > From: David P. Quigley <[EMAIL PROTECTED]> > > Revalidate the write permissions for fallocate(2), in case security policy has > changed since the files were opened. Thanks for your patch! Will include it in the patchset. -- Regards, Amit Arora > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> > > fs/open.c |3 +++ > 1 file changed, 3 insertions(+) > > diff -uprN -X linux-2.6.22/Documentation/dontdiff > linux-2.6.22-fallocate/fs/open.c linux-2.6.22-fallocate-selinux/fs/open.c > --- linux-2.6.22-fallocate/fs/open.c 2007-07-11 15:51:10.0 -0400 > +++ linux-2.6.22-fallocate-selinux/fs/open.c 2007-07-11 16:10:43.0 > -0400 > @@ -411,6 +411,9 @@ asmlinkage long sys_fallocate(int fd, in > goto out; > if (!(file->f_mode & FMODE_WRITE)) > goto out_fput; > + ret = security_file_permission(file, MAY_WRITE); > + if (ret) > + goto out_fput; > > inode = file->f_path.dentry->d_inode; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 11:13:34PM +1000, David Chinner wrote: > On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: > > > > Why don't we just merge the interface for preallocation (essentially > > enough to satisfy posix_fallocate() and the simple XFS requirement for > > space reservation without changing file size), which there is clear > > agreement > > on (I hope :)). After all, this was all that we set out to do when we > > started. > > > > And leave all the dealloc/punch/hsm type features for separate future > > patches/ > > debates, those really shouldn't hold up the basic fallocate interface. > > I agree with Christoph that we are just diverging too much in trying to > > club those decisions here. > > > > Dave, Andreas, Ted ? > > Sure. I'll just make XFS work with whatever it is that gets merged. Great. I will post the new patches soon. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment
From: Amit Arora <[EMAIL PROTECTED]> Change on-disk format for extent to represent uninitialized/initialized extents This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of 0x8000 at some places. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1106,7 +1106,7 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - unsigned short ext1_ee_len, ext2_ee_len; + unsigned short ext1_ee_len, ext2_ee_len, max_len; /* * Make sure that either both extents are uninitialized, or @@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) return 0; + if (ext4_ext_is_uninitialized(ex1)) + max_len = EXT_UNINIT_MAX_LEN; + else + max_len = EXT_INIT_MAX_LEN; + ext1_ee_len = ext4_ext_get_actual_len(ex1); ext2_ee_len = ext4_ext_get_actual_len(ex2); @@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > max_len) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) @@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); - if (uninitialized) + /* +* Do not mark uninitialized if all the blocks in the +* extent have been removed. +*/ + if (uninitialized && num) ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); @@ -2307,6 +2316,19 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); + /* +* See if request is beyond maximum number of blocks we can have in +* a single extent. For an initialized extent this limit is +* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is +* EXT_UNINIT_MAX_LEN. +*/ + if (max_blocks > EXT_INIT_MAX_LEN && + create != EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_INIT_MAX_LEN; + else if (max_blocks > EXT_UNINIT_MAX_LEN && +create == EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_UNINIT_MAX_LEN; + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ newex.ee_block = cpu_to_le32(iblock); newex.ee_len = cpu_to_le16(max_blocks); Index: linux-2.6.22/include/linux/ext4_fs_extents.h === --- linux-2.6.22.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22/include/linux/ext4_fs_extents.h @@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru #define EXT_MAX_BLOCK 0x -#define EXT_MAX_LEN((1UL << 15) - 1) +/* + * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an + * initialized extent. This is 2^15 and not (2^16 - 1), since we use the + * MSB of ee_len field in the extent datastructure to signify if this + * particular extent is an initialized extent or an uninitialized (i.e. + * preallocated). + * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an + * uninitialized extent. + * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an + * uninitialized one. In other words, if MSB of ee_len is set, it is an + * uninitialized extent with only one special scenario when ee_len = 0x8000. + * In this case we can not have an uninitialized extent of zero length and + * thus we make it as a special case of initialized extent with 0x8000 length. + * This way we get better extent-to-group alignment for initialized extents. + * Hence, the maximum number of blocks we can have in an *initialized* + * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767). + */ +#define EXT_INIT_MAX_LEN (1UL << 15) +#define EXT_UNINIT_MAX_LEN (EXT_INIT_MA
[PATCH 5/6][TAKE7] ext4: write support for preallocated blocks
From: Amit Arora <[EMAIL PROTECTED]> write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1374,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = p
[PATCH 4/6][TAKE7] ext4: fallocate support in ext4
From: Amit Arora <[EMAIL PROTECTED]> fallocate support in ext4 This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_
[PATCH 3/6][TAKE7] revalidate write permissions for fallocate
From: David P. Quigley <[EMAIL PROTECTED]> Revalidate the write permissions for fallocate(2), in case security policy has changed since the files were opened. Acked-by: James Morris <[EMAIL PROTECTED]> Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> --- fs/open.c |3 +++ 1 files changed, 3 insertions(+) Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in goto out; if (!(file->f_mode & FMODE_WRITE)) goto out_fput; + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; inode = file->f_path.dentry->d_inode; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6][TAKE7] manpage for fallocate
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. This includes changes suggested by Heikki Orsila and Barry Naujok. .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " len); .SH DESCRIPTION The .B fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are two modes: .TP .B FALLOC_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than .IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. .B FALLOC_ALLOCATE closely resembles .BR posix_fallocate (3) and is intended as a method of optimally implementing this function. .B FALLOC_ALLOCATE may allocate a larger range than that was specified. .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. .SH RETURN VALUE .B fallocate returns zero on success, or an error number on failure. Note that .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal to 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd . .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .TP .B ENOSYS The filesystem underlying the file descriptor does not support this operation. .TP .B EINTR A signal was caught during execution .TP .B EIO An I/O error occurred while reading from or writing to a file system. .TP .B EOPNOTSUPP The mode is not supported on the file descriptor. .SH AVAILABILITY The .B fallocate system call is available since 2.6.XX .SH SEE ALSO .BR syscall (2), .BR posix_fadvise (3), .BR ftruncate (3). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
From: Amit Arora <[EMAIL PROTECTED]> sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/arch/i386/kernel/syscall_table.S === --- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys32_fallocate ia32_syscall_end: Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies the behavior of allocation. + * @offset: The offset within file, from where allocation is being + * requested. It should not have a negative value. + * @len: The amount of space in bytes to be allocated, from the offset. + * This can not be zero or a negative value. + * + * This system call preallocates space for a file. The range of blocks + * allocated depends on the value of offset and len arguments provided + * by the user/application. With FALLOC_ALLOCATE or FALLOC_RESV_SPACE + * modes, if the system call succeeds, subsequent writes to the file in + * the given range (specified by offset & len) should not fail - even if + * the file system later becomes full. Hence the preallocation done is + * persistent (valid even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by + * the individual file systems will update the file size and/or + * ctime/mtime depending on the mode and also on the success of the + * operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0
Re: [PATCH 1/6][TAKE7] manpage for fallocate
On Sat, Jul 14, 2007 at 12:06:51AM +1000, David Chinner wrote: > On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote: > > Following is the modified version of the manpage originally submitted by > > David Chinner. Please use `nroff -man fallocate.2 | less` to view. > > > > This includes changes suggested by Heikki Orsila and Barry Naujok. > > Can we get itemised change logs for all these patches from now on? Sure. > > .TH fallocate 2 > > .SH NAME > > fallocate \- allocate or remove file space > > If fallocate is just being used for allocating space this is wrong. > maybe - "manipulate file space" instead? Yes, it needs to be changed. > dd> .TP > > .B FALLOC_RESV_SPACE > > provides the same functionality as > > .B FALLOC_ALLOCATE > > except it does not ever change the file size. This allows allocation > > of zero blocks beyond the end of file and is useful for optimising > > "of zeroed blocks" Ok. -- Regards, Amit Arora > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/6][TAKE7] fallocate system call
This is the latest fallocate patchset and is based on 2.6.22. * Following are the changes from TAKE6: 1) We now just have two modes (and no deallocation modes). 2) Updated the man page 3) Added a new patch submitted by David P. Quigley (Patch 3/6). 4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6. 5) Included below in the end is a small testcase to test fallocate. * Following are the changes from TAKE5 to TAKE6: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html 5) Renamed mode flags and values from "FA_" to "FALLOC_" 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: - 1> Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4> Patch to e2fsprogs to recognize and display uninitialized extents. Following patches follow: Patch 1/6 : manpage for fallocate Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/6 : revalidate write permissions for fallocate Patch 4/6 : ext4: fallocate support in ext4 Patch 5/6 : ext4: write support for preallocated blocks Patch 6/6 : ext4: change for better extent-to-group alignment Note: Attached below is a small testcase to test fallocate. The __NR_fallocate will need to be changed depending on the system call number in the kernel (it may get changed due to merge) and also depending on the architecture. -- Regards, Amit Arora #include #include #include #include #include #include #include #define VERBOSE 0 #define __NR_fallocate324 #define FALLOC_FL_KEEP_SIZE 0x01 #define FALLOC_ALLOCATE 0x0 #define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE int do_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret; if (VERBOSE) printf("Trying to preallocate blocks (offset=%llu, len=%llu)\n", offset, len); ret = syscall(__NR_fallocate, fd, mode, offset, len); if (ret <0) { printf("SYSCALL: received error %d, ret=%d\n", errno, ret); close(fd); return(1); } if (VERBOSE) printf("fallocate system call succedded ! ret=%d\n", ret); return ret; } int test_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret, blocks; struct stat statbuf1, statbuf2; fstat(fd, &statbuf1); ret = do_fallocate(fd, mode, offset, len); fstat(fd, &statbuf2); /* check file size after preallocation */ if (mode == FALLOC_ALLOCATE) { if (!ret && statbuf1.st_size < (offset + len) && statbuf2.st_size != (offset + len)) { printf("Error: fallocate succeeded, but the file size did not " "change, where it should have!\n"); ret = 1; } } else if (statbuf1.st_size != statbuf2.st_size) { printf("Error : File size changed, when it should not have!\n"); ret = 1; } blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ statbuf2.st_blksize; /* Print report */ printf("# FALLOCATE TEST REPORT #\n"); printf("\tNew blocks preallocated = %d.\n", blocks); printf("\tNumber of bytes preallocated = %d\n", blocks * statbuf2.st_blksize); printf("\tOld file size = %d, New file size %d.\n", statbuf1.st_size, statbuf2.st_size); printf("\tOld num blocks = %d, New num blocks %d.\n", (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024); return ret; } int do_write(int fd, loff_t offset, loff_t len) { int ret; char *buf; buf = (char *)malloc(len); if (!buf) { printf("error: malloc failed.\n"); return(-1); } if (VERBOSE) printf("Trying to write to file (offset=%llu, len=%llu)\n", offset, len); ret = lseek(fd, offset, SEEK_SET); if (ret != offset) { printf("lseek() failed error=%d, ret=%d\n", errno, ret); close(fd); return(-1); } ret = write(fd, buf, len); if (ret != len) { printf("write() failed error=%d, ret=%d\n", errno, ret); close(fd); return(-1); } if (VERBOSE) printf("Write succedded ! Written %llu bytes ret=%d\n", len, ret); return ret; } int test_write(int fd, loff_t offset, loff_t len) { int ret; ret = do_write(fd, offset, len); printf("# WRITE TEST REPO
Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
On Fri, Jul 13, 2007 at 02:21:19PM +0100, Christoph Hellwig wrote: > On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote: > > /* > > + * sys_fallocate - preallocate blocks or free preallocated blocks > > + * @fd: the file descriptor > > + * @mode: mode specifies the behavior of allocation. > > + * @offset: The offset within file, from where allocation is being > > + * requested. It should not have a negative value. > > + * @len: The amount of space in bytes to be allocated, from the offset. > > + * This can not be zero or a negative value. > > kerneldoc comments are for in-kernel APIs which syscalls aren't. I'd say > just temove this comment, the manpage is a much better documentation anyway. Ok. I will remove this entire comment. > > + * Generic fallocate to be added for file systems that do not > > + * support fallocate. > > Please remove the comment, adding a generic fallback in kernelspace is a > very dumb idea as we already discussed long time ago. > > > --- linux-2.6.22.orig/include/linux/fs.h > > +++ linux-2.6.22/include/linux/fs.h > > @@ -266,6 +266,21 @@ extern int dir_notify_enable; > > #define SYNC_FILE_RANGE_WRITE 2 > > #define SYNC_FILE_RANGE_WAIT_AFTER 4 > > > > +/* > > + * sys_fallocate modes > > + * Currently sys_fallocate supports two modes: > > + * FALLOC_ALLOCATE : This is the preallocate mode, using which an > > application > > + * may request reservation of space for a particular file. > > + * The file size will be changed if the allocation is > > + * beyond EOF. > > + * FALLOC_RESV_SPACE : This is same as the above mode, with only one > > difference > > + * that the file size will not be modified. > > + */ > > +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */ > > + > > +#define FALLOC_ALLOCATE0 > > +#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE > > Just remove FALLOC_ALLOCATE, 0 flags should be the default. I'm also > not sure there is any point in having two namespace now that we have a flags- > based ABI. Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this flag and remove the other mode too (FALLOC_RESV_SPACE). Is this what you are suggesting ? > Also please don't add this to fs.h. fs.h is a complete mess and the > falloc flags are a new user ABI. Add a linux/falloc.h instead which can > be added to headers-y so the ABI constant can be exported to userspace. Should we need a header file just to declare one flag - i.e. FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h" be a sane place for this flag ? Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
On Fri, Jul 13, 2007 at 02:21:37PM +0100, Christoph Hellwig wrote: > On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote: > > From: David P. Quigley <[EMAIL PROTECTED]> > > > > Revalidate the write permissions for fallocate(2), in case security policy > > has > > changed since the files were opened. > > > > Acked-by: James Morris <[EMAIL PROTECTED]> > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> > > This should be merged into the main falloc patch. Ok. Will merge it... -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5][TAKE8] fallocate system call
This is the latest fallocate patchset and is based on 2.6.22. * Following are the changes from TAKE7: 1) Updated the man page. 2) Merged "revalidate write permissions" patch with the main falloc patch. 3) Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it. Also removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE). 4) Removed comment above sys_fallocate definition. 5) Updated the testcase below to use FALLOC_FL_KEEP_SIZE flag instead of previous two modes. * Following are the changes from TAKE6: 1) We now just have two modes (and no deallocation modes). 2) Updated the man page 3) Added a new patch submitted by David P. Quigley (Patch 3/6). 4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6. 4) Included below in the end is a small testcase to test fallocate. * Following are the changes from TAKE5 to TAKE6: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html 5) Renamed mode flags and values from "FA_" to "FALLOC_" 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: - 1> Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4> Patch to e2fsprogs to recognize and display uninitialized extents. Following patches follow: Patch 1/5 : manpage for fallocate Patch 2/5 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/5 : ext4: fallocate support in ext4 Patch 4/5 : ext4: write support for preallocated blocks Patch 5/5 : ext4: change for better extent-to-group alignment ** Attached below is a small testcase to test fallocate. The __NR_fallocate will need to be changed depending on the system call number in the kernel (it may get changed due to merge) and also depending on the architecture. -- Regards, Amit Arora #include #include #include #include #include #include #include #define VERBOSE 0 #define __NR_fallocate324 #define FALLOC_FL_KEEP_SIZE 0x01 int do_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret; if (VERBOSE) printf("Trying to preallocate blocks (offset=%llu, len=%llu)\n", offset, len); ret = syscall(__NR_fallocate, fd, mode, offset, len); if (ret <0) { printf("SYSCALL: received error %d, ret=%d\n", errno, ret); close(fd); return(1); } if (VERBOSE) printf("fallocate system call succedded ! ret=%d\n", ret); return ret; } int test_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret, blocks; struct stat statbuf1, statbuf2; fstat(fd, &statbuf1); ret = do_fallocate(fd, mode, offset, len); fstat(fd, &statbuf2); /* check file size after preallocation */ if (!mode) { if (!ret && statbuf1.st_size < (offset + len) && statbuf2.st_size != (offset + len)) { printf("Error: fallocate succeeded, but the file size did not " "change, where it should have!\n"); ret = 1; } } else if (statbuf1.st_size != statbuf2.st_size) { printf("Error : File size changed, when it should not have!\n"); ret = 1; } blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ statbuf2.st_blksize; /* Print report */ printf("# FALLOCATE TEST REPORT #\n"); printf("\tNew blocks preallocated = %d.\n", blocks); printf("\tNumber of bytes preallocated = %d\n", blocks * statbuf2.st_blksize); printf("\tOld file size = %d, New file size %d.\n", statbuf1.st_size, statbuf2.st_size); printf("\tOld num blocks = %d, New num blocks %d.\n", (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024); return ret; } int do_write(int fd, loff_t offset, loff_t len) { int ret; char *buf; buf = (char *)malloc(len); if (!buf) { printf("error: malloc failed.\n"); return(-1); } if (VERBOSE) printf("Trying to write to file (offset=%llu, len=%llu)\n", offset, len); ret = lseek(fd, offset, SEEK_SET); if (ret != offset) { printf("lseek() failed error=%d, ret=%d\n", errno, ret); close(fd); return(-1); } ret = write(fd, buf, len); if (ret != len) { printf("write() failed error=%d, ret=%d\n", errno, ret); close(fd);
[PATCH 1/5][TAKE8] manpage for fallocate
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. Following changed from TAKE7: * Removed FALLOC_ALLOCATE and FALLOCATE_RESV_SPACE modes. * Described only single flag for mode, i.e. FALLOC_FL_KEEP_SIZE. * s/zero blocks/zeroed blocks/ as suggested by Dave. * Included instead of . Following changed from TAKE6 to TAKE7: Included changes suggested by Heikki Orsila and Barry Naujok. .TH fallocate 2 .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .B #include .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " len "); .SH DESCRIPTION The .B fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there is only one flag supported for the mode argument. .TP .B FALLOC_FL_KEEP_SIZE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. Even if the size of the file is less than .IR offset + len , the file size is not changed. This allows allocation of zeroed blocks beyond the end of file and is useful for optimising append workloads. .PP If .B FALLOC_FL_KEEP_SIZE flag is not specified in the mode argument, the default behavior of this system call is almost same as when this flag is passed. The only difference is that on success, the file size will be changed if the .IR offset + len is greater than the file size. This default behavior closely resembles .BR posix_fallocate (3) and is intended as a method of optimally implementing this function. .PP .B fallocate may allocate a larger range than that was specified. .SH RETURN VALUE .B fallocate returns zero on success, or an error number on failure. Note that .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal to 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd . .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .TP .B ENOSYS The filesystem underlying the file descriptor does not support this operation. .TP .B EINTR A signal was caught during execution .TP .B EIO An I/O error occurred while reading from or writing to a file system. .TP .B EOPNOTSUPP The mode is not supported on the file descriptor. .SH AVAILABILITY The .B fallocate system call is available since 2.6.XX .SH SEE ALSO .BR posix_fallocate (3), .BR posix_fadvise (3), .BR ftruncate (3). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5][TAKE8] fallocate() implementation in i386, x86_64 and powerpc
From: Amit Arora <[EMAIL PROTECTED]> sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() CHANGELOG: - Following changed from TAKE7: 1. Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it. 2. Removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE). 3. Merged "revalidate write permissions" patch from David P. Quigley to this patch. 4. Deleted comment above sys_fallocate definition, as suggested by Christoph. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/arch/i386/kernel/syscall_table.S === --- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys32_fallocate ia32_syscall_end: Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -26,6 +26,7 @@ #include #include #include +#include int vfs_statfs(struct dentry *dentry, struct kstatfs *buf) { @@ -352,6 +353,64 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode && !(mode & FALLOC_FL_KEEP_SIZE)) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + /* +* Revalidate the write permissions, in case security policy has +* changed since the files were opened. +*/ + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* +* Let individual file system decide if it supports preallocation
[PATCH 3/5][TAKE8] ext4: fallocate support in ext4
From: Amit Arora <[EMAIL PROTECTED]> fallocate support in ext4 This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. CHANGELOG: - Following changed from TAKE7: 1. Removed usage of FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes and used FALLOC_FL_KEEP_SIZE mode flag instead. 2. Included new header file, which defines above flag. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -282,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr
[PATCH 4/5][TAKE8] ext4: write support for preallocated blocks
From: Amit Arora <[EMAIL PROTECTED]> write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. CHANGELOG: - This patch did not change from TAKE7 (besides offsets ;). Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1141,6 +1141,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1328,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2012,15 +2041,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int
[PATCH 5/5][TAKE8] ext4: change for better extent-to-group alignment
From: Amit Arora <[EMAIL PROTECTED]> Change on-disk format for extent to represent uninitialized/initialized extents This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. CHANGELOG: - This patch did not change from TAKE7 (besides offsets ;). Following changed from TAKE6 to TAKE7: 1. Taken care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of 0x8000 at some places. Signed-off-by: Amit Arora <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1107,7 +1107,7 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - unsigned short ext1_ee_len, ext2_ee_len; + unsigned short ext1_ee_len, ext2_ee_len, max_len; /* * Make sure that either both extents are uninitialized, or @@ -1116,6 +1116,11 @@ ext4_can_extents_be_merged(struct inode if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) return 0; + if (ext4_ext_is_uninitialized(ex1)) + max_len = EXT_UNINIT_MAX_LEN; + else + max_len = EXT_INIT_MAX_LEN; + ext1_ee_len = ext4_ext_get_actual_len(ex1); ext2_ee_len = ext4_ext_get_actual_len(ex2); @@ -1128,7 +1133,7 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > max_len) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) @@ -1815,7 +1820,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); - if (uninitialized) + /* +* Do not mark uninitialized if all the blocks in the +* extent have been removed. +*/ + if (uninitialized && num) ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); @@ -2308,6 +2317,19 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); + /* +* See if request is beyond maximum number of blocks we can have in +* a single extent. For an initialized extent this limit is +* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is +* EXT_UNINIT_MAX_LEN. +*/ + if (max_blocks > EXT_INIT_MAX_LEN && + create != EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_INIT_MAX_LEN; + else if (max_blocks > EXT_UNINIT_MAX_LEN && +create == EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_UNINIT_MAX_LEN; + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ newex.ee_block = cpu_to_le32(iblock); newex.ee_len = cpu_to_le16(max_blocks); Index: linux-2.6.22/include/linux/ext4_fs_extents.h === --- linux-2.6.22.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22/include/linux/ext4_fs_extents.h @@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru #define EXT_MAX_BLOCK 0x -#define EXT_MAX_LEN((1UL << 15) - 1) +/* + * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an + * initialized extent. This is 2^15 and not (2^16 - 1), since we use the + * MSB of ee_len field in the extent datastructure to signify if this + * particular extent is an initialized extent or an uninitialized (i.e. + * preallocated). + * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an + * uninitialized extent. + * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an + * uninitialized one. In other words, if MSB of ee_len is set, it is an + * uninitialized extent with only one special scenario when ee_len = 0x8000. + * In this case we can not have an uninitialized extent of zero length and + * thus we make it as a special case of initialized extent with 0x8000 length. + * This way we get better extent-to-group alignment for initialized extents. + * Hence, the maximum number of blocks we can have in an *initialized* + * extent is 2^15 (32768) and in an *uninitialized* exte
Re: [PATCH 1/6][TAKE7] manpage for fallocate
On Sat, Jul 14, 2007 at 10:23:42AM +0200, Michael Kerrisk wrote: > [CC += [EMAIL PROTECTED] > > Amit, Hi Michael, > Thanks for this page. I will endeavour to review it in > the coming days. In the meantime, the better address to CC > me on fot man pages stuff is [EMAIL PROTECTED] Sure. BTW, this man page has changed a bit and the one in TAKE8 of fallocate patches is the latest one. You are copied on that too. I will forward that mail to "[EMAIL PROTECTED]" id also, so that you do not miss it. Thanks! -- Regards, Amit Arora > > Cheers, > > Michael > > > Following is the modified version of the manpage originally submitted by > > David Chinner. Please use `nroff -man fallocate.2 | less` to view. > > > > This includes changes suggested by Heikki Orsila and Barry Naujok. > > > > > > .TH fallocate 2 > > .SH NAME > > fallocate \- allocate or remove file space > > .SH SYNOPSIS > > .nf > > .B #include > > .PP > > .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " > > len); > > .SH DESCRIPTION > > The > > .B fallocate > > syscall allows a user to directly manipulate the allocated disk space > > for the file referred to by > > .I fd > > for the byte range starting at > > .I offset > > and continuing for > > .I len > > bytes. > > The > > .I mode > > parameter determines the operation to be performed on the given range. > > Currently there are two modes: > > .TP > > .B FALLOC_ALLOCATE > > allocates and initialises to zero the disk space within the given range. > > After a successful call, subsequent writes are guaranteed not to fail > > because > > of lack of disk space. If the size of the file is less than > > .IR offset + len , > > then the file is increased to this size; otherwise the file size is left > > unchanged. > > .B FALLOC_ALLOCATE > > closely resembles > > .BR posix_fallocate (3) > > and is intended as a method of optimally implementing this function. > > .B FALLOC_ALLOCATE > > may allocate a larger range than that was specified. > > .TP > > .B FALLOC_RESV_SPACE > > provides the same functionality as > > .B FALLOC_ALLOCATE > > except it does not ever change the file size. This allows allocation > > of zero blocks beyond the end of file and is useful for optimising > > append workloads. > > .SH RETURN VALUE > > .B fallocate > > returns zero on success, or an error number on failure. > > Note that > > .I errno > > is not set. > > .SH ERRORS > > .TP > > .B EBADF > > .I fd > > is not a valid file descriptor, or is not opened for writing. > > .TP > > .B EFBIG > > .IR offset + len > > exceeds the maximum file size. > > .TP > > .B EINVAL > > .I offset > > was less than 0, or > > .I len > > was less than or equal to 0. > > .TP > > .B ENODEV > > .I fd > > does not refer to a regular file or a directory. > > .TP > > .B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > .IR fd . > > .TP > > .B ESPIPE > > .I fd > > refers to a pipe of file descriptor. > > .TP > > .B ENOSYS > > The filesystem underlying the file descriptor does not support this > > operation. > > .TP > > .B EINTR > > A signal was caught during execution > > .TP > > .B EIO > > An I/O error occurred while reading from or writing to a file system. > > .TP > > .B EOPNOTSUPP > > The mode is not supported on the file descriptor. > > .SH AVAILABILITY > > The > > .B fallocate > > system call is available since 2.6.XX > > .SH SEE ALSO > > .BR syscall (2), > > .BR posix_fadvise (3), > > .BR ftruncate (3). > > -- > Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten > Browser-Versionen downloaden: http://www.gmx.net/de/go/browser - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5][TAKE2] fallocate system call
This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: --- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: - The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). sys_fallocate() on s390: --- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: - mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: http://lkml.org/lkml/2007/5/8/583 Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: http://lkml.org/lkml/2007/3/7/161 http://lkml.org/lkml/2007/3/18/198 ToDos: - 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: - Each post will have an individual changelog for the particular patch. Following posts with patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: - Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/powerpc/kernel/sys_ppc32.c |7 +++ arch/x86_64/kernel/functionlist |1 fs/open.c| 89 +++ include/asm-i386/unistd.h|3 - include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 - include/asm-x86_64/unistd.h |4 + include/linux/fs.h | 13 + include/linux/syscalls.h |1 10 files changed, 120 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/i386/kernel/syscall_table.S === --- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.21/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.21/arch/x86_64/kernel/functionlist === --- linux-2.6.21.orig/arch/x86_64/kernel/functionlist +++ linux-2.6.21/arch/x86_64/kernel/functionlist @@ -931,6 +931,7 @@ *(.text.sys_getitimer) *(.text.sys_getgroups) *(.text.sys_ftruncate) +*(.text.sys_fallocate) *(.text.sysfs_lookup) *(.text.sys_exit_group) *(.text.stub_fork) Index: linux-2.6.21/fs/open.c === --- linux-2.6.21.orig/fs/open.c +++ linux-2.6.21/fs/open.c @@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* +* Let individual file system decide if it supports preallocation +* for directories or not. +*/ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if
[PATCH 2/5][TAKE2] fallocate() on s390
This is the patch suggested by Martin Schwidefsky. Here are the comments and patch from him. - From: Martin Schwidefsky <[EMAIL PROTECTED]> This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> --- arch/s390/kernel/compat_wrapper.S | 10 ++ arch/s390/kernel/sys_s390.c | 29 + arch/s390/kernel/syscalls.S |1 + include/asm-s390/unistd.h |3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S === --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr%r2,%r2 # int + lgfr%r3,%r3 # int + sllg%r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg%r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.21/arch/s390/kernel/syscalls.S === --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.21/arch/s390/kernel/sys_s390.c === --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.21/arch/s390/kernel/sys_s390.c @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, "d" (__arg3) : "memory"); return __svcres; } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.21/include/asm-s390/unistd.h === --- linux-2.6.21.orig/include/asm-s390/unistd.h +++ linux-2.6.21/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu311 #define __NR_epoll_pwait 312 #define __NR_utimes313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5][TAKE2] ext4: Extent overlap bugfix
This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: - As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 60 ++-- include/linux/ext4_fs_extents.h |1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c === --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* +* get the next allocated block if the extent in the path +* is before the requested block(s) +*/ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.21/include/linux/ext4_fs_extents.h === --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5][TAKE2] ext4: fallocate support in ext4
This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a "To Do" item. Changelog: - Here are the changes from the previous post: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 241 +--- fs/ext4/file.c |1 include/linux/ext4_fs.h |8 + include/linux/ext4_fs_extents.h | 12 + 4 files changed, 221 insertions(+), 41 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c === --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1107,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext)
[PATCH 5/5][TAKE2] ext4: write support for preallocated blocks
This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: - 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 234 +++- include/linux/ext4_fs_extents.h |3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c === --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1328,25 +1376,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, +
Re: [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper
On Mon, May 14, 2007 at 08:18:34PM +0530, Amit K. Arora wrote: > This is the patch suggested by Martin Schwidefsky. Here are the comments > and patch from him. Martin also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. Here it is: .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15)/* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) -- Regards, Amit Arora > - > From: Martin Schwidefsky <[EMAIL PROTECTED]> > > This patch implements support of fallocate system call on s390(x) > platform. A wrapper is added to address the issue which s390 ABI has > with the arguments of this system call. > > Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> > --- > > arch/s390/kernel/compat_wrapper.S | 10 ++ > arch/s390/kernel/sys_s390.c | 29 + > arch/s390/kernel/syscalls.S |1 + > include/asm-s390/unistd.h |3 ++- > 4 files changed, 42 insertions(+), 1 deletion(-) > > Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S > === > --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S > +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S > @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: > llgtr %r2,%r2 # char * > llgtr %r3,%r3 # struct compat_timeval * > jg compat_sys_utimes > + > + .globl sys_fallocate_wrapper > +sys_fallocate_wrapper: > + lgfr%r2,%r2 # int > + lgfr%r3,%r3 # int > + sllg%r4,%r4,32 # get high word of 64bit loff_t > + lr %r4,%r5 # get low word of 64bit loff_t > + sllg%r5,%r6,32 # get high word of 64bit loff_t > + l %r5,164(%r15) # get low word of 64bit loff_t > + jg sys_fallocate > Index: linux-2.6.21/arch/s390/kernel/syscalls.S > === > --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S > +++ linux-2.6.21/arch/s390/kernel/syscalls.S > @@ -322,3 +322,4 @@ NI_SYSCALL > /* 310 sys_move_pages * > SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) > SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) > SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) > +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) > Index: linux-2.6.21/arch/s390/kernel/sys_s390.c > === > --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c > +++ linux-2.6.21/arch/s390/kernel/sys_s390.c > @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, > "d" (__arg3) : "memory"); > return __svcres; > } > + > +#ifndef CONFIG_64BIT > +/* > + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last > + * 64 bit argument "len" is split into the upper and lower 32 bits. The > + * system call wrapper in the user space loads the value to %r6/%r7. > + * The code in entry.S keeps the values in %r2 - %r6 where they are and > + * stores %r7 to 96(%r15). But the standard C linkage requires that > + * the whole 64 bit value for len is stored on the stack and doesn't > + * use %r6 at all. So s390_fallocate has to convert the arguments from > + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len > + * to > + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len > + */ > +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, > +u32 len_high, u32 len_low) > +{ > + union { > + u64 len; > + struct { > + u32 high; > + u32 low; > + }; > + } cv; > + cv.high = len_high; > + cv.low = len_low; > + return sys_fallocate(fd, mode, offset, cv.len); > +} > +#endif > Index: linux-2.6.21/include/asm-s390/unistd.h > === > --- linux-2.6.21.orig/include/asm-s390/unistd.h > +++ linux-2.6.21/include/asm-s390/unistd.h > @@ -251,8 +251,9 @@ > #define __NR_getcpu 311 > #define __NR_epoll_pwait 312 > #define __NR_utimes 313 > +#define __NR_fallocate
Re: [PATCH 0/5][TAKE2] fallocate system call
On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote: > On May 14, 2007 18:59 +0530, Amit K. Arora wrote: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > fd: The descriptor of the open file. > > > > mode*: This specifies the behavior of the system call. Currently the > > system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. > > FA_ALLOCATE: Applications can use this mode to preallocate blocks to > > a given file (specified by fd). This mode changes the file size if > > the preallocation is done beyond the EOF. It also updates the > > ctime/mtime in the inode of the corresponding file, marking a > > successfull allocation. > > FA_DEALLOCATE: This mode can be used by applications to deallocate the > > previously preallocated blocks. This also may change the file size > > and the ctime/mtime. > > * New modes might get added in future. One such new mode which is > > already under discussion is FA_PREALLOCATE, which when used will > > preallocate space but will not change the filesize and [cm]time. > > Since the semantics of this new mode is not clear and agreed upon yet, > > this patchset does not implement it currently. > > > > offset: This is the offset in bytes, from where the preallocation should > > start. > > > > len: This is the number of bytes requested for preallocation (from > > offset). > > What is the return value? I'd hope it is the number of bytes preallocated, > in case of interrupted preallocation for whatever reason (interrupt, out of > space, etc) like a regular write(2) call. In this case the return type needs > to also be an loff_t to match @len. The return value in current implementation has been kept as "long" where zero is returned for success and an error on failure. This is done to keep it inline with posix_fallocate behavior. This point was brought up sometime back by Badari. At that time it was decided to keep it the way posix_fallocate is designed. Here are the posts related to this: http://lkml.org/lkml/2007/3/2/18 http://lkml.org/lkml/2007/3/2/162 http://lkml.org/lkml/2007/3/2/208 Still if you feel that we should be returning number of bytes preallocated, we can again ask for opinion here. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
On Tue, May 15, 2007 at 09:44:36AM +1000, Stephen Rothwell wrote: > On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > This patch implements sys_fallocate() and adds support on i386, x86_64 > > and powerpc platforms. > > This patch no longer applies to Linus' tree - for a start there is no file > arch/x86_64/kernel/functionlist any more. > > Can you rebase it, please? I will rebase it to 2.6.22-rc1 and repost the patches soon. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5][TAKE3] fallocate system call
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P L E A S EN O T E : *** 1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were based on 2.6.21. 2. An unnecessary export of symbol is removed from the ext4 preallocate patch. Details in the corresponding post (PATCH 4/5). 3. Return type now described in the interface description below. 4. Besides above points, everything is exactly same as TAKE2. -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: --- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: - The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: --- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: - mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: http://lkml.org/lkml/2007/5/8/583 Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: http://lkml.org/lkml/2007/3/7/161 http://lkml.org/lkml/2007/3/18/198 ToDos: - 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: - Each post will have an individual changelog for a particular patch. Following patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a me
[PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: - Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/powerpc/kernel/sys_ppc32.c |7 +++ arch/x86_64/ia32/ia32entry.S |1 fs/open.c| 89 +++ include/asm-i386/unistd.h|3 - include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 - include/asm-x86_64/unistd.h |2 include/linux/fs.h | 13 + include/linux/syscalls.h |1 10 files changed, 119 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S === --- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc1/fs/open.c === --- linux-2.6.22-rc1.orig/fs/open.c +++ linux-2.6.22-rc1/fs/open.c @@ -353,6 +353,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file =
[PATCH 2/5][TAKE3] fallocate() on s390
This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15)/* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. - From: Martin Schwidefsky <[EMAIL PROTECTED]> This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> --- arch/s390/kernel/compat_wrapper.S | 10 ++ arch/s390/kernel/sys_s390.c | 29 + arch/s390/kernel/syscalls.S |1 + include/asm-s390/unistd.h |3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S === --- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr%r2,%r2 # int + lgfr%r3,%r3 # int + sllg%r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg%r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c === --- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c @@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S === --- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.22-rc1/include/asm-s390/unistd.h === --- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h +++ linux-2.6.22-rc1/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu311 #define __NR_epoll_pwait 312 #define __NR_utimes313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5][TAKE3] ext4: Extent overlap bugfix
This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: - Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 60 ++-- include/linux/ext4_fs_extents.h |1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* +* get the next allocated block if the extent in the path +* is before the requested block(s) +*/ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h === --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5][TAKE3] ext4: fallocate support in ext4
This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Changelog: - Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based and point "8)" below. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. Here are the changes from the previous post: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. 8) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);". Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 240 +--- fs/ext4/file.c |1 include/linux/ext4_fs.h |8 + include/linux/ext4_fs_extents.h | 12 ++ 4 files changed, 220 insertions(+), 41 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1;
[PATCH 5/5][TAKE3] ext4: write support for preallocated blocks
This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: - Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 234 +++- include/linux/ext4_fs_extents.h |3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: S
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Tue, May 15, 2007 at 05:42:46PM -0700, Mingming Cao wrote: > On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote: > > This patch implements sys_fallocate() and adds support on i386, x86_64 > > and powerpc platforms. > > > @@ -1137,6 +1148,8 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *inode, int mode, loff_t offset, > > + loff_t len); > > }; > > Does the return value from fallocate inode operation has to be *long*? > It's not consistent with the ext4_fallocate() define in patch 4/5, I think ->fallocate() should return a "long", since sys_fallocate() has to return what ->fallocate() returns and hence their return type should ideally match. > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > len) I will change the ext4_fallocate() to return a "long" (in patch 4/5) in the next post. Agree ? Thanks! -- Regards, Amit Arora > > thus cause compile warnings. > > > > Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > Following changes were made to the previous version: > > > 1) Added description before sys_fallocate() definition. > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > posix_fallocate should return EINVAL for len <= 0. > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > they want to support preallocation to directories or not. > > > 5) Check for wrap through zero. > > > 6) Update c/mtime if fallocate() succeeds. > > > > Please don't make this always happen. c/mtime updates should be dependent > > on the mode being used and whether there is visible change to the file. If > > no > > userspace visible changes to the file occurred, then timestamps should not > > be changed. > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > shouldn't be changed, though, since the contents of the file will be > unchanged. I agree. Thus the ctime should change for FA_PREALLOCATE mode also (which does not change the file size) - if we end up having this additional mode in near future. -- Regards, Amit Arora > > e.g. FA_ALLOCATE that changes file size requires same semantics of > > ftruncate() > > extending the file, otherwise no change in timestamps should occur. > > > > Cheers, > > > > Dave. > -- > David Kleikamp > IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote: > On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > > > Following changes were made to the previous version: > > > > 1) Added description before sys_fallocate() definition. > > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > > posix_fallocate should return EINVAL for len <= 0. > > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > > they want to support preallocation to directories or not. > > > > 5) Check for wrap through zero. > > > > 6) Update c/mtime if fallocate() succeeds. > > > > > > Please don't make this always happen. c/mtime updates should be dependent > > > on the mode being used and whether there is visible change to the file. > > > If no > > > userspace visible changes to the file occurred, then timestamps should not > > > be changed. > > > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > > shouldn't be changed, though, since the contents of the file will be > > unchanged. > > That's assuming blocks were actually allocated - if the prealloc range already > has underlying blocks there is no change and so we should not be changing > mtime either. Only the filesystem will know if it has changed the file, so I > think that timestamp updates need to be driven down to that level, not done > blindy at the highest layer Ok. Will make this change in the next post. -- Regards, Amit Arora > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/6][TAKE4] fallocate system call
Description: --- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: - The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: --- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: - mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: http://lkml.org/lkml/2007/5/8/583 Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: http://lkml.org/lkml/2007/3/7/161 http://lkml.org/lkml/2007/3/18/198 ToDos: - 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: - Changes from Take2 to Take3: 1) Return type is now described in the interface description above. 2) Patches rebased to 2.6.22-rc1 kernel. ** Each post will have an individual changelog for a particular patch. Following patches follow: Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/6 : fallocate() on s390 Patch 3/6 : fallocate() on ia64 Patch 4/6 : ext4: Extent overlap bugfix Patch 5/6 : ext4: fallocate support in ext4 Patch 6/6 : ext4: write support for preallocated blocks -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc
This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: - Changes from Take3 to Take4: 1) Do not update c/mtime. Let each filesystem update ctime (update of mtime will not be required for allocation since we touch only metadata/inode and not blocks), if required. Changes from Take2 to Take3: 1) Patches now based on 2.6.22-rc1 kernel. Changes from Take1(initial post on 26th April, 2007) to Take2: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/powerpc/kernel/sys_ppc32.c |7 +++ arch/x86_64/ia32/ia32entry.S |1 fs/open.c| 86 +++ include/asm-i386/unistd.h|3 - include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 - include/asm-x86_64/unistd.h |2 include/linux/fs.h | 13 + include/linux/syscalls.h |1 10 files changed, 116 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S === --- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, +((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc1/fs/open.c === --- linux-2.6.22-rc1.orig/fs/open.c +++ linux-2.6.22-rc1/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by the + * individual file systems will update the file size and/or ctime/mtime + * depending on the mode and also on the success of the operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long r
[PATCH 2/6][TAKE4] fallocate() on s390
This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15)/* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. - From: Martin Schwidefsky <[EMAIL PROTECTED]> This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> --- arch/s390/kernel/compat_wrapper.S | 10 ++ arch/s390/kernel/sys_s390.c | 29 + arch/s390/kernel/syscalls.S |1 + include/asm-s390/unistd.h |3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S === --- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr%r2,%r2 # int + lgfr%r3,%r3 # int + sllg%r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg%r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c === --- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c @@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S === --- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.22-rc1/include/asm-s390/unistd.h === --- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h +++ linux-2.6.22-rc1/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu311 #define __NR_epoll_pwait 312 #define __NR_utimes313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6][TAKE4] fallocate() on ia64
Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64 From: David Chinner <[EMAIL PROTECTED]> Subject: [PATCH] ia64 fallocate syscall Cc: "Amit K. Arora" <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] ia64 fallocate syscall support. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- arch/ia64/kernel/entry.S |1 + include/asm-ia64/unistd.h |3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S === --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-12 18:45:56.0 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-15 15:36:48.0 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h === --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 18:45:56.0 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-15 15:37:51.0 -0700 @@ -296,6 +296,7 @@ #define __NR_getcpu1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6][TAKE4] ext4: Extent overlap bugfix
This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: - Changes from Take3 to Take4: - no change - Changes from Take2 to Take3: 1) Patch rebased to 2.6.22-rc1 kernel. Changes from Take1 to Take2: 1) As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 60 ++-- include/linux/ext4_fs_extents.h |1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* +* get the next allocated block if the extent in the path +* is before the requested block(s) +*/ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h === --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6][TAKE4] ext4: fallocate support in ext4
This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Changelog: - Changes from Take3 to Take4: 1) Changed ext4_fllocate() declaration and definition to return a "long" and not an "int", to match with ->fallocate() inode op. 2) Update ctime if new blocks get allocated. Changes from Take2 to Take3: 1) Patch rebased to 2.6.22-rc1 kernel version. 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);". Changes from Take1 to Take2: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 249 +--- fs/ext4/file.c |1 include/linux/ext4_fs.h |8 + include/linux/ext4_fs_extents.h | 12 + 4 files changed, 229 insertions(+), 41 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) retur
[PATCH 6/6][TAKE4] ext4: write support for preallocated blocks
This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: - Changes from Take3 to Take4: - no change - Changes from Take2 to Take3: 1) Patch now rebased to 2.6.22-rc1 kernel. Changes from Take1 to Take2: 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> --- fs/ext4/extents.c | 234 +++- include/linux/ext4_fs_extents.h |3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c === --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *h
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > > > Can you clarify - what is the current behaviour when ENOSPC (or some other > > > error) is hit? Does it keep the current fallocate() or does it free it? > > > > Currently it is left on the file system implementation. In ext4, we do > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > end up with partial (pre)allocation. This is inline with dd and > > posix_fallocate, which also do not free the partially allocated space. > > Since I believe the XFS allocation ioctls do it the opposite way (free > preallocated space on error) this should be encoded into the flags. > Having it "filesystem dependent" just means that nobody will be happy. Ok, got your point. Maybe we can have a flag for this, as you suggested. But, default behavior IMHO should be _not_ to undo partial allocation (thus the file system will have the option of supporting this flag or not and it will be inline with posix_fallocate; XFS will obviously like to support this flag, inline with its existing behavior). > > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we > > > don't want to expose uninitialized disk blocks to userspace. I'm not > > > sure if this makes sense at all. > > > > I don't think we need to make it default - atleast for filesystems which > > have a mechanism to distinguish preallocated blocks from "regular" ones. > > What I mean is that any data read from the file should have the "appearance" > of being zeroed (whether zeroes are actually written to disk or not). What > I _think_ David is proposing is to allow fallocate() to return without > marking the blocks even "uninitialized" and subsequent reads would return > the old data from the disk. I can't think of a good reason for this (i.e. returning stale data from preallocated blocks). It is infact a security issue to me. Anyhow, this may though be beneficial for file systems which have noticable overhead in marking the blocks "uninitialized/preallocated". Can you or David please throw some light on how this option might really be helpful ? Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote: > On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote: > > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: > > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate. > > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on > > > each extent. For some workloads this would be much faster than truncate > > > and reallocate of all the blocks in a file. > > > > In ext4, we already mark each extent having preallocated blocks as > > uninitialized. This is done as part of following code (which is part of > > patch 5/7) in ext4_ext_get_blocks() : > > What I meant is that with XFS_IOC_ALLOCSP the previously-written data > is ZEROED OUT, unlike with fallocate() which leaves previously-written > data alone and only allocates in holes. > > In order to specify this for allocation, FA_FL_DEL_DATA would need to make > sense for allocations (as well as the deallocation). This is farily easy > to do - just mark all of the existing extents as unallocated, and their > data disappears. Ok, agreed. Will add the FA_ZERO_SPACE mode too. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7][TAKE5] ext4: support new modes
On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote: > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote: > > Hmm.. I am thinking of a scenario when the file system supports some > > individual flags, but does not support a particular combination of them. > > Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a > > file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and > > FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported > > for some reason). This means that although we support FA_FL_DEALLOC, > > FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the > > combination of all these flags (which is nothing but FA_UNRESV_SPACE). > > That is up to the filesystem to determine then. I just thought it should > be clear to return an error for flags (or as you say combinations thereof) > that the filesystem doesn't understand. > > That said, I'd think in most cases the flags are orthogonal, so if you > support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC) > then you will also support other combinations of those flags just from > the way it is coded. Ok. > > > I also thought another proposed flag was to determine whether mtime (and > > > maybe ctime) is changed when doing prealloc/dealloc space? Default should > > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else > > > should decide if we want to allow changing the file w/o changing ctime, if > > > that is required even though the file is not visibly changing. Maybe the > > > ctime update should be implicit if the size or mtime are changing? > > > > Is it really required ? I mean, why should we allow users not to update > > ctime/mtime even if the file metadata/data gets updated ? It sounds > > a bit "unnatural" to me. > > Is there any application scenario in your mind, when you suggest of > > giving this flexibility to userspace ? > > One reason is that XFS does NOT update the mtime/ctime when doing the > XFS_IOC_* allocation ioctls. Hmm.. I personally will call it a bug in XFS code then. :) > > I think, modifying ctime/mtime should be dependent on the other flags. > > E.g., if we do not zero out data blocks on allocation/deallocation, > > update only ctime. Otherwise, update ctime and mtime both. > > I'm only being the advocate for requirements David Chinner has put > forward due to existing behaviour in XFS. This is one of the reasons > why I think the "flags" mechanism we now have - we can encode the > various different behaviours in any way we want and leave it to the > caller. I understand. May be we can confirm once more with David Chinner if this is really required. Will it really be a compatibility issue if new XFS preallocations (ie. via fallocate) update mtime/ctime ? Will old applications really get affected ? If yes, then it might be worth implementing - even though I personally don't like it. David, can you please confirm ? Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote: > On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > N O T E: > > --- > > 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part > >of ext4 patch queue git tree hosted by Ted. > > Why the heck are replacements for these things being sent out again when > they're already in -mm and they're already in Ted's queue (from which I > need to diligently drop them each time I remerge)? > > Are we all supposed to re-review the entire patchset (or at least #4 and > #7) again? As I mentioned in the note above, only patches #4 and #7 were new and thus these needed to be reviewed. Other patches are _not_ replacements of any of the patches which are already part of -mm and/or in Ted's patch queue. They were posted again as just "placeholders" so that the two new patches (#4 & #7) could be reviewed. Sorry for any confusion. > Please drop the non-ext4 patches from the ext4 tree and send incremental > patches against the (non-ext4) fallocate patches in -mm. Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7][TAKE5] ext4: support new modes
On Wed, Jun 27, 2007 at 10:04:56AM +1000, David Chinner wrote: > On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote: > > On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote: > > > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote: > > > > I think, modifying ctime/mtime should be dependent on the other flags. > > > > E.g., if we do not zero out data blocks on allocation/deallocation, > > > > update only ctime. Otherwise, update ctime and mtime both. > > > > > > I'm only being the advocate for requirements David Chinner has put > > > forward due to existing behaviour in XFS. This is one of the reasons > > > why I think the "flags" mechanism we now have - we can encode the > > > various different behaviours in any way we want and leave it to the > > > caller. > > > > I understand. May be we can confirm once more with David Chinner if this > > is really required. Will it really be a compatibility issue if new XFS > > preallocations (ie. via fallocate) update mtime/ctime? > > It should be left up to the filesystem to decide. Only the > filesystem knows whether something changed and the timestamp should > or should not be updated. Since Andreas had suggested FA_FL_NO_MTIME flag thinking it as a requirement from XFS (whereas XFS does not need this flag), I don't think we need to add this new flag. Please let know if someone still feels FA_FL_NO_MTIME flag can be useful. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote: > On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: > > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: > > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > > > > Can you clarify - what is the current behaviour when ENOSPC (or some > > > > other > > > > error) is hit? Does it keep the current fallocate() or does it free it? > > > > > > Currently it is left on the file system implementation. In ext4, we do > > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > > end up with partial (pre)allocation. This is inline with dd and > > > posix_fallocate, which also do not free the partially allocated space. > > > > Since I believe the XFS allocation ioctls do it the opposite way (free > > preallocated space on error) this should be encoded into the flags. > > Having it "filesystem dependent" just means that nobody will be happy. > > No, XFs does not free preallocated space on error. it is up to the > application to clean up. Since XFS also does not free preallocated space on error and this behavior is inline with dd, posix_fallocate() and the current ext4 implementation, do we still need FA_FL_FREE_ENOSPC flag ? > > What I mean is that any data read from the file should have the "appearance" > > of being zeroed (whether zeroes are actually written to disk or not). What > > I _think_ David is proposing is to allow fallocate() to return without > > marking the blocks even "uninitialized" and subsequent reads would return > > the old data from the disk. > > Correct, but for swap files that's not an issue - no user should be able > too read them, and FA_MKSWAP would really need root privileges to execute. Will the FA_MKSWAP mode still be required with your suggested change of teaching do_mpage_readpage() about unwritten extents being in place ? Or, will you still like to have FA_MKSWAP mode ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: > On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote: > > On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: > > > > Can you clarify - what is the current behaviour when ENOSPC (or some > > > > other > > > > error) is hit? Does it keep the current fallocate() or does it free it? > > > > > > Currently it is left on the file system implementation. In ext4, we do > > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > > end up with partial (pre)allocation. This is inline with dd and > > > posix_fallocate, which also do not free the partially allocated space. > > > > I can't find anything in the specification of posix_fallocate > > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) > > that tells what should happen to allocate blocks on error. > > Yeah, and AFAICT glibc leaves them behind ATM. Yes, it does. > > But common sense would be to not leak disk space on failure of this > > syscall, and this definitively should not be left up to the filesystem, > > either we always leak it or always free it, and I'd strongly favour > > the latter variant. I would not call it a "leak", since the blocks which got allocated as part of the partial success of the fallocate syscall can be strictly accounted for (i.e. they are assigned to a particular inode). And these can be freed by the application, using a suitable @mode of fallocate. > We can't simply walk the range an remove unwritten extents, as some > of them may have been present before the fallocate() call. That > makes it extremely difficult to undo a failed call and not remove > more pre-existing pre-allocations. Same is true for ext4 too. It is very difficult to keep track of which uninitialized (unwritten) extents got allocated as part of the current syscall. This is because, as David mentions, some of them might be already present; and also because some of the older ones may have got merged with the *new* uninitialized/unwritten extents as part of the current syscall. > Given the current behaviour for posix_fallocate() in glibc, I think > that retaining the same error semantic and punting the cleanup to > userspace (where the app will fail with ENOSPC anyway) is the only > sane thing we can do here. Trying to undo this in the kernel leads > to lots of extra rarely used code in error handling paths... Right. This gives applications the free hand if they really want to use the partially preallocated space, OR they want to free it; without introducing additional complexity in the kernel. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote: > The @mode flags that are currently under consideration are (AFAIK): > > FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) > */ > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change > size) */ > FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default > keep) */ We now have two sets of flags - 1) the above three with which I think no one has any issues with, and 2) the ones below, for which we need some discussions before finalizing on them. I will prefer fallocate going in mainline with the above three modes, and rest of the modes can be debated upon and discussed parallely. And, each new mode/flag can be pushed as a separate patch. This will not hold fallocate feature indefinitely... Please confirm if you find this approach ok. Otherwise, please object. Thanks! > FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep > prealloc) */ > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data > change) */ > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data > change) */ -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote: > On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote: > > > FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default > > > allocate) */ > > > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change > > > size) */ > > > FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default > > > keep) */ > > > > We now have two sets of flags - > > 1) the above three with which I think no one has any issues with, and > > Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should > never delete data. FA_FL_DEALLOC should probably be a separate syscall > because it's very different functionality. Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. And, regarding FA_FL_DEALLOC being a separate syscall - I think then the very purpose of @mode argument is not justified. We have this mode so that we can provide more features like this. That said, I don't say that we should make things very complicated; but, atleast we should provide some basic features which we expect most of the applications wanting preallocation to use. To start with, we need to cater to already existing applications/user base who use XFS preallocation feature. And further advanced features, like goal based preallocation, can be implemented as a separate syscall. > While we're at it I also dislike the FA_ prefix becuase it doesn't say > anything and is far too generic. FALLOC_ is much better. Ok. This can be changed in the next take. > > > FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep > > > prealloc) */ > > NACK on this one. We should have just one behaviour, and from the thread > that not freeing the allocation on error. I agree on this one. > > > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data > > > change) */ > > > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data > > > change) */ > > NACK to these aswell. If i_size changes c/mtime need updates, if the size > doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interface for the new fallocate() system call
On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > Wouldn't > int fallocate(loff_t offset, loff_t len, int fd, int mode) > work on both s390 and ppc/arm? glibc will certainly wrap it and > reorder the arguments as needed, so there is no need to keep fd first. > I think more people are comfirtable with this approach. Since glibc will wrap the system call and export the "conventional" interface (with fd first) to applications, we may not worry about keeping fd first in kernel code. I am personally fine with this approach. Still, if people have major concerns, we can think of getting rid of the "mode" argument itself. Anyhow we may, in future, need to have a policy based system call (say, for providing the goal block by applications for performance reasons). "mode" can then be made part of it. Comments ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interface for the new fallocate() system call
On Wed, Apr 18, 2007 at 07:06:00AM -0600, Andreas Dilger wrote: > On Apr 17, 2007 18:25 +0530, Amit K. Arora wrote: > > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > > Wouldn't > > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > > reorder the arguments as needed, so there is no need to keep fd first. > > > > I think more people are comfirtable with this approach. > > Really? I thought from the last postings that "fd first, wrap on s390" > was better. > > > Since glibc > > will wrap the system call and export the "conventional" interface > > (with fd first) to applications, we may not worry about keeping fd first > > in kernel code. I am personally fine with this approach. > > It would seem to make more sense to wrap the syscall on those architectures > that can't handle the "conventional" interface (fd first). Ok. In this case we may have to consider following things: 1) Obviously, for this glibc will have to call fallocate() syscall with different arguments on s390, than other archs. I think this should be doable and should not be an issue with glibc folks (right?). 2) we also need to see how strace behaves in this case. With little knowledge that I have of strace, I don't think it should depend on argument ordering of a system call on different archs (since it uses ptrace internally and that should take care of it). But, it will be nice if someone can confirm this. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the > above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/5] ext4: Extent overlap bugfix
On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > +unsigned int ext4_ext_check_overlap(struct inode *inode, > > + struct ext4_extent *newext, > > + struct ext4_ext_path *path) > > +{ > > + unsigned long b1, b2; > > + unsigned int depth, len1; > > + > > + b1 = le32_to_cpu(newext->ee_block); > > + len1 = le16_to_cpu(newext->ee_len); > > + depth = ext_depth(inode); > > + if (!path[depth].p_ext) > > + goto out; > > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > > + > > + /* get the next allocated block if the extent in the path > > +* is before the requested block(s) */ > > + if (b2 < b1) { > > + b2 = ext4_ext_next_allocated_block(path); > > + if (b2 == EXT_MAX_BLOCK) > > + goto out; > > + } > > + > > + if (b1 + len1 > b2) { > > Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! > > + newext->ee_len = cpu_to_le16(b2 - b1); > > + return 1; > > + } -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch has the ext4 implemtation of fallocate system call. > > > > ... > > > > + /* ext4_can_extents_be_merged should have checked that either > > +* both extents are uninitialized, or both aren't. Thus we > > +* need to check only one of them here. > > +*/ > > Please always format multiline comments like this: > > /* >* ext4_can_extents_be_merged should have checked that either >* both extents are uninitialized, or both aren't. Thus we >* need to check only one of them here. >*/ Ok. > > ... > > > > +/* > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. Ok. Will expand the description. > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > > len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. > Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > +- block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of "len". > > + handle=ext4_journal_start(inode, credits + > > Please always put spaces around "="A Ok. > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > > And around "+" Ok. > > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > +retry: > > + ret = 0; > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ok. Will do that. > > > + if (ret > 0 && test_bit(BH_New, &map_bh
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > + */ > > +int ext4_ext_try_to_merge(struct inode *inode, > > + struct ext4_ext_path *path, > > + struct ext4_extent *ex) > > +{ > > + struct ext4_extent_header *eh; > > + unsigned int depth, len; > > + int merge_done=0, uninitialized = 0; > > space around "=", please. > > Many people prefer not to do the multiple-definitions-per-line, btw: > > int merge_done = 0; > int uninitialized = 0; Ok. Will make the change. > > reasons: > > - If gives you some space for a nice comment > > - It makes patches much more readable, and it makes rejects easier to fix > > - standardisation. > > > + depth = ext_depth(inode); > > + BUG_ON(path[depth].p_hdr == NULL); > > + eh = path[depth].p_hdr; > > + > > + while (ex < EXT_LAST_EXTENT(eh)) { > > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > > + break; > > + /* merge with next extent! */ > > + if (ext4_ext_is_uninitialized(ex)) > > + uninitialized = 1; > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > + + ext4_ext_get_actual_len(ex + 1)); > > + if (uninitialized) > > + ext4_ext_mark_uninitialized(ex); > > + > > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > > + * sizeof(struct ext4_extent); > > + memmove(ex + 1, ex + 2, len); > > + } > > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); > > Kenrel convention is to put spaces around "-" Will fix this. > > > + merge_done = 1; > > + BUG_ON(eh->eh_entries == 0); > > eek, scary BUG_ON. Do we really need to be that severe? Would it be > better to warn and run ext4_error() here? Ok. > > > + } > > + > > + return merge_done; > > +} > > + > > + > > > > ... > > > > +/* > > + * ext4_ext_convert_to_initialized: > > + * this function is called by ext4_ext_get_blocks() if someone tries to > > write > > + * to an uninitialized extent. It may result in splitting the uninitialized > > + * extent into multiple extents (upto three). Atleast one initialized > > extent > > + * and atmost two uninitialized extents can result. > > There are some typos here > > > + * There are three possibilities: > > + * a> No split required: Entire extent should be initialized. > > + * b> Split into two extents: Only one end of the extent is being > > written to. > > + * c> Split into three extents: Somone is writing in middle of the > > extent. > > and here > Ok. Will fix them. > > + */ > > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > > + struct ext4_ext_path *path, > > + ext4_fsblk_t iblock, > > + unsigned long max_blocks) > > +{ > > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > > + struct ext4_extent_header *eh; > > + unsigned int allocated, ee_block, ee_len, depth; > > + ext4_fsblk_t newblock; > > + int err = 0, ret = 0; > > + > > + depth = ext_depth(inode); > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + ee_block = le32_to_cpu(ex->ee_block); > > + ee_len = ext4_ext_get_actual_len(ex); > > + allocated = ee_len - (iblock - ee_block); > > + newblock = iblock - ee_block + ext_pblock(ex); > > + ex2 = ex; > > + > > + /* ex1: ee_block to iblock - 1 : uninitialized */ > > + if (iblock > ee_block) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* for sanity, update the length of the ex2 extent before > > +* we insert ex3, if ex1 is NULL. This is to avoid temporary > > +* overlap of blocks. > > +*/ > > + if (!ex1 && allocated > max_blocks) > > + ex2->ee_len = cpu_to_le16(max_blocks); > > + /* ex3: to ee_block + ee
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: > On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote: > > /* > >+ * ext4_ext_try_to_merge: > >+ * tries to merge the "ex" extent to the next extent in the tree. > >+ * It always tries to merge towards right. If you want to merge towards > >+ * left, pass "ex - 1" as argument instead of "ex". > >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > >+ * 1 if they got merged. > >+ */ > >+int ext4_ext_try_to_merge(struct inode *inode, > >+ struct ext4_ext_path *path, > >+ struct ext4_extent *ex) > >+{ > > Please either use proper kerneldoc format or drop > "ext4_ext_try_to_merge" from the comment. Ok, Thanks. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > > wrote: > > > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, > > > > loff_t len) > > > > +{ > > > > + handle_t *handle; > > > > + ext4_fsblk_t block, max_blocks; > > > > + int ret, ret2, nblocks = 0, retries = 0; > > > > + struct buffer_head map_bh; > > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > > + > > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > > + if (mode != FA_ALLOCATE) > > > > + return -EOPNOTSUPP; > > > > + > > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > > + return -ENOTTY; > > > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > news. The changelog would be an appropriate place to communicate this, > > > along with reasons why, or a description of the plan to fix it. > > > > Ok. Will add this in the function description as well. > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > Right. I don't seem to find any suitable error from posix description. > > Can you please suggest an error code which might make more sense here ? > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > non-extent files. > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > here, and fall back to the current library code to do preallocation? > This way, the caller of fallocate() will never see this return code, so > it won't violate posix. You are right. But, we still need to "standardize" (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY(?) Now the question is - should we limit the set of errors for this purpose to just 1 & 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-ext4-1
On Wed, May 09, 2007 at 09:24:49AM +1000, David Chinner wrote: > On Tue, May 08, 2007 at 03:05:56PM -0700, Mingming Cao wrote: > > On Tue, 2007-05-08 at 12:50 +1000, David Chinner wrote: > > > BTW, have you guys tested mmap writes into unwritten extents? ;) > > > > > I am not sure, Amit, have you done some mmap write test into > > uninitialized extents? > > > > Sorry, I still not quite clear what's the mapped problem you are worry > > about. Could you explain to me a bit more? thanks! > > XFS needs a ->page_mkwrite() callout to correctly map pages that > have been dirtied by mmap that span unwritten extents. mmap reads > (i.e. when the fault first occurred) treat unwritten extents like > holes and so we need to remap them when they are dirtied to set all > the unwritten state in the bufferheads correctly for writeback. > > See test 166 here: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/ > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/src/unwritten_mmap.c Hi David, I updated the above testcase to use fallocate() and ran it on the ext4 (with the fallocate patches applied). It threw following message on console : # ./a.out 2000 /home/test/mnt/testfile BUG: at fs/buffer.c:1640 __block_write_full_page() 08c6fcd0: [<080572a7>] dump_stack+0x1b/0x1d 08c6fce8: [<080c0d07>] __block_write_full_page+0xc4/0x24a 08c6fd10: [<080c21e0>] block_write_full_page+0xb0/0xb8 08c6fd40: [<0810c6a3>] ext4_ordered_writepage+0xcb/0x139 08c6fd80: [<08091ba8>] generic_writepages+0x178/0x2a0 08c6fdfc: [<08091cfd>] do_writepages+0x2d/0x38 08c6fe10: [<0808cea2>] __filemap_fdatawrite_range+0x62/0x6d 08c6fe88: [<0808cec5>] filemap_fdatawrite+0x18/0x1d 08c6fea8: [<080bf4bf>] do_fsync+0x26/0x67 08c6fec0: [<080bf521>] __do_fsync+0x21/0x35 08c6fed8: [<080bf542>] sys_fsync+0xd/0xf 08c6fee8: [<08058cac>] handle_syscall+0x8c/0xa4 08c6ff64: [<0806728a>] handle_trap+0xc1/0xc9 08c6ff80: [<08067683>] userspace+0x123/0x166 08c6ffd8: [<080589db>] fork_handler+0xa0/0xa2 08c6fffc: [] 0xa55a5a5a This is coming from: fs/buffer.c 1628 if (block > last_block) { 1629 .. ... . 1639 } else if (!buffer_mapped(bh) && buffer_dirty(bh)) { => 1640 WARN_ON(bh->b_size != blocksize); 1641 err = get_block(inode, block, bh, 1); 1642 .. ... . 1649 } Thus, I think in ext4 also we may need to have ->page_mkwrite implemented. I came across a patch you had submitted couple of months back which implemented a generic block_page_mkwrite() function, to which any file system could hook easily. Here is the link: http://lkml.org/lkml/2007/3/18/198 Any idea when is it going to be in the mainline ? Not sure if it is already part of some -mm kernel, but I did not find it in 2.6.21. Or, since there was a talk of ->fault() replacing ->page_mkwrite() the patch is not in the pipeline now ? And, how does XFS behave now if we write to mmapped preallocated blocks, since XFS also doesn't have ->page_mkwrite() implemented as of date ? Thanks! -- Regards, Amit Arora > > The same behaviour is needed for delalloc extents to prevent ENOSPC > errors on writeback - the mmap write needs to do the freespace > accounting at the time the page is dirtied and that can only be done > through the ->page_mkwrite callout. Otherwise ENOSPC will occur in > the writeback path and that is a major pain > > This may not be a problem for ext4, but I thought I better point > out a couple of the more subtle problems mmap can introduce > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > > --- > xfstests/src/Makefile |7 > xfstests/src/falloc.c | 376 > ++ > 2 files changed, 381 insertions(+), 2 deletions(-) > > Index: xfs-cmds/xfstests/src/falloc.c > === > --- /dev/null 1970-01-01 00:00:00.0 + > +++ xfs-cmds/xfstests/src/falloc.c2007-04-30 12:41:13.862302450 +1000 > @@ -0,0 +1,376 @@ > +/* > + * Copyright (c) 2000-2003,2007 Silicon Graphics, Inc. > + * All Rights Reserved. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it would be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write the Free Software Foundation, > + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include "global.h" > + > +/* should end up
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for "unallocation" when block holding the EOF gets unallocated. - If we do not "unallocate" normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime & ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > I have the updated patches ready which take care of Andrew's comments. > > Will run some tests and post them soon. > > > > But, before submitting these patches, I think it will be better to finalize > > on certain things which might be worth some discussion here: > > > > 1) Should the file size change when preallocation is done beyond EOF ? > >- Andreas and Chris Wedgwood are in favor of not changing the > > file size in this case. I also tend to agree with them. Does anyone > > has an argument in favor of changing the filesize ? > > If not, I will remove the code which changes the filesize, before I > > resubmit the concerned ext4 patch. > > I think there needs to be both. If we don't have a mechanism to > atomically change the file size with the preallocation, then > applications that use stat() to work out if they need to preallocate > more space will end up racing. By "both" above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > >of normal (non-preallocated) blocks (blocks allocated via > >regular write/truncate operations) also (i.e. work as punch()) ? > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and > what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. > >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > > we need to finalize on the convention here as a general guideline > > to all the filesystems that implement fallocate. > > > > 3) If above is true, the file size will need to be changed > >for "unallocation" when block holding the EOF gets unallocated. > > No - we punch a hole. If you want the filesize to change, then > you use ftruncate() to remove the blocks at EOF and change the > file size atomically. Ok. > > > 4) Should we update mtime & ctime on a successfull allocation/ > >unallocation ? > >- David Chinner raised this question in following post: > > http://lkml.org/lkml/2007/4/29/407 > > I think it makes sense to update the [mc]time for a successfull > > preallocation/unallocation. Does anyone feel otherwise ? > > It will be interesting to know how XFS behaves currently. Does XFS > > update [mc]time for preallocation ? > > No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size > changes. If the filesize changes, it behaves exactly the same way that > ftruncate() behaves. Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] sys_fallocate() system call
First of all, thanks for the overwhelming response! Based on the suggestions received, I have added a new parameter to the sys_fallocate() system call - an interger called "mode", just after the "fd". Now the system call looks like this: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for preallocation and deallocation of preallocated blocks respectively. More modes can be added, when required. And these modes can be renamed, since I am sure these are no way the best ones ! :) Attached below is the patch which implements this system call. It has been currently implemented and tested on i386, ppc64 and x86_64 architectures. I am facing some problems while trying to implement this on s390, and thus the delay. While I try to get it right on s390(x), we thought of posting this patch, so that we can save some time. Parallely we will work on getting the patch work on s390, and probably it will come as a separate patch. ToDos: = Following is pending: 1> Implementation on other architectures (other than i386, x86_64 and ppc64) like s390(x) 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> ext4 patches that support fallocate inode operation are ready. I plan to submit those separately to just ext4 mailing list. 4> Changes to glibc, so that posix_fallocate() and posix_fallocate64() call fallocate() system call 5> Changes to XFS to implement the fallocate inode operation Signed-off-by: Amit K Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/x86_64/kernel/functionlist |1 fs/open.c| 41 +++ include/asm-i386/unistd.h|3 +- include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 +- include/asm-x86_64/unistd.h |4 ++- include/linux/fs.h |7 ++ include/linux/syscalls.h |1 9 files changed, 59 insertions(+), 3 deletions(-) Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S === --- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.20.1/fs/open.c === --- linux-2.6.20.1.orig/fs/open.c +++ linux-2.6.20.1/fs/open.c @@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset < 0) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + if (offset + len > inode->i_sb->s_maxbytes) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.20.1/include/asm-i386/unistd.h === --- linux-2.6.20.1.orig/include/asm-i386/unistd.h +++ linux-2.6.20.1/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.20.1/include/linux/fs.h === --- linux-2.6.20.1.orig/include/linux/fs.h +++ linux-2.6.20.1/include/linux/fs.h @@ -263,6 +263,12 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * fallocate() modes + */ +#define FA_ALLOCATE0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #inclu
Re: [RFC][PATCH] sys_fallocate() system call
On Fri, Mar 16, 2007 at 04:21:03PM +0100, Heiko Carstens wrote: > On Fri, Mar 16, 2007 at 08:01:01PM +0530, Amit K. Arora wrote: > > First of all, thanks for the overwhelming response! > > > > Based on the suggestions received, I have added a new parameter to the > > sys_fallocate() system call - an interger called "mode", just after the > > "fd". Now the system call looks like this: > > > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for > > preallocation and deallocation of preallocated blocks respectively. More > > modes can be added, when required. And these modes can be renamed, since > > I am sure these are no way the best ones ! :) > > > > Attached below is the patch which implements this system call. It has > > been currently implemented and tested on i386, ppc64 and x86_64 > > architectures. I am facing some problems while trying to implement this > > on s390, and thus the delay. While I try to get it right on s390(x), we > > thought of posting this patch, so that we can save some time. Parallely > > we will work on getting the patch work on s390, and probably it will > > come as a separate patch. > > What's the problem you face on s390? If it's just the compat wrapper, you > may look at sys_sync_file_range_wrapper. Or I will send a patch if needed. Hi Heiko, Yes, the problem was adding compat wrapper for this. I will appreciate your help in writing it. Only thing is that we might have to wait till the order of the arguments is decided upon. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] sys_fallocate() system call
On Sat, Mar 17, 2007 at 04:33:50PM +1100, Stephen Rothwell wrote: > On Fri, 16 Mar 2007 20:01:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); > > > > --- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h > > +++ linux-2.6.20.1/include/asm-powerpc/systbl.h > > @@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat) > > COMPAT_SYS_SPU(get_robust_list) > > COMPAT_SYS_SPU(set_robust_list) > > COMPAT_SYS(move_pages) > > +SYSCALL(fallocate) > > It is going to need to be a COMPAT_SYS call in powerpc because 32 bit > powerpc will pass the two loff_t's in pairs of registers while > 64bit passes them in one register each. Ok. Will make that change, unless it is decided to pass each loff_t argument as two "u32"s. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] sys_fallocate() system call
On Sat, Mar 17, 2007 at 05:10:37AM -0600, Matthew Wilcox wrote: > How about: > > asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, > u32 len_low, u32 len_high); > > That way we all suffer equally ... As suggested by you and Russel, I have made this change to the patch. Here is how it looks like now. Please let me know if anyone has concerns about passing arguments this way (breaking each "loff_t" into two "u32"s). Signed-off-by: Amit K Arora <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/x86_64/kernel/functionlist |1 fs/open.c| 46 +++ include/asm-i386/unistd.h|3 +- include/asm-powerpc/systbl.h |1 include/asm-powerpc/unistd.h |3 +- include/asm-x86_64/unistd.h |4 ++- include/linux/fs.h |7 + include/linux/syscalls.h |2 + 9 files changed, 65 insertions(+), 3 deletions(-) Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S === --- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.20.1/fs/open.c === --- linux-2.6.20.1.orig/fs/open.c +++ linux-2.6.20.1/fs/open.c @@ -350,6 +350,52 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, + u32 len_low, u32 len_high) +{ + struct file *file; + struct inode *inode; + loff_t offset, len; + long ret = -EINVAL; + + offset = (off_high << 32) + off_low; + len = (len_high << 32) + len_low; + + if (len == 0 || offset < 0) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + if (offset + len > inode->i_sb->s_maxbytes) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.20.1/include/asm-i386/unistd.h === --- linux-2.6.20.1.orig/include/asm-i386/unistd.h +++ linux-2.6.20.1/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.20.1/include/linux/fs.h === --- linux-2.6.20.1.orig/include/linux/fs.h +++ linux-2.6.20.1/include/linux/fs.h @@ -263,6 +263,12 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * fallocate() modes + */ +#define FA_ALLOCATE0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1124,6 +1130,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*fallocate)(struct inode *, int, loff_t, loff_t); }; struct seq_file; Index: linux-2.6.20.1/include/linux/syscalls.h === --- linux-2.6.20.1.orig/include/linux/syscalls.h +++ linux-2.6.20.1/include/linux/syscalls.h @@ -602,6 +602,8 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, +