Two questions regarding Opening files within Kernel!
Hi Every one, I have got two questions regarding opening files within the Linux kernel. If some body can help me, in sorting out this problem, i will be very thankful. 1) I have just a file path with me, an absolute path, but no dentry, no inode, no vfsmount object, which function i can call to get a "file" object associated with the absoulte file path. I have surfed arround the source code especially fs/open.c and some other files, but each function requires a parameter "mode" and "fd" beside file path. Actually, i was confuse about the "mode" parameter (and its differece with "flag"), like what to send, and secondly for "fd", i am not sure, what value to send as there is no file infact and only file path exists. Any idea? 2) Any functionality within linux kernel source code, to read one line per file? or some indirect way to set buffer size for one read?. That is, any existing header file for doing text I/O rather than binary within the kernel source code? Thanks, JG ___ The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
On Fri, Apr 06, 2007 at 10:58:45PM -0700, [EMAIL PROTECTED] wrote: > You know,... you cut out this bit: > > - > > > The following benchmarks are from > > > > http://linuxhelp.150m.com/resources/fs-benchmarks.htm or, > > http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm ... Hey John, please change your disk, it's scratched and you're repeating yourself again and again. At first I thought "Oh cool, some good news about reiser4", now when I see "reiserfs" in a thread, I think "oh no, not this boring guy who escaped from the asylum again !". I hope this thread will be cut shortly so that you stop doing bad publicity to reiserfs and its developers, because when a product is indicated as good by stupid people, it's really doing harm. Also, about this part : [Jan] > > But in the end everything is a tradeoff. You can save diskspace, but > > increase the cost of corruption. I don't 100% agree with Jan, because for some usages (temporary space), light compression can increase speed. For instance, when processing logs, I get better speed by compressing intermediate files with LZO on the fly. [John] > You deliberately ignored the fact that bad blocks are NOT dealt with by > the filesystem,... but by the operating system. Like I said: If your > filesystem is writing to bad blocks, then throw away your operating > system. But what you write here is complete crap. The filesystem relies on a linear block device. The operating system is responsible for doing read retries or reporting errors on bad blocks, but the FS and only the FS can decide how not to use some known defective areas, for instance not putting any metadata on them nor any useful data. Now if you want to stop writing stupid things again and again, take your bag, don't miss the bus to school, and listen to the teachers instead of playing games on your calculator. Willy PS: non need to reply either, I'll kill this thread and your address here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] Char: cyclades, remove volatiles
Andrew Morton napsal(a): > On Wed, 4 Apr 2007 23:45:38 +0200 (CEST) Jiri Slaby <[EMAIL PROTECTED]> > wrote: > >> cyclades, remove volatiles > > The other changes seem uncontroversial, but this one has the potential > to change runtime behaviour. And cyclades.c is a driver which some people > actually use ;) Well, don't you know anybody with Z card, please? But all volatiles were - used locally (loop variables, nonptr count read once from HW), so that nothing has a chance to change them in the memory. - accessed by readX/writeX (pointers to mapped HW) and that should be OK > Have these changes been runtime-tested? Yes this time :), I have at least PCI Y cyclades card within reach to test. thanks, -- http://www.fi.muni.cz/~xslaby/Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] scheduler: first timeslice of the exiting thread
Hi Ingo and all, When I was examining the following program ... 1. There are a large amount of small jobs takes several msecs, and the number of job increases constantly. 2. The process creates a thread or a process per job (I examined both the thread model and the process model). 3. Each child process/thread does the assigned job and exit immediately. ... I found that the thread model's latency is longer than proess model's one against my expectation. It's because of the current sched_fork()/sched_exit() implementation as follows: a) On sched_fork, the creator share its timeslice with new process. b) On sched_exit, if the exiting process didn't exhaust its first timeslice yet, it gives its timeslice to the parent. It has no problem on the process model since the creator is the parent. However, on the thread model, the creator is not the parent, it is same as the creator's parent. Hence, on this kind of program, the creator can't retrieve shared timeslice and exausts its timeslice at a rate of knots. In addition, somehow, the parent (typically shell?) gets extra timeslice. I believe it's a bug and the exiting process should give its timeslice to the creator. Now I have some patch plan to fix this problem as follow: a) Add the field for the creator to task_struct. It needs extra memory. b) Doesn't add extra field and have thread's parent the creater, which is same as process creation. However it has many side effects, for example, we also need to change sys_getppid() implementation. What do you think? Any comments are welcome. BTW, We can easily confirm the problem with systemtap, a convenient diagnostic program. Test programs(attached in the mail): - satprocess.c: Process model. It creates a child process and wait for it several times. Each child process exits immediately. - satthread.c: Thread model. It creates a child thread and join it several times. Each child thread exits immediately. - fork_exit.stp: systemtap script to overlook satprocess/satthread My kernel: 2.6.21-rc6(i386). How to confirm: 1) Execute systemtap script. # stap fork_exit.stp -v Pass 1: parsed user script and 54 library script(s) in 680usr/40sys/728real ms. Pass 2: analyzed script: 2 probe(s), 8 function(s), 0 embed(s), 0 global(s) in 670usr/40sys/699real ms. Pass 3: using cached /root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.c Pass 4: using cached /root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.ko Pass 5: starting run. 2) Execute the process model program on another terminal. $ ./satprocess Then systemtap overlooks satprocess's fork/exit and prints the followings: fork: pid = 11635, tgid = 11635, ppid = 5969, time_slice = 11 exit: pid = 11635, tgid = 11635, ppid = 11634, time_slice = 6 fork: pid = 11636, tgid = 11636, ppid = 5969, time_slice = 11 exit: pid = 11636, tgid = 11636, ppid = 11634, time_slice = 6 fork: pid = 11637, tgid = 11637, ppid = 5969, time_slice = 11 exit: pid = 11637, tgid = 11637, ppid = 11634, time_slice = 6 fork: pid = 11638, tgid = 11638, ppid = 5969, time_slice = 11 exit: pid = 11638, tgid = 11638, ppid = 11634, time_slice = 6 fork: pid = 11639, tgid = 11639, ppid = 5969, time_slice = 11 exit: pid = 11639, tgid = 11639, ppid = 11634, time_slice = 6 fork: pid = 11640, tgid = 11640, ppid = 5969, time_slice = 11 exit: pid = 11640, tgid = 11640, ppid = 11634, time_slice = 6 fork: pid = 11641, tgid = 11641, ppid = 5969, time_slice = 11 exit: pid = 11641, tgid = 11641, ppid = 11634, time_slice = 6 fork: pid = 11642, tgid = 11642, ppid = 5969, time_slice = 11 exit: pid = 11642, tgid = 11642, ppid = 11634, time_slice = 6 fork: pid = 11643, tgid = 11643, ppid = 5969, time_slice = 11 exit: pid = 11643, tgid = 11643, ppid = 11634, time_slice = 6 fork: pid = 11644, tgid = 11644, ppid = 5969, time_slice = 11 exit: pid = 11644, tgid = 11644, ppid = 11634, time_slice = 6 exit: pid = 11634, tgid = 11634, ppid = 5969, time_slice = 10 It looks good. 3) Execute the thread model program on another terminal. $ ./satthread Then systemtap overlooks satthread's fork/exit and prints the followings: fork: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 10 exit: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 5 fork: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 5 fork: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 2 exit: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 3 fork: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1 exit: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 1 fork: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 25 exit: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1 fork: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 12 exit: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 13 fork: pid = 11652, tgid = 11645, ppid = 5969, time_slice = 6 exit: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 6 for
Re: COMPILING AND CONFIGURING A NEW KERNEL.
Just correcting some errors and typos. Wouldn't want you to say that the linux kernel mailing list gave you incorrect info. COMPILING AND CONFIGURING A NEW KERNEL. Download a recent kernel from http://www.kernel.org/ I will use the kernel linux-2.6.20.tar.bz2 You will have to change details of the following to suit your purposes. Save it in /usr/src/ # mv linux-2.6.20.tar.bz2 /usr/src/ Unzip the kernel package # tar -jxf linux-2.6.20.tar.bz2 Copy the original kernel configuration file (that came with your distro) to .config # cp /boot/config-2.6.13-15-default /usr/src/linux-2.6.20/.config Change to the new kernel source directory # cd /usr/src/linux-2.6.20/ Look at the available kernel building options # make help Run oldconfig to update the original kernel configuration to a current configuration # make oldconfig Use menuconfig (or xconfig or gconfig) to make any further changes # make menuconfig YOU SHOULD compile all the drivers necessary to boot your system, into the kernel (ie, such drivers should not be built as modules). This way you will NOT need an initrd file. Use rpm-pkg to create a Red Hat RPM kernel package. # make rpm-pkg When built, the RPM package is put in /usr/src/packages/RPMS/*your*architecture* # cd /usr/src/packages/RPMS/x86_64 Install the package (you may have to un-install previous installs) # rpm -i kernel-2.6.20-1.x86_64.rpm Use deb-pkg to create a Debian .deb kernel package. # make deb-pkg When built, the .deb package is put in /usr/src/ # cd /usr/src/ Install the package (you may have to un-install previous installs) # dpkg --install linux-2.6.20_2.6.20_amd64.deb If you were unable to determine which drivers you need (to boot), then you will need an initrd file. To build it use the command # mkinitrd -o /boot/initrd-2.6.20 IF YOU ARE CUSTOMIZING YOUR KERNEL, YOU SHOULD PUT IN THE EFFORT TO BUILD A KERNEL THAT DOES NOT NEED AN INITRD FILE. It is possible that deb-pkg and rpm-pkg take care of creating the initrd automatically. I have always compiled in the important drivers, so I do not know. Does any caring person here know the answer to this question? -- Now you need to configure your kernel. Using GRUB you need to change the menu.lst file. # emacs /boot/grub/menu.lst & The grub entry that you presently boot with, will look something like: ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE LINUX 10.0 root (hd0,2) kernel /boot/vmlinuz-2.6.13-15-default root=/dev/hda3 resume=/dev/hda5 vga=0x317 video=vesafb:nomtrr splash=silent initrd /boot/initrd Do NOT delete the old boot entry, so you can boot it, if things go wrong with the new kernel. Cut a copy of it and paste it above the original. Then adjust the copy for the new kernel. ###Don't change this comment - YaST2 identifier: Original name: linux### title MY NEW KERNEL root (hd0,2) kernel /boot/linux-2.6.20 root=/dev/hda3 resume=/dev/hda5 vga=0x317 video=vesafb:nomtrr splash=silent Of course, you don't need a initrd entry as you have compiled in all the vital drivers,... right? If you could not determine the vital drivers and needed to build an initrd file, then you need an entry, like initrd /boot/initrd-2.6.20 -- If your new kernel is destined to have the same name as the old one, you need to do something about it (unless you do not mind the old one being overwritten). Use your favorite text editor to change the top level Makefile # emacs /usr/src/linux-2.6.20/Makefile & change the line EXTRAVERSION to EXTRAVERSION = something This will change the name of the new kernel to linux-2.6.20-something Your /boot/grub/menu.lst entry will now look something like: ###Don't change this comment - YaST2 identifier: Original name: linux### title MY NEW KERNEL root (hd0,2) kernel /boot/linux-2.6.20-something root=/dev/hda3 resume=/dev/hda5 vga=0x317 video=vesafb:nomtrr splash=silent and perhaps an entry initrd /boot/initrd-2.6.20-something -- Now reboot and choose the "MY NEW KERNEL" entry from the GRUB boot menu, and see how you went. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Send your email first class - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] scheduler: first timeslice of the exiting thread
At Sat, 07 Apr 2007 16:31:39 +0900, Satoru Takeuchi wrote: > Test programs(attached in the mail): > > - satprocess.c: Process model. It creates a child process and wait for it >several times. Each child process exits immediately. > - satthread.c: Thread model. It creates a child thread and join it several >times. Each child thread exits immediately. > - fork_exit.stp: systemtap script to overlook satprocess/satthread Oh, I sent a old systemtap script. Correct one is here. Satoru /// fork_exit.stp / /* * fork_exit.stp - Overlooks sched_fork()/exit_exit() for satprocess/satthread * and prints some information * * Copyright (C) 2007 Satoru Takeuchi <[EMAIL PROTECTED]> * * This software may be used and distributed according to the terms * of the GNU General Public License, incorporated herein by reference. */ function is_my_testpro(comm) { if (comm == "satthread" || comm == "satprocess") return 1 else return 0 } function print_log(name, pid, tgid, ppid, time_slice) { printf("%s: pid = %d, tgid = %d, ppid = %d, time_slice = %u\n", name, pid, tgid, ppid, time_slice); } probe kernel.function("sched_fork") { if (is_my_testpro(kernel_string($p->comm))) print_log("fork", $p->pid, $p->tgid, $p->parent->pid, $p->time_slice); } probe kernel.function("sched_exit") { if (is_my_testpro(kernel_string($p->comm))) print_log("exit", $p->pid, $p->tgid, $p->parent->pid, $p->time_slice); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: coding style for long conditions
David Brownell <[EMAIL PROTECTED]> writes: > So in practical terms "\n \t" and "\n\t" are identical; > although the former "should not" be used, it doesn't > actually affect what CodingStyle is primarily trying to > control (i.e. what the code looks like). That's not what CodingStyle is trying to control. Not "what the code looks like" at all. Think why this line is here at the end of Chapter 1. Get a decent editor and don't leave whitespace at the end of lines. By the way, "git show --color 0aa599c -- drivers/usb/net/usbnet.h" would catch this kind of breakage if you have git. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] high-res timers: UP resume fix
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > In fact, I have a theory.. Your backtrace is: > > [] smp_apic_timer_interrupt+0x57/0x90 > [] retrigger_next_event+0x0/0xb0 > [] apic_timer_interrupt+0x28/0x30 > [] retrigger_next_event+0x0/0xb0 > [] __kfifo_put+0x8/0x90 > [] on_each_cpu+0x35/0x60 > [] clock_was_set+0x18/0x20 > [] timekeeping_resume+0x7c/0xa0 > [] __sysdev_resume+0x11/0x80 > [] sysdev_resume+0x47/0x80 > [] device_power_up+0x5/0x10 > > and the thing is, I don't think we should have interrupt enabled at > this point in time! I susect that the timer resume enables interrupts > too early! We should be doing the whole "device_power_up()" sequence > with irq's off, I think.. yeah, i think you are right. timekeeping_resume() itself does not re-enable interrupts, it's clock_was_set() that does it implicitly: void clock_was_set(void) { /* Retrigger the CPU local events everywhere */ on_each_cpu(retrigger_next_event, NULL, 0, 1); } on_each_cpu() is safe on SMP during resume 'bootup', because we only have a single CPU at that point, and smp_call_function() does: spin_lock(&call_lock); cpus = num_online_cpus() - 1; if (!cpus) { spin_unlock(&call_lock); so we just return. Note that the built-in warning of smp_call_function() does not trigger because it's done too late: /* Can deadlock when called with interrupts disabled */ WARN_ON(irqs_disabled()); we should move this up to the head of the function. But for this bug in question to trigger we'd have to use an UP kernel, which has this code for on_each_cpu(): #define on_each_cpu(func,info,retry,wait) \ ({ \ local_irq_disable();\ func(info); \ local_irq_enable(); \ ouch! the solution is this: what we want to call here in timekeeping_resume is not clock_was_set() but retrigger_next_event() for the current CPU. The patch below should fix it. Soeren, can you confirm that you are using a !CONFIG_SMP kernel, and if yes, does the patch below fix the resume problem for you? Ingo > Subject: [patch] high-res timers: UP resume fix From: Ingo Molnar <[EMAIL PROTECTED]> Soeren Sonnenburg reported that upon resume he is getting this backtrace: [] smp_apic_timer_interrupt+0x57/0x90 [] retrigger_next_event+0x0/0xb0 [] apic_timer_interrupt+0x28/0x30 [] retrigger_next_event+0x0/0xb0 [] __kfifo_put+0x8/0x90 [] on_each_cpu+0x35/0x60 [] clock_was_set+0x18/0x20 [] timekeeping_resume+0x7c/0xa0 [] __sysdev_resume+0x11/0x80 [] sysdev_resume+0x47/0x80 [] device_power_up+0x5/0x10 it turns out that on UP we mistakenly re-enable interrupts, so do the timer retrigger only on the current CPU. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/linux/hrtimer.h |3 +++ kernel/hrtimer.c| 12 2 files changed, 15 insertions(+) Index: linux/include/linux/hrtimer.h === --- linux.orig/include/linux/hrtimer.h +++ linux/include/linux/hrtimer.h @@ -206,6 +206,7 @@ struct hrtimer_cpu_base { struct clock_event_device; extern void clock_was_set(void); +extern void hres_timers_resume(void); extern void hrtimer_interrupt(struct clock_event_device *dev); /* @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim */ static inline void clock_was_set(void) { } +static inline void hres_timers_resume(void) { } + /* * In non high resolution mode the time reference is taken from * the base softirq time variable. Index: linux/kernel/hrtimer.c === --- linux.orig/kernel/hrtimer.c +++ linux/kernel/hrtimer.c @@ -459,6 +459,18 @@ void clock_was_set(void) } /* + * During resume we might have to reprogram the high resolution timer + * interrupt (on the local CPU): + */ +void hres_timers_resume(void) +{ + WARN_ON_ONCE(num_online_cpus() > 1); + + /* Retrigger the CPU local events: */ + retrigger_next_event(NULL); +} + +/* * Check, whether the timer is on the callback pending list */ static inline int hrtimer_cb_pending(const struct hrtimer *timer) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/14] sysfs: fix i_ino handling in sysfs
Inode number handling was incorrect in two ways. 1. sysfs uses the inode number allocated by new_inode() and never hashes it. When reporting the inode number, it uses iunique() if inode is inaccessible. This is incorrect because iunique() assumes the inodes are hashed. This can cause duplicate inode numbers and the condition is likely to happen because new_inode() and iunique() use separate increasing static counters to scan for empty slot. 2. sysfs_dirent->s_dentry can go away anytime and can't be referenced unless the caller knows the dentry is not and not going to be deleted. This patch makes sysfs report the pointer to sysfs_dirent as ino. ino_t is always as big as or larger than unsigned long && sysfs_dirent hierarchy is the internal representation of the sysfs tree, so it makes sense and simple to implement. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c | 11 --- fs/sysfs/inode.c |1 + 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 85a6686..5112f88 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -504,19 +504,19 @@ static int sysfs_readdir(struct file * filp, void * dirent, filldir_t filldir) struct sysfs_dirent * parent_sd = dentry->d_fsdata; struct sysfs_dirent *cursor = filp->private_data; struct list_head *p, *q = &cursor->s_sibling; - ino_t ino; + unsigned long ino; int i = filp->f_pos; switch (i) { case 0: - ino = dentry->d_inode->i_ino; + ino = (unsigned long)parent_sd; if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0) break; filp->f_pos++; i++; /* fallthrough */ case 1: - ino = parent_ino(dentry); + ino = (unsigned long)dentry->d_parent->d_fsdata; if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0) break; filp->f_pos++; @@ -538,10 +538,7 @@ static int sysfs_readdir(struct file * filp, void * dirent, filldir_t filldir) name = sysfs_get_name(next); len = strlen(name); - if (next->s_dentry) - ino = next->s_dentry->d_inode->i_ino; - else - ino = iunique(sysfs_sb, 2); + ino = (unsigned long)next; if (filldir(dirent, name, len, filp->f_pos, ino, dt_type(next)) < 0) diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c index 4de5c6b..b8b010c 100644 --- a/fs/sysfs/inode.c +++ b/fs/sysfs/inode.c @@ -140,6 +140,7 @@ struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent * sd) inode->i_mapping->a_ops = &sysfs_aops; inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info; inode->i_op = &sysfs_inode_operations; + inode->i_ino = (unsigned long)sd; lockdep_set_class(&inode->i_mutex, &sysfs_inode_imutex_key); if (sd->s_iattr) { -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/14] sysfs: fix error handling in binattr write()
Error handling in fs/sysfs/bin.c:write() was wrong because size_t count is used to receive return value from flush_write() which is negative on failure. This patch updates write() such that int variable is used instead. read() is updated the same way for consistency. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/bin.c | 21 - 1 files changed, 8 insertions(+), 13 deletions(-) diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index d3b9f5f..8273dd6 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -33,16 +33,13 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count) } static ssize_t -read(struct file * file, char __user * userbuf, size_t count, loff_t * off) +read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off) { char *buffer = file->private_data; struct dentry *dentry = file->f_path.dentry; int size = dentry->d_inode->i_size; loff_t offs = *off; - int ret; - - if (count > PAGE_SIZE) - count = PAGE_SIZE; + int count = min_t(size_t, bytes, PAGE_SIZE); if (size) { if (offs > size) @@ -51,10 +48,9 @@ read(struct file * file, char __user * userbuf, size_t count, loff_t * off) count = size - offs; } - ret = fill_read(dentry, buffer, offs, count); - if (ret < 0) - return ret; - count = ret; + count = fill_read(dentry, buffer, offs, count); + if (count < 0) + return count; if (copy_to_user(userbuf, buffer, count)) return -EFAULT; @@ -78,16 +74,15 @@ flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count) return attr->write(kobj, buffer, offset, count); } -static ssize_t write(struct file * file, const char __user * userbuf, -size_t count, loff_t * off) +static ssize_t write(struct file *file, const char __user *userbuf, +size_t bytes, loff_t *off) { char *buffer = file->private_data; struct dentry *dentry = file->f_path.dentry; int size = dentry->d_inode->i_size; loff_t offs = *off; + int count = min_t(size_t, bytes, PAGE_SIZE); - if (count > PAGE_SIZE) - count = PAGE_SIZE; if (size) { if (offs > size) return 0; -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/14] sysfs: kill unnecessary attribute->owner
sysfs is now completely out of driver/module lifetime game. After deletion, a sysfs node doesn't access anything outside sysfs proper, so there's no reason to hold onto the attribute owners. Note that often the wrong modules were accounted for as owners leading to accessing removed modules. This patch kills now unnecessary attribute->owner. Note that with this change, userland holding a sysfs node does not prevent the backing module from being unloaded. For more info regarding lifetime rule cleanup, please read the following message. http://article.gmane.org/gmane.linux.kernel/510293 Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- drivers/base/class.c|2 -- drivers/base/core.c |4 drivers/base/firmware_class.c |2 +- drivers/block/pktcdvd.c |3 +-- drivers/char/ipmi/ipmi_msghandler.c | 10 -- drivers/cpufreq/cpufreq_stats.c |3 +-- drivers/cpufreq/cpufreq_userspace.c |2 +- drivers/cpufreq/freq_table.c|1 - drivers/firmware/dcdbas.h |3 +-- drivers/firmware/dell_rbu.c |6 +++--- drivers/firmware/edd.c |2 +- drivers/firmware/efivars.c |6 +++--- drivers/i2c/chips/eeprom.c |1 - drivers/i2c/chips/max6875.c |1 - drivers/infiniband/core/sysfs.c |1 - drivers/input/mouse/psmouse.h |1 - drivers/media/video/pvrusb2/pvrusb2-sysfs.c | 13 - drivers/misc/asus-laptop.c |3 +-- drivers/pci/hotplug/acpiphp_ibm.c |1 - drivers/pci/pci-sysfs.c |4 drivers/pcmcia/socket_sysfs.c |2 +- drivers/rtc/rtc-ds1553.c|1 - drivers/rtc/rtc-ds1742.c|1 - drivers/scsi/arcmsr/arcmsr_attr.c |3 --- drivers/scsi/lpfc/lpfc_attr.c |2 -- drivers/scsi/qla2xxx/qla_attr.c |6 -- drivers/spi/at25.c |1 - drivers/video/aty/radeon_base.c |2 -- drivers/video/backlight/backlight.c |2 +- drivers/video/backlight/lcd.c |2 +- drivers/w1/slaves/w1_ds2433.c |1 - drivers/w1/slaves/w1_therm.c|1 - drivers/w1/w1.c |2 -- fs/ecryptfs/main.c |2 -- fs/ocfs2/cluster/masklog.c |1 - fs/partitions/check.c |1 - fs/sysfs/bin.c | 19 +-- fs/sysfs/file.c | 24 +--- include/linux/sysdev.h |3 +-- include/linux/sysfs.h |7 +++ kernel/module.c |9 +++-- kernel/params.c |1 - net/bridge/br_sysfs_br.c|3 +-- net/bridge/br_sysfs_if.c|3 +-- 44 files changed, 35 insertions(+), 133 deletions(-) diff --git a/drivers/base/class.c b/drivers/base/class.c index d596812..064c1de 100644 --- a/drivers/base/class.c +++ b/drivers/base/class.c @@ -624,7 +624,6 @@ int class_device_add(struct class_device *class_dev) goto out3; class_dev->uevent_attr.attr.name = "uevent"; class_dev->uevent_attr.attr.mode = S_IWUSR; - class_dev->uevent_attr.attr.owner = parent_class->owner; class_dev->uevent_attr.store = store_uevent; error = class_device_create_file(class_dev, &class_dev->uevent_attr); if (error) @@ -639,7 +638,6 @@ int class_device_add(struct class_device *class_dev) } attr->attr.name = "dev"; attr->attr.mode = S_IRUGO; - attr->attr.owner = parent_class->owner; attr->show = show_dev; error = class_device_create_file(class_dev, attr); if (error) { diff --git a/drivers/base/core.c b/drivers/base/core.c index d7fcf82..37930d0 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -563,8 +563,6 @@ int device_add(struct device *dev) dev->uevent_attr.attr.name = "uevent"; dev->uevent_attr.attr.mode = S_IWUSR; - if (dev->driver) - dev->uevent_attr.attr.owner = dev->driver->owner; dev->uevent_attr.store = store_uevent; error = device_create_file(dev, &dev->uevent_attr); if (error) @@ -579,8 +577,6 @@ int device_add(struct device *dev) } attr->attr.name = "dev"; attr->attr.mode = S_IRUGO; - if (dev->driver) - attr->attr.owner = dev->driver->owner; attr->show = show_dev; error = device_create_file(dev, attr); i
[PATCH 08/14] sysfs: make sysfs_dirent->s_element a union
Make sd->s_element a union of sysfs_elem_{dir|symlink|attr|bin_attr} and rename it to s_elem. This is to achieve... * some level of type checking : changing symlink to point to sysfs_dirent instead of kobject is much safer and less painful now. * easier / standardized dereferencing * allow sysfs_elem_* to contain more than one entry Where possible, pointer is obtained by directly deferencing from sd instead of going through other entities. This reduces dependencies to dentry, inode and kobject. to_attr() and to_bin_attr() are unused now and removed. This is in preparation of object reference simplification. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/bin.c | 18 +++--- fs/sysfs/dir.c | 40 ++-- fs/sysfs/file.c| 19 ++- fs/sysfs/inode.c |2 +- fs/sysfs/mount.c |1 - fs/sysfs/symlink.c | 23 --- fs/sysfs/sysfs.h | 47 +++ 7 files changed, 71 insertions(+), 79 deletions(-) diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index 8273dd6..0f0027b 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -23,7 +23,8 @@ static int fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count) { - struct bin_attribute * attr = to_bin_attr(dentry); + struct sysfs_dirent *attr_sd = dentry->d_fsdata; + struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; struct kobject * kobj = to_kobj(dentry->d_parent); if (!attr->read) @@ -65,7 +66,8 @@ read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off) static int flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count) { - struct bin_attribute *attr = to_bin_attr(dentry); + struct sysfs_dirent *attr_sd = dentry->d_fsdata; + struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; struct kobject *kobj = to_kobj(dentry->d_parent); if (!attr->write) @@ -101,9 +103,9 @@ static ssize_t write(struct file *file, const char __user *userbuf, static int mmap(struct file *file, struct vm_area_struct *vma) { - struct dentry *dentry = file->f_path.dentry; - struct bin_attribute *attr = to_bin_attr(dentry); - struct kobject *kobj = to_kobj(dentry->d_parent); + struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; + struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; + struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent); if (!attr->mmap) return -EINVAL; @@ -114,7 +116,8 @@ static int mmap(struct file *file, struct vm_area_struct *vma) static int open(struct inode * inode, struct file * file) { struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent); - struct bin_attribute * attr = to_bin_attr(file->f_path.dentry); + struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; + struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; int error = -EINVAL; if (!kobj || !attr) @@ -150,7 +153,8 @@ static int open(struct inode * inode, struct file * file) static int release(struct inode * inode, struct file * file) { struct kobject * kobj = to_kobj(file->f_path.dentry->d_parent); - struct bin_attribute * attr = to_bin_attr(file->f_path.dentry); + struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; + struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; u8 * buffer = file->private_data; kobject_put(kobj); diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 26c3088..d70ead5 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -21,11 +21,10 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry) spin_lock(&dcache_lock); if (!d_unhashed(dentry)) { struct sysfs_dirent * sd = dentry->d_fsdata; - if (sd->s_type & SYSFS_KOBJ_LINK) { - struct sysfs_symlink * sl = sd->s_element; - kobj = kobject_get(sl->target_kobj); - } else - kobj = kobject_get(sd->s_element); + if (sd->s_type & SYSFS_KOBJ_LINK) + kobj = kobject_get(sd->s_elem.symlink.target_kobj); + else + kobj = kobject_get(sd->s_elem.dir.kobj); } spin_unlock(&dcache_lock); @@ -39,11 +38,8 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) repeat: parent_sd = sd->s_parent; - if (sd->s_type & SYSFS_KOBJ_LINK) { - struct sysfs_symlink * sl = sd->s_element; - kobject_put(sl->target_kobj); - kfree(sl); - } + if (sd->s_type & SYSFS_KOBJ_LINK) + kobject_put(sd->s_elem.symlink.target_kobj); if (sd->s_type & SYSFS_COPY_NAME) kfree(sd->s_name); kfree(sd->s_iattr); @@ -70,8 +66,7 @
[PATCH 05/14] sysfs: consolidate sysfs_dirent creation functions
Currently there are four functions to create sysfs_dirent - __sysfs_new_dirent(), sysfs_new_dirent(), __sysfs_make_dirent() and sysfs_make_dirent(). Other than sysfs_make_dirent(), no function has two users if calls to implement other functions are excluded. This patch consolidates sysfs_dirent creation functions into the following two. * sysfs_new_dirent() : allocate and initialize * sysfs_attach_dirent() : attach to sysfs_dirent hierarchy and/or associate with dentry This simplifies interface and gives callers more flexibility. This is in preparation of object reference simplification. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c | 82 fs/sysfs/file.c| 21 ++--- fs/sysfs/symlink.c |7 ++-- fs/sysfs/sysfs.h |7 +++- 4 files changed, 50 insertions(+), 67 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 99a0ba1..653c23c 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -60,10 +60,7 @@ static struct dentry_operations sysfs_dentry_ops = { .d_iput = sysfs_d_iput, }; -/* - * Allocates a new sysfs_dirent and links it to the parent sysfs_dirent - */ -static struct sysfs_dirent * __sysfs_new_dirent(void * element) +struct sysfs_dirent *sysfs_new_dirent(void *element, umode_t mode, int type) { struct sysfs_dirent * sd; @@ -75,25 +72,25 @@ static struct sysfs_dirent * __sysfs_new_dirent(void * element) atomic_set(&sd->s_event, 1); INIT_LIST_HEAD(&sd->s_children); INIT_LIST_HEAD(&sd->s_sibling); + sd->s_element = element; + sd->s_mode = mode; + sd->s_type = type; return sd; } -static void __sysfs_list_dirent(struct sysfs_dirent *parent_sd, - struct sysfs_dirent *sd) +void sysfs_attach_dirent(struct sysfs_dirent *sd, +struct sysfs_dirent *parent_sd, struct dentry *dentry) { - if (sd) - list_add(&sd->s_sibling, &parent_sd->s_children); -} + if (dentry) { + sd->s_dentry = dentry; + dentry->d_fsdata = sysfs_get(sd); + dentry->d_op = &sysfs_dentry_ops; + } -static struct sysfs_dirent * sysfs_new_dirent(struct sysfs_dirent *parent_sd, - void * element) -{ - struct sysfs_dirent *sd; - sd = __sysfs_new_dirent(element); - __sysfs_list_dirent(parent_sd, sd); - return sd; + if (parent_sd) + list_add(&sd->s_sibling, &parent_sd->s_children); } /* @@ -121,39 +118,6 @@ int sysfs_dirent_exist(struct sysfs_dirent *parent_sd, return 0; } - -static struct sysfs_dirent * -__sysfs_make_dirent(struct dentry *dentry, void *element, mode_t mode, int type) -{ - struct sysfs_dirent * sd; - - sd = __sysfs_new_dirent(element); - if (!sd) - goto out; - - sd->s_mode = mode; - sd->s_type = type; - sd->s_dentry = dentry; - if (dentry) { - dentry->d_fsdata = sysfs_get(sd); - dentry->d_op = &sysfs_dentry_ops; - } - -out: - return sd; -} - -int sysfs_make_dirent(struct sysfs_dirent * parent_sd, struct dentry * dentry, - void * element, umode_t mode, int type) -{ - struct sysfs_dirent *sd; - - sd = __sysfs_make_dirent(dentry, element, mode, type); - __sysfs_list_dirent(parent_sd, sd); - - return sd ? 0 : -ENOMEM; -} - static int init_dir(struct inode * inode) { inode->i_op = &sysfs_dir_inode_operations; @@ -197,10 +161,11 @@ static int create_dir(struct kobject *kobj, struct dentry *parent, if (sysfs_dirent_exist(parent->d_fsdata, name)) goto out_dput; - error = sysfs_make_dirent(parent->d_fsdata, dentry, kobj, mode, - SYSFS_DIR); - if (error) + error = -ENOMEM; + sd = sysfs_new_dirent(kobj, mode, SYSFS_DIR); + if (!sd) goto out_drop; + sysfs_attach_dirent(sd, parent->d_fsdata, dentry); error = sysfs_create(dentry, mode, init_dir); if (error) @@ -215,7 +180,6 @@ static int create_dir(struct kobject *kobj, struct dentry *parent, goto out_dput; out_sput: - sd = dentry->d_fsdata; list_del_init(&sd->s_sibling); sysfs_put(sd); out_drop: @@ -512,13 +476,16 @@ static int sysfs_dir_open(struct inode *inode, struct file *file) { struct dentry * dentry = file->f_path.dentry; struct sysfs_dirent * parent_sd = dentry->d_fsdata; + struct sysfs_dirent * sd; mutex_lock(&dentry->d_inode->i_mutex); - file->private_data = sysfs_new_dirent(parent_sd, NULL); + sd = sysfs_new_dirent(NULL, 0, 0); + if (sd) + sysfs_attach_dirent(sd, parent_sd, NULL); mutex_unlock(&dentry->d_inode->i_mutex); - return file->private_da
[PATCH 11/14] sysfs: implement bin_buffer
Implement bin_buffer which contains a mutex and pointer to PAGE_SIZE buffer to properly synchronize accesses to per-openfile buffer and prepare for immediate-kobj-disconnect. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/bin.c | 64 ++- 1 files changed, 49 insertions(+), 15 deletions(-) diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index 0f0027b..1dd1bf1 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -20,6 +20,11 @@ #include "sysfs.h" +struct bin_buffer { + struct mutexmutex; + void*buffer; +}; + static int fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count) { @@ -36,7 +41,7 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count) static ssize_t read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off) { - char *buffer = file->private_data; + struct bin_buffer *bb = file->private_data; struct dentry *dentry = file->f_path.dentry; int size = dentry->d_inode->i_size; loff_t offs = *off; @@ -49,17 +54,23 @@ read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off) count = size - offs; } - count = fill_read(dentry, buffer, offs, count); + mutex_lock(&bb->mutex); + + count = fill_read(dentry, bb->buffer, offs, count); if (count < 0) - return count; + goto out_unlock; - if (copy_to_user(userbuf, buffer, count)) - return -EFAULT; + if (copy_to_user(userbuf, bb->buffer, count)) { + count = -EFAULT; + goto out_unlock; + } pr_debug("offs = %lld, *off = %lld, count = %zd\n", offs, *off, count); *off = offs + count; + out_unlock: + mutex_unlock(&bb->mutex); return count; } @@ -79,7 +90,7 @@ flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count) static ssize_t write(struct file *file, const char __user *userbuf, size_t bytes, loff_t *off) { - char *buffer = file->private_data; + struct bin_buffer *bb = file->private_data; struct dentry *dentry = file->f_path.dentry; int size = dentry->d_inode->i_size; loff_t offs = *off; @@ -92,25 +103,38 @@ static ssize_t write(struct file *file, const char __user *userbuf, count = size - offs; } - if (copy_from_user(buffer, userbuf, count)) - return -EFAULT; + mutex_lock(&bb->mutex); + + if (copy_from_user(bb->buffer, userbuf, count)) { + count = -EFAULT; + goto out_unlock; + } - count = flush_write(dentry, buffer, offs, count); + count = flush_write(dentry, bb->buffer, offs, count); if (count > 0) *off = offs + count; + + out_unlock: + mutex_unlock(&bb->mutex); return count; } static int mmap(struct file *file, struct vm_area_struct *vma) { + struct bin_buffer *bb = file->private_data; struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent); + int rc; if (!attr->mmap) return -EINVAL; - return attr->mmap(kobj, attr, vma); + mutex_lock(&bb->mutex); + rc = attr->mmap(kobj, attr, vma); + mutex_unlock(&bb->mutex); + + return rc; } static int open(struct inode * inode, struct file * file) @@ -118,6 +142,7 @@ static int open(struct inode * inode, struct file * file) struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent); struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; + struct bin_buffer *bb = NULL; int error = -EINVAL; if (!kobj || !attr) @@ -135,14 +160,22 @@ static int open(struct inode * inode, struct file * file) goto Error; error = -ENOMEM; - file->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL); - if (!file->private_data) + bb = kzalloc(sizeof(*bb), GFP_KERNEL); + if (!bb) goto Error; + bb->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!bb->buffer) + goto Error; + + mutex_init(&bb->mutex); + file->private_data = bb; + error = 0; -goto Done; + goto Done; Error: + kfree(bb); module_put(attr->attr.owner); Done: if (error) @@ -155,11 +188,12 @@ static int release(struct inode * inode, struct file * file) struct kobject * kobj = to_kobj(file->f_path.dentry->d_parent); struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; - u8 * buffer
[PATCH 06/14] sysfs: add sysfs_dirent->s_parent
Add sysfs_dirent->s_parent. With this patch, each sd points to and holds a reference to its parent. This allows walking sysfs tree without referencing sd->s_dentry which can go away anytime if the user doesn't control when it's deleted. sd->s_parent is initialized and parent is referenced in sysfs_attach_dirent(). Reference to parent is released when the sd is released, so as long as reference to a sd is held, s_parent can be followed. dentry walk in sysfs_readdir() is convereted to s_parent walk. This will be used to reimplement symlink such that it uses only sysfs_dirent tree. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c | 27 --- fs/sysfs/mount.c |1 + fs/sysfs/sysfs.h |1 + 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 653c23c..ef45c3e 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -34,6 +34,11 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry) void release_sysfs_dirent(struct sysfs_dirent * sd) { + struct sysfs_dirent *parent_sd; + + repeat: + parent_sd = sd->s_parent; + if (sd->s_type & SYSFS_KOBJ_LINK) { struct sysfs_symlink * sl = sd->s_element; kfree(sl->link_name); @@ -42,6 +47,10 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) } kfree(sd->s_iattr); kmem_cache_free(sysfs_dir_cachep, sd); + + sd = parent_sd; + if (sd && atomic_dec_and_test(&sd->s_count)) + goto repeat; } static void sysfs_d_iput(struct dentry * dentry, struct inode * inode) @@ -89,8 +98,10 @@ void sysfs_attach_dirent(struct sysfs_dirent *sd, dentry->d_op = &sysfs_dentry_ops; } - if (parent_sd) + if (parent_sd) { + sd->s_parent = sysfs_get(parent_sd); list_add(&sd->s_sibling, &parent_sd->s_children); + } } /* @@ -526,7 +537,7 @@ static int sysfs_readdir(struct file * filp, void * dirent, filldir_t filldir) i++; /* fallthrough */ case 1: - ino = (unsigned long)dentry->d_parent->d_fsdata; + ino = (unsigned long)parent_sd->s_parent; if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0) break; filp->f_pos++; @@ -643,13 +654,13 @@ int sysfs_make_shadowed_dir(struct kobject *kobj, struct dentry *sysfs_create_shadow_dir(struct kobject *kobj) { + struct dentry *dir = kobj->dentry; + struct inode *inode = dir->d_inode; + struct dentry *parent = dir->d_parent; + struct sysfs_dirent *parent_sd = parent->d_fsdata; + struct dentry *shadow; struct sysfs_dirent *sd; - struct dentry *parent, *dir, *shadow; - struct inode *inode; - dir = kobj->dentry; - inode = dir->d_inode; - parent = dir->d_parent; shadow = ERR_PTR(-EINVAL); if (!sysfs_is_shadowed_inode(inode)) goto out; @@ -661,6 +672,8 @@ struct dentry *sysfs_create_shadow_dir(struct kobject *kobj) sd = sysfs_new_dirent(kobj, inode->i_mode, SYSFS_DIR); if (!sd) goto nomem; + /* point to parent_sd but don't attach to it */ + sd->s_parent = sysfs_get(parent_sd); sysfs_attach_dirent(sd, NULL, shadow); d_instantiate(shadow, igrab(inode)); diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 23a48a3..141f7b1 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -28,6 +28,7 @@ static const struct super_operations sysfs_ops = { }; static struct sysfs_dirent sysfs_root = { + .s_count= ATOMIC_INIT(1), .s_sibling = LIST_HEAD_INIT(sysfs_root.s_sibling), .s_children = LIST_HEAD_INIT(sysfs_root.s_children), .s_element = NULL, diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index 104649c..b4876de 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -1,5 +1,6 @@ struct sysfs_dirent { atomic_ts_count; + struct sysfs_dirent * s_parent; struct list_heads_sibling; struct list_heads_children; void* s_element; -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, > and if yes, does the patch below fix the resume problem for you? hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where we re-enable interrupts in the SMP case, but could you try my patch nevertheless? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/14] sysfs: implement immediate kobject disconnect
Opening a sysfs node references its associated kobject, so userland can arbitrarily prolong lifetime of a kobject which complicates lifetime rules in drivers. This patch makes the association between kobject and sysfs immediately breakable. Each sysfs_dirent representing a kobject has a rwsem. Any file operation which has to access the associated kobject should read lock the rwsem and check whether the pointer is still valid. The read lock should be held until access to the kobj is not necessary. On sysfs_dirent deletion, the rwsem is write locked and the pointer is cleared. This ensures that there is no user to the kobj through the sysfs_dirent. This way, sysfs_dirent doesn't have to hold reference to its associated kobj and can disconnect from it immediately on deletion. sysfs_get_dir_kobj() and sysfs_put_dir_kobj() read lock and unlock kobj access, respectively. As write locking is used only once during deletion, blocking on down_read() indicates that the kobj will have been disassociated when down_read() succeeds, so down_read_trylock() is used in sysfs_get_dir_kobj() making the function non-blocking. Unlike other operations, mmapped area lingers on after mmap() is finished and the kobj needs to stay referenced till all the mapped pages are gone. This is accomplished by holding one reference to kobj if there have been any mmap during lifetime of an openfile. The reference is dropped when the openfile is released. Note that read locking kobject not only protects the kobject itself but also ensures that the backing module doesn't go away while the lock is held. IOW, any access to code or data out of sysfs core shouldn't be made without grabbing kobject. So, access to attr should be done while holding its parent's kobject. This change makes sysfs lifetime rules independent from both kobject's and module's. It not only fixes several race conditions caused by sysfs not holding onto the proper module when referencing kobject, but also helps fixing and simplifying lifetime management in driver model and drivers by taking sysfs out of the equation. Please read the following message for more info. http://article.gmane.org/gmane.linux.kernel/510293 Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/bin.c | 97 ++- fs/sysfs/dir.c | 39 +++- fs/sysfs/file.c | 133 -- fs/sysfs/sysfs.h | 10 +--- 4 files changed, 172 insertions(+), 107 deletions(-) diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index 1dd1bf1..a2180c5 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -23,6 +23,7 @@ struct bin_buffer { struct mutexmutex; void*buffer; + int mmapped; }; static int @@ -30,12 +31,20 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count) { struct sysfs_dirent *attr_sd = dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; - struct kobject * kobj = to_kobj(dentry->d_parent); + struct kobject *kobj; + int rc; + + kobj = sysfs_get_dir_kobj(attr_sd->s_parent); + if (!kobj) + return -ENODEV; - if (!attr->read) - return -EIO; + rc = -EIO; + if (attr->read) + rc = attr->read(kobj, buffer, off, count); - return attr->read(kobj, buffer, off, count); + sysfs_put_dir_kobj(attr_sd->s_parent); + + return rc; } static ssize_t @@ -79,12 +88,20 @@ flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count) { struct sysfs_dirent *attr_sd = dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; - struct kobject *kobj = to_kobj(dentry->d_parent); + struct kobject *kobj; + int rc; + + kobj = sysfs_get_dir_kobj(attr_sd->s_parent); + if (!kobj) + return -ENODEV; - if (!attr->write) - return -EIO; + rc = -EIO; + if (attr->write) + rc = attr->write(kobj, buffer, offset, count); - return attr->write(kobj, buffer, offset, count); + sysfs_put_dir_kobj(attr_sd->s_parent); + + return rc; } static ssize_t write(struct file *file, const char __user *userbuf, @@ -124,14 +141,24 @@ static int mmap(struct file *file, struct vm_area_struct *vma) struct bin_buffer *bb = file->private_data; struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr; - struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent); + struct kobject *kobj; int rc; - if (!attr->mmap) - return -EINVAL; - mutex_lock(&bb->mutex); - rc = attr->mmap(kobj, attr, vma); + + kobj = sysfs_get_dir_kobj(attr_sd->s_parent); + if (!kobj) + return -ENODEV; + + rc = -EI
[PATCH 09/14] sysfs: implement kobj_sysfs_assoc_lock
kobj->dentry can go away anytime unless the user controls when the associated sysfs node is deleted. This patch implements kobj_sysfs_assoc_lock which protects kobj->dentry. This will be used to maintain kobj based API when converting sysfs to use sysfs_dirent tree instead of dentry/kobject. Note that this lock belongs to kobject/driver-model not sysfs. Once sysfs is converted to not use kobject in its interface, this can be removed from sysfs. This is in preparation of object reference simplification. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c |8 +++- fs/sysfs/sysfs.h |1 + 2 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index d70ead5..8372c0c 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -13,6 +13,7 @@ #include "sysfs.h" DECLARE_RWSEM(sysfs_rename_sem); +spinlock_t kobj_sysfs_assoc_lock = SPIN_LOCK_UNLOCKED; struct kobject *sysfs_get_kobject(struct dentry *dentry) { @@ -388,8 +389,13 @@ static void __sysfs_remove_dir(struct dentry *dentry) void sysfs_remove_dir(struct kobject * kobj) { - __sysfs_remove_dir(kobj->dentry); + struct dentry *d = kobj->dentry; + + spin_lock(&kobj_sysfs_assoc_lock); kobj->dentry = NULL; + spin_unlock(&kobj_sysfs_assoc_lock); + + __sysfs_remove_dir(d); } int sysfs_rename_dir(struct kobject * kobj, struct dentry *new_parent, diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index c1965b9..9dcd0b0 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -61,6 +61,7 @@ extern void sysfs_remove_subdir(struct dentry *); extern void sysfs_drop_dentry(struct sysfs_dirent *sd, struct dentry *parent); extern int sysfs_setattr(struct dentry *dentry, struct iattr *iattr); +extern spinlock_t kobj_sysfs_assoc_lock; extern struct rw_semaphore sysfs_rename_sem; extern struct super_block * sysfs_sb; extern const struct file_operations sysfs_dir_operations; -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/14] sysfs: add sysfs_dirent->s_name
Add s_name to sysfs_dirent. This is to further reduce dependency to the associated dentry. Name is copied for directories and symlinks but not for attributes. Where possible, name dereferences are converted to use sd->s_name. sysfs_symlink->link_name and sysfs_get_name() are unused now and removed. This change allows symlink to be implemented using sysfs_dirent tree proper, which is the last remaining dentry-dependent sysfs walk. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c| 33 + fs/sysfs/file.c |2 +- fs/sysfs/inode.c | 33 + fs/sysfs/symlink.c|8 +--- fs/sysfs/sysfs.h |7 +++ include/linux/sysfs.h |1 + 6 files changed, 28 insertions(+), 56 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index ef45c3e..26c3088 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -41,10 +41,11 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) if (sd->s_type & SYSFS_KOBJ_LINK) { struct sysfs_symlink * sl = sd->s_element; - kfree(sl->link_name); kobject_put(sl->target_kobj); kfree(sl); } + if (sd->s_type & SYSFS_COPY_NAME) + kfree(sd->s_name); kfree(sd->s_iattr); kmem_cache_free(sysfs_dir_cachep, sd); @@ -69,19 +70,30 @@ static struct dentry_operations sysfs_dentry_ops = { .d_iput = sysfs_d_iput, }; -struct sysfs_dirent *sysfs_new_dirent(void *element, umode_t mode, int type) +struct sysfs_dirent *sysfs_new_dirent(const char *name, void *element, + umode_t mode, int type) { + char *dup_name = NULL; struct sysfs_dirent * sd; + if (type & SYSFS_COPY_NAME) { + name = dup_name = kstrdup(name, GFP_KERNEL); + if (!name) + return NULL; + } + sd = kmem_cache_zalloc(sysfs_dir_cachep, GFP_KERNEL); - if (!sd) + if (!sd) { + kfree(dup_name); return NULL; + } atomic_set(&sd->s_count, 1); atomic_set(&sd->s_event, 1); INIT_LIST_HEAD(&sd->s_children); INIT_LIST_HEAD(&sd->s_sibling); + sd->s_name = name; sd->s_element = element; sd->s_mode = mode; sd->s_type = type; @@ -118,8 +130,7 @@ int sysfs_dirent_exist(struct sysfs_dirent *parent_sd, list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { if (sd->s_element) { - const unsigned char *existing = sysfs_get_name(sd); - if (strcmp(existing, new)) + if (strcmp(sd->s_name, new)) continue; else return -EEXIST; @@ -173,7 +184,7 @@ static int create_dir(struct kobject *kobj, struct dentry *parent, goto out_dput; error = -ENOMEM; - sd = sysfs_new_dirent(kobj, mode, SYSFS_DIR); + sd = sysfs_new_dirent(name, kobj, mode, SYSFS_DIR); if (!sd) goto out_drop; sysfs_attach_dirent(sd, parent->d_fsdata, dentry); @@ -298,9 +309,7 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry, list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { if (sd->s_type & SYSFS_NOT_PINNED) { - const unsigned char * name = sysfs_get_name(sd); - - if (strcmp(name, dentry->d_name.name)) + if (strcmp(sd->s_name, dentry->d_name.name)) continue; if (sd->s_type & SYSFS_KOBJ_LINK) @@ -490,7 +499,7 @@ static int sysfs_dir_open(struct inode *inode, struct file *file) struct sysfs_dirent * sd; mutex_lock(&dentry->d_inode->i_mutex); - sd = sysfs_new_dirent(NULL, 0, 0); + sd = sysfs_new_dirent("_DIR_", NULL, 0, 0); if (sd) sysfs_attach_dirent(sd, parent_sd, NULL); mutex_unlock(&dentry->d_inode->i_mutex); @@ -557,7 +566,7 @@ static int sysfs_readdir(struct file * filp, void * dirent, filldir_t filldir) if (!next->s_element) continue; - name = sysfs_get_name(next); + name = next->s_name; len = strlen(name); ino = (unsigned long)next; @@ -669,7 +678,7 @@ struct dentry *sysfs_create_shadow_dir(struct kobject *kobj) if (!shadow) goto nomem; - sd = sysfs_new_dirent(kobj, inode->i_mode, SYSFS_DIR); + sd = sysfs_new_dirent("_SHADOW_", kobj, inode->i_mode, SYSFS_DIR); if (!sd) goto nomem; /* point to parent_sd but don't attach to it */ diff --git a/fs/sysfs/file.
[PATCH 04/14] sysfs: flatten cleanup paths in sysfs_add_link() and create_dir()
Flatten cleanup paths in sysfs_add_link() and create_dir() to improve readability and ease further changes to these functions. This is in preparation of object reference simplification. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c | 73 ++- fs/sysfs/symlink.c | 27 ++ 2 files changed, 58 insertions(+), 42 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 8b1a00a..99a0ba1 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -177,40 +177,53 @@ static int init_symlink(struct inode * inode) return 0; } -static int create_dir(struct kobject * k, struct dentry * p, - const char * n, struct dentry ** d) +static int create_dir(struct kobject *kobj, struct dentry *parent, + const char *name, struct dentry **p_dentry) { int error; umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO; + struct dentry *dentry; + struct sysfs_dirent *sd; - mutex_lock(&p->d_inode->i_mutex); - *d = lookup_one_len(n, p, strlen(n)); - if (!IS_ERR(*d)) { - if (sysfs_dirent_exist(p->d_fsdata, n)) - error = -EEXIST; - else - error = sysfs_make_dirent(p->d_fsdata, *d, k, mode, - SYSFS_DIR); - if (!error) { - error = sysfs_create(*d, mode, init_dir); - if (!error) { - inc_nlink(p->d_inode); - (*d)->d_op = &sysfs_dentry_ops; - d_rehash(*d); - } - } - if (error && (error != -EEXIST)) { - struct sysfs_dirent *sd = (*d)->d_fsdata; - if (sd) { - list_del_init(&sd->s_sibling); - sysfs_put(sd); - } - d_drop(*d); - } - dput(*d); - } else - error = PTR_ERR(*d); - mutex_unlock(&p->d_inode->i_mutex); + mutex_lock(&parent->d_inode->i_mutex); + + dentry = lookup_one_len(name, parent, strlen(name)); + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); + goto out_unlock; + } + + error = -EEXIST; + if (sysfs_dirent_exist(parent->d_fsdata, name)) + goto out_dput; + + error = sysfs_make_dirent(parent->d_fsdata, dentry, kobj, mode, + SYSFS_DIR); + if (error) + goto out_drop; + + error = sysfs_create(dentry, mode, init_dir); + if (error) + goto out_sput; + + inc_nlink(parent->d_inode); + dentry->d_op = &sysfs_dentry_ops; + d_rehash(dentry); + + *p_dentry = dentry; + error = 0; + goto out_dput; + + out_sput: + sd = dentry->d_fsdata; + list_del_init(&sd->s_sibling); + sysfs_put(sd); + out_drop: + d_drop(dentry); + out_dput: + dput(dentry); + out_unlock: + mutex_unlock(&parent->d_inode->i_mutex); return error; } diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c index 7b9c5bf..b463f17 100644 --- a/fs/sysfs/symlink.c +++ b/fs/sysfs/symlink.c @@ -49,30 +49,33 @@ static int sysfs_add_link(struct dentry * parent, const char * name, struct kobj { struct sysfs_dirent * parent_sd = parent->d_fsdata; struct sysfs_symlink * sl; - int error = 0; + int error; error = -ENOMEM; - sl = kmalloc(sizeof(*sl), GFP_KERNEL); + sl = kzalloc(sizeof(*sl), GFP_KERNEL); if (!sl) - goto exit1; + goto err_out; sl->link_name = kmalloc(strlen(name) + 1, GFP_KERNEL); if (!sl->link_name) - goto exit2; + goto err_out; strcpy(sl->link_name, name); sl->target_kobj = kobject_get(target); error = sysfs_make_dirent(parent_sd, NULL, sl, S_IFLNK|S_IRWXUGO, SYSFS_KOBJ_LINK); - if (!error) - return 0; - - kobject_put(target); - kfree(sl->link_name); -exit2: - kfree(sl); -exit1: + if (error) + goto err_out; + + return 0; + + err_out: + if (sl) { + kobject_put(sl->target_kobj); + kfree(sl->link_name); + kfree(sl); + } return error; } -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/14] sysfs: move sysfs_get_kobject() and release_sysfs_dirent() to dir.c
There is no reason these functions should be inlined and soon to follow sysfs object reference simplification will make these functions heavier. Move them to dir.c. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c | 30 ++ fs/sysfs/sysfs.h | 32 ++-- 2 files changed, 32 insertions(+), 30 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 5112f88..8b1a00a 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -14,6 +14,36 @@ DECLARE_RWSEM(sysfs_rename_sem); +struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject * kobj = NULL; + + spin_lock(&dcache_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent * sd = dentry->d_fsdata; + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink * sl = sd->s_element; + kobj = kobject_get(sl->target_kobj); + } else + kobj = kobject_get(sd->s_element); + } + spin_unlock(&dcache_lock); + + return kobj; +} + +void release_sysfs_dirent(struct sysfs_dirent * sd) +{ + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink * sl = sd->s_element; + kfree(sl->link_name); + kobject_put(sl->target_kobj); + kfree(sl); + } + kfree(sd->s_iattr); + kmem_cache_free(sysfs_dir_cachep, sd); +} + static void sysfs_d_iput(struct dentry * dentry, struct inode * inode) { struct sysfs_dirent * sd = dentry->d_fsdata; diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index a77c57e..812c8c3 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -17,6 +17,8 @@ extern void sysfs_delete_inode(struct inode *inode); extern struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent *); extern int sysfs_create(struct dentry *, int mode, int (*init)(struct inode *)); +extern struct kobject *sysfs_get_kobject(struct dentry *dentry); +extern void release_sysfs_dirent(struct sysfs_dirent * sd); extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *); extern int sysfs_make_dirent(struct sysfs_dirent *, struct dentry *, void *, umode_t, int); @@ -79,36 +81,6 @@ static inline struct bin_attribute * to_bin_attr(struct dentry * dentry) return ((struct bin_attribute *) sd->s_element); } -static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) -{ - struct kobject * kobj = NULL; - - spin_lock(&dcache_lock); - if (!d_unhashed(dentry)) { - struct sysfs_dirent * sd = dentry->d_fsdata; - if (sd->s_type & SYSFS_KOBJ_LINK) { - struct sysfs_symlink * sl = sd->s_element; - kobj = kobject_get(sl->target_kobj); - } else - kobj = kobject_get(sd->s_element); - } - spin_unlock(&dcache_lock); - - return kobj; -} - -static inline void release_sysfs_dirent(struct sysfs_dirent * sd) -{ - if (sd->s_type & SYSFS_KOBJ_LINK) { - struct sysfs_symlink * sl = sd->s_element; - kfree(sl->link_name); - kobject_put(sl->target_kobj); - kfree(sl); - } - kfree(sd->s_iattr); - kmem_cache_free(sysfs_dir_cachep, sd); -} - static inline struct sysfs_dirent * sysfs_get(struct sysfs_dirent * sd) { if (sd) { -- 1.5.0.3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET #master] sysfs: make sysfs disconnect immediately from kobject on deletion
Hello, all. This patchset is result of the following thread. http://thread.gmane.org/gmane.linux.kernel/510293 This patchset takes sysfs out of device driver lifetime equation which not not only fixes several race conditions caused by sysfs not holding onto the proper module when referencing kobject, but also helps fixing and simplifying lifetime management in driver model. sysfs is peculiar in how it's intertwined with driver model via kobject and fs layer by using dentry to record some of its hierarchy. This not only complicates lifetime management outside of sysfs but also inside sysfs proper. We end up with several different yet inter-dependent lifetime rules. For example, dentry depends on sysfs_dirent for file accesses as any dentry would depend on inode and its backing fs private data to do so, while sysfs_dirent depends on dentry to walk sysfs_dirent tree for internal purpose (symlink walk) and the initial access to the dentry happens by going through kobject pointer. This interdependcy is okay while all the objects are on memory but hell breaks loose when it's time to kill those objects. dentry and sysfs_dirent depend on each other. Unless they go away at the same time or use some way to safely break the loop, one side ends up with dangled pointer to the other. This patchset solves this by making sysfs_dirent behave more like fs internal inodes in other filesystems which don't depend on dentry or other external entity to manage itself. Most information is already there. Only sd->s_parent and s_name are added. These do increase the size of sysfs_dirent a bit but makes the logic look designed more in Earth instead of Mars and with further changes, dentry and inode for kobject can be made reclaimable which can probably compensate the added space overhead. Sysfs lifetime rules are much simpler now. sd denotes sysfs_dirent. * sd has default reference of 1 on creation which is dropped on deletion. * dentry holds reference to sd. dentry->d_fsdata can be safely dereferenced while referecne to dentry is held. * sd->s_parent points to the parent sd and each child holds a reference to its parent which is released when the child is released (reference reaches zero), so sd->s_parent can be dereferenced recursively if reference to the sd is held. * sd->s_name can always be read while sd is valid. * sd->s_elem.dir.kobj should only be accessed while sd->s_elem.dir.rwsem is read locked, which can be done by calling sysfs_get_dir_kobj() on the sd. * sd->s_elem.[bin_]attr.[bin_attr] should only be accessed while its parent's sd->s_elem.dir.rwsem is read locked. If sysfs_get_dir_kobj() returns NULL, attr pointer might point to released area. * sysfs doesn't reference foreign objects for internal purpose. Foreign objets are accessed from the callbacks or interface functions where the caller is responsible for guaranteeing accessbility - symlink interface function is currently an exception. sysfs should export sysfs_dirent based interface and kobject code should do the locking. * Directory dentries are still pinned as they are used in interface function - this should change in the future. This patchset is consisted of the following 14 patches. #01 : fix i_no handling bug and reduce dependency to inode #02 : fix error handling in binattr write #03-05 : prep for symlink reimplementation #06-07 : add s_parent and s_name #08 : make s_elem a union so that chaning what it points to doesn't cause chaos and s_elem can contain more than one pointer #09 : implement kobj_sysfs_assoc_lock to protect kobj->s_dentry which will be used to keep symlink interface compatible #10 : reimplement symlink using only sysfs_dirent tree #11 : implement bin_buffer for immediate-kobject-disconnect #12 : implement immediate-kobject-disconnect #13-14 : kill now obsolete stuff The first 11 are some fixes and preparation for immediate-kobject-disconnect. Depencies to external objects are gradually removed such that only accesses during file ops remain which are converted by patch #12. The last two patches remove now unnecessary attribute orphaning and attribute->owner. I've run the following test on a UP machine for several hours without any oops or memory leak and the test is currently running on a dual processor (hyperthreading) machine for about half an hour. I'll keep it running for at least 5 hours. # (cd kernel-build-dir; while true; do echo [Loading...]; insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko; sleep 1; echo [Unloading...]; while lsmod | grep -q sd_mod; do rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod; sleep .1; done; done) & # (cd /sys; while true; do ls -liR > /dev/null; done) & # (cd /sys; while true; do find . | xargs cat > /dev/null 2>&1; done) & # (cd /sys; while true; do find . | sort | xargs cat > /dev/null 2>&1; done) & T
[PATCH 13/14] sysfs: kill attribute file orphaning
Now that sysfs_dirent can be disconnected from kobject on deletion, there is no need to orphan each attribute files. All [bin_]attribute nodes are automatically orphaned when the parent node is deleted. Kill attribute file orphaning. Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/file.c | 65 ++--- fs/sysfs/inode.c | 25 fs/sysfs/mount.c |8 -- fs/sysfs/sysfs.h | 16 - 4 files changed, 13 insertions(+), 101 deletions(-) diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 133a108..cd80b20 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -51,29 +51,15 @@ static struct sysfs_ops subsys_sysfs_ops = { .store = subsys_attr_store, }; -/** - * add_to_collection - add buffer to a collection - * @buffer:buffer to be added - * @node: inode of set to add to - */ - -static inline void -add_to_collection(struct sysfs_buffer *buffer, struct inode *node) -{ - struct sysfs_buffer_collection *set = node->i_private; - - mutex_lock(&node->i_mutex); - list_add(&buffer->associates, &set->associates); - mutex_unlock(&node->i_mutex); -} - -static inline void -remove_from_collection(struct sysfs_buffer *buffer, struct inode *node) -{ - mutex_lock(&node->i_mutex); - list_del(&buffer->associates); - mutex_unlock(&node->i_mutex); -} +struct sysfs_buffer { + size_t count; + loff_t pos; + char* page; + struct sysfs_ops* ops; + struct semaphoresem; + int needs_read_fill; + int event; +}; /** * fill_read_buffer - allocate and fill buffer from object. @@ -175,10 +161,7 @@ sysfs_read_file(struct file *file, char __user *buf, size_t count, loff_t *ppos) down(&buffer->sem); if (buffer->needs_read_fill) { - if (buffer->orphaned) - retval = -ENODEV; - else - retval = fill_read_buffer(file->f_path.dentry,buffer); + retval = fill_read_buffer(file->f_path.dentry,buffer); if (retval) goto out; } @@ -276,16 +259,11 @@ sysfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t ssize_t len; down(&buffer->sem); - if (buffer->orphaned) { - len = -ENODEV; - goto out; - } len = fill_write_buffer(buffer, buf, count); if (len > 0) len = flush_write_buffer(file->f_path.dentry, buffer, len); if (len > 0) *ppos += len; -out: up(&buffer->sem); return len; } @@ -294,7 +272,6 @@ static int sysfs_open_file(struct inode *inode, struct file *file) { struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct attribute *attr = attr_sd->s_elem.attr.attr; - struct sysfs_buffer_collection *set; struct sysfs_buffer * buffer; struct sysfs_ops * ops = NULL; struct kobject *kobj; @@ -322,26 +299,14 @@ static int sysfs_open_file(struct inode *inode, struct file *file) else ops = &subsys_sysfs_ops; + error = -EACCES; + /* No sysfs operations, either from having no subsystem, * or the subsystem have no operations. */ - error = -EACCES; if (!ops) goto err_mput; - /* make sure we have a collection to add our buffers to */ - mutex_lock(&inode->i_mutex); - if (!(set = inode->i_private)) { - error = -ENOMEM; - if (!(set = inode->i_private = kmalloc(sizeof(struct sysfs_buffer_collection), GFP_KERNEL))) - goto err_mput; - else - INIT_LIST_HEAD(&set->associates); - } - mutex_unlock(&inode->i_mutex); - - error = -EACCES; - /* File needs write support. * The inode's perms must say it's ok, * and we must have a store method. @@ -368,11 +333,9 @@ static int sysfs_open_file(struct inode *inode, struct file *file) if (!buffer) goto err_mput; - INIT_LIST_HEAD(&buffer->associates); init_MUTEX(&buffer->sem); buffer->needs_read_fill = 1; buffer->ops = ops; - add_to_collection(buffer, inode); file->private_data = buffer; /* open succeeded, put kobj and pin attr_sd */ @@ -391,10 +354,8 @@ static int sysfs_release(struct inode * inode, struct file * filp) { struct sysfs_dirent *attr_sd = filp->f_path.dentry->d_fsdata; struct attribute *attr = attr_sd->s_elem.attr.attr; - struct sysfs_buffer * buffer = filp->private_data; + struct sysfs_buffer *buffer = filp->private_data; - if (buffer) - remove_from_collection(buffer,
[PATCH 10/14] sysfs: reimplement symlink using sysfs_dirent tree
sysfs symlink is implemented by referencing dentry and kobject from sysfs_dirent - symlink entry references kobject, dentry is used to walk the tree. This complicates object lifetimes rules and is dangerous - for example, there is no way to tell to which module the target of a symlink belongs and referencing that kobject can make it linger after the module is gone. This patch reimplements symlink using only sysfs_dirent tree. sd for a symlink points and holds reference to the target sysfs_dirent and all walking is done using sysfs_dirent tree. Simpler and safer. Please read the following message for more info. http://article.gmane.org/gmane.linux.kernel/510293 Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> --- fs/sysfs/dir.c |9 +++-- fs/sysfs/symlink.c | 88 +++ fs/sysfs/sysfs.h |2 +- 3 files changed, 53 insertions(+), 46 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 8372c0c..b7d0e0e 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -22,10 +22,11 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry) spin_lock(&dcache_lock); if (!d_unhashed(dentry)) { struct sysfs_dirent * sd = dentry->d_fsdata; + if (sd->s_type & SYSFS_KOBJ_LINK) - kobj = kobject_get(sd->s_elem.symlink.target_kobj); - else - kobj = kobject_get(sd->s_elem.dir.kobj); + sd = sd->s_elem.symlink.target_sd; + + kobj = kobject_get(sd->s_elem.dir.kobj); } spin_unlock(&dcache_lock); @@ -40,7 +41,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) parent_sd = sd->s_parent; if (sd->s_type & SYSFS_KOBJ_LINK) - kobject_put(sd->s_elem.symlink.target_kobj); + sysfs_put(sd->s_elem.symlink.target_sd); if (sd->s_type & SYSFS_COPY_NAME) kfree(sd->s_name); kfree(sd->s_iattr); diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c index 27df635..ff605d3 100644 --- a/fs/sysfs/symlink.c +++ b/fs/sysfs/symlink.c @@ -11,50 +11,49 @@ #include "sysfs.h" -static int object_depth(struct kobject * kobj) +static int object_depth(struct sysfs_dirent *sd) { - struct kobject * p = kobj; int depth = 0; - do { depth++; } while ((p = p->parent)); + + for (; sd->s_parent; sd = sd->s_parent) + depth++; + return depth; } -static int object_path_length(struct kobject * kobj) +static int object_path_length(struct sysfs_dirent * sd) { - struct kobject * p = kobj; int length = 1; - do { - length += strlen(kobject_name(p)) + 1; - p = p->parent; - } while (p); + + for (; sd->s_parent; sd = sd->s_parent) + length += strlen(sd->s_name) + 1; + return length; } -static void fill_object_path(struct kobject * kobj, char * buffer, int length) +static void fill_object_path(struct sysfs_dirent *sd, char *buffer, int length) { - struct kobject * p; - --length; - for (p = kobj; p; p = p->parent) { - int cur = strlen(kobject_name(p)); + for (; sd->s_parent; sd = sd->s_parent) { + int cur = strlen(sd->s_name); /* back up enough to print this bus id with '/' */ length -= cur; - strncpy(buffer + length,kobject_name(p),cur); + strncpy(buffer + length, sd->s_name, cur); *(buffer + --length) = '/'; } } -static int sysfs_add_link(struct dentry * parent, const char * name, struct kobject * target) +static int sysfs_add_link(struct sysfs_dirent * parent_sd, const char * name, + struct sysfs_dirent * target_sd) { - struct sysfs_dirent * parent_sd = parent->d_fsdata; struct sysfs_dirent * sd; sd = sysfs_new_dirent(name, S_IFLNK|S_IRWXUGO, SYSFS_KOBJ_LINK); if (!sd) return -ENOMEM; - sd->s_elem.symlink.target_kobj = kobject_get(target); + sd->s_elem.symlink.target_sd = target_sd; sysfs_attach_dirent(sd, parent_sd, NULL); return 0; } @@ -68,6 +67,8 @@ static int sysfs_add_link(struct dentry * parent, const char * name, struct kobj int sysfs_create_link(struct kobject * kobj, struct kobject * target, const char * name) { struct dentry *dentry = NULL; + struct sysfs_dirent *parent_sd = NULL; + struct sysfs_dirent *target_sd = NULL; int error = -EEXIST; BUG_ON(!name); @@ -80,11 +81,27 @@ int sysfs_create_link(struct kobject * kobj, struct kobject * target, const char if (!dentry) return -EFAULT; + parent_sd = dentry->d_fsdata; + + /* target->dentry can go away beneath us but is protected with +* kobj_sysfs_assoc_lock. Fetch target_sd from it. +*/ + spin_lock(&kobj_sysfs_assoc_lock); +
Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
Hello, Maneesh. Maneesh Soni wrote: > o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This > happens without i_mutex. It also nullifies s_dentry to indicate that > the associated dentry is evicted. sysfs_readdir() accesses the s_dentry, > and gets the inode number from the associated dentry->d_inode, if > there is one, else it invokes iunique(). This can create a race situation, > and crash while accessing the d_inode in sysfs_readdir(). > > o The race happens when the dentry is getting reclaimed and detached from > the corresponding sysfs_dirent though sysfs_dirent is still a valid > node. Accessing dentry fields are ok as it is under RCU but the inode is > not hence we may see oops accessing dentry->d_inode->i_no. > > o The following patch always use i_unique() to get the inode number in > sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering. > It could be slower but avoids the oops. This isn't correct as i_unique() assumes the inode is in inode hash table which isn't true for sysfs. This can result in duplicate inode numbers. Please take a look at the following alternative fix. http://article.gmane.org/gmane.linux.kernel/513325 Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PPC4xx UART0 (8250) problem
Benjamin Herrenschmidt wrote: > This is an old problem. The proper fix is already implemented for > arch/powerpc and consist of having virtual irq numbers (which helps for > many other things anyway). > > Support for 4xx platforms in arch/powerpc is starting to get in, pop on > [EMAIL PROTECTED] where the patches are being posted and you are > welcome to give a hand porting more platforms over :-) First, thanks for response. Ok, I see IRQ0 mapping to virtual number is good solution but I know the problem from the times of 2.4 kernel. It was very surprising while moving to 2.6 kernel - serial driver was changed, but we have the same problem there. Maybe it's good idea to create unofficial ppc linux branch repository (or it already exists ?) where we can put patches that work (as above) but not in pretty good way linux community expects. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH, take4] FUTEX : new PRIVATE futexes
Hi all Updates on this take4 : - All remarks from Nick were addressed I hope - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... I had to change get_futex_key() prototype to be able to do a correct test. History : I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4 (so supports 64bit futexes) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Thank you [PATCH] FUTEX : new PRIVATE futexes Analysis of current linux futex code : -- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 427 cycles per futex(FUTEX_WAIT) call (using one futex) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
Re: [patch] high-res timers: UP resume fix
On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote: > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, > > and if yes, does the patch below fix the resume problem for you? > > hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where > we re-enable interrupts in the SMP case, but could you try my patch > nevertheless We do in on_each_cpu() unconditionally. I missed that. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.21-rc6
Hi all, This looks like a lockdep problem. 2.6.21-rc6 + hrtimers_debug.patch (from Ingo) - skge_wol_support (commit a504e64ab42bcc27074ea37405d06833ed6e0820) dropped due to swsusp problems [14016.726946] BUG: at /mnt/md0/devel/linux-git/kernel/lockdep.c:2427 check_flags() [14016.734331] [] show_trace_log_lvl+0x1a/0x2f [14016.739507] [] show_trace+0x12/0x14 [14016.743982] [] dump_stack+0x16/0x18 [14016.748460] [] check_flags+0x95/0x143 [14016.753106] [] lock_acquire+0x29/0x82 [14016.757741] [] down_write+0x3a/0x54 [14016.762203] [] sys_munmap+0x23/0x3f [14016.71] [] syscall_call+0x7/0xb [14016.771134] === [14016.774712] irq event stamp: 43076 [14016.778111] hardirqs last enabled at (43075): [] syscall_exit_work+0x11/0x26 [14016.786166] hardirqs last disabled at (43076): [] ret_from_exception+0x9/0xc [14016.794118] softirqs last enabled at (42608): [] __do_softirq+0xe4/0xea [14016.801706] softirqs last disabled at (42599): [] do_softirq+0x64/0xd1 http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc6/git-console.log http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc6/git-config BTW. I noticed some strange fio (1.15) behavior Starting 16 processes file:io_u.c:65, assert idx < f->num_maps failed[ 1605/ 36442 kb/s] [eta 00m:32s] fio: pid=13734, got signal=11 file:io_u.c:65, assert idx < f->num_maps failed[ 10452/ 0 kb/s] [eta 00m:23s] fio: pid=13731, got signal=11 Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
* Thomas Gleixner <[EMAIL PROTECTED]> wrote: > On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, > > > and if yes, does the patch below fix the resume problem for you? > > > > hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see > > where we re-enable interrupts in the SMP case, but could you try my > > patch nevertheless > > We do in on_each_cpu() unconditionally. I missed that. doh, indeed! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
On Sat, 2007-04-07 at 10:12 +0200, Ingo Molnar wrote: > Subject: [patch] high-res timers: UP resume fix > From: Ingo Molnar <[EMAIL PROTECTED]> > > Soeren Sonnenburg reported that upon resume he is getting > this backtrace: > > [] smp_apic_timer_interrupt+0x57/0x90 > [] retrigger_next_event+0x0/0xb0 > [] apic_timer_interrupt+0x28/0x30 > [] retrigger_next_event+0x0/0xb0 > [] __kfifo_put+0x8/0x90 > [] on_each_cpu+0x35/0x60 > [] clock_was_set+0x18/0x20 > [] timekeeping_resume+0x7c/0xa0 > [] __sysdev_resume+0x11/0x80 > [] sysdev_resume+0x47/0x80 > [] device_power_up+0x5/0x10 > > it turns out that on UP we mistakenly re-enable interrupts, > so do the timer retrigger only on the current CPU. > > Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> Acked-by: Thomas Gleixner <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] clean up identify_cpu
On Fri, 06 Apr 2007 15:41:54 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > identify_cpu() is used to identify both the boot CPU and secondary > CPUs, but it performs some actions which only apply to the boot CPU. > Those functions are therefore really __init functions, but because > they're called by identify_cpu(), they must be marked __cpuinit. > > This patch splits identify_cpu() into identify_boot_cpu() and > identify_secondary_cpu(), and calls the appropriate init functions > from each. Also, identify_boot_cpu() and all the functions it > dominates are marked __init. x86_64 uses this too. WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text:mtrr_bp_init from .text.identify_cpu after 'identify_cpu' (at offset 0x655) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array
On 07/04/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Wed, 4 Apr 2007 22:05:40 +0200 "Dmitry Adamushko" > [...] > > o Make TASK_PREEMPTS_CURR(task, rq) return "true" only if the task's > prio is higher than the current's one and the task is in the "active" > array. > This ensures we don't make redundant resched_task() calls when the > task is in the "expired" array (as may happen now in set_user_prio(), > rt_mutex_setprio() and pull_task() ) ; > > o generilise conditions for a call to resched_task() in > set_user_nice(), rt_mutex_setprio() and sched_setscheduler() > grief. This patch conflicts seriously with the staircase scheduler in -mm. So to merge it I need to - apply it - then apply a revert-it-again patch - then apply staircase - then ask Con to cook up a staircase-based equivalent of your change. I'll make a SD-based version and send it to Con. so - your code only gets publically tested in its against-staircase version - the against-mainline version will get merged without having been publically tested outside of staircase which is probably all OK for a 2.6.22-rc1 thing, provided Ingo can give a confident ack. Ok, thanks. btw, just out of curiosity. The very first approach I was thinking of - was to move a task from the "expired" to the "active" array when its priority is boosted (like rt_mutex_setprio() does for rt tasks). Reasoning: getting a higher static_prio means getting an additional quota of timeslice which still could be used during this rotation. delta = task_timeslice(p->static_prio) - task_timeslice(old_static_prio) Aha.. /here I'm looking at the mainline now/ another funny thing is that a time_slice is not immediately affected by the change of static_prio in set_user_nice(). If a task is in the expired array, it will run the next rotation with the *old* time_slice (i.e. calculated in task_running_tick() before putting the task in the expired array and based on the *old* static_prio). In theory, set_user_nice() could adjust a p->time_slice with "delta" being calculated as shown above.. But ok, it's not more than a minor inconsistency (of course, if I'm not missing something). > --- linux-2.6.21-rc5/kernel/sched-orig.c2007-04-04 > 18:26:19.0 +0200 > +++ linux-2.6.21-rc5/kernel/sched.c 2007-04-04 18:26:43.0 +0200 > @@ -168,7 +168,7 @@ unsigned long long __attribute__((weak)) > (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) > > #define TASK_PREEMPTS_CURR(p, rq) \ > - ((p)->prio < (rq)->curr->prio) > + (((p)->prio < (rq)->curr->prio) && ((p)->array == (rq)->active)) Your patch was wordwrapped and had its tabs replaced with spaces. Please fix your email client. I apologize for this. Will fix. -- Best regards, Dmitry Adamushko - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array
* Dmitry Adamushko <[EMAIL PROTECTED]> wrote: > following the conversation on "a redundant reschedule call in > set_user_prio()", here is a possible approach. > > The patch is somewhat intrusive as it even dares to adapt > TASK_PREEMPTS_CURR(). looks good to me, but the patch seems seriously whitespace-damaged: all tabs were converted to spaces. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Two questions regarding Opening files within Kernel!
Hi! On 7 Apr 2007, at 08:58, JanuGerman wrote: 1) I have just a file path with me, an absolute path, but no dentry, no inode, no vfsmount object, which function i can call to get a "file" object associated with the absoulte file path. I have surfed arround the source code especially fs/open.c and some other files, but each function requires a parameter "mode" and "fd" beside file path. Actually, i was confuse about the "mode" parameter (and its differece with "flag"), like what to send, and secondly for "fd", i am not sure, what value to send as there is no file infact and only file path exists. Any idea? No, but I'm no guru either. 2) Any functionality within linux kernel source code, to read one line per file? or some indirect way to set buffer size for one read?. That is, any existing header file for doing text I/O rather than binary within the kernel source code? Do you have a compelling reason for not letting userspace feed the file to your driver? That would be the natural and much easier way, I suppose... Ciao, Roland -- TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching Telefon 089/289-12575; Telefax 089/289-12570 -- CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482 -- Any society that would give up a little liberty to gain a little security will deserve neither and lose both. - Benjamin Franklin -BEGIN GEEK CODE BLOCK- Version: 3.12 GS/CS/M/MU d-(++) s:+ a-> C+++ UL P+++ L+++ E(+) W+ !N K- w--- M + !V Y+ PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++> h y+++ --END GEEK CODE BLOCK-- smime.p7s Description: S/MIME cryptographic signature PGP.sig Description: This is a digitally signed message part
Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array
* Andrew Morton <[EMAIL PROTECTED]> wrote: > so > > - your code only gets publically tested in its against-staircase > version > > - the against-mainline version will get merged without having been > publically tested outside of staircase > > which is probably all OK for a 2.6.22-rc1 thing, provided Ingo can > give a confident ack. it looks good to me - and once i get a non-whitespace-damaged patch i'll put it into -rt so we'll have testing. (this patch should have at most a latency impact, if we forget to preempt somewhere, and -rt users are quite touchy about latencies.) > Where are we at with staircase anyway? Is it looking like a 2.6.22 > thing? I don't personally think we've yet seen enough serious > performance testing to permit a merge, apart from other issues... yes, that's my thinking too at the moment. I'd also like to see a summary of 'open design questions' list from Mike (if Mike has time/energy for that?) - many questions were raised, a good number of them were answered, various changes done to SD but there's no good summary of the current state of affairs. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] console UTF-8 fixes
On Fri, Apr 06, 2007 at 12:43:03PM -0700, H. Peter Anvin wrote: Hi, > I strongly disagree. First of all, you're changing the semantics of a > 13-year-old API. The semantics of the Linux console is that by > specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have > specified the fallback glyph. OK, I'm not against using U+FFFD for missing glyphs. In the mean time I think it's still a good idea to clearly separate the two cases in the code (that is, the case of invalid sequence from the case of missing glyph), but we can still use the same replacement character in these two cases. I'll send an updated patch after Easter if it sounds good for you. > What's worse, you've hard-coded the uses of specific visual > representations. That is completely unacceptable. Now that we've dropped the idea of "dot" for missing glyphs, the other thing that remains is the hard-coded '?' if and only if U+FFFD is not present in the font. It is even hardcoded in the current code and I have no better idea, there must be a last-resort hardcoded fallback. The only thing I changed is that I inverted the color attributes for this question mark. Do you think that the old behavior, a normal question mark would be better? No problem, I'll adjust the code in this case. Just please tell me what the expected behavior is, I'm not sure I clearly understand your thoughts. > > - Another possible thing the current code may do (for latin1-compatible > >characters) is to simply display the glyph loaded in that position. > >Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with > >double accent". An applications prints U+00FB, which is an "u with > >circumflex". Since this glyph is not present in latin2, it cannot be > >printed with the current font. Still, the current code falls back to > >printing the glyph from the 0xFB position of the glyph table. Hence my > >app asked to print "u with circumflex" but an "u with double accent" > >appears on the screen. This is totally contrary to the goals of Unicode > >and shouldn't ever happen. > > When does that happen? That is clearly a bug. I think I've (mostly) described it above. Set everything to UTF-8, load a latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an application print U+00FB (alt + numpad 251 is one trivial way), you'll see an "u with double accent", though the symbol to be displayed is "u with circumflex". This isn't present in the current font, so the replacement character should appear, not a different letter. > >- The replacement character for invalid UTF-8 sequences is U+FFFD, falling > > back to a question mark. I've changed the fallback version to an inverted > > question mark. This way it's more similar to the common glyph of U+FFFD, > > and it's more trivial to the user that it's not a literal question mark > > but rather some erroneous situation. > > Brilliant. You've picked a fallback glyph which is unlikely to exist in > all fonts. The whole point of falling back to ? is that it's an ASCII > character, which means that if the font designer failed to designate a > fallback glyph -- which is an error!!! -- there is at least some hope of > conveying the error back to the user. Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I choose for fallback is still '?' (the ASCII question mark), I just invert the color attributes of the cell where this is printed. This way it becomes visually distinguisable from the literal question mark. Using the current kernel you just cannot know whether the character printed is a real question mark, or a replacement glyph. Still, should you stongly disagree with this decision, the color inverting part can easily be removed. > >- There's no concept of double-width characters. It's way beyond the scope > > of my patch to try to display them, but at least I think it's important > > for the cursor to jump two positions when printing such characters, since > > this is what applications (such as text editors) expect. If the cursor > > didn't jump two positions, applications would suffer from displaying and > > refreshing problems, and editing some English letters that are preceded > > by > > some CJK characters in the same line became a nightmare. With my patch an > > inverted dot followed by an inverted space is displayed for double-width > > characters so it's quite easy to see that they are tied together. > > To be able to do CJK you need something like Kon anyway. This feels > like bloat. I don't want CJK support. All that I want is to be able to edit English words within a file that contains mixture of English and CJK, with a text editor like vim or joe. Just try it with the current kernel, and with my patch. Suppose that within a line some CJK text is followed by an English word, and you want to edit the latter one. It's going to be a huge headache with the current kernel. Where you see the cursor is not wh
Re: [PATCH 1/8] Enhance process freezer interface for usage beyond software suspend
On Saturday, 7 April 2007 00:20, Nigel Cunningham wrote: > Hi. > > On Fri, 2007-04-06 at 16:34 +0200, Rafael J. Wysocki wrote: > > On Monday, 2 April 2007 22:51, Pavel Machek wrote: > > > Hi! > > > > > > > > > +/* Per process freezer specific flags */ > > > > > > +#define PF_FE_SUSPEND 0x8000 /* This thread should > > > > > > not be frozen > > > > > > +* for suspend > > > > > > +*/ > > > > > > + > > > > > > +#define PF_FE_KPROBES 0x0010 /* This thread should > > > > > > not be frozen > > > > > > +* for Kprobes > > > > > > +*/ > > > > > > > > > > Just put the comment before the define for long comments? > > > > > > > > Agreed. > > > > > > (Actually it would be nice to say > > > > > > /* This thread should not be frozen for suspend, becuase it is needed > > >for getting image saved to disk */ > > > > > > > > > -#ifdef CONFIG_PM > > > > > > +#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || \ > > > > > > + defined(CONFIG_KPROBES) > > > > > > > > > > Should we create CONFIG_FREEZER? > > > > > > > > Why do you think so? I think the freezer should be compiled > > > > automatically > > > > if any of the above is set, which is what this directive really means. > > > > > > Kconfig can do that. ("select statement"). If we have one such ifdef, > > > it is okay, but if it would be more of them. > > > > > > > > Hmmm, I do not really like softlockup watchdog running during suspend. > > > > > Can we make this freezeable and make watchdog shut itself off while > > > > > suspending? > > > > > > > > Generally, I agree, but this patch only replaces the existing instances > > > > of PF_NOFREEZE with the new mechanism. The changes you're talking about > > > > require a separate patch series (or at least one separate patch), I > > > > think, and > > > > they need not be so simple to make. > > > > > > Agreed about separate patch series. > > > > > > > > > - current->flags |= PF_NOFREEZE; > > > > > > + freezer_exempt(FE_ALL); > > > > > > pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD); > > > > > > if (pid > 0) { > > > > > > while (pid != sys_wait4(-1, NULL, 0, NULL)) > > > > > > > > > > Does this mean we have userland /linuxrc running with PF_NOFREEZE? > > > > > That would be very bad... > > > > > > > > No, actually it is _required_ for the userland resume to work. Well, > > > > perhaps > > > > I should place a comment in there so that I don't have to explain this > > > > again > > > > and again. :-) > > > > > > Can you put big bold comment there? > > > > > > Why is it needed? Freezer never freezes _current_ task. > > > > No, it doesn't, but this task spawns linuxrc and then calls sys_wait4() in a > > loop. > > > > Well, actually, I'll try to plant try_to_freeze() in this loop and see if > > that > > works. If it doesn't, I'll add a comment. > > It works. I've had: > > while (pid != sys_wait4(-1, NULL, 0, NULL)) { > yield(); > try_to_freeze(); > } > > there for ages for Suspend2. OK, thanks. Is there any particular reason to place try_to_freeze() after yield()? Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, take4] FUTEX : new PRIVATE futexes
Eric Dumazet wrote: Hi all Updates on this take4 : - All remarks from Nick were addressed I hope Yeah looks very nice. Thanks for doing that. - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... I had to change get_futex_key() prototype to be able to do a correct test. I wonder if it should be encfocing alignment to keep in on 1 page? Otherwise, Acked-by: Nick Piggin <[EMAIL PROTECTED]> I'll be away for a couple of days, but I'll look at running some performance tests when I get back. Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
possible NULL pointer usage
Hi, from the function fs/udf/inode.c:udf_fill_inode - ... UDF_I_DATA(inode) = kmalloc(inode->i_sb->s_blocksize - sizeof(struct extendedFileEntry), GFP_KERNEL); memcpy(UDF_I_DATA(inode), bh->b_data + sizeof(struct extendedFileEntry), inode->i_sb->s_blocksize - sizeof(struct extendedFileEntry)); ... so that can lead to NULL pointer usage. udf_fill_inode() declared as 'void' and the question I have is: what is the best solution to deal with a situation if kmalloc failed? May be just mark the node as bad by calling make_bad_inode() and return from the function? Cyrill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
On Saturday, 7 April 2007 10:48, Thomas Gleixner wrote: > On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, > > > and if yes, does the patch below fix the resume problem for you? > > > > hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where > > we re-enable interrupts in the SMP case, but could you try my patch > > nevertheless > > We do in on_each_cpu() unconditionally. I missed that. BTW, the on_each_cpu() in clock_was_set() is unnecessary, because timekeeping_resume() is always run on one CPU. Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
* Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: > > We do in on_each_cpu() unconditionally. I missed that. > > BTW, the on_each_cpu() in clock_was_set() is unnecessary, because > timekeeping_resume() is always run on one CPU. yes - but that's not the only place where we do clock_was_set(), and the on_each_cpu() is necessary in every other case. So i think the right solution was the patch i did: to split the resume functionality from the clock_was_set() functionality. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] Enhance process freezer interface for usage beyond software suspend
Hi. On Sat, 2007-04-07 at 11:33 +0200, Rafael J. Wysocki wrote: > On Saturday, 7 April 2007 00:20, Nigel Cunningham wrote: > > > > > > > - current->flags |= PF_NOFREEZE; > > > > > > > + freezer_exempt(FE_ALL); > > > > > > > pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD); > > > > > > > if (pid > 0) { > > > > > > > while (pid != sys_wait4(-1, NULL, 0, NULL)) > > > > > > > > > > > > Does this mean we have userland /linuxrc running with PF_NOFREEZE? > > > > > > That would be very bad... > > > > > > > > > > No, actually it is _required_ for the userland resume to work. Well, > > > > > perhaps > > > > > I should place a comment in there so that I don't have to explain > > > > > this again > > > > > and again. :-) > > > > > > > > Can you put big bold comment there? > > > > > > > > Why is it needed? Freezer never freezes _current_ task. > > > > > > No, it doesn't, but this task spawns linuxrc and then calls sys_wait4() > > > in a > > > loop. > > > > > > Well, actually, I'll try to plant try_to_freeze() in this loop and see if > > > that > > > works. If it doesn't, I'll add a comment. > > > > It works. I've had: > > > > while (pid != sys_wait4(-1, NULL, 0, NULL)) { > > yield(); > > try_to_freeze(); > > } > > > > there for ages for Suspend2. > > OK, thanks. Is there any particular reason to place try_to_freeze() after > yield()? Not that I remember. I haven't touched that for years :) Nigel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] high-res timers: resume fix
find updated patch below - only the patch description changed: i removed the 'UP' thing (patch has relevance on SMP too), and added Thomas' ack. Ingo > Subject: [patch] high-res timers: resume fix From: Ingo Molnar <[EMAIL PROTECTED]> Soeren Sonnenburg reported that upon resume he is getting this backtrace: [] smp_apic_timer_interrupt+0x57/0x90 [] retrigger_next_event+0x0/0xb0 [] apic_timer_interrupt+0x28/0x30 [] retrigger_next_event+0x0/0xb0 [] __kfifo_put+0x8/0x90 [] on_each_cpu+0x35/0x60 [] clock_was_set+0x18/0x20 [] timekeeping_resume+0x7c/0xa0 [] __sysdev_resume+0x11/0x80 [] sysdev_resume+0x47/0x80 [] device_power_up+0x5/0x10 it turns out that on resume we mistakenly re-enable interrupts. Do the timer retrigger only on the current CPU. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> Acked-by: Thomas Gleixner <[EMAIL PROTECTED]> --- include/linux/hrtimer.h |3 +++ kernel/hrtimer.c| 12 2 files changed, 15 insertions(+) Index: linux/include/linux/hrtimer.h === --- linux.orig/include/linux/hrtimer.h +++ linux/include/linux/hrtimer.h @@ -206,6 +206,7 @@ struct hrtimer_cpu_base { struct clock_event_device; extern void clock_was_set(void); +extern void hres_timers_resume(void); extern void hrtimer_interrupt(struct clock_event_device *dev); /* @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim */ static inline void clock_was_set(void) { } +static inline void hres_timers_resume(void) { } + /* * In non high resolution mode the time reference is taken from * the base softirq time variable. Index: linux/kernel/hrtimer.c === --- linux.orig/kernel/hrtimer.c +++ linux/kernel/hrtimer.c @@ -459,6 +459,18 @@ void clock_was_set(void) } /* + * During resume we might have to reprogram the high resolution timer + * interrupt (on the local CPU): + */ +void hres_timers_resume(void) +{ + WARN_ON_ONCE(num_online_cpus() > 1); + + /* Retrigger the CPU local events: */ + retrigger_next_event(NULL); +} + +/* * Check, whether the timer is on the callback pending list */ static inline int hrtimer_cb_pending(const struct hrtimer *timer) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
On Saturday, 7 April 2007 11:47, Ingo Molnar wrote: > > * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: > > > > We do in on_each_cpu() unconditionally. I missed that. > > > > BTW, the on_each_cpu() in clock_was_set() is unnecessary, because > > timekeeping_resume() is always run on one CPU. > > yes - but that's not the only place where we do clock_was_set(), and the > on_each_cpu() is necessary in every other case. So i think the right > solution was the patch i did: to split the resume functionality from the > clock_was_set() functionality. Agreed. Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: UP resume fix
On Sat, 2007-04-07 at 11:47 +0200, Ingo Molnar wrote: > * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: > > > > We do in on_each_cpu() unconditionally. I missed that. > > > > BTW, the on_each_cpu() in clock_was_set() is unnecessary, because > > timekeeping_resume() is always run on one CPU. > > yes - but that's not the only place where we do clock_was_set(), and the > on_each_cpu() is necessary in every other case. So i think the right > solution was the patch i did: to split the resume functionality from the > clock_was_set() functionality. Right, I reused it and just did not notice, that interrupts are enabled unconditionally in on_each_cpu(). tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] high-res timers: resume fix
On Saturday, 7 April 2007 11:49, Ingo Molnar wrote: > > find updated patch below - only the patch description changed: i removed > the 'UP' thing (patch has relevance on SMP too), and added Thomas' ack. > > Ingo > > > > Subject: [patch] high-res timers: resume fix > From: Ingo Molnar <[EMAIL PROTECTED]> > > Soeren Sonnenburg reported that upon resume he is getting > this backtrace: > > [] smp_apic_timer_interrupt+0x57/0x90 > [] retrigger_next_event+0x0/0xb0 > [] apic_timer_interrupt+0x28/0x30 > [] retrigger_next_event+0x0/0xb0 > [] __kfifo_put+0x8/0x90 > [] on_each_cpu+0x35/0x60 > [] clock_was_set+0x18/0x20 > [] timekeeping_resume+0x7c/0xa0 > [] __sysdev_resume+0x11/0x80 > [] sysdev_resume+0x47/0x80 > [] device_power_up+0x5/0x10 > > it turns out that on resume we mistakenly re-enable interrupts. > Do the timer retrigger only on the current CPU. > > Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> > Acked-by: Thomas Gleixner <[EMAIL PROTECTED]> > --- > include/linux/hrtimer.h |3 +++ > kernel/hrtimer.c| 12 > 2 files changed, 15 insertions(+) > > Index: linux/include/linux/hrtimer.h > === > --- linux.orig/include/linux/hrtimer.h > +++ linux/include/linux/hrtimer.h > @@ -206,6 +206,7 @@ struct hrtimer_cpu_base { > struct clock_event_device; > > extern void clock_was_set(void); > +extern void hres_timers_resume(void); > extern void hrtimer_interrupt(struct clock_event_device *dev); > > /* > @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim > */ > static inline void clock_was_set(void) { } > > +static inline void hres_timers_resume(void) { } > + > /* > * In non high resolution mode the time reference is taken from > * the base softirq time variable. > Index: linux/kernel/hrtimer.c > === > --- linux.orig/kernel/hrtimer.c > +++ linux/kernel/hrtimer.c > @@ -459,6 +459,18 @@ void clock_was_set(void) > } > > /* > + * During resume we might have to reprogram the high resolution timer > + * interrupt (on the local CPU): > + */ > +void hres_timers_resume(void) > +{ > + WARN_ON_ONCE(num_online_cpus() > 1); > + > + /* Retrigger the CPU local events: */ > + retrigger_next_event(NULL); > +} > + > +/* > * Check, whether the timer is on the callback pending list > */ > static inline int hrtimer_cb_pending(const struct hrtimer *timer) > - Hm, I'm probably missing something obvious, but where is it going to be called from? Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dtlk: fix error checks in module_init()
This patch fixes two things in module_init. - fix register_chrdev() error check Currently dtlk doesn't check register_chrdev() failure correctly. register_chrdev() returns a errno on failure. - check probe failure dtlk ignores probe failure and allows the module loading without such device. I got "Trying to free nonexistent resource" message by release_region() when unloading module without device. Signed-off-by: Akinobu Mita <[EMAIL PROTECTED]> Cc: Chris Pallotta <[EMAIL PROTECTED]> Cc: Jim Van Zandt <[EMAIL PROTECTED]> --- drivers/char/dtlk.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) Index: 2.6-mm/drivers/char/dtlk.c === --- 2.6-mm.orig/drivers/char/dtlk.c +++ 2.6-mm/drivers/char/dtlk.c @@ -324,16 +324,22 @@ static int dtlk_release(struct inode *in static int __init dtlk_init(void) { + int err; + dtlk_port_lpc = 0; dtlk_port_tts = 0; dtlk_busy = 0; dtlk_major = register_chrdev(0, "dtlk", &dtlk_fops); - if (dtlk_major == 0) { + if (dtlk_major < 0) { printk(KERN_ERR "DoubleTalk PC - cannot register device\n"); - return 0; + return -EBUSY; + } + err = dtlk_dev_probe(); + if (err) { + unregister_chrdev(dtlk_major, "dtlk"); + return err; } - if (dtlk_dev_probe() == 0) - printk(", MAJOR %d\n", dtlk_major); + printk(", MAJOR %d\n", dtlk_major); init_waitqueue_head(&dtlk_process_list); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, take4] FUTEX : new PRIVATE futexes
On Sat, 07 Apr 2007 19:30:14 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Eric Dumazet wrote: > > > > - Current mm code have a problem with 64bit futexes, as spoted by Nick : > > > > get_futex_key() does a check against sizeof(u32) regardless of futex being > > 64bits or not. > > So it is possible a 64bit futex spans two pages of memory... > > I had to change get_futex_key() prototype to be able to do a correct test. > > I wonder if it should be encfocing alignment to keep in on 1 page? I believe I just did that :) Before the patch : Alignment was only 4 bytes for all futexes, but some user app could trigger a kernel bug (since one 64bit futex could sit on two different pages, so possible separate vmas, so the inode refcounting was wrong, and access_ok did not a correct check) After the patch : Alignment is 8 bytes for 64 bit futexes, 4 bytes for 32bit futexes. All futexes are contrained to be in one single page. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch, take #3] high-res timers: resume fix
* Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: > Hm, I'm probably missing something obvious, but where is it going to > be called from? doh! :) Find new patch below :-/ Soeren, please test this one. Ingo > Subject: [patch] high-res timers: resume fix From: Ingo Molnar <[EMAIL PROTECTED]> Soeren Sonnenburg reported that upon resume he is getting this backtrace: [] smp_apic_timer_interrupt+0x57/0x90 [] retrigger_next_event+0x0/0xb0 [] apic_timer_interrupt+0x28/0x30 [] retrigger_next_event+0x0/0xb0 [] __kfifo_put+0x8/0x90 [] on_each_cpu+0x35/0x60 [] clock_was_set+0x18/0x20 [] timekeeping_resume+0x7c/0xa0 [] __sysdev_resume+0x11/0x80 [] sysdev_resume+0x47/0x80 [] device_power_up+0x5/0x10 it turns out that on resume we mistakenly re-enable interrupts. Do the timer retrigger only on the current CPU. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> Acked-by: Thomas Gleixner <[EMAIL PROTECTED]> --- include/linux/hrtimer.h |3 +++ kernel/hrtimer.c| 12 kernel/timer.c |2 +- 3 files changed, 16 insertions(+), 1 deletion(-) Index: linux/include/linux/hrtimer.h === --- linux.orig/include/linux/hrtimer.h +++ linux/include/linux/hrtimer.h @@ -206,6 +206,7 @@ struct hrtimer_cpu_base { struct clock_event_device; extern void clock_was_set(void); +extern void hres_timers_resume(void); extern void hrtimer_interrupt(struct clock_event_device *dev); /* @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim */ static inline void clock_was_set(void) { } +static inline void hres_timers_resume(void) { } + /* * In non high resolution mode the time reference is taken from * the base softirq time variable. Index: linux/kernel/hrtimer.c === --- linux.orig/kernel/hrtimer.c +++ linux/kernel/hrtimer.c @@ -459,6 +459,18 @@ void clock_was_set(void) } /* + * During resume we might have to reprogram the high resolution timer + * interrupt (on the local CPU): + */ +void hres_timers_resume(void) +{ + WARN_ON_ONCE(num_online_cpus() > 1); + + /* Retrigger the CPU local events: */ + retrigger_next_event(NULL); +} + +/* * Check, whether the timer is on the callback pending list */ static inline int hrtimer_cb_pending(const struct hrtimer *timer) Index: linux/kernel/timer.c === --- linux.orig/kernel/timer.c +++ linux/kernel/timer.c @@ -1016,7 +1016,7 @@ static int timekeeping_resume(struct sys clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL); /* Resume hrtimers */ - clock_was_set(); + hres_timers_resume(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel OOPSes when changing DVB-T adapter in 2.6.21-rc3
Hi, I can confirm this fixes problem. Please include into 2.6.21 Michal Dne pátek 06 duben 2007 21:58 Markus Rechberger napsal(a): > I committed a patch which fixes this issue: > > http://mcentral.de/hg/~mrec/v4l-dvb-stable/ > > That problem got caused by releasing data structures which are still > in use when the device gets unplugged. These patches delay the > deallocation of the data till the last user releases its reference to > the dvb nodes. > > This patch got tested with devices which use the dvb-usb framework as > well as em28xx based dvb devices. > > Markus > > On 3/16/07, Oliver Neukum <[EMAIL PROTECTED]> wrote: > > Am Freitag, 16. März 2007 10:13 schrieb CIJOML: > > > Hi, > > > > > > looks like more general problem with 2.6.21-rc3. This happens when I > > > > remove my > > > > > PCMCIA USB2.0/IEEE1384 adapter from slot: > > > > Yes, the more important it is to know whether -rc2 works. > > And please report this as a generic sysfs failure to lkml. > > > > Regards > > Oliver > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" > > in the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] timekeeping: drop irq-context clocksource polling
On Thu, 05 Apr 2007 14:03:16 -0700 Daniel Walker <[EMAIL PROTECTED]> wrote: > Before this change the timekeeping code would poll the clocksource > list every interrupt. This changes that so the clocksource list is > only checked when there has been and update, and no longer checks > in interrupt context. I get a complete lockup on i386 SMP - before the kernel has printed anything. I'm suspecting a recursive taking of xtime_lock. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, take #3] high-res timers: resume fix
On Sat, 2007-04-07 at 12:05 +0200, Ingo Molnar wrote: > * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: > > > Hm, I'm probably missing something obvious, but where is it going to > > be called from? > > doh! :) Find new patch below :-/ Soeren, please test this one. OK, I did about 5 suspend/resume cycles with CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_HPET=y CONFIG_HPET_MMAP=y and no oops / no problem ... So I guess the fix take #3 is good :-) One not directly related to this patch (but probably all the timer stuff) I noticed with -rc6 is that it takes 10 seconds to suspend (it was ~2 seconds before) Soeren > Ingo > > > > Subject: [patch] high-res timers: resume fix > From: Ingo Molnar <[EMAIL PROTECTED]> > > Soeren Sonnenburg reported that upon resume he is getting > this backtrace: > > [] smp_apic_timer_interrupt+0x57/0x90 > [] retrigger_next_event+0x0/0xb0 > [] apic_timer_interrupt+0x28/0x30 > [] retrigger_next_event+0x0/0xb0 > [] __kfifo_put+0x8/0x90 > [] on_each_cpu+0x35/0x60 > [] clock_was_set+0x18/0x20 > [] timekeeping_resume+0x7c/0xa0 > [] __sysdev_resume+0x11/0x80 > [] sysdev_resume+0x47/0x80 > [] device_power_up+0x5/0x10 > > it turns out that on resume we mistakenly re-enable interrupts. > Do the timer retrigger only on the current CPU. > > Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> > Acked-by: Thomas Gleixner <[EMAIL PROTECTED]> > --- > include/linux/hrtimer.h |3 +++ > kernel/hrtimer.c| 12 > kernel/timer.c |2 +- > 3 files changed, 16 insertions(+), 1 deletion(-) > > Index: linux/include/linux/hrtimer.h > === > --- linux.orig/include/linux/hrtimer.h > +++ linux/include/linux/hrtimer.h > @@ -206,6 +206,7 @@ struct hrtimer_cpu_base { > struct clock_event_device; > > extern void clock_was_set(void); > +extern void hres_timers_resume(void); > extern void hrtimer_interrupt(struct clock_event_device *dev); > > /* > @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim > */ > static inline void clock_was_set(void) { } > > +static inline void hres_timers_resume(void) { } > + > /* > * In non high resolution mode the time reference is taken from > * the base softirq time variable. > Index: linux/kernel/hrtimer.c > === > --- linux.orig/kernel/hrtimer.c > +++ linux/kernel/hrtimer.c > @@ -459,6 +459,18 @@ void clock_was_set(void) > } > > /* > + * During resume we might have to reprogram the high resolution timer > + * interrupt (on the local CPU): > + */ > +void hres_timers_resume(void) > +{ > + WARN_ON_ONCE(num_online_cpus() > 1); > + > + /* Retrigger the CPU local events: */ > + retrigger_next_event(NULL); > +} > + > +/* > * Check, whether the timer is on the callback pending list > */ > static inline int hrtimer_cb_pending(const struct hrtimer *timer) > Index: linux/kernel/timer.c > === > --- linux.orig/kernel/timer.c > +++ linux/kernel/timer.c > @@ -1016,7 +1016,7 @@ static int timekeeping_resume(struct sys > clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL); > > /* Resume hrtimers */ > - clock_was_set(); > + hres_timers_resume(); > > return 0; > } > -- Sometimes, there's a moment as you're waking, when you become aware of the real world around you, but you're still dreaming. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] console UTF-8 fixes
Hi, I just wanted to give my opinion on things... (and enable utf8 to read this properly) On Apr 7 2007 11:24, Egmont Koblinger wrote: > >> I strongly disagree. First of all, you're changing the semantics of a >> 13-year-old API. The semantics of the Linux console is that by >> specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have >> specified the fallback glyph. > >OK, I'm not against using U+FFFD for missing glyphs. In the mean time I >think it's still a good idea to clearly separate the two cases in the code >(that is, the case of invalid sequence from the case of missing glyph), but >we can still use the same replacement character in these two cases. I'll >send an updated patch after Easter if it sounds good for you. I am quite ok with the way things are right now. - vc displays for illegal sequences - vc displays e.g. "U" (latin capital U) in place when Û (latin capital U with accent circumflex) is not available in this font (determined by the unicodemap) (I do use an unicode map, because I use a 4096-byte cp437 "DOS" font which requires one) - vc displays for sequences it does not know how to print - xterm displays for illegal sequences - xterm seems to display on undefined glyphs (U+DFFF for ex., using the "Unicode Best" font from the xterm menu) - xterm seems to display nothing on undefined glyphs (U+E000 for ex., "Unicode Best" again) >> What's worse, you've hard-coded the uses of specific visual >> representations. That is completely unacceptable. > >Now that we've dropped the idea of "dot" for missing glyphs, the other thing > >[...] > >Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I >choose for fallback is still '?' (the ASCII question mark), I just invert >the color attributes of the cell where this is printed. This way it becomes >visually distinguisable from the literal question mark. Using the current >kernel you just cannot know whether the character printed is a real question >mark, or a replacement glyph. Still, should you stongly disagree with this >decision, the color inverting part can easily be removed. Please, no dot, and no inverse color. Imagine someone had the following bitmap for : #### #### #### #### #### #### Then inverting that again would be susceptible to confusion with the regular '?' at 0x3F. (cp437 for example maps unknown/illegal to 0xFD which happens to be the block graphic '■', but YMMV depending on font.) >I think I've (mostly) described it above. Set everything to UTF-8, load a >latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an >application print U+00FB (alt + numpad 251 is one trivial way), you'll see >an "u with double accent", though the symbol to be displayed is "u with >circumflex". This isn't present in the current font, so the replacement >character should appear, not a different letter. I blame your latin2 unicode map. (See above about 'Û'.) It should perhaps display a regular 'u' if it cannot display 'û', but definitely not 'ü' (which is not called a double accent, btw). >> To be able to do CJK you need something like Kon anyway. This feels >> like bloat. > >I don't want CJK support. All that I want is to be able to edit English >words within a file that contains mixture of English and CJK, with a text >editor like vim or joe. +1 for this one :) xterm## echo "韓国と日本にようこそ!" >/tmp/foobar.txt vc## cat foobar.txt currently gets things not so right, because multibyte characters are not displayed with as many as they are wide. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Two questions regarding Opening files within Kernel!
On Apr 7 2007 06:58, JanuGerman wrote: >Hi Every one, > > I have got two questions regarding opening files within the Linux > kernel. If some body can help me, in sorting out this problem, i will > be very thankful. > >1) I have just a file path with me, an absolute path, but no dentry, > no inode, no vfsmount object, which function i can call to get a > "file" object associated with the absoulte file path. I have surfed > arround the source code especially fs/open.c and some other files, > but each function requires a parameter "mode" and "fd" beside file > path. Actually, i was confuse about the "mode" parameter (and its > differece with "flag"), like what to send, and secondly for "fd", i > am not sure, what value to send as there is no file infact and only > file path exists. Any idea? Not sure if this is the right function, but it should get you started... struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root); >2) Any functionality within linux kernel source code, to read one line > per file? or some indirect way to set buffer size for one read?. > That is, any existing header file for doing text I/O rather than > binary within the kernel source code? http://kernelnewbies.org/FAQ/WhyWritingFilesFromKernelIsBad (same goes for reading) Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, take4] FUTEX : new PRIVATE futexes
On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote: > get_futex_key() does a check against sizeof(u32) regardless of futex being > 64bits or not. > So it is possible a 64bit futex spans two pages of memory... That would be a user bug. 32-bit futexes have to be 32-bit aligned, 64-bit futexes have to be 64-bit aligned. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: broken device locking, sg vs. sg_io on block devices
#include First, we (me and Thomas Schmidt) are working on a draft for a mandatory locking scheme which will take care of the most racy situations even without having a proper in-kernel solution. But you need to exlain some things, otherwise we cannot rely on your words. > (open has side effects relocking doesnt) What exactly does that mean in our scope? Can we do following without having side effects: open("/dev/sr0",O_EXCL|O_RDWR); /* no matter what it returns */ fcntl(..., F_SETLK); /* no matter what it returns */ ioctl(f, SCSI_IOCTL_GET_IDLUN, &x); ioctl(f, SCSI_IOCTL_GET_BUS_NUMBER, &jo); Can you guarantee us that bit? Or shall we really implement ugly workarounds to avoid every open call? Note that "just do like UUCP guys" is not as easy or reliable as people may pretend. Eduard. -- Naja, Garbage Collector eben. Holt den Müll sogar vom Himmel. (Heise Trollforum über Java in der Flugzeugsteuerung) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
compressing intermediate files with LZO on the fly
Willy Tarreau wrote: > > ... for some usages (temporary space), > light compression can increase speed. For instance, when processing logs, > I get better speed by compressing intermediate files with LZO on the fly. How can you do that on ext3? Also, can you do that on a partition block-io level? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH nf-2.6.22] [netfilter] early_drop imrovement
When the number of conntracks is reached nf_conntrack_max limit, early_drop() is called and tries to free one of already used conntracks in one of the hash buckets. If it does not find any conntracks that may be freed, it leads to transmission errors. However it is not fair because of current hash bucket may be empty but the neighbour ones can have the number of conntracks that can be freed. On the other hand the number of checked conntracks is not limited and it can cause a long delay. The following patch limits the number of checked conntracks by average number of conntracks in one hash bucket and allows to search conntracks in other hash buckets. Signed-off-by: Vasily Averin <[EMAIL PROTECTED]> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index e132c8a..d0b5794 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -525,7 +525,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken); /* There's a small race here where we may free a just-assured connection. Too bad: we're in trouble anyway. */ -static int early_drop(struct list_head *chain) +static int __early_drop(struct list_head *chain, unsigned int *cnt) { /* Traverse backwards: gives us oldest, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; @@ -540,6 +540,10 @@ static int early_drop(struct list_head *chain) atomic_inc(&ct->ct_general.use); break; } + if (!--(*cnt)) { + dropped = 1; + break; + } } read_unlock_bh(&nf_conntrack_lock); @@ -555,6 +559,21 @@ static int early_drop(struct list_head *chain) return dropped; } +static int early_drop(const struct nf_conntrack_tuple *orig) +{ + unsigned int i, hash, cnt; + int ret = 0; + + hash = hash_conntrack(orig); + cnt = (nf_conntrack_max/nf_conntrack_htable_size) + 1; + + for (i = 0; + !ret && i < nf_conntrack_htable_size; + ++i, hash = ++hash % nf_conntrack_htable_size) + ret = __early_drop(&nf_conntrack_hash[hash], &cnt); + return ret; +} + static struct nf_conn * __nf_conntrack_alloc(const struct nf_conntrack_tuple *orig, const struct nf_conntrack_tuple *repl, @@ -574,9 +593,7 @@ __nf_conntrack_alloc(const struct nf_conntrack_tuple *orig, if (nf_conntrack_max && atomic_read(&nf_conntrack_count) > nf_conntrack_max) { - unsigned int hash = hash_conntrack(orig); - /* Try dropping from this hash chain. */ - if (!early_drop(&nf_conntrack_hash[hash])) { + if (!early_drop(orig)) { atomic_dec(&nf_conntrack_count); if (net_ratelimit()) printk(KERN_WARNING - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, take4] FUTEX : new PRIVATE futexes
Jakub Jelinek a écrit : On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote: get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... That would be a user bug. 32-bit futexes have to be 32-bit aligned, 64-bit futexes have to be 64-bit aligned. I am not sure what you want to say. User doing sys_futex64(0x..FFC, FUTEX_WAKE_OP, ...) and crashing kernel or corrupting data is ok because its a user bug ? User is allowed to do anything, kernel must check and protect innocents. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: compressing intermediate files with LZO on the fly
Hi Al, On Sat, Apr 07, 2007 at 02:32:34PM +0300, Al Boldi wrote: > Willy Tarreau wrote: > > > > ... for some usages (temporary space), > > light compression can increase speed. For instance, when processing logs, > > I get better speed by compressing intermediate files with LZO on the fly. > > How can you do that on ext3? > > Also, can you do that on a partition block-io level? No, sorry for the confusion. My scripts simply do : $ lzop -cd file1.lzo | process | lzop -c3 > file2.lzo With decent CPU, you can reach higher read/write data rates than what a single off-the-shelf disk can achieve. For this reason, I think that reiser4 would be worth trying for this particular usage. And in this case, I'm not interested at all in reliability. It's just temporary storage. If the disk fails, I throw it away and buy a new one. Cheers, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH nf-2.6.22] [netfilter] early_drop imrovement
Vasily Averin a e'crit : When the number of conntracks is reached nf_conntrack_max limit, early_drop() is called and tries to free one of already used conntracks in one of the hash buckets. If it does not find any conntracks that may be freed, it leads to transmission errors. However it is not fair because of current hash bucket may be empty but the neighbour ones can have the number of conntracks that can be freed. On the other hand the number of checked conntracks is not limited and it can cause a long delay. The following patch limits the number of checked conntracks by average number of conntracks in one hash bucket and allows to search conntracks in other hash buckets. Hi Vasily atomic_inc(&ct->ct_general.use); break; } + if (!--(*cnt)) { + dropped = 1; + break; + } + cnt = (nf_conntrack_max/nf_conntrack_htable_size) + 1; I am sorry but this wont help in the case you mentioned in an earlier mail : If nf_conntrack_max < nf_conntrack_htable_size, cnt will be set to 1. Then in __early_drop() you endup in breaking the list_for_each_entry_reverse() loop after the first element was tested ! Not what you intended I'm afraid, because you wont event scan the whole chain as before your patch :( I believe you should not test --cnt in __early_drop() but in the caller. (That is not counting the number of found cells, but the number of hash chains you tried) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[dm-devel] bio too big device md1 (16 > 8)
Hi, i'm using 2.6.21-rc5-git9 + http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-merge-max_hw_sector.patch ( i've been testing with and without it, and first encountered it on 2.6.18-debian ) I've setup a raid1 array md1 (it was created in a degraded mode using the debian installer) (md0 is also a small raid1 array created in degraded mode, but i did not have any issue with it) md1 hold a lvm physical volume holding a vg and several lvs mdadm -D /dev/md1: /dev/md1: Version : 00.90.03 Creation Time : Sun Mar 25 16:34:42 2007 Raid Level : raid1 Array Size : 290607744 (277.15 GiB 297.58 GB) Device Size : 290607744 (277.15 GiB 297.58 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Apr 3 01:37:23 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : af8d2807:e573935d:04be1e12:bc7defbb Events : 0.422096 Number Major Minor RaidDevice State 0 330 active sync /dev/hda3 1 001 removed the problem i'm encountering is when i add /dev/md2 to /dev/md1. mdadm -D /dev/md2 /dev/md2: Version : 00.90.03 Creation Time : Sun Apr 1 15:06:43 2007 Raid Level : linear Array Size : 290607808 (277.15 GiB 297.58 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Sun Apr 1 15:06:43 2007 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Rounding : 64K UUID : 887ecdeb:5f205eb6:4cd470d6:4cbda83c (local to host odo) Events : 0.1 Number Major Minor RaidDevice State 0 3440 active sync /dev/hdg4 1 5721 active sync /dev/hdk2 2 9132 active sync /dev/hds3 3 8923 active sync /dev/hdo2 I use mdadm --manage --add /dev/md1 /dev/md2 when I do so here is what happen: md: bind RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:hda3 disk 1, wo:1, o:1, dev:md2 md: syncing RAID array md1 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 290607744 blocks. bio too big device md1 (16 > 8) Device dm-7, XFS metadata write error block 0x243ec0 in dm-7 bio too big device md1 (16 > 8) I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1b5b6550 ("xfs_trans_read_buf") error 5 buf count 8192 bio too big device md1 (16 > 8) I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1fb3b00 ("xfs_trans_read_buf") error 5 buf count 8192 every filesystems on md1 get corrupted. I manually fail md2 then reboot and so i can boot the fs again. (but md1 is still degraded) Any idea ? I can provide more information if needed. (the only weird thing is /dev/hdo that doesn't seem to be lba48-ready, but i guess that shouldn't be a geometry issue.) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm snapshot broken-out-2007-04-07-03-27.tar.gz uploaded
[EMAIL PROTECTED] napisał(a): > The mm snapshot broken-out-2007-04-07-03-27.tar.gz has been uploaded to > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-04-07-03-27.tar.gz > > It contains the following patches against 2.6.21-rc6: > LTP triggered a ptrace problem. [ cut here ] kernel BUG at kernel/ptrace.c:1281! invalid opcode: [#1] PREEMPT SMP last sysfs file: devices/platform/w83627hf.656/temp2_input Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nfsd exportfs lockd nfs_acl autofs4 sunrpc af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss intel_agp snd_pcm agpgart evdev snd_timer snd soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix CPU:1 EIP:0060:[]Not tainted VLI EFLAGS: 00010202 (2.6.21-rc6-mm1 #1) EIP is at ptrace_do_wait+0x1eb/0x510 l *0xc0163566 0xc0163566 is in ptrace_do_wait (kernel/ptrace.c:1281). 1276 1277pr_debug("%d ptrace_do_wait (%d) found %d code %x (%u/%d)\n", 1278 current->pid, tsk->pid, p->pid, exit_code, 1279 p->exit_state, p->exit_signal); 1280 1281NO_LOCKS; 1282 1283/* 1284 * If there was a group exit in progress, all threads report that 1285 * status. Most will have SIGKILL in their own exit_code. eax: 0001 ebx: cff43550 ecx: c04b043c edx: 0001 esi: fff6 edi: 000c ebp: c96c5f10 esp: c96c5ee8 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process ptrace01 (pid: 8610, ti=c96c4000 task=c9554b00 task.ti=c96c4000) Stack: 0002 0004 21a3 c9554b00 cc8aa808 c96c5f10 cff43550 0001 fff6 c96c5f80 c0127e10 bf833c78 0134 0004 21a3 c9554b00 0001 0001 c9554bbc Call Trace: [] do_wait+0x9d6/0xbad [] sys_wait4+0x30/0x32 [] sys_waitpid+0x27/0x29 [] syscall_call+0x7/0xb [] 0xb7f36410 === INFO: lockdep is turned off. Code: a9 9d 0b 00 85 c0 74 05 e8 87 b8 1e 00 89 e0 25 00 e0 ff ff 31 d2 83 78 14 00 0f 95 c2 b8 3c 04 4b c0 e8 86 9d 0b 00 85 c0 74 04 <0f> 0b eb fe 8b 83 5c 04 00 00 f6 40 54 08 74 03 8b 78 44 83 bb EIP: [] ptrace_do_wait+0x1eb/0x510 SS:ESP 0068:c96c5ee8 [ cut here ] kernel BUG at kernel/ptrace.c:494! invalid opcode: [#2] PREEMPT SMP last sysfs file: devices/platform/w83627hf.656/temp2_input Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nfsd exportfs lockd nfs_acl autofs4 sunrpc af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss intel_agp snd_pcm agpgart evdev snd_timer snd soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix CPU:1 EIP:0060:[]Not tainted VLI EFLAGS: 00010202 (2.6.21-rc6-mm1 #1) EIP is at ptrace_exit+0x29/0x21d l *0xc0163f22 0xc0163f22 is in ptrace_exit (kernel/ptrace.c:494). 489 ptrace_exit(struct task_struct *tsk) 490 { 491 struct list_head *pos, *n; 492 int restart; 493 494 NO_LOCKS; 495 496 /* 497 * Taking the task_lock after PF_EXITING is set ensures that a 498 * child in ptrace_traceme will not put itself on our list when eax: 0001 ebx: ecx: c04b1044 edx: 0001 esi: c96c5ee8 edi: c9554b00 ebp: c96c5d78 esp: c96c5d60 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process ptrace01 (pid: 8610, ti=c96c4000 task=c9554b00 task.ti=c96c4000) Stack: c96c5d78 c96c5ee8 c9554b00 c96c5db8 c012820e 0001 0286 c96c5da8 c011fec8 0001 c04a007b c96c007b c02100d8 ff10 c96c5eb0 000b c96c5eb0 c96c5ee8 c03f0068 c96c5de8 c0105a77 Call Trace: [] do_exit+0x16b/0x86c [] die+0x206/0x22c [] do_trap+0x8a/0xa4 [] do_invalid_op+0x88/0x92 [] error_code+0x79/0x80 [] ptrace_do_wait+0x1eb/0x510 [] do_wait+0x9d6/0xbad [] sys_wait4+0x30/0x32 [] sys_waitpid+0x27/0x29 [] syscall_call+0x7/0xb [] 0xb7f36410 === INFO: lockdep is turned off. Code: 5d c3 55 89 e5 57 56 53 83 ec 0c 89 c7 89 e0 25 00 e0 ff ff 31 d2 83 78 14 00 0f 95 c2 b8 44 10 4b c0 e8 ca 93 0b 00 85 c0 74 04 <0f> 0b eb fe 8d 9f b8 04 00 00 89 d8 e8 e7 d5 1e 00 8d 87 40 0a EIP: [] ptrace_exit+0x29/0x21d SS:ESP 0068:c96c5d60 Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: ptrace
Re: REISER4: fix for reiser4_write_extent
Le 06.04.2007 00:42, Ignatich a écrit : While trying to find the cause of problems with reiser4 in recent kernels I came across this. Incomplete write handling seem to be missing from reiser4_write_extent() thanks to reiser4-temp-fix.patch. Strangely, there is a patch by Edward Shishkin that should address that issue, but it is missing from -mm tree. Please check. Max This patch was added to -mm tree the 14 Dec 2006 (see http://www.mail-archive.com/mm-commits@vger.kernel.org/msg05338.html). It was then dropped from -mm tree the 05 Mar 2007 (see http://www.mail-archive.com/mm-commits@vger.kernel.org/msg10818.html), with this comment: "This patch was dropped because it is obsolete" No idea why it was obsolete. Does somebody know ? ~~ laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
Hi Willy,... > With decent CPU, you can reach higher read/write data rates than what a > single off-the-shelf disk can achieve. For this reason, I think that > reiser4 would be worth trying for this particular usage. Glad to see you are willing to give Reiser4 a go. Good man. -- On Sat, 7 Apr 2007 09:15:35 +0200, "Willy Tarreau" <[EMAIL PROTECTED]> said: > On Fri, Apr 06, 2007 at 10:58:45PM -0700, [EMAIL PROTECTED] > wrote: > > You know,... you cut out this bit: > > > > - > > > > > The following benchmarks are from > > > > > > http://linuxhelp.150m.com/resources/fs-benchmarks.htm or, > > > http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm > > ... > > Hey John, please change your disk, it's scratched and you're repeating > yourself again and again. At first I thought "Oh cool, some good news > about reiser4", now when I see "reiserfs" in a thread, I think "oh no, > not this boring guy who escaped from the asylum again !". I hope this > thread will be cut shortly so that you stop doing bad publicity to > reiserfs and its developers, because when a product is indicated as > good by stupid people, it's really doing harm. > > Also, about this part : > [Jan] > > > But in the end everything is a tradeoff. You can save diskspace, but > > > increase the cost of corruption. > > I don't 100% agree with Jan, because for some usages (temporary space), > light compression can increase speed. For instance, when processing logs, > I get better speed by compressing intermediate files with LZO on the fly. > > [John] > > You deliberately ignored the fact that bad blocks are NOT dealt with by > > the filesystem,... but by the operating system. Like I said: If your > > filesystem is writing to bad blocks, then throw away your operating > > system. > > But what you write here is complete crap. The filesystem relies on a > linear block device. The operating system is responsible for doing > read retries or reporting errors on bad blocks, but the FS and only > the FS can decide how not to use some known defective areas, for > instance not putting any metadata on them nor any useful data. > > Now if you want to stop writing stupid things again and again, take > your bag, don't miss the bus to school, and listen to the teachers > instead of playing games on your calculator. > > Willy > PS: non need to reply either, I'll kill this thread and your address > here. > -- [EMAIL PROTECTED] -- http://www.fastmail.fm - One of many happy users: http://www.fastmail.fm/docs/quotes.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
Jan does have a point about bad blocks. A couple years ago I had a relatively new disk start to go bad on random blocks. I detected it fairly quickly but did have some data loss. All the compressed archives which were hit were near total losses; most other files were at least partially recoverable. It is not a matter of your operating system writing to bad blocks. It is a matter of what happens when the blocks on which your data sit go bad underneath you. This issue has also been discussed by people working with revision control system. If you are archiving data, how do you know you if your data is still good unless you actually need it? If you do not know it is bad, you may well get rid of good copies thinking you do not need the extras... it does happen. I would be quite hesitant to go with on disk compression unless damage was limited to only the bad bits or blocks and did not propagate through the rest of the file. Perhaps if everyone used hardware RAID and the RAID automatically detected a difference due to trashed data on one disk and flagged the admin with a warning... BTW: I'm a CMU Alum, so who are you working with Jan? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
[EMAIL PROTECTED] writes: > Why do you think I hate reiserfs developers? That is an insane claim. > Why would I hate reiser3 developers? > Why would I hate reiser4 developers? > Why would I even dislike them? > > I think Hans Reiser is a genius. Is that what you mean by hate? I think they could hire a person with a bit better marketing skills, though. People on a technical mailing list don't buy things just because something on TV told them they have to. > Answer this question. Why do YOU think I am antagonizing reiserfs > developers? That might be just a side effect. > Think about it,... read speeds that are some FOUR times the physical > disk read rate,... impossible without the use of compression (or > something similar). It's really impossible with compression only unless you're writing only zeros or stuff alike. I don't know what bonnie uses for testing but real life data doesn't compress 4 times. Two times, sometimes, but then it will be typically slower than disk access (I mean read, as write will be much slower). You can get faster I/O (both linear speed and access times) using multiple disks (mirrors etc). Perhaps some ZFS ideas would do us some good? Gzip - 3 files (zeros only, raw DV data from video camera, x86_64 kernel rpm file), 10 MB of data (10*1024*1024), done on tmpfs so no real disk speed factor. The CPU is AMD64 with 1 MB cache per core, 2600 MHz clock (clock scaling disabled). That's my typical usage pattern (well, not counting these zeros). $ l -Ggh zeros dv bin -rw-r--r-- 1 10M Apr 7 15:30 bin -rw-r--r-- 1 10M Apr 7 15:31 dv -rw-r--r-- 1 10M Apr 7 15:31 zeros $ for f in zeros dv bin; do time gzip $f; done real0m0.112s real0m0.686s real0m0.559s Dealing with pure zeros gzip can get almost 90 MB/s compressing, but with DV and rpm it only does 14.5 and almost 18 MB/s respectively... $ l -Ggh zeros.gz dv.gz bin.gz -rw-r--r-- 1 10K Apr 7 15:31 zeros.gz -rw-r--r-- 1 9.1M Apr 7 15:31 dv.gz -rw-r--r-- 1 9.3M Apr 7 15:30 bin.gz ... and though the numbers may still sound impressive, space savings are less than 10%. $ for f in zeros dv bin; do time gunzip $f.gz; done real0m0.067s real0m0.131s real0m0.120s Decompression gives 150 MB/s for zeros and ~ 80 MB/s for DV and rpm. $ for f in zeros dv bin; do time gzip -1 $f; done real0m0.079s real0m0.572s real0m0.530s Supposed to be "fastest gzip". 126 MB/s for zeros but still less than 19 MB/s for DV and rpm. $ l -Ggh zeros.gz dv.gz bin.gz -rw-r--r-- 1 45K Apr 7 15:31 zeros.gz -rw-r--r-- 1 9.2M Apr 7 15:31 dv.gz -rw-r--r-- 1 9.3M Apr 7 15:30 bin.gz $ for f in zeros dv bin; do time gunzip $f.gz; done real0m0.044s real0m0.135s real0m0.120s It seems gzip can decompress zeros with 227 MB/s rate. I assume the "4x read speed" claim comes from something like this. $ /sbin/hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 210 MB in 3.02 seconds = 69.59 MB/sec $ echo "69.59*4" | bc 278.36 Seems you'd need a faster algorithm, faster machine or slower disk - slower than this cheap SATA with disabled NCQ (NV SATA) at least: $ cat /sys/block/sda/device/model Maxtor 6V250F0 Please note that aplication-level compression usually gives way better results - the application knows much more. -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
Krzysztof -- Aren't you missing the point? Twice the speed would be great,... even a 50% increase,... even a 0% increase. I checked what bonnie++ actually writes to its test files, for you. It is about 98-99% zeros. Still, the results record sequential reads, of 232,729 K/sec, nearly four times the physical disk read rate, 63,160 K/sec, of the hard drive. The sequential writes are about three times the physical disk write rate. Even if the speed increase was zero, the more efficient use of disk space means that Reiser4 is worth investigating. People use RAID arrays to achieve speed increases. The people who developed RAID clearly thought that increases in speed were worth investigating. > > > Why do you think I hate reiserfs developers? That is an insane claim. > > Why would I hate reiser3 developers? > > Why would I hate reiser4 developers? > > Why would I even dislike them? > > > > I think Hans Reiser is a genius. Is that what you mean by hate? > > I think they could hire a person with a bit better marketing skills, > though. People on a technical mailing list don't buy things just > because something on TV told them they have to. I don't work for Reiser if that is what you are suggesting. And people buy all sorts of lies because someone on TV told them it was true. Did you believe Iraq had WMD (weapons of mass destruction) because a bunch of American liars told you this on TV? Millions of Americans did. > > Answer this question. Why do YOU think I am antagonizing reiserfs > > developers? > > That might be just a side effect. > -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Send your email first class - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kernel-doc: handle arrays with arithmetic expressions as initializers
On Fri, Apr 06, 2007 at 05:53:25PM -0700, Randy Dunlap wrote: > From: Jan Engelhardt <[EMAIL PROTECTED]> > > Unfortunately, kernel-doc has problems with a struct field like this: > uint8_t databuf[NAND_MAX_PAGESIZE + NAND_MAX_OOBSIZE]; > > simply due to the spaces around the "+" sign, so drop all spaces inside > [...] so that parsing is done correctly (in some sense). > > Warning(linux-2.6.20-git15/include/linux/mtd/nand.h:304): No description > found for parameter 'NAND_MAX_OOBSIZE]' > > This needs to sit in -mm for awhile to see if it has any adverse effects. > > And yes, this is just a hack until kernel-doc learns to do better > parsing. > > Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]> > Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> > --- > scripts/kernel-doc |5 + > 1 file changed, 5 insertions(+) > > --- linux-2.6.21-rc6.orig/scripts/kernel-doc > +++ linux-2.6.21-rc6/scripts/kernel-doc > @@ -1452,6 +1452,11 @@ sub create_parameterlist($$$) { > $arg =~ s/\s*:\s*/:/g; > $arg =~ s/\s*\[/\[/g; > > + # no spaces inside [array size expression]; > + # messes up split/pop/shift/unshift below; > + while ($arg =~ s/\[(.*)\s+(.*)\]/[$1$2]/) { > + } > + > my @args = split('\s*,\s*', $arg); > if ($args[0] =~ m/\*/) { > $args[0] =~ s/(\*+)\s*/ $1/; > - In a different approach here's a patch that handles the special case of composite arithmetic expressions in array size initializers. With it, prior to pushing the split strings on the @first_arg array, I split the keywords before the array name as before and then keep the array name along with the subscript expression as a single whole element which gets pushed last. In this manner, kernel-doc produces correct output without removing whitespaces which makes the array subscripts unreadable in the docs. Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]> --- 21-rc6/scripts/kernel-doc.orig 2007-04-07 16:48:51.0 +0200 +++ 21-rc6/scripts/kernel-doc 2007-04-07 16:51:17.0 +0200 @@ -1456,7 +1456,16 @@ sub create_parameterlist($$$) { if ($args[0] =~ m/\*/) { $args[0] =~ s/(\*+)\s*/ $1/; } - my @first_arg = split('\s+', shift @args); + + my @first_arg; + if ($args[0] =~ /^(.*\s+)(.*?\[.*\].*)$/) { + shift @args; + push(@first_arg, split('\s+', $1)); + push(@first_arg, $2); + } else { + @first_arg = split('\s+', shift @args); + } + unshift(@args, pop @first_arg); $type = join " ", @first_arg; -- Regards/Gruß, Boris. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] nfs statfs error-handling fix
Hi, The nfs statfs function returns a success code on error, and fills the output buffer with invalid values. The attached patch makes it return a correct error code instead. Thanks, Amnon Signed-off-by: Amnon Aaronsohn <[EMAIL PROTECTED]> -- --- linux-source-2.6.20-2.6.20/fs/nfs/super.c.orig 2007-04-07 15:19:14.0 +0300 +++ linux-source-2.6.20-2.6.20/fs/nfs/super.c 2007-04-07 15:24:35.0 +0300 @@ -203,9 +203,9 @@ static int nfs_statfs(struct dentry *den lock_kernel(); error = server->nfs_client->rpc_ops->statfs(server, fh, &res); - buf->f_type = NFS_SUPER_MAGIC; if (error < 0) - goto out_err; + goto out; + buf->f_type = NFS_SUPER_MAGIC; /* * Current versions of glibc do not correctly handle the @@ -234,13 +234,7 @@ static int nfs_statfs(struct dentry *den buf->f_namelen = server->namelen; out: unlock_kernel(); - return 0; - - out_err: - dprintk("%s: statfs error = %d\n", __FUNCTION__, -error); - buf->f_bsize = buf->f_blocks = buf->f_bfree = buf->f_bavail = -1; - goto out; - + return error; } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
On Sat, 7 Apr 2007 13:59:14 +0100, "Dale Amon" <[EMAIL PROTECTED]> said: > Jan does have a point about bad blocks. A couple years ago > I had a relatively new disk start to go bad on random blocks. > I detected it fairly quickly but did have some data loss. > > All the compressed archives which were hit were near > total losses; most other files were at least partially > recoverable. As you know, there is not substitute for backups. What if the disk had totally crashed and scratched GBs of your data. And did you ever trust those (non-compressed) executables that you saved after recovering them from corruption? Of course not. No one would. The fact that they were not compressed did not save them. You are really arguing for backups, not for one filesystem or another. Besides, Jan claimed that corruption due to bad blocks propagates to MULTIPLE files because of the compression in the file system. You are arguing something different. > It is not a matter of your operating system writing > to bad blocks. It is a matter of what happens when the > blocks on which your data sit go bad underneath you. > > This issue has also been discussed by people working > with revision control system. If you are archiving > data, how do you know you if your data is still good > unless you actually need it? If you do not know it > is bad, you may well get rid of good copies thinking > you do not need the extras... it does happen. > > I would be quite hesitant to go with on disk compression > unless damage was limited to only the bad bits or blocks > and did not propagate through the rest of the file. You don't really mean that. Most backup uses compression (which propagates errors through the rest of the file). > Perhaps if everyone used hardware RAID and the RAID > automatically detected a difference due to trashed > data on one disk and flagged the admin with a warning... > > BTW: I'm a CMU Alum, so who are you working with John? I retired quite young. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Send your email first class - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
O_APPEND, lseek() and pwrite()
If I open a file with O_APPEND and write() to it, it looks like the file offset is updated and I can get it with lseek(SEEK_CUR). Can I trust that this behavior won't change in future Linux versions? Apparently this isn't standard, because at least OS X and Solaris don't do this. pwrite() ignores the file offset if the fd has O_APPEND set (with 2.6.20). http://www.opengroup.org/austin/mailarchives/ag/ msg09453.html suggests that it shouldn't ignore it. Could this be changed? For now I can of course just change the flag with fcntl(). I guess there aren't any limits to how large blocks write() accepts without the data being mixed with another process's writes (both with O_APPEND)? And I guess there aren't any horrible performance problems with this, so that this is actually a good idea compared to file lock + write() + unlock? :) PGP.sig Description: This is a digitally signed message part
Re: [RFD driver-core] Lifetime problems of the current driver model
On March 30, 2007, Tejun Heo wrote: > Hello, all. > > This document tries to describe lifetime problems of the current > device driver model primarily from the point view of device drivers > and establish consensus, or at least, start discussion about how to > solve these problems. This is primarily based on my experience with > IDE and SCSI layers and my knowledge on other drivers is limited, so I > might have generalized too aggressively. Feel free to point out. ... > Example 1. sysfs_schedule_callback() not grabbing the owning module > > This function is recently added to be used by suicidal sysfs nodes > such that they don't deadlock when trying to unregister themselves. > > +#include > static void sysfs_schedule_callback_work(struct work_struct *work) > { > struct sysfs_schedule_callback_struct *ss = container_of(work, > struct sysfs_schedule_callback_struct, work); > > +msleep(100); > (ss->func)(ss->data); > kobject_put(ss->kobj); > kfree(ss); > } > > int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *), > void *data) > { > struct sysfs_schedule_callback_struct *ss; > > ss = kmalloc(sizeof(*ss), GFP_KERNEL); > if (!ss) > return -ENOMEM; > kobject_get(kobj); > ss->kobj = kobj; > ss->func = func; > ss->data = data; > INIT_WORK(&ss->work, sysfs_schedule_callback_work); > schedule_work(&ss->work); > return 0; > } > > Two lines starting with '+' are inserted to make the problem > reliably reproducible. With the above changes, > > # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod > drivers/ata/libata.ko; > insmod drivers/ata/ahci.ko > # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod > sd_mod; rmmod scsi_mod > > It's assumed that ahci detects /dev/sda. The above command sequence > causes the following oops. > > BUG: unable to handle kernel paging request at virtual address e0984020 > [--snip--] > EIP is at 0xe0984020 > [--snip--] >[] run_workqueue+0x92/0x140 >[] worker_thread+0x137/0x160 >[] kthread+0xa3/0xd0 >[] kernel_thread_helper+0x7/0x10 > > The problem here is that kobjec_get() in sysfs_schedule_callback() > doesn't grab the module backing the kobject it's grabbing. By the > time (ss->func)(ss->kobj) runs, scsi_mod is already gone. As the author of this routine, I wish you had included my name in your CC: list. :-( The problem here isn't exactly as you described. scsi_mod needs to be pinned (1) because it is the owner of the kobject and hence will be called when the kobject is released, and (2) because it is the owner of the callback routine. However this is just a detail; clearly the bug needs to be fixed. One possibility would be to have scsi_mod's exit_scsi() routine call flush_scheduled_work(). Another would be to add such a call in sys_delete_module(). Neither of these is attractive. They would add overhead when it's not needed, and they would deadlock if a workqueue routine tried to unload a module. On balance, the patch below seems better. Do you agree? With regard to your analysis of lifetime issues, there is a whole aspect you did not mention. A basic assumption of the refcounting approach is that once X has a reference to Y, X can freely access and use Y as much as it wants until it drops the reference. However this is not true when X is a device driver and Y is a device structure. Drivers can be unbound from devices. If X has been unbound from Y then it must not access Y again, no matter how many references it possesses. After all, some other driver may have bound to Y in the meantime; this other driver would not appreciate the interference. Just as bad, if Y represents a hot-pluggable device then some other device may have been plugged in and may be using Y's old address. We don't want X sending commands to a new device, thinking that it is Y! The complications caused by this requirement affect both the subsystem code and device drivers. Drivers must synchronize their release() methods with every action they take -- and refcounts cannot provide synchronization. A similar problem afflicts the char-device subsystem, and here even less care has been taken to address the issues. The race between open() and unregister() is resolved in many places by relying on the BKL! We should be able to make things better and easier than they are. Orphaning open sysfs files was a move in this direction. But I doubt they will ever become truly simple and clear. Alan Stern Index: usb-2.6/drivers/base/core.c === --- usb-2.6.orig/drivers/base/core.c +++ usb-2.6/drivers/base/core.c @@ -431,9 +431,10 @@ void device_remove_bin_file(struct devic EXPORT_SYMBOL_GPL(device_remove_bin_file); /** - * device_schedule_callback - helper to schedule
Re: Reiser4. BEST FILESYSTEM EVER.
On 4/7/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: I checked what bonnie++ actually writes to its test files, for you. It is about 98-99% zeros. Still, the results record sequential reads, of 232,729 K/sec, nearly four times the physical disk read rate, 63,160 K/sec, of the hard drive. Excellent! You've established the undeniable hard cold fact that reiser4 beats the crap out of all other filesystems, when the files are 98-99% filled with zeros. You've proven your point, so can we stop this thread now? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] freezer: Remove PF_NOFREEZE from handle_initrd
From: Rafael J. Wysocki <[EMAIL PROTECTED]> Make handle_initrd() call try_to_freeze() in a suitable place instead of setting PF_NOFREEZE for the current task. Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]> --- init/do_mounts_initrd.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6.21-rc6/init/do_mounts_initrd.c === --- linux-2.6.21-rc6.orig/init/do_mounts_initrd.c +++ linux-2.6.21-rc6/init/do_mounts_initrd.c @@ -55,11 +55,12 @@ static void __init handle_initrd(void) sys_mount(".", "/", NULL, MS_MOVE, NULL); sys_chroot("."); - current->flags |= PF_NOFREEZE; pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD); if (pid > 0) { - while (pid != sys_wait4(-1, NULL, 0, NULL)) + while (pid != sys_wait4(-1, NULL, 0, NULL)) { + try_to_freeze(); yield(); + } } /* move initrd to rootfs' /old */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ten percent test
On Saturday 07 April 2007, Con Kolivas wrote: >On Friday 06 April 2007 20:03, Ingo Molnar wrote: >> * Con Kolivas <[EMAIL PROTECTED]> wrote: >[...] >> >> firstly, testing on various workloads Mike's tweaks work pretty well, >> while SD still doesnt handle the high-load case all that well. Note >> that it was you who raised this whole issue to begin with: everything >> was pretty quiet in scheduling interactivity land. Con was scratching an itch, one we desktop users all have in a place we can't quite reach to scratch because we aren't quite the coding gods we should be. Con at least has the coding knowledge to walk in and start shoveling, which is more than I can say of the efforts to derail the SD scheduler have demonstrated to this user. >I'm terribly sorry but you have completely missed my intentions then. I > was _not_ trying to improve mainline's interactivity at all. My desire > was to fix the unfairness that mainline has, across the board without > compromising fairness. You said yourself that an approach that fixed a > lot and had a small number of regressions would be worth it. In a > surprisingly ironic turnaround two bizarre things happened. People > found SD fixed a lot of their interactivity corner cases which were > showstoppers. That didn't surprise me because any unfair design will by > its nature get it wrong sometimes. The even _more_ surprising thing is > that you're now using interactivity as the argument against SD. I did > not set out to create better interactivity, I set out to create > widespread fairness without too much compromise to interactivity. As I > said from the _very first email_, there would be cases of interactivity > in mainline that performed better. > >> (There was one person who >> reported wide-scale interactivity regressions against mainline but he >> didnt answer my followup posts to trace/debug the scenario.) > >That was one user. As I mentioned in an earlier thread, the problem with > email threads on drawn out issues on lkml is that all that people > remember is the last one creating noise, and that has only been the > noise from Mike for 2 weeks now. Has everyone forgotten the many many > users who reported the advantages first up which generated the interest > in the first place? Why have they stopped reporting? Well the answer is > obvious; all the signs suggest that SD is slated for mainline. It is on > the path, Linus has suggested it and now akpm is asking if it's ready > for 2.6.22. So they figure there is no point testing and replying any > further. SD is ready for prime time, finalised and does everything I > intended it to. This is where I have to reveal to them the horrible > truth. This is no guarantee it will go in. In fact, this one point that > you (Ingo) go on and on about is not only a quibble, but you will call > it an absolute showstopper. As maintainer of the cpu scheduler, in its > current form you will flatly refuse it goes to mainline citing the 5% > of cases where interactivity has regressed. So people will tell me to > fix it, right?... Read on for this to unfold. Sorry, this user got quiet to watch the cat fight. Obviously I should have been throwing messages wrapped around rocks (or something). >> SD has a built-in "interactivity estimator" as well, but hardcoded >> into its design. SD has its own set of ugly-looking tweaks as well - >> for example the prio_matrix. > >I'm sorry but this is a mis-representation to me, as I suggested on an > earlier thread where I disagree about what an interactivity estimator > is. The idea of fence posts in a clock that are passed as a way of > metering out earliest-deadline-first in a design is well established. > The matrix is simply an array designed for O(1) lookups of the fence > posts. That is not the same as "oh how much have we slept in the last > $magic_number period and how much extra time should we get for that". > >> So it all comes down on 'what interactivity >> heuristics is enough', and which one is more tweakable. So far i've >> yet to see SD address the hackbench and make -j interactivity >> problems/regression for example, while Mike has been busy addressing >> the 'exploits' reported against mainline. Who gives a s*** about hackbench or a make -j 200?! Those are NOT, and NEVER WILL BE, REAL WORLD LOADS for the vast majority of us. For us SD Just Worked(TM). >And BANG there is the bullet you will use against SD from here to > eternity. SD obeys fairness at all costs. Your interactivity regression > is that SD causes progressive slowdown with load which by definition is > fairness. You repeatedly ask me to address it and there is on unfailing > truth; the only way to address it is to add unfairness to the design. > So why don't I? Because the simple fact is that any unfairness no > matter how carefully administered or metered will always have cases > where it's wrong. Look at the title of this email for example - it's > yet another exploit for the mainline sleep/run
Re: [patch] remove artificial software max_loop limit
On Fri, 06 Apr 2007 16:33:32 EDT, Bill Davidsen said: > Jan Engelhardt wrote: > > Who cares if the user specifies max_loop=8 but still is able to open up > > /dev/loop8, loop9, etc.? max_loop=X basically meant (at least to me) > > "have at least X" loops ready. > > > You have just come up with a really good reason not to do unlimited > loops. That, and I'd expect the intuitive name for "have at least N ready" to be 'min_loop=N'. 'max_loop=N' means (to me, at least) "If I ask for N+1, something has obviously gone very wrong, so please shoot my process before it gets worse". Maybe what's needed is *both* a max_ and min_ parameter? pgppAX7GLTgkP.pgp Description: PGP signature
SD scheduler testing hitch
On Sat, 2007-04-07 at 11:24 +0200, Ingo Molnar wrote: > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > Where are we at with staircase anyway? Is it looking like a 2.6.22 > > thing? I don't personally think we've yet seen enough serious > > performance testing to permit a merge, apart from other issues... > > yes, that's my thinking too at the moment. I'd also like to see a > summary of 'open design questions' list from Mike (if Mike has > time/energy for that?) - many questions were raised, a good number of > them were answered, various changes done to SD but there's no good > summary of the current state of affairs. I'm working on it. I started testing fairness, but ran into a snag. What I was testing was my theory that SD can't possibly be fair to sleeping tasks because the differential between long burn short sleep tasks and long sleep short burn tasks is tossed at the end of every rotation. That theory seems to be true, but here's the snag... 2.6.21-rc6-sd-0.39, box is 3GHz P4/HT tenpercent: tenpercent.c compiled to run 1 10% duty cycle task. 100ms and friends: tenpercent.c hard coded for N ms burn + 1 usec sleep. taskset -c 1 ./tenpercent taskset -c 1 ./100ms (or ilk) top - 10:47:57 up 3:11, 13 users, load average: 1.65, 1.63, 2.50 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 7357 root 9 0 1568 440 360 R 92 0.0 10:55.01 1 100ms 7356 root 1 0 1568 444 360 S8 0.0 1:00.01 1 tenpercent 5557 root 1 0 164m 21m 4876 S0 2.1 1:58.90 0 Xorg 6343 root 3 0 2376 1068 768 R0 0.1 2:51.19 0 top top - 11:05:16 up 3:29, 13 users, load average: 1.52, 1.50, 1.81 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 7395 root 5 0 1568 444 360 R 90 0.0 8:54.25 1 100ms 7394 root 0 -10 1568 440 360 S 10 0.0 1:00.21 1 tenpercent 6343 root 3 0 2376 1068 768 R0 0.1 3:04.16 0 top 1 root 1 0 736 288 240 S0 0.0 0:00.90 0 init top - 11:20:58 up 3:44, 13 users, load average: 1.89, 1.87, 1.78 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 7429 root 2 -10 1568 444 360 R 92 0.0 12:03.81 1 100ms 7428 root 0 -10 1568 444 360 R8 0.0 1:00.08 1 tenpercent 6343 root 3 0 2376 1068 768 R1 0.1 3:19.36 0 top 1 root 1 0 736 288 240 S0 0.0 0:00.90 0 init top - 12:22:27 up 4:46, 13 users, load average: 1.90, 1.92, 1.94 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 8235 root 1 -20 1568 444 360 R 95 0.0 19:31.20 1 100ms 8234 root 0 -20 1568 444 360 S5 0.0 1:00.01 1 tenpercent 6343 root 3 0 2376 1068 768 R1 0.1 4:24.24 0 top 4926 root 1 0 1820 632 544 S0 0.1 0:02.34 0 hald-addon-stor top - 13:38:22 up 6:02, 13 users, load average: 1.53, 1.51, 1.51 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 8643 root 5 0 1564 444 360 R 93 0.0 12:15.49 1 50ms 8642 root 1 0 1564 444 360 S7 0.0 1:00.28 1 tenpercent 6343 root 3 0 2376 1080 768 R0 0.1 5:27.22 0 top 1 root 1 0 736 288 240 S0 0.0 0:00.91 0 init top - 14:02:39 up 6:26, 13 users, load average: 1.75, 1.71, 1.56 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 8726 root 5 0 1564 444 360 R 94 0.0 15:19.07 1 8ms 8727 root 1 0 1564 444 360 R6 0.0 1:00.11 1 tenpercent 5557 root 1 0 164m 21m 4632 S0 2.1 3:20.92 0 Xorg 6079 root 1 0 31584 17m 12m S0 1.7 0:04.35 0 konsole top - 16:22:01 up 8:45, 13 users, load average: 1.73, 1.81, 1.60 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 10622 root 1 0 1428 264 212 R 98 0.0 10:00.43 1 xx 10621 root 1 0 1564 440 360 S1 0.0 0:06.49 1 tenpercent 10423 root 3 0 2248 1052 764 R0 0.1 0:27.45 0 top 1 root 1 0 736 288 240 S0 0.0 0:00.91 0 init xx.c just tries to terminate the rotation if it gets preempted, and seems to succeed. It usually isn't this bad, but every few starts it gets this bad. I thought it might be screwing up the calibration of tenpercent if xx started first, but I plugged it into tenp.c (attached) after the calibration, and still see this every few starts. It always gets more cpu than it should, but sometimes it's extreme. I have yet to see tenpercent start at 1 percent usage in many many tries, but I just repeated it with the attached in seven tries. xx.c #include #include #define max(a,b) ((a) > (b) ? (a) : (b)) #define min(a,b) ((a) < (b) ? (a) : (b)) int main(void) { struct timeval then, now; struct timespec t = {0, 1000}, r; for(;;) { int t1, t2; short i; if (gettimeofday(&then, 0)) break;
[PATCH] block layer: Add bdev capacity helper function get_sect_count
From: John Anthony Kazos Jr. <[EMAIL PROTECTED]> Add static inline function get_sect_count to include/linux/genhd.h to complement get_start_sect. Returns sector_t capacity of block device whether it is whole or a partition. Signed-off-by: John Anthony Kazos Jr. <[EMAIL PROTECTED]> --- This will be useful for fill_super functions of filesystems with online resizing for checks against recorded and actual device size. get_start_sect and get_sect_count are helper functions useful to keep things from breaking in case the block_device structure decides to change. Applied against Linux v2.6.20.6. --- linux-2.6.20.6-orig/include/linux/genhd.h 2007-04-06 16:02:48.0 -0400 +++ linux-2.6.20.6-mod/include/linux/genhd.h2007-04-07 11:58:55.0 -0400 @@ -244,6 +244,10 @@ static inline sector_t get_start_sect(st { return bdev->bd_contains == bdev ? 0 : bdev->bd_part->start_sect; } +static inline sector_t get_sect_count(struct block_device *bdev) +{ + return bdev->bd_contains == bdev ? bdev->bd_disk->capacity : bdev->bd_part->nr_sects; +} static inline sector_t get_capacity(struct gendisk *disk) { return disk->capacity; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ten percent test
On Sat, 2007-04-07 at 16:50 +1000, Con Kolivas wrote: > On Friday 06 April 2007 20:03, Ingo Molnar wrote: > > (There was one person who > > reported wide-scale interactivity regressions against mainline but he > > didnt answer my followup posts to trace/debug the scenario.) > > That was one user. As I mentioned in an earlier thread, the problem with > email > threads on drawn out issues on lkml is that all that people remember is the > last one creating noise, and that has only been the noise from Mike for 2 > weeks now. This doesn't even deserve a reply, so I'll just say "get well soon". -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] remove artificial software max_loop limit
[EMAIL PROTECTED] wrote: On Fri, 06 Apr 2007 16:33:32 EDT, Bill Davidsen said: Jan Engelhardt wrote: Who cares if the user specifies max_loop=8 but still is able to open up /dev/loop8, loop9, etc.? max_loop=X basically meant (at least to me) "have at least X" loops ready. You have just come up with a really good reason not to do unlimited loops. That, and I'd expect the intuitive name for "have at least N ready" to be 'min_loop=N'. 'max_loop=N' means (to me, at least) "If I ask for N+1, something has obviously gone very wrong, so please shoot my process before it gets worse". Maybe what's needed is *both* a max_ and min_ parameter? I think that max_loop is a sufficient statement of the highest number of devices needed, and can reasonably interpreted as both "I may need this many" and "I won't legitimately want more." As I recall memory is allocated as the device is set up, so unless you want to use the max memory at boot, "just in case," the minimum won't be guaranteed anyway. Something else could eat memory. In practice I think asking for way too many is more common than not being able to get to the max. It may happen but it's a corner case, and status is returned. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, take4] FUTEX : new PRIVATE futexes
On 4/7/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: I am not sure what you want to say. What Jakub meant is that it is OK for the kernel to reject using unaligned 64-bit futexes. Just return an error in all cases (not just in some). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: COMPILING AND CONFIGURING A NEW KERNEL.
On Fri, 06 Apr 2007 18:26:45 PDT, [EMAIL PROTECTED] said: > > YOU SHOULD compile all the drivers necessary to boot your system, into > the kernel (ie, such drivers should not be built as modules). > > This way you will NOT need an initrd file. It is quite possible to build a kernel that has all the drivers built-in, but still require an initrd file. For instance, if you have a recent RedHat or Fedora system, '/' may very well be on an LVM partition, which means you need an initrd to do a 'lvm varyonvg' before mounting your real root filesystem will work pgpf4iuCwGrtE.pgp Description: PGP signature
Re: [PATCH 12/13] maps#2: Add /proc/pid/pagemap interface
On Fri, Apr 06, 2007 at 11:55:10PM -0700, Andrew Morton wrote: > On Fri, 06 Apr 2007 17:03:13 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote: > > > Add /proc/pid/pagemap interface > > > > This interface provides a mapping for each page in an address space to > > its physical page frame number, allowing precise determination of what > > pages are mapped and what pages are shared between processes. > > Could we please have a simple read-proc-pid-pagemap.c placed under > Documentation/ somewhere? Also some sample output for the changelog > so we can see what all this does. Working on that. The userspace portion of my tools are very rough at the moment. And in Python. > Also for kpagemap, please. > > Should /proc/pid/pagemap and kpagemap be versioned? They've both got a variable-sized header, so we can add things there. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Two questions regarding Opening files within Kernel!
JanuGerman wrote: Hi Every one, I have got two questions regarding opening files within the Linux kernel. If some body can help me, in sorting out this problem, i will be very thankful. First off, likely not something you should be doing: http://kernelnewbies.org/FAQ/WhyWritingFilesFromKernelIsBad -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: COMPILING AND CONFIGURING A NEW KERNEL.
On Sat, 07 Apr 2007 00:45:32 PDT, [EMAIL PROTECTED] said: > Use rpm-pkg to create a Red Hat RPM kernel package. > # make rpm-pkg > > When built, the RPM package is put in > /usr/src/packages/RPMS/*your*architecture* > > # cd /usr/src/packages/RPMS/x86_64 > > Install the package (you may have to un-install previous installs) > # rpm -i kernel-2.6.20-1.x86_64.rpm It is *highly* recommended that you change the kernel identifier at least slightly, so that you can install '2.6.20-1.local' without overlaying the vendor-supplied 2.6.20-1 kernel. Among other things, this lets you boot back to the equivalent code level in the vendor kernel, so you can figure out if it's your .config file that's broken, or if you hit a bug upggrading from 2.6.19-10 to 2.6.20-1. pgprBYClj9Gql.pgp Description: PGP signature
Re: Two questions regarding Opening files within Kernel!
Thanks Jan for the response. >struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root); But that gives me a dentry, where as file object is still not reachable. Question: I am currently using a function called fs.h/dentry_open which takes a "dentry", "vfsmount" object and flag (usually RW i.e. 2), and gives me the file object. with your suggested method, vfsmount is still not available. In this regard, any idea about a function, which gives directly the file object instead of dentry will be highly appreciated. OR, (Kindly see the code below), i need some thing for "missing vfsmount". struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root); struct file *file1 = dentry_open(fbar, "missing vfsmount here",2) Thanks, JG ___ New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes. http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
If not readdir() then what?
In their closed chambers (well, workshops, http://lwn.net/Articles/226351/), the filesystem developers complain about readdir. I fully appreciate the difficulties. But what I fail to see so far is any proposal for an alternative interface. The phase to get new functionality included in the next revision of POSIX is over. But that does not mean we should not try to get some sensible new implementation in place. There is, for example, the "High End Computing Extensions Working Group" (the guys who showed up here with their statlite and readdirplus proposals). This is an official working group at the OpenGroup which can produce a document which can be the basis of inclusion in the next revision and become a OpenGroup specification earlier than that. So, if anybody has a proposal for better interfaces let's hear them. "Now" is a very good time to start working on this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
On Sat, 07 Apr 2007 16:11:46 +0200, Krzysztof Halasa said: > > Think about it,... read speeds that are some FOUR times the physical > > disk read rate,... impossible without the use of compression (or > > something similar). > > It's really impossible with compression only unless you're writing > only zeros or stuff alike. I don't know what bonnie uses for testing > but real life data doesn't compress 4 times. Two times, sometimes, All depends on your data. From a recent "compress the old logs" job on our syslog server: /logs/lennier.cc.vt.edu/2007/03/maillog-2007-0308: 85.4% -- replaced with /logs/lennier.cc.vt.edu/2007/03/maillog-2007-0308.gz And it wasn't a tiny file either - it's a busy mailserver, the logs run to several hundred megabytes a day. Syslogs *often* compress 90% or more, meaning a 10X compression. > but then it will be typically slower than disk access (I mean read, > as write will be much slower). Actually, as far back as 1998 or so, I was able to document 20% *speedups* on an AIX system that supported compressed file systems - and that was from when a 133mz PowerPC 604e was a *fast* machine. Since then, CPUs have gotten faster at a faster rate than disks have, even increasing the speedup. The basic theory is that unless you're sitting close to 100%CPU, it is *faster* to burn some CPU to compress/decompress a 4K chunk of data down to 2K, and then move 2K to the disk drive, than it is to move 4K. It's particularly noticable for larger files - if you can apply the compression to remove the need to move 2M of data faster than you can move 2M of data, you win. pgp1Fr9NtbQlR.pgp Description: PGP signature
Re: SD scheduler testing hitch
On Sat, 2007-04-07 at 18:20 +0200, Mike Galbraith wrote: > xx.c > > #include > #include > > #define max(a,b) ((a) > (b) ? (a) : (b)) > #define min(a,b) ((a) < (b) ? (a) : (b)) > > int main(void) > { > struct timeval then, now; > struct timespec t = {0, 1000}, r; > > for(;;) { > int t1, t2; > short i; > > if (gettimeofday(&then, 0)) > break; > for (i = 1; i > 0; i++); > if (gettimeofday(&now, 0)) > break; > t2 = max(then.tv_usec, now.tv_usec); > t1 = min(then.tv_usec, now.tv_usec); > if (t2 - t1 >= 1000 && nanosleep(&t, &r)) > break; > } > return 0; > } I lowered the time to 500us, and ran at nice -10.. it starves tenpercent here every time. (ran as taskset -c 1 nice -n -10 ./fairtest) The starving 10% duty cycle task has trouble getting 1% CPU. -Mike // gcc -O2 -o tenp tenp.c -lrt // code from interbench.c #include #include #include #include #include #include /* * Start $forks processes that run for 10% cpu time each. Set this to * 15 * number of cpus for best effect. */ int forks = 1; unsigned long run_us = 10, sleep_us; unsigned long loops_per_ms; void terminal_error(const char *name) { fprintf(stderr, "\n"); perror(name); exit (1); } unsigned long long get_nsecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) terminal_error("clock_gettime"); return (myts->tv_sec * 10 + myts->tv_nsec ); } void burn_loops(unsigned long loops) { unsigned long i; /* * We need some magic here to prevent the compiler from optimising * this loop away. Otherwise trying to emulate a fixed cpu load * with this loop will not work. */ for (i = 0 ; i < loops ; i++) asm volatile("" : : : "memory"); } /* Use this many usecs of cpu time */ void burn_usecs(unsigned long usecs) { unsigned long ms_loops; ms_loops = loops_per_ms / 1000 * usecs; burn_loops(ms_loops); } void microsleep(unsigned long long usecs) { struct timespec req, rem; rem.tv_sec = rem.tv_nsec = 0; req.tv_sec = usecs / 100; req.tv_nsec = (usecs - (req.tv_sec * 100)) * 1000; continue_sleep: if ((nanosleep(&req, &rem)) == -1) { if (errno == EINTR) { if (rem.tv_sec || rem.tv_nsec) { req.tv_sec = rem.tv_sec; req.tv_nsec = rem.tv_nsec; goto continue_sleep; } goto out; } terminal_error("nanosleep"); } out: return; } /* * In an unoptimised loop we try to benchmark how many meaningless loops * per second we can perform on this hardware to fairly accurately * reproduce certain percentage cpu usage */ void calibrate_loop(void) { unsigned long long start_time, loops_per_msec, run_time = 0, min_run_us = run_us; unsigned long loops; struct timespec myts; int i; printf("Calibrating loop\n"); loops_per_msec = 100; redo: /* Calibrate to within 1% accuracy */ while (run_time > 101 || run_time < 99) { loops = loops_per_msec; start_time = get_nsecs(&myts); burn_loops(loops); run_time = get_nsecs(&myts) - start_time; loops_per_msec = (100 * loops_per_msec / run_time ? : loops_per_msec); } /* Rechecking after a pause increases reproducibility */ microsleep(1); loops = loops_per_msec; start_time = get_nsecs(&myts); burn_loops(loops); run_time = get_nsecs(&myts) - start_time; /* Tolerate 5% difference on checking */ if (run_time > 105 || run_time < 95) goto redo; loops_per_ms=loops_per_msec; printf("Calibrating sleep interval\n"); microsleep(1); /* Find the smallest time interval close to 1ms that we can sleep */ for (i = 0; i < 100; i++) { start_time=get_nsecs(&myts); microsleep(1000); run_time=get_nsecs(&myts)-start_time; run_time /= 1000; if (run_time < run_us && run_us > 1000) run_us = run_time; } /* Then set run_us to that duration and sleep_us to 9 x that */ sleep_us = run_us * 9; printf("Calibrating run interval\n"); microsleep(1); /* Do a few runs to see what really gets us run_us runtime */ for (i = 0; i < 100; i++) { start_time=get_nsecs(&myts); burn_usecs(run_us); run_time=get_nsecs(&myts)-start_time; run_time /= 1000; if (run_time < min_run_us && run_time > run_us) min_run_us = run_time; } if (min_run_us < run_us) run_us = run_us * run_us / min_run_us; printf("Each fork will run for %lu usecs and sleep for %lu usecs\n", run_us, sleep_us); } #define max(a,b) ((a) > (b) ? (a) : (b)) #define min(a,b) ((a) < (b) ? (a) : (b)) void steal(void) { struct timeval then, now; struct timespec t = {0, 500}, r; for(;;) { int t1, t2; short i; if (gettimeofday(&then, 0)) break; for (i = 1; i > 0; i++); if (gettimeofday(&now, 0)) break; t2 = max(then.tv_usec, now.tv_usec); t1 = min(then.tv_usec, now.tv_usec); if (t2 - t1 >= 500 && nanosleep(&t, &r)) break; } } int main(void){ int i, child; calibrate_loop(); pr
Re: [patch 2/4] clean up identify_cpu
Andrew Morton wrote: > x86_64 uses this too. > > WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to > .init.text:mtrr_bp_init from .text.identify_cpu after 'identify_cpu' (at > offset 0x655) > OK, two patches to follow: a x86-64 variant of the bugs.h cleanup, and a replacement for this patch. I don't have a x86-64 compile environment on hand, so the 64 bits are completely untested, but they *look* they should work. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
On Sat, 07 Apr 2007 16:11:46 +0200, Krzysztof Halasa said: > > Gzip - 3 files (zeros only, raw DV data from video camera, x86_64 > kernel rpm file), 10 MB of data (10*1024*1024), > $ l -Ggh zeros dv bin > -rw-r--r-- 1 10M Apr 7 15:30 bin > -rw-r--r-- 1 10M Apr 7 15:31 dv > -rw-r--r-- 1 10M Apr 7 15:31 zeros > $ l -Ggh zeros.gz dv.gz bin.gz > -rw-r--r-- 1 10K Apr 7 15:31 zeros.gz > -rw-r--r-- 1 9.1M Apr 7 15:31 dv.gz > -rw-r--r-- 1 9.3M Apr 7 15:30 bin.gz > > ... and though the numbers may still sound impressive, space savings > are less than 10%. I am quite sure that the kernel RPM file is *already* compressed, at least somewhat. Otherwise, it's hard to explain this: -rw-r--r--1 529 263 17835757 Apr 5 00:19 kernel-2.6.20-1.3045.fc7.x86_64.rpm % du -s /lib/modules/2.6.20-1.3038.fc7/ 76436 /lib/modules/2.6.20-1.3038.fc7/ and it can't all be slack space at ends of files: % find /lib/modules/2.6.20-1.3038.fc7/ -type f | wc -l 1482 Even on a 4K filesystem, the *max* wasted slack would be about 4M. And what do you know - if you tar.gz that /lib/modules: % tar czf /tmp/kern.tar.gz /lib/modules/2.6.20-1.3038.fc7/ tar: Removing leading `/' from member names % ls -l /tmp/kern.tar.gz -rw-r--r-- 1 valdis valdis 15506359 2007-04-07 13:19 /tmp/kern.tar.gz The *compressed* tar is about 15M (remember the .rpm contained a 2M vmlinuz as well - that;s compressed too). So we're right up to the 17M of the .rpm, which indicates that the RPM is compressed at a factor close to tar.gz. I'd not be surprised to find out that your digital-video also contains at least some light compression - if it's mpeg or similar, that's already had some *heavy* compression done to it pgpYDc0gClyYr.pgp Description: PGP signature
[PATCH 1/2] Clean up asm-x86_64/bugs.h
Most of asm-x86_64/bugs.h is code which should be in a C file, so put it there. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: Andi Kleen <[EMAIL PROTECTED]> Cc: Linus Torvalds <[EMAIL PROTECTED]> --- arch/x86_64/kernel/Makefile |3 ++- arch/x86_64/kernel/bugs.c| 28 include/asm-x86_64/alternative.h |1 + include/asm-x86_64/bugs.h| 30 -- 4 files changed, 35 insertions(+), 27 deletions(-) === --- a/arch/x86_64/kernel/Makefile +++ b/arch/x86_64/kernel/Makefile @@ -8,7 +8,8 @@ obj-y := process.o signal.o entry.o trap ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \ x8664_ksyms.o i387.o syscall.o vsyscall.o \ setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \ - pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o sched-clock.o + pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o sched-clock.o \ + bugs.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-$(CONFIG_X86_MCE) += mce.o therm_throt.o === --- /dev/null +++ b/arch/x86_64/kernel/bugs.c @@ -0,0 +1,28 @@ +/* + * arch/x86_64/kernel/bugs.c + * + * Copyright (C) 1994 Linus Torvalds + * Copyright (C) 2000 SuSE + * + * This is included by init/main.c to check for architecture-dependent bugs. + * + * Needs: + * void check_bugs(void); + */ + +#include +#include +#include +#include +#include +#include + +void __init check_bugs(void) +{ + identify_cpu(&boot_cpu_data); +#if !defined(CONFIG_SMP) + printk("CPU: "); + print_cpu_info(&boot_cpu_data); +#endif + alternative_instructions(); +} === --- a/include/asm-x86_64/alternative.h +++ b/include/asm-x86_64/alternative.h @@ -16,6 +16,7 @@ struct alt_instr { u8 pad[5]; }; +extern void alternative_instructions(void); extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end); struct module; === --- a/include/asm-x86_64/bugs.h +++ b/include/asm-x86_64/bugs.h @@ -1,28 +1,6 @@ -/* - * include/asm-x86_64/bugs.h - * - * Copyright (C) 1994 Linus Torvalds - * Copyright (C) 2000 SuSE - * - * This is included by init/main.c to check for architecture-dependent bugs. - * - * Needs: - * void check_bugs(void); - */ +#ifndef _ASM_X86_64_BUGS_H +#define _ASM_X86_64_BUGS_H -#include -#include -#include -#include +void check_bugs(void); -extern void alternative_instructions(void); - -static void __init check_bugs(void) -{ - identify_cpu(&boot_cpu_data); -#if !defined(CONFIG_SMP) - printk("CPU: "); - print_cpu_info(&boot_cpu_data); -#endif - alternative_instructions(); -} +#endif /* _ASM_X86_64_BUGS_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86: clean up identify_cpu
identify_cpu() is used to identify both the boot CPU and secondary CPUs, but it performs some actions which only apply to the boot CPU. Those functions are therefore really __init functions, but because they're called by identify_cpu(), they must be marked __cpuinit. This patch splits identify_cpu() into identify_boot_cpu() and identify_secondary_cpu(), and calls the appropriate init functions from each. Also, identify_boot_cpu() and all the functions it dominates are marked __init. The same change applies to both i386 and x86_64, and both have to be changed together because they share the mtrr setup code. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: Andi Kleen <[EMAIL PROTECTED]> --- arch/i386/kernel/cpu/common.c| 41 + arch/i386/kernel/cpu/mtrr/main.c |4 +-- arch/i386/kernel/smpboot.c |2 - arch/i386/kernel/sysenter.c |2 - arch/x86_64/kernel/bugs.c|2 - arch/x86_64/kernel/setup.c | 47 ++ arch/x86_64/kernel/smpboot.c |2 - include/asm-i386/processor.h |3 +- include/asm-x86_64/processor.h |3 +- 9 files changed, 70 insertions(+), 36 deletions(-) === --- a/arch/i386/kernel/cpu/common.c +++ b/arch/i386/kernel/cpu/common.c @@ -390,7 +390,7 @@ __setup("serialnumber", x86_serial_nr_se /* * This does the hard work of actually picking apart the CPU stuff... */ -void __cpuinit identify_cpu(struct cpuinfo_x86 *c) +static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) { int i; @@ -486,30 +486,43 @@ void __cpuinit identify_cpu(struct cpuin for (i = 0; i < NCAPINTS; i++) printk(" %08lx", c->x86_capability[i]); printk("\n"); - +} + +void __init identify_boot_cpu(void) +{ + identify_cpu(&boot_cpu_data); + + /* Init Machine Check Exception if available. */ + mcheck_init(&boot_cpu_data); + + sysenter_setup(); + enable_sep_cpu(); + + mtrr_bp_init(); +} + +void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c) +{ + int i; + + BUG_ON(c == &boot_cpu_data); + identify_cpu(c); /* * On SMP, boot_cpu_data holds the common feature set between * all CPUs; so make sure that we indicate which features are * common between the CPUs. The first time this routine gets * executed, c == &boot_cpu_data. -*/ - if ( c != &boot_cpu_data ) { - /* AND the already accumulated flags with these */ - for ( i = 0 ; i < NCAPINTS ; i++ ) - boot_cpu_data.x86_capability[i] &= c->x86_capability[i]; - } +* AND the already accumulated flags with these +*/ + for ( i = 0 ; i < NCAPINTS ; i++ ) + boot_cpu_data.x86_capability[i] &= c->x86_capability[i]; /* Init Machine Check Exception if available. */ mcheck_init(c); - if (c == &boot_cpu_data) - sysenter_setup(); enable_sep_cpu(); - if (c == &boot_cpu_data) - mtrr_bp_init(); - else - mtrr_ap_init(); + mtrr_ap_init(); } #ifdef CONFIG_X86_HT === --- a/arch/i386/kernel/cpu/mtrr/main.c +++ b/arch/i386/kernel/cpu/mtrr/main.c @@ -571,7 +571,7 @@ extern void cyrix_init_mtrr(void); extern void cyrix_init_mtrr(void); extern void centaur_init_mtrr(void); -static void __cpuinit init_ifs(void) +static void __init init_ifs(void) { #ifndef CONFIG_X86_64 amd_init_mtrr(); @@ -639,7 +639,7 @@ static struct sysdev_driver mtrr_sysdev_ * initialized (i.e. before smp_init()). * */ -void __cpuinit mtrr_bp_init(void) +void __init mtrr_bp_init(void) { init_ifs(); === --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -157,7 +157,7 @@ static void __cpuinit smp_store_cpu_info *c = boot_cpu_data; if (id!=0) - identify_cpu(c); + identify_secondary_cpu(c); /* * Mask B, Pentium, but not Pentium MMX */ === --- a/arch/i386/kernel/sysenter.c +++ b/arch/i386/kernel/sysenter.c @@ -68,7 +68,7 @@ extern const char vsyscall_sysenter_star extern const char vsyscall_sysenter_start, vsyscall_sysenter_end; static struct page *syscall_pages[1]; -int __cpuinit sysenter_setup(void) +int __init sysenter_setup(void) { void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC); syscall_pages[0] = virt_to_page(syscall_page); === --- a/arch/x86_64/kernel/bugs.c +++ b/arch/x86_64/kernel/bugs.c @@ -19,7 +19,7 @@ void __init check_bugs(void) { - identify_cpu(&boot_cpu_data); + i
Re: [PATCH] console UTF-8 fixes
On Sat, Apr 07, 2007 at 01:00:48PM +0200, Jan Engelhardt wrote: Hi, > Please, no dot, and no inverse color. > Imagine someone had the following bitmap for : No dot, I'm already convinced. To clarify the inverse thingy: This is what the current kernel does: 1) tries to display the desired symbol 2) if it fails, tries to display U+FFFD (which usually looks similar to an inverted question mark) 3) if this fails again then displays a normal '?' (or a different symbol due to a bug discussed below) Here's my proposal. This only alters the 3rd step, not the first two: 1) tries to display the desired symbol 2) if it fails, tries to display U+FFFD, still with _normal_ attributes 3) if this fails then display an ascii '?' with inverted attributes So you won't get "double" inversion. If you do have U+FFFD in your font then this will introduce no chance. If you don't have U+FFFD, you'll see inverse question marks instead of normal ones. > I blame your latin2 unicode map. (See above about 'Û'.) There's nothing wrong with my latin2 unicode map, and I've located and changed the part _in the kernel_ that displays a false glyph using the algorithm I've outlined. It just uses "the glyph at that code position within the glyph table" as a fallback, which might be okay in 8-bit mode (and I haven't modified the behavior in that case), but I got rid of this behavior in UTF-8 mode since it's definitely a fault in the world of Unicode. > It should perhaps display a regular 'u' if it cannot display 'û', I rather think it should display U+FFFD but YMMV. > but definitely not 'ü' (which is not called a double accent, btw). This is not the character I've been talking about, I actually _did_ talk about u with double acute accent (ű - you might not have seen this character so far, AFAIK it's only used in Hungarian, no other languages). But we agree that the kernel definitely shouldn't display a character with a different accent on it. This is one of the bugs my patch addresses. bye, Egmont - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] partitions: Enhance Kconfig help text for EESOX and MSDOS formats
From: John Anthony Kazos Jr. <[EMAIL PROTECTED]> Adds help text for ACORN_PARTITION_EESOX and improves help text for MSDOS_PARTITION in fs/partitions/Kconfig. Signed-off-by: John Anthony Kazos Jr. <[EMAIL PROTECTED]> --- Applied against Linux v2.6.20.6. --- linux-2.6.20.6-orig/fs/partitions/Kconfig 2007-04-06 16:02:48.0 -0400 +++ linux-2.6.20.6-mod/fs/partitions/Kconfig2007-04-07 13:22:17.0 -0400 @@ -32,6 +32,10 @@ config ACORN_PARTITION_EESOX bool "EESOX partition support" if PARTITION_ADVANCED default y if ARCH_ACORN depends on ACORN_PARTITION + help + EESOX SCSI card on-disk partition format support for Acorn + systems. If you have one of these cards, or want to use a disk + written by one, say Y. config ACORN_PARTITION_ICS bool "ICS partition support" if PARTITION_ADVANCED @@ -108,7 +112,11 @@ config MSDOS_PARTITION bool "PC BIOS (MSDOS partition tables) support" if PARTITION_ADVANCED default y help - Say Y here. + Standard PC-compatible partition table support for Linux. Used by + i386 systems, Linux/Windows dual-boot systems, and many others. + Unless you are certain your system does not use this partition + table format, and you're not using any disks from a system that + does, say Y. config BSD_DISKLABEL bool "BSD disklabel (FreeBSD partition tables) support" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] Containers (V8): Add generic multi-subsystem API to containers
On 4/6/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: On Fri, Apr 06, 2007 at 04:32:24PM -0700, [EMAIL PROTECTED] wrote: > +static int attach_task(struct container *cont, struct task_struct *tsk) > { [snip] > + task_lock(tsk); You need to check here if task state is PF_EXITING and fail with -ESRCH if so? Otherwise we risk breaking refcount on init_container_group. Yes, I think you're right; I've now changed it to this in my tree: task_lock(tsk); if (tsk->flags & PF_EXITING) { task_unlock(tsk); put_container_group(newcg); return -ESRCH; } rcu_assign_pointer(tsk->containers, newcg); task_unlock(tsk); Thanks, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiser4. BEST FILESYSTEM EVER.
On Fri, 06 Apr 2007 19:47:36 PDT, [EMAIL PROTECTED] said: > On Fri, 6 Apr 2007 11:21:19 -0400, "Jan Harkes" <[EMAIL PROTECTED]> > > With compression there is a pretty high probability that one corrupted > > byte or disk block will result in loss of a considerably larger amount > > of data. > > Bad blocks are NOT dealt with by the filesystem,... so your comment is > irrelevant, or just plain wrong. > > If your filesystem is writing to bad blocks, then throw away your > operating system. You know... occasionally, blocks go bad *after* you write to them. If you have an uncompressed filesystem, it's often possible to recover most of the file , and just have a few 512-byte blocks of zeros, simply by doing something like 'dd if=bad.file of=bad.file bs=512 conv=noerror' or careful applications of 'skip=N'. If it's compressed, you usually can't recover the rest of a compression group if a previous block is lost. (And for those who talk about backups - yes, taking backups is good. However, it's the rare laptop or desktop machine that can afford the luxury of RAID disks, and backups usually happen once a night, if that often. This means that if you've been working hard on something important all day, and the disk blows chunks at 4:30PM, you *will* be suddenly very concerned over exactly how much you can recover off the failing drive And yes, I'd *love* to have all my users connected to nice SAN systems that do snapshotting and remote replication to DR sites and all that - but have you ever *priced* a petabyte of SAN storage, the NAS gateways to serve it to users, and upgrading several tens of thousands of network ports to Gig-E? Hint - US$1M would get us through a pilot, and probably $5M and up to *start* deployment. Anybody wanna buy us an EMC DMX-3? :) http://www.emc.com/products/systems/symmetrix/DMX_series/DMX3.jsp pgp1JOWRSl3hZ.pgp Description: PGP signature
[PATCH] ip_tables.h
Hi lads, I had some problems compiling the external netfilter modules due to missing definitions. I googled a lot, saw a lot of people having the same problems but no real answer to how to fix it. So.. I made a little patch which make things work for me, at least. Modules that work after applying the patch are the geopip module, connlimit module, and prolly more, but I didnt test them. Please note, I am not a coder, not a maintainer and I am happy that I didnt break anything so please don't consider this as a proposal to include in the kernel or something, I am just in favor of sharing what helped me getting things to work, if i can help others with it or if it is interesting material for inclusion, even better :) So, the patch is attached in this email and can also be found on http://www.patrickale.eu/documents/archives/patches/ip_tables.h.diff I hope this helps a people or two. Patrick --- include/linux/netfilter_ipv4/ip_tables.h.orig 2007-04-07 20:30:25.344365707 +0200 +++ include/linux/netfilter_ipv4/ip_tables.h2007-04-07 20:34:05.076887550 +0200 @@ -34,6 +34,12 @@ #define ipt_table xt_table #define ipt_get_revision xt_get_revision +#define ipt_register_match(mtch)\ +({ (mtch)->family = AF_INET; \ +xt_register_match(mtch); }) +#define ipt_unregister_match(mtch) xt_unregister_match(mtch) + + /* Yes, Virginia, you have to zero the padding. */ struct ipt_ip { /* Source and destination IP addr */