Two questions regarding Opening files within Kernel!

2007-04-07 Thread JanuGerman
Hi Every one,

  I have got two questions regarding opening files within the Linux kernel. If 
some body can help me, in sorting out this problem, i will be very thankful.

1)   I have just a file path with me, an absolute path, but no dentry, no 
inode, no vfsmount object, which function i can call to get a "file" object 
associated with the absoulte file path. I have surfed arround the source code 
especially fs/open.c and some other files, but each function requires a 
parameter "mode" and "fd" beside file path. Actually, i was confuse about the 
"mode" parameter (and its differece with "flag"), like what to send, and 
secondly for "fd", i am not sure, what value to send as there is no file infact 
and only file path exists. Any idea?

2) Any functionality within linux kernel source code, to read one line per 
file? or some indirect way to set buffer size for one read?. That is, any 
existing header file for doing text I/O rather than binary within the kernel 
source code?

Thanks,
JG



___ 
The all-new Yahoo! Mail goes wherever you go - free your email address from 
your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Willy Tarreau
On Fri, Apr 06, 2007 at 10:58:45PM -0700, [EMAIL PROTECTED] wrote:
> You know,... you cut out this bit:
> 
> -
> 
> > The following benchmarks are from
> > 
> > http://linuxhelp.150m.com/resources/fs-benchmarks.htm or,
> > http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm

...

Hey John, please change your disk, it's scratched and you're repeating
yourself again and again. At first I thought "Oh cool, some good news
about reiser4", now when I see "reiserfs" in a thread, I think "oh no,
not this boring guy who escaped from the asylum again !". I hope this
thread will be cut shortly so that you stop doing bad publicity to
reiserfs and its developers, because when a product is indicated as
good by stupid people, it's really doing harm.

Also, about this part :
[Jan]
> > But in the end everything is a tradeoff. You can save diskspace, but
> > increase the cost of corruption. 

I don't 100% agree with Jan, because for some usages (temporary space),
light compression can increase speed. For instance, when processing logs,
I get better speed by compressing intermediate files with LZO on the fly.

[John]
> You deliberately ignored the fact that bad blocks are NOT dealt with by
> the filesystem,... but by the operating system. Like I said: If your
> filesystem is writing to bad blocks, then throw away your operating
> system.

But what you write here is complete crap. The filesystem relies on a
linear block device. The operating system is responsible for doing
read retries or reporting errors on bad blocks, but the FS and only
the FS can decide how not to use some known defective areas, for
instance not putting any metadata on them nor any useful data.

Now if you want to stop writing stupid things again and again, take
your bag, don't miss the bus to school, and listen to the teachers
instead of playing games on your calculator.

Willy
PS: non need to reply either, I'll kill this thread and your address here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] Char: cyclades, remove volatiles

2007-04-07 Thread Jiri Slaby
Andrew Morton napsal(a):
> On Wed,  4 Apr 2007 23:45:38 +0200 (CEST) Jiri Slaby <[EMAIL PROTECTED]> 
> wrote:
> 
>> cyclades, remove volatiles
> 
> The other changes seem uncontroversial, but this one has the potential
> to change runtime behaviour.  And cyclades.c is a driver which some people
> actually use ;)

Well, don't you know anybody with Z card, please?

But all volatiles were
- used locally (loop variables, nonptr count read once from HW), so that nothing
has a chance to change them in the memory.
- accessed by readX/writeX (pointers to mapped HW) and that should be OK

> Have these changes been runtime-tested?

Yes this time :), I have at least PCI Y cyclades card within reach to test.

thanks,
-- 
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] scheduler: first timeslice of the exiting thread

2007-04-07 Thread Satoru Takeuchi
Hi Ingo and all,

When I was examining the following program ...

  1. There are a large amount of small jobs takes several msecs,
 and the number of job increases constantly.
  2. The process creates a thread or a process per job (I examined both
 the thread model and the process model).
  3. Each child process/thread does the assigned job and exit immediately.

... I found that the thread model's latency is longer than proess
model's one against my expectation. It's because of the current
sched_fork()/sched_exit() implementation as follows:

  a) On sched_fork, the creator share its timeslice with new process.
  b) On sched_exit, if the exiting process didn't exhaust its first
 timeslice yet, it gives its timeslice to the parent.

It has no problem on the process model since the creator is the parent.
However, on the thread model, the creator is not the parent, it is same
as the creator's parent. Hence, on this kind of program, the creator
can't retrieve shared timeslice and exausts its timeslice at a rate of
knots. In addition, somehow, the parent (typically shell?) gets extra
timeslice.

I believe it's a bug and the exiting process should give its timeslice
to the creator. Now I have some patch plan to fix this problem as follow:

 a) Add the field for the creator to task_struct. It needs extra memory.
 b) Doesn't add extra field and have thread's parent the creater, which is
same as process creation. However it has many side effects, for example,
we also need to change sys_getppid() implementation.

What do you think? Any comments are welcome.

BTW, We can easily confirm the problem with systemtap, a convenient diagnostic
program.

Test programs(attached in the mail):

  - satprocess.c:  Process model. It creates a child process and wait for it
   several times. Each child process exits immediately.
  - satthread.c:   Thread model. It creates a child thread and join it several
   times. Each child thread exits immediately.
  - fork_exit.stp: systemtap script to overlook satprocess/satthread

My kernel:

2.6.21-rc6(i386).

How to confirm:

1) Execute systemtap script.

# stap fork_exit.stp -v
Pass 1: parsed user script and 54 library script(s) in 680usr/40sys/728real ms.
Pass 2: analyzed script: 2 probe(s), 8 function(s), 0 embed(s), 0 global(s) in 
670usr/40sys/699real ms.
Pass 3: using cached 
/root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.c
Pass 4: using cached 
/root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.ko
Pass 5: starting run.

2) Execute the process model program on another terminal.

$ ./satprocess

Then systemtap overlooks satprocess's fork/exit and prints the followings:

fork: pid = 11635, tgid = 11635, ppid = 5969, time_slice = 11
exit: pid = 11635, tgid = 11635, ppid = 11634, time_slice = 6
fork: pid = 11636, tgid = 11636, ppid = 5969, time_slice = 11
exit: pid = 11636, tgid = 11636, ppid = 11634, time_slice = 6
fork: pid = 11637, tgid = 11637, ppid = 5969, time_slice = 11
exit: pid = 11637, tgid = 11637, ppid = 11634, time_slice = 6
fork: pid = 11638, tgid = 11638, ppid = 5969, time_slice = 11
exit: pid = 11638, tgid = 11638, ppid = 11634, time_slice = 6
fork: pid = 11639, tgid = 11639, ppid = 5969, time_slice = 11
exit: pid = 11639, tgid = 11639, ppid = 11634, time_slice = 6
fork: pid = 11640, tgid = 11640, ppid = 5969, time_slice = 11
exit: pid = 11640, tgid = 11640, ppid = 11634, time_slice = 6
fork: pid = 11641, tgid = 11641, ppid = 5969, time_slice = 11
exit: pid = 11641, tgid = 11641, ppid = 11634, time_slice = 6
fork: pid = 11642, tgid = 11642, ppid = 5969, time_slice = 11
exit: pid = 11642, tgid = 11642, ppid = 11634, time_slice = 6
fork: pid = 11643, tgid = 11643, ppid = 5969, time_slice = 11
exit: pid = 11643, tgid = 11643, ppid = 11634, time_slice = 6
fork: pid = 11644, tgid = 11644, ppid = 5969, time_slice = 11
exit: pid = 11644, tgid = 11644, ppid = 11634, time_slice = 6
exit: pid = 11634, tgid = 11634, ppid = 5969, time_slice = 10

It looks good.

3) Execute the thread model program on another terminal.

$ ./satthread

Then systemtap overlooks satthread's fork/exit and prints the followings:

fork: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 10
exit: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 5
fork: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 5
fork: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 2
exit: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 3
fork: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1
exit: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 1
fork: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 25
exit: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1
fork: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 12
exit: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 13
fork: pid = 11652, tgid = 11645, ppid = 5969, time_slice = 6
exit: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 6
for

Re: COMPILING AND CONFIGURING A NEW KERNEL.

2007-04-07 Thread johnrobertbanks

Just correcting some errors and typos. 

Wouldn't want you to say that the linux kernel mailing list gave you
incorrect info.

COMPILING AND CONFIGURING A NEW KERNEL.

Download a recent kernel from http://www.kernel.org/
I will use the kernel linux-2.6.20.tar.bz2

You will have to change details of the following to suit your purposes.

Save it in /usr/src/
# mv linux-2.6.20.tar.bz2 /usr/src/

Unzip the kernel package
# tar -jxf linux-2.6.20.tar.bz2

Copy the original kernel configuration file (that came with your distro)
to .config
# cp /boot/config-2.6.13-15-default /usr/src/linux-2.6.20/.config

Change to the new kernel source directory
# cd /usr/src/linux-2.6.20/

Look at the available kernel building options
# make help

Run oldconfig to update the original kernel configuration to a current
configuration
# make oldconfig

Use menuconfig (or xconfig or gconfig) to make any further changes
# make menuconfig

YOU SHOULD compile all the drivers necessary to boot your system, into
the kernel (ie, such drivers should not be built as modules).

This way you will NOT need an initrd file.

Use rpm-pkg to create a Red Hat RPM kernel package.
# make rpm-pkg

When built, the RPM package is put in
/usr/src/packages/RPMS/*your*architecture*

# cd /usr/src/packages/RPMS/x86_64

Install the package (you may have to un-install previous installs)
# rpm -i kernel-2.6.20-1.x86_64.rpm

Use deb-pkg to create a Debian .deb kernel package.
# make deb-pkg

When built, the .deb package is put in /usr/src/
# cd /usr/src/

Install the package (you may have to un-install previous installs)
# dpkg --install linux-2.6.20_2.6.20_amd64.deb

If you were unable to determine which drivers you need (to boot), then
you will need an initrd file. To build it use the command
# mkinitrd -o /boot/initrd-2.6.20

IF YOU ARE CUSTOMIZING YOUR KERNEL, YOU SHOULD PUT IN THE EFFORT TO
BUILD A KERNEL THAT DOES NOT NEED AN INITRD FILE.

It is possible that deb-pkg and rpm-pkg take care of creating the initrd
automatically.

I have always compiled in the important drivers, so I do not know.

Does any caring person here know the answer to this question?

--

Now you need to configure your kernel. Using GRUB you need to change the
menu.lst file.

# emacs /boot/grub/menu.lst &

The grub entry that you presently boot with, will look something like:

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE LINUX 10.0
root (hd0,2)
kernel /boot/vmlinuz-2.6.13-15-default root=/dev/hda3
resume=/dev/hda5 vga=0x317 video=vesafb:nomtrr splash=silent
initrd /boot/initrd

Do NOT delete the old boot entry, so you can boot it, if things go wrong
with the new kernel.

Cut a copy of it and paste it above the original. Then adjust the copy
for the new kernel.

###Don't change this comment - YaST2 identifier: Original name: linux###
title MY NEW KERNEL
root (hd0,2)
kernel /boot/linux-2.6.20 root=/dev/hda3 resume=/dev/hda5 vga=0x317
video=vesafb:nomtrr splash=silent

Of course, you don't need a initrd entry as you have compiled in all the
vital drivers,... right?
If you could not determine the vital drivers and needed to build an
initrd file, then you need an entry, like

initrd /boot/initrd-2.6.20

--

If your new kernel is destined to have the same name as the old one, you
need to do something about it (unless you do not mind the old one being
overwritten).

Use your favorite text editor to change the top level Makefile
# emacs /usr/src/linux-2.6.20/Makefile &

change the line 
EXTRAVERSION
to 
EXTRAVERSION = something

This will change the name of the new kernel to linux-2.6.20-something

Your /boot/grub/menu.lst entry will now look something like:

###Don't change this comment - YaST2 identifier: Original name: linux###
title MY NEW KERNEL
root (hd0,2)
kernel /boot/linux-2.6.20-something root=/dev/hda3 resume=/dev/hda5
vga=0x317 video=vesafb:nomtrr splash=silent

and perhaps an entry

initrd /boot/initrd-2.6.20-something

--

Now reboot and choose the "MY NEW KERNEL" entry from the GRUB boot menu,
and see how you went.

-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Send your email first class

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] scheduler: first timeslice of the exiting thread

2007-04-07 Thread Satoru Takeuchi
At Sat, 07 Apr 2007 16:31:39 +0900,
Satoru Takeuchi wrote:
> Test programs(attached in the mail):
> 
>   - satprocess.c:  Process model. It creates a child process and wait for it
>several times. Each child process exits immediately.
>   - satthread.c:   Thread model. It creates a child thread and join it several
>times. Each child thread exits immediately.
>   - fork_exit.stp: systemtap script to overlook satprocess/satthread

Oh, I sent a old systemtap script. Correct one is here.

Satoru

/// fork_exit.stp /
/*
 * fork_exit.stp - Overlooks sched_fork()/exit_exit() for satprocess/satthread
 * and prints some information
 *
 * Copyright (C) 2007 Satoru Takeuchi <[EMAIL PROTECTED]>
 *
 * This software may be used and distributed according to the terms
 * of the GNU General Public License, incorporated herein by reference.
 */

function is_my_testpro(comm)
{
if (comm == "satthread" || comm == "satprocess")
return 1
else
return 0
}

function print_log(name, pid, tgid, ppid, time_slice)
{
printf("%s: pid = %d, tgid = %d, ppid = %d, time_slice = %u\n",
   name, pid, tgid, ppid, time_slice);
}

probe kernel.function("sched_fork")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log("fork",
  $p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}

probe kernel.function("sched_exit")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log("exit",
  $p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: coding style for long conditions

2007-04-07 Thread Junio C Hamano
David Brownell <[EMAIL PROTECTED]> writes:

> So in practical terms "\n \t" and "\n\t" are identical;
> although the former "should not" be used, it doesn't
> actually affect what CodingStyle is primarily trying to
> control (i.e. what the code looks like).

That's not what CodingStyle is trying to control.  Not "what the
code looks like" at all.

Think why this line is here at the end of Chapter 1.

Get a decent editor and don't leave whitespace at the end of lines.

By the way, "git show --color 0aa599c -- drivers/usb/net/usbnet.h"
would catch this kind of breakage if you have git.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] high-res timers: UP resume fix

2007-04-07 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> In fact,  I have a theory.. Your backtrace is:
> 
>  [] smp_apic_timer_interrupt+0x57/0x90
>  [] retrigger_next_event+0x0/0xb0
>  [] apic_timer_interrupt+0x28/0x30
>  [] retrigger_next_event+0x0/0xb0
>  [] __kfifo_put+0x8/0x90
>  [] on_each_cpu+0x35/0x60
>  [] clock_was_set+0x18/0x20
>  [] timekeeping_resume+0x7c/0xa0
>  [] __sysdev_resume+0x11/0x80
>  [] sysdev_resume+0x47/0x80
>  [] device_power_up+0x5/0x10
> 
> and the thing is, I don't think we should have interrupt enabled at 
> this point in time! I susect that the timer resume enables interrupts 
> too early! We should be doing the whole "device_power_up()" sequence 
> with irq's off, I think..

yeah, i think you are right. timekeeping_resume() itself does not 
re-enable interrupts, it's clock_was_set() that does it implicitly:

void clock_was_set(void)
{
/* Retrigger the CPU local events everywhere */
on_each_cpu(retrigger_next_event, NULL, 0, 1);
}

on_each_cpu() is safe on SMP during resume 'bootup', because we only 
have a single CPU at that point, and smp_call_function() does:

spin_lock(&call_lock);
cpus = num_online_cpus() - 1;
if (!cpus) {
spin_unlock(&call_lock);

so we just return. Note that the built-in warning of smp_call_function() 
does not trigger because it's done too late:

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());

we should move this up to the head of the function. But for this bug in 
question to trigger we'd have to use an UP kernel, which has this code 
for on_each_cpu():

#define on_each_cpu(func,info,retry,wait)   \
({  \
local_irq_disable();\
func(info); \
local_irq_enable(); \

ouch!

the solution is this: what we want to call here in timekeeping_resume is 
not clock_was_set() but retrigger_next_event() for the current CPU. The 
patch below should fix it. Soeren, can you confirm that you are using a 
!CONFIG_SMP kernel, and if yes, does the patch below fix the resume 
problem for you?

Ingo

>
Subject: [patch] high-res timers: UP resume fix
From: Ingo Molnar <[EMAIL PROTECTED]>

Soeren Sonnenburg reported that upon resume he is getting
this backtrace:

 [] smp_apic_timer_interrupt+0x57/0x90
 [] retrigger_next_event+0x0/0xb0
 [] apic_timer_interrupt+0x28/0x30
 [] retrigger_next_event+0x0/0xb0
 [] __kfifo_put+0x8/0x90
 [] on_each_cpu+0x35/0x60
 [] clock_was_set+0x18/0x20
 [] timekeeping_resume+0x7c/0xa0
 [] __sysdev_resume+0x11/0x80
 [] sysdev_resume+0x47/0x80
 [] device_power_up+0x5/0x10

it turns out that on UP we mistakenly re-enable interrupts,
so do the timer retrigger only on the current CPU.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/linux/hrtimer.h |3 +++
 kernel/hrtimer.c|   12 
 2 files changed, 15 insertions(+)

Index: linux/include/linux/hrtimer.h
===
--- linux.orig/include/linux/hrtimer.h
+++ linux/include/linux/hrtimer.h
@@ -206,6 +206,7 @@ struct hrtimer_cpu_base {
 struct clock_event_device;
 
 extern void clock_was_set(void);
+extern void hres_timers_resume(void);
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
 /*
@@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim
  */
 static inline void clock_was_set(void) { }
 
+static inline void hres_timers_resume(void) { }
+
 /*
  * In non high resolution mode the time reference is taken from
  * the base softirq time variable.
Index: linux/kernel/hrtimer.c
===
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -459,6 +459,18 @@ void clock_was_set(void)
 }
 
 /*
+ * During resume we might have to reprogram the high resolution timer
+ * interrupt (on the local CPU):
+ */
+void hres_timers_resume(void)
+{
+   WARN_ON_ONCE(num_online_cpus() > 1);
+
+   /* Retrigger the CPU local events: */
+   retrigger_next_event(NULL);
+}
+
+/*
  * Check, whether the timer is on the callback pending list
  */
 static inline int hrtimer_cb_pending(const struct hrtimer *timer)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/14] sysfs: fix i_ino handling in sysfs

2007-04-07 Thread Tejun Heo
Inode number handling was incorrect in two ways.

1. sysfs uses the inode number allocated by new_inode() and never
   hashes it.  When reporting the inode number, it uses iunique() if
   inode is inaccessible.  This is incorrect because iunique() assumes
   the inodes are hashed.  This can cause duplicate inode numbers and
   the condition is likely to happen because new_inode() and iunique()
   use separate increasing static counters to scan for empty slot.

2. sysfs_dirent->s_dentry can go away anytime and can't be referenced
   unless the caller knows the dentry is not and not going to be
   deleted.

This patch makes sysfs report the pointer to sysfs_dirent as ino.
ino_t is always as big as or larger than unsigned long && sysfs_dirent
hierarchy is the internal representation of the sysfs tree, so it
makes sense and simple to implement.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c   |   11 ---
 fs/sysfs/inode.c |1 +
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 85a6686..5112f88 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -504,19 +504,19 @@ static int sysfs_readdir(struct file * filp, void * 
dirent, filldir_t filldir)
struct sysfs_dirent * parent_sd = dentry->d_fsdata;
struct sysfs_dirent *cursor = filp->private_data;
struct list_head *p, *q = &cursor->s_sibling;
-   ino_t ino;
+   unsigned long ino;
int i = filp->f_pos;
 
switch (i) {
case 0:
-   ino = dentry->d_inode->i_ino;
+   ino = (unsigned long)parent_sd;
if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
break;
filp->f_pos++;
i++;
/* fallthrough */
case 1:
-   ino = parent_ino(dentry);
+   ino = (unsigned long)dentry->d_parent->d_fsdata;
if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
break;
filp->f_pos++;
@@ -538,10 +538,7 @@ static int sysfs_readdir(struct file * filp, void * 
dirent, filldir_t filldir)
 
name = sysfs_get_name(next);
len = strlen(name);
-   if (next->s_dentry)
-   ino = next->s_dentry->d_inode->i_ino;
-   else
-   ino = iunique(sysfs_sb, 2);
+   ino = (unsigned long)next;
 
if (filldir(dirent, name, len, filp->f_pos, ino,
 dt_type(next)) < 0)
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 4de5c6b..b8b010c 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -140,6 +140,7 @@ struct inode * sysfs_new_inode(mode_t mode, struct 
sysfs_dirent * sd)
inode->i_mapping->a_ops = &sysfs_aops;
inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info;
inode->i_op = &sysfs_inode_operations;
+   inode->i_ino = (unsigned long)sd;
lockdep_set_class(&inode->i_mutex, &sysfs_inode_imutex_key);
 
if (sd->s_iattr) {
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/14] sysfs: fix error handling in binattr write()

2007-04-07 Thread Tejun Heo
Error handling in fs/sysfs/bin.c:write() was wrong because size_t
count is used to receive return value from flush_write() which is
negative on failure.

This patch updates write() such that int variable is used instead.
read() is updated the same way for consistency.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/bin.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index d3b9f5f..8273dd6 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -33,16 +33,13 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, 
size_t count)
 }
 
 static ssize_t
-read(struct file * file, char __user * userbuf, size_t count, loff_t * off)
+read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off)
 {
char *buffer = file->private_data;
struct dentry *dentry = file->f_path.dentry;
int size = dentry->d_inode->i_size;
loff_t offs = *off;
-   int ret;
-
-   if (count > PAGE_SIZE)
-   count = PAGE_SIZE;
+   int count = min_t(size_t, bytes, PAGE_SIZE);
 
if (size) {
if (offs > size)
@@ -51,10 +48,9 @@ read(struct file * file, char __user * userbuf, size_t 
count, loff_t * off)
count = size - offs;
}
 
-   ret = fill_read(dentry, buffer, offs, count);
-   if (ret < 0) 
-   return ret;
-   count = ret;
+   count = fill_read(dentry, buffer, offs, count);
+   if (count < 0)
+   return count;
 
if (copy_to_user(userbuf, buffer, count))
return -EFAULT;
@@ -78,16 +74,15 @@ flush_write(struct dentry *dentry, char *buffer, loff_t 
offset, size_t count)
return attr->write(kobj, buffer, offset, count);
 }
 
-static ssize_t write(struct file * file, const char __user * userbuf,
-size_t count, loff_t * off)
+static ssize_t write(struct file *file, const char __user *userbuf,
+size_t bytes, loff_t *off)
 {
char *buffer = file->private_data;
struct dentry *dentry = file->f_path.dentry;
int size = dentry->d_inode->i_size;
loff_t offs = *off;
+   int count = min_t(size_t, bytes, PAGE_SIZE);
 
-   if (count > PAGE_SIZE)
-   count = PAGE_SIZE;
if (size) {
if (offs > size)
return 0;
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/14] sysfs: kill unnecessary attribute->owner

2007-04-07 Thread Tejun Heo
sysfs is now completely out of driver/module lifetime game.  After
deletion, a sysfs node doesn't access anything outside sysfs proper,
so there's no reason to hold onto the attribute owners.  Note that
often the wrong modules were accounted for as owners leading to
accessing removed modules.

This patch kills now unnecessary attribute->owner.  Note that with
this change, userland holding a sysfs node does not prevent the
backing module from being unloaded.

For more info regarding lifetime rule cleanup, please read the
following message.

  http://article.gmane.org/gmane.linux.kernel/510293

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 drivers/base/class.c|2 --
 drivers/base/core.c |4 
 drivers/base/firmware_class.c   |2 +-
 drivers/block/pktcdvd.c |3 +--
 drivers/char/ipmi/ipmi_msghandler.c |   10 --
 drivers/cpufreq/cpufreq_stats.c |3 +--
 drivers/cpufreq/cpufreq_userspace.c |2 +-
 drivers/cpufreq/freq_table.c|1 -
 drivers/firmware/dcdbas.h   |3 +--
 drivers/firmware/dell_rbu.c |6 +++---
 drivers/firmware/edd.c  |2 +-
 drivers/firmware/efivars.c  |6 +++---
 drivers/i2c/chips/eeprom.c  |1 -
 drivers/i2c/chips/max6875.c |1 -
 drivers/infiniband/core/sysfs.c |1 -
 drivers/input/mouse/psmouse.h   |1 -
 drivers/media/video/pvrusb2/pvrusb2-sysfs.c |   13 -
 drivers/misc/asus-laptop.c  |3 +--
 drivers/pci/hotplug/acpiphp_ibm.c   |1 -
 drivers/pci/pci-sysfs.c |4 
 drivers/pcmcia/socket_sysfs.c   |2 +-
 drivers/rtc/rtc-ds1553.c|1 -
 drivers/rtc/rtc-ds1742.c|1 -
 drivers/scsi/arcmsr/arcmsr_attr.c   |3 ---
 drivers/scsi/lpfc/lpfc_attr.c   |2 --
 drivers/scsi/qla2xxx/qla_attr.c |6 --
 drivers/spi/at25.c  |1 -
 drivers/video/aty/radeon_base.c |2 --
 drivers/video/backlight/backlight.c |2 +-
 drivers/video/backlight/lcd.c   |2 +-
 drivers/w1/slaves/w1_ds2433.c   |1 -
 drivers/w1/slaves/w1_therm.c|1 -
 drivers/w1/w1.c |2 --
 fs/ecryptfs/main.c  |2 --
 fs/ocfs2/cluster/masklog.c  |1 -
 fs/partitions/check.c   |1 -
 fs/sysfs/bin.c  |   19 +--
 fs/sysfs/file.c |   24 +---
 include/linux/sysdev.h  |3 +--
 include/linux/sysfs.h   |7 +++
 kernel/module.c |9 +++--
 kernel/params.c |1 -
 net/bridge/br_sysfs_br.c|3 +--
 net/bridge/br_sysfs_if.c|3 +--
 44 files changed, 35 insertions(+), 133 deletions(-)

diff --git a/drivers/base/class.c b/drivers/base/class.c
index d596812..064c1de 100644
--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -624,7 +624,6 @@ int class_device_add(struct class_device *class_dev)
goto out3;
class_dev->uevent_attr.attr.name = "uevent";
class_dev->uevent_attr.attr.mode = S_IWUSR;
-   class_dev->uevent_attr.attr.owner = parent_class->owner;
class_dev->uevent_attr.store = store_uevent;
error = class_device_create_file(class_dev, &class_dev->uevent_attr);
if (error)
@@ -639,7 +638,6 @@ int class_device_add(struct class_device *class_dev)
}
attr->attr.name = "dev";
attr->attr.mode = S_IRUGO;
-   attr->attr.owner = parent_class->owner;
attr->show = show_dev;
error = class_device_create_file(class_dev, attr);
if (error) {
diff --git a/drivers/base/core.c b/drivers/base/core.c
index d7fcf82..37930d0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -563,8 +563,6 @@ int device_add(struct device *dev)
 
dev->uevent_attr.attr.name = "uevent";
dev->uevent_attr.attr.mode = S_IWUSR;
-   if (dev->driver)
-   dev->uevent_attr.attr.owner = dev->driver->owner;
dev->uevent_attr.store = store_uevent;
error = device_create_file(dev, &dev->uevent_attr);
if (error)
@@ -579,8 +577,6 @@ int device_add(struct device *dev)
}
attr->attr.name = "dev";
attr->attr.mode = S_IRUGO;
-   if (dev->driver)
-   attr->attr.owner = dev->driver->owner;
attr->show = show_dev;
error = device_create_file(dev, attr);
i

[PATCH 08/14] sysfs: make sysfs_dirent->s_element a union

2007-04-07 Thread Tejun Heo
Make sd->s_element a union of sysfs_elem_{dir|symlink|attr|bin_attr}
and rename it to s_elem.  This is to achieve...

* some level of type checking : changing symlink to point to
  sysfs_dirent instead of kobject is much safer and less painful now.
* easier / standardized dereferencing
* allow sysfs_elem_* to contain more than one entry

Where possible, pointer is obtained by directly deferencing from sd
instead of going through other entities.  This reduces dependencies to
dentry, inode and kobject.  to_attr() and to_bin_attr() are unused now
and removed.

This is in preparation of object reference simplification.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/bin.c |   18 +++---
 fs/sysfs/dir.c |   40 ++--
 fs/sysfs/file.c|   19 ++-
 fs/sysfs/inode.c   |2 +-
 fs/sysfs/mount.c   |1 -
 fs/sysfs/symlink.c |   23 ---
 fs/sysfs/sysfs.h   |   47 +++
 7 files changed, 71 insertions(+), 79 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 8273dd6..0f0027b 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -23,7 +23,8 @@
 static int
 fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count)
 {
-   struct bin_attribute * attr = to_bin_attr(dentry);
+   struct sysfs_dirent *attr_sd = dentry->d_fsdata;
+   struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
struct kobject * kobj = to_kobj(dentry->d_parent);
 
if (!attr->read)
@@ -65,7 +66,8 @@ read(struct file *file, char __user *userbuf, size_t bytes, 
loff_t *off)
 static int
 flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count)
 {
-   struct bin_attribute *attr = to_bin_attr(dentry);
+   struct sysfs_dirent *attr_sd = dentry->d_fsdata;
+   struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
struct kobject *kobj = to_kobj(dentry->d_parent);
 
if (!attr->write)
@@ -101,9 +103,9 @@ static ssize_t write(struct file *file, const char __user 
*userbuf,
 
 static int mmap(struct file *file, struct vm_area_struct *vma)
 {
-   struct dentry *dentry = file->f_path.dentry;
-   struct bin_attribute *attr = to_bin_attr(dentry);
-   struct kobject *kobj = to_kobj(dentry->d_parent);
+   struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
+   struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
+   struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent);
 
if (!attr->mmap)
return -EINVAL;
@@ -114,7 +116,8 @@ static int mmap(struct file *file, struct vm_area_struct 
*vma)
 static int open(struct inode * inode, struct file * file)
 {
struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent);
-   struct bin_attribute * attr = to_bin_attr(file->f_path.dentry);
+   struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
+   struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
int error = -EINVAL;
 
if (!kobj || !attr)
@@ -150,7 +153,8 @@ static int open(struct inode * inode, struct file * file)
 static int release(struct inode * inode, struct file * file)
 {
struct kobject * kobj = to_kobj(file->f_path.dentry->d_parent);
-   struct bin_attribute * attr = to_bin_attr(file->f_path.dentry);
+   struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
+   struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
u8 * buffer = file->private_data;
 
kobject_put(kobj);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 26c3088..d70ead5 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -21,11 +21,10 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry)
spin_lock(&dcache_lock);
if (!d_unhashed(dentry)) {
struct sysfs_dirent * sd = dentry->d_fsdata;
-   if (sd->s_type & SYSFS_KOBJ_LINK) {
-   struct sysfs_symlink * sl = sd->s_element;
-   kobj = kobject_get(sl->target_kobj);
-   } else
-   kobj = kobject_get(sd->s_element);
+   if (sd->s_type & SYSFS_KOBJ_LINK)
+   kobj = kobject_get(sd->s_elem.symlink.target_kobj);
+   else
+   kobj = kobject_get(sd->s_elem.dir.kobj);
}
spin_unlock(&dcache_lock);
 
@@ -39,11 +38,8 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
  repeat:
parent_sd = sd->s_parent;
 
-   if (sd->s_type & SYSFS_KOBJ_LINK) {
-   struct sysfs_symlink * sl = sd->s_element;
-   kobject_put(sl->target_kobj);
-   kfree(sl);
-   }
+   if (sd->s_type & SYSFS_KOBJ_LINK)
+   kobject_put(sd->s_elem.symlink.target_kobj);
if (sd->s_type & SYSFS_COPY_NAME)
kfree(sd->s_name);
kfree(sd->s_iattr);
@@ -70,8 +66,7 @

[PATCH 05/14] sysfs: consolidate sysfs_dirent creation functions

2007-04-07 Thread Tejun Heo
Currently there are four functions to create sysfs_dirent -
__sysfs_new_dirent(), sysfs_new_dirent(), __sysfs_make_dirent() and
sysfs_make_dirent().  Other than sysfs_make_dirent(), no function has
two users if calls to implement other functions are excluded.

This patch consolidates sysfs_dirent creation functions into the
following two.

* sysfs_new_dirent() : allocate and initialize
* sysfs_attach_dirent() : attach to sysfs_dirent hierarchy and/or
  associate with dentry

This simplifies interface and gives callers more flexibility.  This is
in preparation of object reference simplification.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c |   82 
 fs/sysfs/file.c|   21 ++---
 fs/sysfs/symlink.c |7 ++--
 fs/sysfs/sysfs.h   |7 +++-
 4 files changed, 50 insertions(+), 67 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 99a0ba1..653c23c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -60,10 +60,7 @@ static struct dentry_operations sysfs_dentry_ops = {
.d_iput = sysfs_d_iput,
 };
 
-/*
- * Allocates a new sysfs_dirent and links it to the parent sysfs_dirent
- */
-static struct sysfs_dirent * __sysfs_new_dirent(void * element)
+struct sysfs_dirent *sysfs_new_dirent(void *element, umode_t mode, int type)
 {
struct sysfs_dirent * sd;
 
@@ -75,25 +72,25 @@ static struct sysfs_dirent * __sysfs_new_dirent(void * 
element)
atomic_set(&sd->s_event, 1);
INIT_LIST_HEAD(&sd->s_children);
INIT_LIST_HEAD(&sd->s_sibling);
+
sd->s_element = element;
+   sd->s_mode = mode;
+   sd->s_type = type;
 
return sd;
 }
 
-static void __sysfs_list_dirent(struct sysfs_dirent *parent_sd,
- struct sysfs_dirent *sd)
+void sysfs_attach_dirent(struct sysfs_dirent *sd,
+struct sysfs_dirent *parent_sd, struct dentry *dentry)
 {
-   if (sd)
-   list_add(&sd->s_sibling, &parent_sd->s_children);
-}
+   if (dentry) {
+   sd->s_dentry = dentry;
+   dentry->d_fsdata = sysfs_get(sd);
+   dentry->d_op = &sysfs_dentry_ops;
+   }
 
-static struct sysfs_dirent * sysfs_new_dirent(struct sysfs_dirent *parent_sd,
-   void * element)
-{
-   struct sysfs_dirent *sd;
-   sd = __sysfs_new_dirent(element);
-   __sysfs_list_dirent(parent_sd, sd);
-   return sd;
+   if (parent_sd)
+   list_add(&sd->s_sibling, &parent_sd->s_children);
 }
 
 /*
@@ -121,39 +118,6 @@ int sysfs_dirent_exist(struct sysfs_dirent *parent_sd,
return 0;
 }
 
-
-static struct sysfs_dirent *
-__sysfs_make_dirent(struct dentry *dentry, void *element, mode_t mode, int 
type)
-{
-   struct sysfs_dirent * sd;
-
-   sd = __sysfs_new_dirent(element);
-   if (!sd)
-   goto out;
-
-   sd->s_mode = mode;
-   sd->s_type = type;
-   sd->s_dentry = dentry;
-   if (dentry) {
-   dentry->d_fsdata = sysfs_get(sd);
-   dentry->d_op = &sysfs_dentry_ops;
-   }
-
-out:
-   return sd;
-}
-
-int sysfs_make_dirent(struct sysfs_dirent * parent_sd, struct dentry * dentry,
-   void * element, umode_t mode, int type)
-{
-   struct sysfs_dirent *sd;
-
-   sd = __sysfs_make_dirent(dentry, element, mode, type);
-   __sysfs_list_dirent(parent_sd, sd);
-
-   return sd ? 0 : -ENOMEM;
-}
-
 static int init_dir(struct inode * inode)
 {
inode->i_op = &sysfs_dir_inode_operations;
@@ -197,10 +161,11 @@ static int create_dir(struct kobject *kobj, struct dentry 
*parent,
if (sysfs_dirent_exist(parent->d_fsdata, name))
goto out_dput;
 
-   error = sysfs_make_dirent(parent->d_fsdata, dentry, kobj, mode,
- SYSFS_DIR);
-   if (error)
+   error = -ENOMEM;
+   sd = sysfs_new_dirent(kobj, mode, SYSFS_DIR);
+   if (!sd)
goto out_drop;
+   sysfs_attach_dirent(sd, parent->d_fsdata, dentry);
 
error = sysfs_create(dentry, mode, init_dir);
if (error)
@@ -215,7 +180,6 @@ static int create_dir(struct kobject *kobj, struct dentry 
*parent,
goto out_dput;
 
  out_sput:
-   sd = dentry->d_fsdata;
list_del_init(&sd->s_sibling);
sysfs_put(sd);
  out_drop:
@@ -512,13 +476,16 @@ static int sysfs_dir_open(struct inode *inode, struct 
file *file)
 {
struct dentry * dentry = file->f_path.dentry;
struct sysfs_dirent * parent_sd = dentry->d_fsdata;
+   struct sysfs_dirent * sd;
 
mutex_lock(&dentry->d_inode->i_mutex);
-   file->private_data = sysfs_new_dirent(parent_sd, NULL);
+   sd = sysfs_new_dirent(NULL, 0, 0);
+   if (sd)
+   sysfs_attach_dirent(sd, parent_sd, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
 
-   return file->private_da

[PATCH 11/14] sysfs: implement bin_buffer

2007-04-07 Thread Tejun Heo
Implement bin_buffer which contains a mutex and pointer to PAGE_SIZE
buffer to properly synchronize accesses to per-openfile buffer and
prepare for immediate-kobj-disconnect.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/bin.c |   64 ++-
 1 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 0f0027b..1dd1bf1 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -20,6 +20,11 @@
 
 #include "sysfs.h"
 
+struct bin_buffer {
+   struct mutexmutex;
+   void*buffer;
+};
+
 static int
 fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count)
 {
@@ -36,7 +41,7 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, 
size_t count)
 static ssize_t
 read(struct file *file, char __user *userbuf, size_t bytes, loff_t *off)
 {
-   char *buffer = file->private_data;
+   struct bin_buffer *bb = file->private_data;
struct dentry *dentry = file->f_path.dentry;
int size = dentry->d_inode->i_size;
loff_t offs = *off;
@@ -49,17 +54,23 @@ read(struct file *file, char __user *userbuf, size_t bytes, 
loff_t *off)
count = size - offs;
}
 
-   count = fill_read(dentry, buffer, offs, count);
+   mutex_lock(&bb->mutex);
+
+   count = fill_read(dentry, bb->buffer, offs, count);
if (count < 0)
-   return count;
+   goto out_unlock;
 
-   if (copy_to_user(userbuf, buffer, count))
-   return -EFAULT;
+   if (copy_to_user(userbuf, bb->buffer, count)) {
+   count = -EFAULT;
+   goto out_unlock;
+   }
 
pr_debug("offs = %lld, *off = %lld, count = %zd\n", offs, *off, count);
 
*off = offs + count;
 
+ out_unlock:
+   mutex_unlock(&bb->mutex);
return count;
 }
 
@@ -79,7 +90,7 @@ flush_write(struct dentry *dentry, char *buffer, loff_t 
offset, size_t count)
 static ssize_t write(struct file *file, const char __user *userbuf,
 size_t bytes, loff_t *off)
 {
-   char *buffer = file->private_data;
+   struct bin_buffer *bb = file->private_data;
struct dentry *dentry = file->f_path.dentry;
int size = dentry->d_inode->i_size;
loff_t offs = *off;
@@ -92,25 +103,38 @@ static ssize_t write(struct file *file, const char __user 
*userbuf,
count = size - offs;
}
 
-   if (copy_from_user(buffer, userbuf, count))
-   return -EFAULT;
+   mutex_lock(&bb->mutex);
+
+   if (copy_from_user(bb->buffer, userbuf, count)) {
+   count = -EFAULT;
+   goto out_unlock;
+   }
 
-   count = flush_write(dentry, buffer, offs, count);
+   count = flush_write(dentry, bb->buffer, offs, count);
if (count > 0)
*off = offs + count;
+
+ out_unlock:
+   mutex_unlock(&bb->mutex);
return count;
 }
 
 static int mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct bin_buffer *bb = file->private_data;
struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent);
+   int rc;
 
if (!attr->mmap)
return -EINVAL;
 
-   return attr->mmap(kobj, attr, vma);
+   mutex_lock(&bb->mutex);
+   rc = attr->mmap(kobj, attr, vma);
+   mutex_unlock(&bb->mutex);
+
+   return rc;
 }
 
 static int open(struct inode * inode, struct file * file)
@@ -118,6 +142,7 @@ static int open(struct inode * inode, struct file * file)
struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent);
struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
+   struct bin_buffer *bb = NULL;
int error = -EINVAL;
 
if (!kobj || !attr)
@@ -135,14 +160,22 @@ static int open(struct inode * inode, struct file * file)
goto Error;
 
error = -ENOMEM;
-   file->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!file->private_data)
+   bb = kzalloc(sizeof(*bb), GFP_KERNEL);
+   if (!bb)
goto Error;
 
+   bb->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!bb->buffer)
+   goto Error;
+
+   mutex_init(&bb->mutex);
+   file->private_data = bb;
+
error = 0;
-goto Done;
+   goto Done;
 
  Error:
+   kfree(bb);
module_put(attr->attr.owner);
  Done:
if (error)
@@ -155,11 +188,12 @@ static int release(struct inode * inode, struct file * 
file)
struct kobject * kobj = to_kobj(file->f_path.dentry->d_parent);
struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
-   u8 * buffer 

[PATCH 06/14] sysfs: add sysfs_dirent->s_parent

2007-04-07 Thread Tejun Heo
Add sysfs_dirent->s_parent.  With this patch, each sd points to and
holds a reference to its parent.  This allows walking sysfs tree
without referencing sd->s_dentry which can go away anytime if the user
doesn't control when it's deleted.

sd->s_parent is initialized and parent is referenced in
sysfs_attach_dirent().  Reference to parent is released when the sd is
released, so as long as reference to a sd is held, s_parent can be
followed.

dentry walk in sysfs_readdir() is convereted to s_parent walk.

This will be used to reimplement symlink such that it uses only
sysfs_dirent tree.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c   |   27 ---
 fs/sysfs/mount.c |1 +
 fs/sysfs/sysfs.h |1 +
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 653c23c..ef45c3e 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -34,6 +34,11 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry)
 
 void release_sysfs_dirent(struct sysfs_dirent * sd)
 {
+   struct sysfs_dirent *parent_sd;
+
+ repeat:
+   parent_sd = sd->s_parent;
+
if (sd->s_type & SYSFS_KOBJ_LINK) {
struct sysfs_symlink * sl = sd->s_element;
kfree(sl->link_name);
@@ -42,6 +47,10 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
}
kfree(sd->s_iattr);
kmem_cache_free(sysfs_dir_cachep, sd);
+
+   sd = parent_sd;
+   if (sd && atomic_dec_and_test(&sd->s_count))
+   goto repeat;
 }
 
 static void sysfs_d_iput(struct dentry * dentry, struct inode * inode)
@@ -89,8 +98,10 @@ void sysfs_attach_dirent(struct sysfs_dirent *sd,
dentry->d_op = &sysfs_dentry_ops;
}
 
-   if (parent_sd)
+   if (parent_sd) {
+   sd->s_parent = sysfs_get(parent_sd);
list_add(&sd->s_sibling, &parent_sd->s_children);
+   }
 }
 
 /*
@@ -526,7 +537,7 @@ static int sysfs_readdir(struct file * filp, void * dirent, 
filldir_t filldir)
i++;
/* fallthrough */
case 1:
-   ino = (unsigned long)dentry->d_parent->d_fsdata;
+   ino = (unsigned long)parent_sd->s_parent;
if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
break;
filp->f_pos++;
@@ -643,13 +654,13 @@ int sysfs_make_shadowed_dir(struct kobject *kobj,
 
 struct dentry *sysfs_create_shadow_dir(struct kobject *kobj)
 {
+   struct dentry *dir = kobj->dentry;
+   struct inode *inode = dir->d_inode;
+   struct dentry *parent = dir->d_parent;
+   struct sysfs_dirent *parent_sd = parent->d_fsdata;
+   struct dentry *shadow;
struct sysfs_dirent *sd;
-   struct dentry *parent, *dir, *shadow;
-   struct inode *inode;
 
-   dir = kobj->dentry;
-   inode = dir->d_inode;
-   parent = dir->d_parent;
shadow = ERR_PTR(-EINVAL);
if (!sysfs_is_shadowed_inode(inode))
goto out;
@@ -661,6 +672,8 @@ struct dentry *sysfs_create_shadow_dir(struct kobject *kobj)
sd = sysfs_new_dirent(kobj, inode->i_mode, SYSFS_DIR);
if (!sd)
goto nomem;
+   /* point to parent_sd but don't attach to it */
+   sd->s_parent = sysfs_get(parent_sd);
sysfs_attach_dirent(sd, NULL, shadow);
 
d_instantiate(shadow, igrab(inode));
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 23a48a3..141f7b1 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -28,6 +28,7 @@ static const struct super_operations sysfs_ops = {
 };
 
 static struct sysfs_dirent sysfs_root = {
+   .s_count= ATOMIC_INIT(1),
.s_sibling  = LIST_HEAD_INIT(sysfs_root.s_sibling),
.s_children = LIST_HEAD_INIT(sysfs_root.s_children),
.s_element  = NULL,
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 104649c..b4876de 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -1,5 +1,6 @@
 struct sysfs_dirent {
atomic_ts_count;
+   struct sysfs_dirent * s_parent;
struct list_heads_sibling;
struct list_heads_children;
void* s_element;
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, 
> and if yes, does the patch below fix the resume problem for you?

hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where 
we re-enable interrupts in the SMP case, but could you try my patch 
nevertheless?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/14] sysfs: implement immediate kobject disconnect

2007-04-07 Thread Tejun Heo
Opening a sysfs node references its associated kobject, so userland
can arbitrarily prolong lifetime of a kobject which complicates
lifetime rules in drivers.  This patch makes the association between
kobject and sysfs immediately breakable.

Each sysfs_dirent representing a kobject has a rwsem.  Any file
operation which has to access the associated kobject should read lock
the rwsem and check whether the pointer is still valid.  The read lock
should be held until access to the kobj is not necessary.

On sysfs_dirent deletion, the rwsem is write locked and the pointer is
cleared.  This ensures that there is no user to the kobj through the
sysfs_dirent.  This way, sysfs_dirent doesn't have to hold reference
to its associated kobj and can disconnect from it immediately on
deletion.

sysfs_get_dir_kobj() and sysfs_put_dir_kobj() read lock and unlock
kobj access, respectively.  As write locking is used only once during
deletion, blocking on down_read() indicates that the kobj will have
been disassociated when down_read() succeeds, so down_read_trylock()
is used in sysfs_get_dir_kobj() making the function non-blocking.

Unlike other operations, mmapped area lingers on after mmap() is
finished and the kobj needs to stay referenced till all the mapped
pages are gone.  This is accomplished by holding one reference to kobj
if there have been any mmap during lifetime of an openfile.  The
reference is dropped when the openfile is released.

Note that read locking kobject not only protects the kobject itself
but also ensures that the backing module doesn't go away while the
lock is held.  IOW, any access to code or data out of sysfs core
shouldn't be made without grabbing kobject.  So, access to attr should
be done while holding its parent's kobject.

This change makes sysfs lifetime rules independent from both kobject's
and module's.  It not only fixes several race conditions caused by
sysfs not holding onto the proper module when referencing kobject, but
also helps fixing and simplifying lifetime management in driver model
and drivers by taking sysfs out of the equation.

Please read the following message for more info.

  http://article.gmane.org/gmane.linux.kernel/510293

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/bin.c   |   97 ++-
 fs/sysfs/dir.c   |   39 +++-
 fs/sysfs/file.c  |  133 --
 fs/sysfs/sysfs.h |   10 +---
 4 files changed, 172 insertions(+), 107 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 1dd1bf1..a2180c5 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -23,6 +23,7 @@
 struct bin_buffer {
struct mutexmutex;
void*buffer;
+   int mmapped;
 };
 
 static int
@@ -30,12 +31,20 @@ fill_read(struct dentry *dentry, char *buffer, loff_t off, 
size_t count)
 {
struct sysfs_dirent *attr_sd = dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
-   struct kobject * kobj = to_kobj(dentry->d_parent);
+   struct kobject *kobj;
+   int rc;
+
+   kobj = sysfs_get_dir_kobj(attr_sd->s_parent);
+   if (!kobj)
+   return -ENODEV;
 
-   if (!attr->read)
-   return -EIO;
+   rc = -EIO;
+   if (attr->read)
+   rc = attr->read(kobj, buffer, off, count);
 
-   return attr->read(kobj, buffer, off, count);
+   sysfs_put_dir_kobj(attr_sd->s_parent);
+
+   return rc;
 }
 
 static ssize_t
@@ -79,12 +88,20 @@ flush_write(struct dentry *dentry, char *buffer, loff_t 
offset, size_t count)
 {
struct sysfs_dirent *attr_sd = dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
-   struct kobject *kobj = to_kobj(dentry->d_parent);
+   struct kobject *kobj;
+   int rc;
+
+   kobj = sysfs_get_dir_kobj(attr_sd->s_parent);
+   if (!kobj)
+   return -ENODEV;
 
-   if (!attr->write)
-   return -EIO;
+   rc = -EIO;
+   if (attr->write)
+   rc = attr->write(kobj, buffer, offset, count);
 
-   return attr->write(kobj, buffer, offset, count);
+   sysfs_put_dir_kobj(attr_sd->s_parent);
+
+   return rc;
 }
 
 static ssize_t write(struct file *file, const char __user *userbuf,
@@ -124,14 +141,24 @@ static int mmap(struct file *file, struct vm_area_struct 
*vma)
struct bin_buffer *bb = file->private_data;
struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
struct bin_attribute *attr = attr_sd->s_elem.bin_attr.bin_attr;
-   struct kobject *kobj = to_kobj(file->f_path.dentry->d_parent);
+   struct kobject *kobj;
int rc;
 
-   if (!attr->mmap)
-   return -EINVAL;
-
mutex_lock(&bb->mutex);
-   rc = attr->mmap(kobj, attr, vma);
+
+   kobj = sysfs_get_dir_kobj(attr_sd->s_parent);
+   if (!kobj)
+   return -ENODEV;
+
+   rc = -EI

[PATCH 09/14] sysfs: implement kobj_sysfs_assoc_lock

2007-04-07 Thread Tejun Heo
kobj->dentry can go away anytime unless the user controls when the
associated sysfs node is deleted.  This patch implements
kobj_sysfs_assoc_lock which protects kobj->dentry.  This will be used
to maintain kobj based API when converting sysfs to use sysfs_dirent
tree instead of dentry/kobject.

Note that this lock belongs to kobject/driver-model not sysfs.  Once
sysfs is converted to not use kobject in its interface, this can be
removed from sysfs.

This is in preparation of object reference simplification.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c   |8 +++-
 fs/sysfs/sysfs.h |1 +
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index d70ead5..8372c0c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -13,6 +13,7 @@
 #include "sysfs.h"
 
 DECLARE_RWSEM(sysfs_rename_sem);
+spinlock_t kobj_sysfs_assoc_lock = SPIN_LOCK_UNLOCKED;
 
 struct kobject *sysfs_get_kobject(struct dentry *dentry)
 {
@@ -388,8 +389,13 @@ static void __sysfs_remove_dir(struct dentry *dentry)
 
 void sysfs_remove_dir(struct kobject * kobj)
 {
-   __sysfs_remove_dir(kobj->dentry);
+   struct dentry *d = kobj->dentry;
+
+   spin_lock(&kobj_sysfs_assoc_lock);
kobj->dentry = NULL;
+   spin_unlock(&kobj_sysfs_assoc_lock);
+
+   __sysfs_remove_dir(d);
 }
 
 int sysfs_rename_dir(struct kobject * kobj, struct dentry *new_parent,
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index c1965b9..9dcd0b0 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -61,6 +61,7 @@ extern void sysfs_remove_subdir(struct dentry *);
 extern void sysfs_drop_dentry(struct sysfs_dirent *sd, struct dentry *parent);
 extern int sysfs_setattr(struct dentry *dentry, struct iattr *iattr);
 
+extern spinlock_t kobj_sysfs_assoc_lock;
 extern struct rw_semaphore sysfs_rename_sem;
 extern struct super_block * sysfs_sb;
 extern const struct file_operations sysfs_dir_operations;
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/14] sysfs: add sysfs_dirent->s_name

2007-04-07 Thread Tejun Heo
Add s_name to sysfs_dirent.  This is to further reduce dependency to
the associated dentry.  Name is copied for directories and symlinks
but not for attributes.

Where possible, name dereferences are converted to use sd->s_name.
sysfs_symlink->link_name and sysfs_get_name() are unused now and
removed.

This change allows symlink to be implemented using sysfs_dirent tree
proper, which is the last remaining dentry-dependent sysfs walk.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c|   33 +
 fs/sysfs/file.c   |2 +-
 fs/sysfs/inode.c  |   33 +
 fs/sysfs/symlink.c|8 +---
 fs/sysfs/sysfs.h  |7 +++
 include/linux/sysfs.h |1 +
 6 files changed, 28 insertions(+), 56 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index ef45c3e..26c3088 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -41,10 +41,11 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
 
if (sd->s_type & SYSFS_KOBJ_LINK) {
struct sysfs_symlink * sl = sd->s_element;
-   kfree(sl->link_name);
kobject_put(sl->target_kobj);
kfree(sl);
}
+   if (sd->s_type & SYSFS_COPY_NAME)
+   kfree(sd->s_name);
kfree(sd->s_iattr);
kmem_cache_free(sysfs_dir_cachep, sd);
 
@@ -69,19 +70,30 @@ static struct dentry_operations sysfs_dentry_ops = {
.d_iput = sysfs_d_iput,
 };
 
-struct sysfs_dirent *sysfs_new_dirent(void *element, umode_t mode, int type)
+struct sysfs_dirent *sysfs_new_dirent(const char *name, void *element,
+ umode_t mode, int type)
 {
+   char *dup_name = NULL;
struct sysfs_dirent * sd;
 
+   if (type & SYSFS_COPY_NAME) {
+   name = dup_name = kstrdup(name, GFP_KERNEL);
+   if (!name)
+   return NULL;
+   }
+
sd = kmem_cache_zalloc(sysfs_dir_cachep, GFP_KERNEL);
-   if (!sd)
+   if (!sd) {
+   kfree(dup_name);
return NULL;
+   }
 
atomic_set(&sd->s_count, 1);
atomic_set(&sd->s_event, 1);
INIT_LIST_HEAD(&sd->s_children);
INIT_LIST_HEAD(&sd->s_sibling);
 
+   sd->s_name = name;
sd->s_element = element;
sd->s_mode = mode;
sd->s_type = type;
@@ -118,8 +130,7 @@ int sysfs_dirent_exist(struct sysfs_dirent *parent_sd,
 
list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
if (sd->s_element) {
-   const unsigned char *existing = sysfs_get_name(sd);
-   if (strcmp(existing, new))
+   if (strcmp(sd->s_name, new))
continue;
else
return -EEXIST;
@@ -173,7 +184,7 @@ static int create_dir(struct kobject *kobj, struct dentry 
*parent,
goto out_dput;
 
error = -ENOMEM;
-   sd = sysfs_new_dirent(kobj, mode, SYSFS_DIR);
+   sd = sysfs_new_dirent(name, kobj, mode, SYSFS_DIR);
if (!sd)
goto out_drop;
sysfs_attach_dirent(sd, parent->d_fsdata, dentry);
@@ -298,9 +309,7 @@ static struct dentry * sysfs_lookup(struct inode *dir, 
struct dentry *dentry,
 
list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
if (sd->s_type & SYSFS_NOT_PINNED) {
-   const unsigned char * name = sysfs_get_name(sd);
-
-   if (strcmp(name, dentry->d_name.name))
+   if (strcmp(sd->s_name, dentry->d_name.name))
continue;
 
if (sd->s_type & SYSFS_KOBJ_LINK)
@@ -490,7 +499,7 @@ static int sysfs_dir_open(struct inode *inode, struct file 
*file)
struct sysfs_dirent * sd;
 
mutex_lock(&dentry->d_inode->i_mutex);
-   sd = sysfs_new_dirent(NULL, 0, 0);
+   sd = sysfs_new_dirent("_DIR_", NULL, 0, 0);
if (sd)
sysfs_attach_dirent(sd, parent_sd, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -557,7 +566,7 @@ static int sysfs_readdir(struct file * filp, void * dirent, 
filldir_t filldir)
if (!next->s_element)
continue;
 
-   name = sysfs_get_name(next);
+   name = next->s_name;
len = strlen(name);
ino = (unsigned long)next;
 
@@ -669,7 +678,7 @@ struct dentry *sysfs_create_shadow_dir(struct kobject *kobj)
if (!shadow)
goto nomem;
 
-   sd = sysfs_new_dirent(kobj, inode->i_mode, SYSFS_DIR);
+   sd = sysfs_new_dirent("_SHADOW_", kobj, inode->i_mode, SYSFS_DIR);
if (!sd)
goto nomem;
/* point to parent_sd but don't attach to it */
diff --git a/fs/sysfs/file.

[PATCH 04/14] sysfs: flatten cleanup paths in sysfs_add_link() and create_dir()

2007-04-07 Thread Tejun Heo
Flatten cleanup paths in sysfs_add_link() and create_dir() to improve
readability and ease further changes to these functions.  This is in
preparation of object reference simplification.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c |   73 ++-
 fs/sysfs/symlink.c |   27 ++
 2 files changed, 58 insertions(+), 42 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 8b1a00a..99a0ba1 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -177,40 +177,53 @@ static int init_symlink(struct inode * inode)
return 0;
 }
 
-static int create_dir(struct kobject * k, struct dentry * p,
- const char * n, struct dentry ** d)
+static int create_dir(struct kobject *kobj, struct dentry *parent,
+ const char *name, struct dentry **p_dentry)
 {
int error;
umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO;
+   struct dentry *dentry;
+   struct sysfs_dirent *sd;
 
-   mutex_lock(&p->d_inode->i_mutex);
-   *d = lookup_one_len(n, p, strlen(n));
-   if (!IS_ERR(*d)) {
-   if (sysfs_dirent_exist(p->d_fsdata, n))
-   error = -EEXIST;
-   else
-   error = sysfs_make_dirent(p->d_fsdata, *d, k, mode,
-   SYSFS_DIR);
-   if (!error) {
-   error = sysfs_create(*d, mode, init_dir);
-   if (!error) {
-   inc_nlink(p->d_inode);
-   (*d)->d_op = &sysfs_dentry_ops;
-   d_rehash(*d);
-   }
-   }
-   if (error && (error != -EEXIST)) {
-   struct sysfs_dirent *sd = (*d)->d_fsdata;
-   if (sd) {
-   list_del_init(&sd->s_sibling);
-   sysfs_put(sd);
-   }
-   d_drop(*d);
-   }
-   dput(*d);
-   } else
-   error = PTR_ERR(*d);
-   mutex_unlock(&p->d_inode->i_mutex);
+   mutex_lock(&parent->d_inode->i_mutex);
+
+   dentry = lookup_one_len(name, parent, strlen(name));
+   if (IS_ERR(dentry)) {
+   error = PTR_ERR(dentry);
+   goto out_unlock;
+   }
+
+   error = -EEXIST;
+   if (sysfs_dirent_exist(parent->d_fsdata, name))
+   goto out_dput;
+
+   error = sysfs_make_dirent(parent->d_fsdata, dentry, kobj, mode,
+ SYSFS_DIR);
+   if (error)
+   goto out_drop;
+
+   error = sysfs_create(dentry, mode, init_dir);
+   if (error)
+   goto out_sput;
+
+   inc_nlink(parent->d_inode);
+   dentry->d_op = &sysfs_dentry_ops;
+   d_rehash(dentry);
+
+   *p_dentry = dentry;
+   error = 0;
+   goto out_dput;
+
+ out_sput:
+   sd = dentry->d_fsdata;
+   list_del_init(&sd->s_sibling);
+   sysfs_put(sd);
+ out_drop:
+   d_drop(dentry);
+ out_dput:
+   dput(dentry);
+ out_unlock:
+   mutex_unlock(&parent->d_inode->i_mutex);
return error;
 }
 
diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 7b9c5bf..b463f17 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -49,30 +49,33 @@ static int sysfs_add_link(struct dentry * parent, const 
char * name, struct kobj
 {
struct sysfs_dirent * parent_sd = parent->d_fsdata;
struct sysfs_symlink * sl;
-   int error = 0;
+   int error;
 
error = -ENOMEM;
-   sl = kmalloc(sizeof(*sl), GFP_KERNEL);
+   sl = kzalloc(sizeof(*sl), GFP_KERNEL);
if (!sl)
-   goto exit1;
+   goto err_out;
 
sl->link_name = kmalloc(strlen(name) + 1, GFP_KERNEL);
if (!sl->link_name)
-   goto exit2;
+   goto err_out;
 
strcpy(sl->link_name, name);
sl->target_kobj = kobject_get(target);
 
error = sysfs_make_dirent(parent_sd, NULL, sl, S_IFLNK|S_IRWXUGO,
SYSFS_KOBJ_LINK);
-   if (!error)
-   return 0;
-
-   kobject_put(target);
-   kfree(sl->link_name);
-exit2:
-   kfree(sl);
-exit1:
+   if (error)
+   goto err_out;
+
+   return 0;
+
+ err_out:
+   if (sl) {
+   kobject_put(sl->target_kobj);
+   kfree(sl->link_name);
+   kfree(sl);
+   }
return error;
 }
 
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] sysfs: move sysfs_get_kobject() and release_sysfs_dirent() to dir.c

2007-04-07 Thread Tejun Heo
There is no reason these functions should be inlined and soon to
follow sysfs object reference simplification will make these functions
heavier.  Move them to dir.c.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c   |   30 ++
 fs/sysfs/sysfs.h |   32 ++--
 2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 5112f88..8b1a00a 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -14,6 +14,36 @@
 
 DECLARE_RWSEM(sysfs_rename_sem);
 
+struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject * kobj = NULL;
+
+   spin_lock(&dcache_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent * sd = dentry->d_fsdata;
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink * sl = sd->s_element;
+   kobj = kobject_get(sl->target_kobj);
+   } else
+   kobj = kobject_get(sd->s_element);
+   }
+   spin_unlock(&dcache_lock);
+
+   return kobj;
+}
+
+void release_sysfs_dirent(struct sysfs_dirent * sd)
+{
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink * sl = sd->s_element;
+   kfree(sl->link_name);
+   kobject_put(sl->target_kobj);
+   kfree(sl);
+   }
+   kfree(sd->s_iattr);
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
 static void sysfs_d_iput(struct dentry * dentry, struct inode * inode)
 {
struct sysfs_dirent * sd = dentry->d_fsdata;
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index a77c57e..812c8c3 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -17,6 +17,8 @@ extern void sysfs_delete_inode(struct inode *inode);
 extern struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent *);
 extern int sysfs_create(struct dentry *, int mode, int (*init)(struct inode 
*));
 
+extern struct kobject *sysfs_get_kobject(struct dentry *dentry);
+extern void release_sysfs_dirent(struct sysfs_dirent * sd);
 extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *);
 extern int sysfs_make_dirent(struct sysfs_dirent *, struct dentry *, void *,
umode_t, int);
@@ -79,36 +81,6 @@ static inline struct bin_attribute * to_bin_attr(struct 
dentry * dentry)
return ((struct bin_attribute *) sd->s_element);
 }
 
-static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
-{
-   struct kobject * kobj = NULL;
-
-   spin_lock(&dcache_lock);
-   if (!d_unhashed(dentry)) {
-   struct sysfs_dirent * sd = dentry->d_fsdata;
-   if (sd->s_type & SYSFS_KOBJ_LINK) {
-   struct sysfs_symlink * sl = sd->s_element;
-   kobj = kobject_get(sl->target_kobj);
-   } else
-   kobj = kobject_get(sd->s_element);
-   }
-   spin_unlock(&dcache_lock);
-
-   return kobj;
-}
-
-static inline void release_sysfs_dirent(struct sysfs_dirent * sd)
-{
-   if (sd->s_type & SYSFS_KOBJ_LINK) {
-   struct sysfs_symlink * sl = sd->s_element;
-   kfree(sl->link_name);
-   kobject_put(sl->target_kobj);
-   kfree(sl);
-   }
-   kfree(sd->s_iattr);
-   kmem_cache_free(sysfs_dir_cachep, sd);
-}
-
 static inline struct sysfs_dirent * sysfs_get(struct sysfs_dirent * sd)
 {
if (sd) {
-- 
1.5.0.3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHSET #master] sysfs: make sysfs disconnect immediately from kobject on deletion

2007-04-07 Thread Tejun Heo
Hello, all.

This patchset is result of the following thread.

  http://thread.gmane.org/gmane.linux.kernel/510293

This patchset takes sysfs out of device driver lifetime equation which
not not only fixes several race conditions caused by sysfs not holding
onto the proper module when referencing kobject, but also helps fixing
and simplifying lifetime management in driver model.

sysfs is peculiar in how it's intertwined with driver model via
kobject and fs layer by using dentry to record some of its hierarchy.
This not only complicates lifetime management outside of sysfs but
also inside sysfs proper.  We end up with several different yet
inter-dependent lifetime rules.

For example, dentry depends on sysfs_dirent for file accesses as any
dentry would depend on inode and its backing fs private data to do so,
while sysfs_dirent depends on dentry to walk sysfs_dirent tree for
internal purpose (symlink walk) and the initial access to the dentry
happens by going through kobject pointer.  This interdependcy is okay
while all the objects are on memory but hell breaks loose when it's
time to kill those objects.  dentry and sysfs_dirent depend on each
other.  Unless they go away at the same time or use some way to safely
break the loop, one side ends up with dangled pointer to the other.

This patchset solves this by making sysfs_dirent behave more like fs
internal inodes in other filesystems which don't depend on dentry or
other external entity to manage itself.  Most information is already
there.  Only sd->s_parent and s_name are added.  These do increase the
size of sysfs_dirent a bit but makes the logic look designed more in
Earth instead of Mars and with further changes, dentry and inode for
kobject can be made reclaimable which can probably compensate the
added space overhead.

Sysfs lifetime rules are much simpler now.  sd denotes sysfs_dirent.

* sd has default reference of 1 on creation which is dropped on
  deletion.

* dentry holds reference to sd.  dentry->d_fsdata can be safely
  dereferenced while referecne to dentry is held.

* sd->s_parent points to the parent sd and each child holds a
  reference to its parent which is released when the child is released
  (reference reaches zero), so sd->s_parent can be dereferenced
  recursively if reference to the sd is held.

* sd->s_name can always be read while sd is valid.

* sd->s_elem.dir.kobj should only be accessed while
  sd->s_elem.dir.rwsem is read locked, which can be done by calling
  sysfs_get_dir_kobj() on the sd.

* sd->s_elem.[bin_]attr.[bin_attr] should only be accessed while its
  parent's sd->s_elem.dir.rwsem is read locked.  If
  sysfs_get_dir_kobj() returns NULL, attr pointer might point to
  released area.

* sysfs doesn't reference foreign objects for internal purpose.
  Foreign objets are accessed from the callbacks or interface
  functions where the caller is responsible for guaranteeing
  accessbility - symlink interface function is currently an exception.
  sysfs should export sysfs_dirent based interface and kobject code
  should do the locking.

* Directory dentries are still pinned as they are used in interface
  function - this should change in the future.

This patchset is consisted of the following 14 patches.

#01 : fix i_no handling bug and reduce dependency to inode
#02 : fix error handling in binattr write
#03-05  : prep for symlink reimplementation
#06-07  : add s_parent and s_name
#08 : make s_elem a union so that chaning what it points to doesn't
  cause chaos and s_elem can contain more than one pointer
#09 : implement kobj_sysfs_assoc_lock to protect kobj->s_dentry
  which will be used to keep symlink interface compatible
#10 : reimplement symlink using only sysfs_dirent tree
#11 : implement bin_buffer for immediate-kobject-disconnect
#12 : implement immediate-kobject-disconnect
#13-14  : kill now obsolete stuff

The first 11 are some fixes and preparation for
immediate-kobject-disconnect.  Depencies to external objects are
gradually removed such that only accesses during file ops remain which
are converted by patch #12.  The last two patches remove now
unnecessary attribute orphaning and attribute->owner.

I've run the following test on a UP machine for several hours without
any oops or memory leak and the test is currently running on a dual
processor (hyperthreading) machine for about half an hour.  I'll keep
it running for at least 5 hours.

 # (cd kernel-build-dir; while true; do echo [Loading...]; insmod 
drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod 
drivers/ata/libata.ko; insmod drivers/ata/ahci.ko; sleep 1; echo 
[Unloading...]; while lsmod | grep -q sd_mod; do rmmod ahci; rmmod libata; 
rmmod sd_mod; rmmod scsi_mod; sleep .1; done; done) &
 # (cd /sys; while true; do ls -liR > /dev/null; done) &
 # (cd /sys; while true; do find . | xargs cat > /dev/null 2>&1; done) &
 # (cd /sys; while true; do find . | sort | xargs cat > /dev/null 2>&1; done) &

T

[PATCH 13/14] sysfs: kill attribute file orphaning

2007-04-07 Thread Tejun Heo
Now that sysfs_dirent can be disconnected from kobject on deletion,
there is no need to orphan each attribute files.  All [bin_]attribute
nodes are automatically orphaned when the parent node is deleted.
Kill attribute file orphaning.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/file.c  |   65 ++---
 fs/sysfs/inode.c |   25 
 fs/sysfs/mount.c |8 --
 fs/sysfs/sysfs.h |   16 -
 4 files changed, 13 insertions(+), 101 deletions(-)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 133a108..cd80b20 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -51,29 +51,15 @@ static struct sysfs_ops subsys_sysfs_ops = {
.store  = subsys_attr_store,
 };
 
-/**
- * add_to_collection - add buffer to a collection
- * @buffer:buffer to be added
- * @node:  inode of set to add to
- */
-
-static inline void
-add_to_collection(struct sysfs_buffer *buffer, struct inode *node)
-{
-   struct sysfs_buffer_collection *set = node->i_private;
-
-   mutex_lock(&node->i_mutex);
-   list_add(&buffer->associates, &set->associates);
-   mutex_unlock(&node->i_mutex);
-}
-
-static inline void
-remove_from_collection(struct sysfs_buffer *buffer, struct inode *node)
-{
-   mutex_lock(&node->i_mutex);
-   list_del(&buffer->associates);
-   mutex_unlock(&node->i_mutex);
-}
+struct sysfs_buffer {
+   size_t  count;
+   loff_t  pos;
+   char* page;
+   struct sysfs_ops* ops;
+   struct semaphoresem;
+   int needs_read_fill;
+   int event;
+};
 
 /**
  * fill_read_buffer - allocate and fill buffer from object.
@@ -175,10 +161,7 @@ sysfs_read_file(struct file *file, char __user *buf, 
size_t count, loff_t *ppos)
 
down(&buffer->sem);
if (buffer->needs_read_fill) {
-   if (buffer->orphaned)
-   retval = -ENODEV;
-   else
-   retval = fill_read_buffer(file->f_path.dentry,buffer);
+   retval = fill_read_buffer(file->f_path.dentry,buffer);
if (retval)
goto out;
}
@@ -276,16 +259,11 @@ sysfs_write_file(struct file *file, const char __user 
*buf, size_t count, loff_t
ssize_t len;
 
down(&buffer->sem);
-   if (buffer->orphaned) {
-   len = -ENODEV;
-   goto out;
-   }
len = fill_write_buffer(buffer, buf, count);
if (len > 0)
len = flush_write_buffer(file->f_path.dentry, buffer, len);
if (len > 0)
*ppos += len;
-out:
up(&buffer->sem);
return len;
 }
@@ -294,7 +272,6 @@ static int sysfs_open_file(struct inode *inode, struct file 
*file)
 {
struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata;
struct attribute *attr = attr_sd->s_elem.attr.attr;
-   struct sysfs_buffer_collection *set;
struct sysfs_buffer * buffer;
struct sysfs_ops * ops = NULL;
struct kobject *kobj;
@@ -322,26 +299,14 @@ static int sysfs_open_file(struct inode *inode, struct 
file *file)
else
ops = &subsys_sysfs_ops;
 
+   error = -EACCES;
+
/* No sysfs operations, either from having no subsystem,
 * or the subsystem have no operations.
 */
-   error = -EACCES;
if (!ops)
goto err_mput;
 
-   /* make sure we have a collection to add our buffers to */
-   mutex_lock(&inode->i_mutex);
-   if (!(set = inode->i_private)) {
-   error = -ENOMEM;
-   if (!(set = inode->i_private = kmalloc(sizeof(struct 
sysfs_buffer_collection), GFP_KERNEL)))
-   goto err_mput;
-   else
-   INIT_LIST_HEAD(&set->associates);
-   }
-   mutex_unlock(&inode->i_mutex);
-
-   error = -EACCES;
-
/* File needs write support.
 * The inode's perms must say it's ok, 
 * and we must have a store method.
@@ -368,11 +333,9 @@ static int sysfs_open_file(struct inode *inode, struct 
file *file)
if (!buffer)
goto err_mput;
 
-   INIT_LIST_HEAD(&buffer->associates);
init_MUTEX(&buffer->sem);
buffer->needs_read_fill = 1;
buffer->ops = ops;
-   add_to_collection(buffer, inode);
file->private_data = buffer;
 
/* open succeeded, put kobj and pin attr_sd */
@@ -391,10 +354,8 @@ static int sysfs_release(struct inode * inode, struct file 
* filp)
 {
struct sysfs_dirent *attr_sd = filp->f_path.dentry->d_fsdata;
struct attribute *attr = attr_sd->s_elem.attr.attr;
-   struct sysfs_buffer * buffer = filp->private_data;
+   struct sysfs_buffer *buffer = filp->private_data;
 
-   if (buffer)
-   remove_from_collection(buffer, 

[PATCH 10/14] sysfs: reimplement symlink using sysfs_dirent tree

2007-04-07 Thread Tejun Heo
sysfs symlink is implemented by referencing dentry and kobject from
sysfs_dirent - symlink entry references kobject, dentry is used to
walk the tree.  This complicates object lifetimes rules and is
dangerous - for example, there is no way to tell to which module the
target of a symlink belongs and referencing that kobject can make it
linger after the module is gone.

This patch reimplements symlink using only sysfs_dirent tree.  sd for
a symlink points and holds reference to the target sysfs_dirent and
all walking is done using sysfs_dirent tree.  Simpler and safer.

Please read the following message for more info.

  http://article.gmane.org/gmane.linux.kernel/510293

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
---
 fs/sysfs/dir.c |9 +++--
 fs/sysfs/symlink.c |   88 +++
 fs/sysfs/sysfs.h   |2 +-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 8372c0c..b7d0e0e 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -22,10 +22,11 @@ struct kobject *sysfs_get_kobject(struct dentry *dentry)
spin_lock(&dcache_lock);
if (!d_unhashed(dentry)) {
struct sysfs_dirent * sd = dentry->d_fsdata;
+
if (sd->s_type & SYSFS_KOBJ_LINK)
-   kobj = kobject_get(sd->s_elem.symlink.target_kobj);
-   else
-   kobj = kobject_get(sd->s_elem.dir.kobj);
+   sd = sd->s_elem.symlink.target_sd;
+
+   kobj = kobject_get(sd->s_elem.dir.kobj);
}
spin_unlock(&dcache_lock);
 
@@ -40,7 +41,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
parent_sd = sd->s_parent;
 
if (sd->s_type & SYSFS_KOBJ_LINK)
-   kobject_put(sd->s_elem.symlink.target_kobj);
+   sysfs_put(sd->s_elem.symlink.target_sd);
if (sd->s_type & SYSFS_COPY_NAME)
kfree(sd->s_name);
kfree(sd->s_iattr);
diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 27df635..ff605d3 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -11,50 +11,49 @@
 
 #include "sysfs.h"
 
-static int object_depth(struct kobject * kobj)
+static int object_depth(struct sysfs_dirent *sd)
 {
-   struct kobject * p = kobj;
int depth = 0;
-   do { depth++; } while ((p = p->parent));
+
+   for (; sd->s_parent; sd = sd->s_parent)
+   depth++;
+
return depth;
 }
 
-static int object_path_length(struct kobject * kobj)
+static int object_path_length(struct sysfs_dirent * sd)
 {
-   struct kobject * p = kobj;
int length = 1;
-   do {
-   length += strlen(kobject_name(p)) + 1;
-   p = p->parent;
-   } while (p);
+
+   for (; sd->s_parent; sd = sd->s_parent)
+   length += strlen(sd->s_name) + 1;
+
return length;
 }
 
-static void fill_object_path(struct kobject * kobj, char * buffer, int length)
+static void fill_object_path(struct sysfs_dirent *sd, char *buffer, int length)
 {
-   struct kobject * p;
-
--length;
-   for (p = kobj; p; p = p->parent) {
-   int cur = strlen(kobject_name(p));
+   for (; sd->s_parent; sd = sd->s_parent) {
+   int cur = strlen(sd->s_name);
 
/* back up enough to print this bus id with '/' */
length -= cur;
-   strncpy(buffer + length,kobject_name(p),cur);
+   strncpy(buffer + length, sd->s_name, cur);
*(buffer + --length) = '/';
}
 }
 
-static int sysfs_add_link(struct dentry * parent, const char * name, struct 
kobject * target)
+static int sysfs_add_link(struct sysfs_dirent * parent_sd, const char * name,
+ struct sysfs_dirent * target_sd)
 {
-   struct sysfs_dirent * parent_sd = parent->d_fsdata;
struct sysfs_dirent * sd;
 
sd = sysfs_new_dirent(name, S_IFLNK|S_IRWXUGO, SYSFS_KOBJ_LINK);
if (!sd)
return -ENOMEM;
 
-   sd->s_elem.symlink.target_kobj = kobject_get(target);
+   sd->s_elem.symlink.target_sd = target_sd;
sysfs_attach_dirent(sd, parent_sd, NULL);
return 0;
 }
@@ -68,6 +67,8 @@ static int sysfs_add_link(struct dentry * parent, const char 
* name, struct kobj
 int sysfs_create_link(struct kobject * kobj, struct kobject * target, const 
char * name)
 {
struct dentry *dentry = NULL;
+   struct sysfs_dirent *parent_sd = NULL;
+   struct sysfs_dirent *target_sd = NULL;
int error = -EEXIST;
 
BUG_ON(!name);
@@ -80,11 +81,27 @@ int sysfs_create_link(struct kobject * kobj, struct kobject 
* target, const char
 
if (!dentry)
return -EFAULT;
+   parent_sd = dentry->d_fsdata;
+
+   /* target->dentry can go away beneath us but is protected with
+* kobj_sysfs_assoc_lock.  Fetch target_sd from it.
+*/
+   spin_lock(&kobj_sysfs_assoc_lock);
+  

Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)

2007-04-07 Thread Tejun Heo
Hello, Maneesh.

Maneesh Soni wrote:
> o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
>   happens without i_mutex. It also nullifies s_dentry to indicate that
>   the associated dentry is evicted. sysfs_readdir() accesses the s_dentry,
>   and gets the inode number from the associated dentry->d_inode, if 
>   there is one, else it invokes iunique(). This can create a race situation,
>   and crash while accessing the d_inode in sysfs_readdir(). 
> 
> o The race happens when the dentry is getting reclaimed and detached from
>   the corresponding sysfs_dirent though sysfs_dirent is still a valid
>   node. Accessing dentry fields are ok as it is under RCU but the inode is
>   not hence we may see oops accessing dentry->d_inode->i_no.
> 
> o The following patch always use i_unique() to get the inode number in
>   sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering.
>   It could be slower but avoids the oops. 

This isn't correct as i_unique() assumes the inode is in inode hash
table which isn't true for sysfs.  This can result in duplicate inode
numbers.  Please take a look at the following alternative fix.

  http://article.gmane.org/gmane.linux.kernel/513325

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC4xx UART0 (8250) problem

2007-04-07 Thread Mikhail Zolotaryov
Benjamin Herrenschmidt wrote: 
> This is an old problem. The proper fix is already implemented for
> arch/powerpc and consist of having virtual irq numbers (which helps for
> many other things anyway).
> 
> Support for 4xx platforms in arch/powerpc is starting to get in, pop on
> [EMAIL PROTECTED] where the patches are being posted and you are
> welcome to give a hand porting more platforms over :-)

First, thanks for response.

Ok, I see IRQ0 mapping to virtual number is good solution but I know the
problem from the times of 2.4 kernel. It was very surprising while moving to
2.6 kernel - serial driver was changed, but we have the same problem there.

Maybe it's good idea to create unofficial ppc linux branch repository
(or it already exists ?) where we can put patches that work (as above) but
not in pretty good way linux community expects.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet
Hi all

Updates on this take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick :

get_futex_key() does a check against sizeof(u32) regardless of futex being 
64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.

History :

I'm pleased to present this patch which improves linux futexes performance and 
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4
(so supports 64bit futexes)

In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)

Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Thomas Gleixner
On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote:
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, 
> > and if yes, does the patch below fix the resume problem for you?
> 
> hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where 
> we re-enable interrupts in the SMP case, but could you try my patch 
> nevertheless

We do in on_each_cpu() unconditionally. I missed that.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.21-rc6

2007-04-07 Thread Michal Piotrowski
Hi all,

This looks like a lockdep problem.
2.6.21-rc6
+ hrtimers_debug.patch (from Ingo)
- skge_wol_support (commit a504e64ab42bcc27074ea37405d06833ed6e0820) dropped 
due to
swsusp problems

[14016.726946] BUG: at /mnt/md0/devel/linux-git/kernel/lockdep.c:2427 
check_flags()
[14016.734331]  [] show_trace_log_lvl+0x1a/0x2f
[14016.739507]  [] show_trace+0x12/0x14
[14016.743982]  [] dump_stack+0x16/0x18
[14016.748460]  [] check_flags+0x95/0x143
[14016.753106]  [] lock_acquire+0x29/0x82
[14016.757741]  [] down_write+0x3a/0x54
[14016.762203]  [] sys_munmap+0x23/0x3f
[14016.71]  [] syscall_call+0x7/0xb
[14016.771134]  ===
[14016.774712] irq event stamp: 43076
[14016.778111] hardirqs last  enabled at (43075): [] 
syscall_exit_work+0x11/0x26
[14016.786166] hardirqs last disabled at (43076): [] 
ret_from_exception+0x9/0xc
[14016.794118] softirqs last  enabled at (42608): [] 
__do_softirq+0xe4/0xea
[14016.801706] softirqs last disabled at (42599): [] 
do_softirq+0x64/0xd1

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc6/git-console.log
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc6/git-config

BTW. I noticed some strange fio (1.15) behavior
Starting 16 processes
file:io_u.c:65, assert idx < f->num_maps failed[  1605/ 36442 kb/s] [eta 
00m:32s]
fio: pid=13734, got signal=11
file:io_u.c:65, assert idx < f->num_maps failed[ 10452/ 0 kb/s] [eta 
00m:23s]
fio: pid=13731, got signal=11

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Ingo Molnar

* Thomas Gleixner <[EMAIL PROTECTED]> wrote:

> On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote:
> > * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > 
> > > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, 
> > > and if yes, does the patch below fix the resume problem for you?
> > 
> > hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see 
> > where we re-enable interrupts in the SMP case, but could you try my 
> > patch nevertheless
> 
> We do in on_each_cpu() unconditionally. I missed that.

doh, indeed!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Thomas Gleixner
On Sat, 2007-04-07 at 10:12 +0200, Ingo Molnar wrote:
> Subject: [patch] high-res timers: UP resume fix
> From: Ingo Molnar <[EMAIL PROTECTED]>
> 
> Soeren Sonnenburg reported that upon resume he is getting
> this backtrace:
> 
>  [] smp_apic_timer_interrupt+0x57/0x90
>  [] retrigger_next_event+0x0/0xb0
>  [] apic_timer_interrupt+0x28/0x30
>  [] retrigger_next_event+0x0/0xb0
>  [] __kfifo_put+0x8/0x90
>  [] on_each_cpu+0x35/0x60
>  [] clock_was_set+0x18/0x20
>  [] timekeeping_resume+0x7c/0xa0
>  [] __sysdev_resume+0x11/0x80
>  [] sysdev_resume+0x47/0x80
>  [] device_power_up+0x5/0x10
> 
> it turns out that on UP we mistakenly re-enable interrupts,
> so do the timer retrigger only on the current CPU.
> 
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

Acked-by: Thomas Gleixner <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] clean up identify_cpu

2007-04-07 Thread Andrew Morton
On Fri, 06 Apr 2007 15:41:54 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> 
wrote:

> identify_cpu() is used to identify both the boot CPU and secondary
> CPUs, but it performs some actions which only apply to the boot CPU.
> Those functions are therefore really __init functions, but because
> they're called by identify_cpu(), they must be marked __cpuinit.
> 
> This patch splits identify_cpu() into identify_boot_cpu() and
> identify_secondary_cpu(), and calls the appropriate init functions
> from each.  Also, identify_boot_cpu() and all the functions it
> dominates are marked __init.

x86_64 uses this too.

WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to 
.init.text:mtrr_bp_init from .text.identify_cpu after 'identify_cpu' (at offset 
0x655)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array

2007-04-07 Thread Dmitry Adamushko

On 07/04/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

On Wed, 4 Apr 2007 22:05:40 +0200 "Dmitry Adamushko"
> [...]
>
> o  Make TASK_PREEMPTS_CURR(task, rq) return "true" only if the task's
> prio is higher than the current's one and the task is in the "active"
> array.
> This ensures we don't make redundant resched_task() calls when the
> task is in the "expired" array (as may happen now in set_user_prio(),
> rt_mutex_setprio() and pull_task() ) ;
>
> o  generilise conditions for a call to resched_task() in
> set_user_nice(), rt_mutex_setprio() and sched_setscheduler()
>

grief.  This patch conflicts seriously with the staircase scheduler in -mm.
So to merge it I need to

- apply it
- then apply a revert-it-again patch
- then apply staircase
- then ask Con to cook up a staircase-based equivalent of your change.


I'll make a SD-based version and send it to Con.



so

- your code only gets publically tested in its against-staircase version

- the against-mainline version will get merged without having been
  publically tested outside of staircase

which is probably all OK for a 2.6.22-rc1 thing, provided Ingo can give a
confident ack.


Ok, thanks.

btw, just out of curiosity. The very first approach I was thinking of
- was to move a task from the "expired" to the "active" array when its
priority is boosted (like rt_mutex_setprio() does for rt tasks).

Reasoning: getting a higher static_prio means getting an additional
quota of timeslice which still could be used during this rotation.

delta = task_timeslice(p->static_prio) - task_timeslice(old_static_prio)

Aha.. /here I'm looking at the mainline now/ another funny thing is
that a time_slice is not immediately affected by the change of
static_prio in set_user_nice(). If a task is in the expired array, it
will run the next rotation with the *old* time_slice (i.e. calculated
in task_running_tick() before putting the task in the expired array
and based on the *old* static_prio).
In theory, set_user_nice() could adjust a p->time_slice with "delta"
being calculated as shown above.. But ok, it's not more than a minor
inconsistency (of course, if I'm not missing something).




> --- linux-2.6.21-rc5/kernel/sched-orig.c2007-04-04
> 18:26:19.0 +0200
> +++ linux-2.6.21-rc5/kernel/sched.c 2007-04-04 18:26:43.0 +0200
> @@ -168,7 +168,7 @@ unsigned long long __attribute__((weak))
> (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
>
>  #define TASK_PREEMPTS_CURR(p, rq) \
> -   ((p)->prio < (rq)->curr->prio)
> +   (((p)->prio < (rq)->curr->prio) && ((p)->array == (rq)->active))

Your patch was wordwrapped and had its tabs replaced with spaces.  Please
fix your email client.


I apologize for this. Will fix.


--
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array

2007-04-07 Thread Ingo Molnar

* Dmitry Adamushko <[EMAIL PROTECTED]> wrote:

> following the conversation on "a redundant reschedule call in 
> set_user_prio()", here is a possible approach.
> 
> The patch is somewhat intrusive as it even dares to adapt 
> TASK_PREEMPTS_CURR().

looks good to me, but the patch seems seriously whitespace-damaged: all 
tabs were converted to spaces.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Two questions regarding Opening files within Kernel!

2007-04-07 Thread Roland Kuhn

Hi!

On 7 Apr 2007, at 08:58, JanuGerman wrote:

1)   I have just a file path with me, an absolute path, but no  
dentry, no inode, no vfsmount object, which function i can call to  
get a "file" object associated with the absoulte file path. I have  
surfed arround the source code especially fs/open.c and some other  
files, but each function requires a parameter "mode" and "fd"  
beside file path. Actually, i was confuse about the "mode"  
parameter (and its differece with "flag"), like what to send, and  
secondly for "fd", i am not sure, what value to send as there is no  
file infact and only file path exists. Any idea?



No, but I'm no guru either.

2) Any functionality within linux kernel source code, to read one  
line per file? or some indirect way to set buffer size for one  
read?. That is, any existing header file for doing text I/O rather  
than binary within the kernel source code?


Do you have a compelling reason for not letting userspace feed the  
file to your driver? That would be the natural and much easier way, I  
suppose...


Ciao,
Roland

--
TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching
Telefon 089/289-12575; Telefax 089/289-12570
--
CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482
--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both.  - Benjamin Franklin
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL P+++ L+++ E(+) W+ !N K- w--- M 
+ !V Y+

PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++> h y+++
--END GEEK CODE BLOCK--




smime.p7s
Description: S/MIME cryptographic signature


PGP.sig
Description: This is a digitally signed message part


Re: [PATCH] [sched] redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array

2007-04-07 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> so
> 
> - your code only gets publically tested in its against-staircase 
>   version
> 
> - the against-mainline version will get merged without having been
>   publically tested outside of staircase
> 
> which is probably all OK for a 2.6.22-rc1 thing, provided Ingo can 
> give a confident ack.

it looks good to me - and once i get a non-whitespace-damaged patch i'll 
put it into -rt so we'll have testing. (this patch should have at most a 
latency impact, if we forget to preempt somewhere, and -rt users are 
quite touchy about latencies.)

> Where are we at with staircase anyway?  Is it looking like a 2.6.22 
> thing? I don't personally think we've yet seen enough serious 
> performance testing to permit a merge, apart from other issues...

yes, that's my thinking too at the moment. I'd also like to see a 
summary of 'open design questions' list from Mike (if Mike has 
time/energy for that?) - many questions were raised, a good number of 
them were answered, various changes done to SD but there's no good 
summary of the current state of affairs.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Egmont Koblinger
On Fri, Apr 06, 2007 at 12:43:03PM -0700, H. Peter Anvin wrote:

Hi,

> I strongly disagree.  First of all, you're changing the semantics of a 
> 13-year-old API.  The semantics of the Linux console is that by 
> specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
> specified the fallback glyph.

OK, I'm not against using U+FFFD for missing glyphs. In the mean time I
think it's still a good idea to clearly separate the two cases in the code
(that is, the case of invalid sequence from the case of missing glyph), but
we can still use the same replacement character in these two cases. I'll
send an updated patch after Easter if it sounds good for you.


> What's worse, you've hard-coded the uses of specific visual 
> representations.  That is completely unacceptable.

Now that we've dropped the idea of "dot" for missing glyphs, the other thing
that remains is the hard-coded '?' if and only if U+FFFD is not present in
the font. It is even hardcoded in the current code and I have no better
idea, there must be a last-resort hardcoded fallback. The only thing I
changed is that I inverted the color attributes for this question mark. Do
you think that the old behavior, a normal question mark would be better? No
problem, I'll adjust the code in this case. Just please tell me what the
expected behavior is, I'm not sure I clearly understand your thoughts.


> >  - Another possible thing the current code may do (for latin1-compatible
> >characters) is to simply display the glyph loaded in that position.
> >Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
> >double accent". An applications prints U+00FB, which is an "u with
> >circumflex". Since this glyph is not present in latin2, it cannot be
> >printed with the current font. Still, the current code falls back to
> >printing the glyph from the 0xFB position of the glyph table. Hence my
> >app asked to print "u with circumflex" but an "u with double accent"
> >appears on the screen. This is totally contrary to the goals of Unicode
> >and shouldn't ever happen.
> 
> When does that happen?  That is clearly a bug.

I think I've (mostly) described it above. Set everything to UTF-8, load a
latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an
application print U+00FB (alt + numpad 251 is one trivial way), you'll see
an "u with double accent", though the symbol to be displayed is "u with
circumflex". This isn't present in the current font, so the replacement
character should appear, not a different letter.


> >- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
> >  back to a question mark. I've changed the fallback version to an inverted
> >  question mark. This way it's more similar to the common glyph of U+FFFD,
> >  and it's more trivial to the user that it's not a literal question mark
> >  but rather some erroneous situation.
> 
> Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
> all fonts.  The whole point of falling back to ? is that it's an ASCII 
> character, which means that if the font designer failed to designate a 
> fallback glyph -- which is an error!!! -- there is at least some hope of 
> conveying the error back to the user.

Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I
choose for fallback is still '?' (the ASCII question mark), I just invert
the color attributes of the cell where this is printed. This way it becomes
visually distinguisable from the literal question mark. Using the current
kernel you just cannot know whether the character printed is a real question
mark, or a replacement glyph. Still, should you stongly disagree with this
decision, the color inverting part can easily be removed.

> >- There's no concept of double-width characters. It's way beyond the scope
> >  of my patch to try to display them, but at least I think it's important
> >  for the cursor to jump two positions when printing such characters, since
> >  this is what applications (such as text editors) expect. If the cursor
> >  didn't jump two positions, applications would suffer from displaying and
> >  refreshing problems, and editing some English letters that are preceded 
> >  by
> >  some CJK characters in the same line became a nightmare. With my patch an
> >  inverted dot followed by an inverted space is displayed for double-width
> >  characters so it's quite easy to see that they are tied together.
> 
> To be able to do CJK you need something like Kon anyway.  This feels 
> like bloat.

I don't want CJK support. All that I want is to be able to edit English
words within a file that contains mixture of English and CJK, with a text
editor like vim or joe. Just try it with the current kernel, and with my
patch. Suppose that within a line some CJK text is followed by an English
word, and you want to edit the latter one. It's going to be a huge headache
with the current kernel. Where you see the cursor is not wh

Re: [PATCH 1/8] Enhance process freezer interface for usage beyond software suspend

2007-04-07 Thread Rafael J. Wysocki
On Saturday, 7 April 2007 00:20, Nigel Cunningham wrote:
> Hi.
> 
> On Fri, 2007-04-06 at 16:34 +0200, Rafael J. Wysocki wrote:
> > On Monday, 2 April 2007 22:51, Pavel Machek wrote:
> > > Hi!
> > > 
> > > > > > +/* Per process freezer specific flags */
> > > > > > +#define PF_FE_SUSPEND  0x8000  /* This thread should 
> > > > > > not be frozen
> > > > > > +* for suspend
> > > > > > +*/
> > > > > > +
> > > > > > +#define PF_FE_KPROBES  0x0010  /* This thread should 
> > > > > > not be frozen
> > > > > > +* for Kprobes
> > > > > > +*/
> > > > > 
> > > > > Just put the comment before the define for long comments?
> > > > 
> > > > Agreed.
> > > 
> > > (Actually it would be nice to say
> > > 
> > > /* This thread should not be frozen for suspend, becuase it is needed
> > >for getting image saved to disk */
> > > 
> > > > > > -#ifdef CONFIG_PM
> > > > > > +#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || \
> > > > > > +   defined(CONFIG_KPROBES)
> > > > > 
> > > > > Should we create CONFIG_FREEZER?
> > > > 
> > > > Why do you think so?  I think the freezer should be compiled 
> > > > automatically
> > > > if any of the above is set, which is what this directive really means.
> > > 
> > > Kconfig can do that. ("select statement"). If we have one such ifdef,
> > > it is okay, but if it would be more of them.
> > > 
> > > > > Hmmm, I do not really like softlockup watchdog running during suspend.
> > > > > Can we make this freezeable and make watchdog shut itself off while
> > > > > suspending?
> > > > 
> > > > Generally, I agree, but this patch only replaces the existing instances
> > > > of PF_NOFREEZE with the new mechanism.  The changes you're talking about
> > > > require a separate patch series (or at least one separate patch), I 
> > > > think, and
> > > > they need not be so simple to make.
> > > 
> > > Agreed about separate patch series.
> > > 
> > > > > > -   current->flags |= PF_NOFREEZE;
> > > > > > +   freezer_exempt(FE_ALL);
> > > > > > pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD);
> > > > > > if (pid > 0) {
> > > > > > while (pid != sys_wait4(-1, NULL, 0, NULL))
> > > > > 
> > > > > Does this mean we have userland /linuxrc running with PF_NOFREEZE?
> > > > > That would be very bad...
> > > > 
> > > > No, actually it is _required_ for the userland resume to work.  Well, 
> > > > perhaps
> > > > I should place a comment in there so that I don't have to explain this 
> > > > again
> > > > and again. :-)
> > > 
> > > Can you put big bold comment there?
> > >
> > > Why is it needed? Freezer never freezes _current_ task.
> > 
> > No, it doesn't, but this task spawns linuxrc and then calls sys_wait4() in a
> > loop.
> > 
> > Well, actually, I'll try to plant try_to_freeze() in this loop and see if 
> > that
> > works.  If it doesn't, I'll add a comment.
> 
> It works. I've had:
> 
> while (pid != sys_wait4(-1, NULL, 0, NULL)) {
> yield();
> try_to_freeze();
> }
> 
> there for ages for Suspend2.

OK, thanks.  Is there any particular reason to place try_to_freeze() after
yield()?

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Nick Piggin

Eric Dumazet wrote:

Hi all

Updates on this take4 :

- All remarks from Nick were addressed I hope


Yeah looks very nice. Thanks for doing that.



- Current mm code have a problem with 64bit futexes, as spoted by Nick :

get_futex_key() does a check against sizeof(u32) regardless of futex being 
64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.


I wonder if it should be encfocing alignment to keep in on 1 page?

Otherwise,
Acked-by: Nick Piggin <[EMAIL PROTECTED]>

I'll be away for a couple of days, but I'll look at running some performance
tests when I get back.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


possible NULL pointer usage

2007-04-07 Thread Cyrill Gorcunov
Hi,

from the function fs/udf/inode.c:udf_fill_inode -

...

UDF_I_DATA(inode) = kmalloc(inode->i_sb->s_blocksize - sizeof(struct 
extendedFileEntry), GFP_KERNEL);
memcpy(UDF_I_DATA(inode), bh->b_data + sizeof(struct 
extendedFileEntry), inode->i_sb->s_blocksize - sizeof(struct 
extendedFileEntry));

...

so that can lead to NULL pointer usage. udf_fill_inode()
declared as 'void' and the question I have is: what is the
best solution to deal with a situation if kmalloc failed?
May be just mark the node as bad by calling make_bad_inode()
and return from the function?


Cyrill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Rafael J. Wysocki
On Saturday, 7 April 2007 10:48, Thomas Gleixner wrote:
> On Sat, 2007-04-07 at 10:25 +0200, Ingo Molnar wrote:
> > * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > 
> > > [...] Soeren, can you confirm that you are using a !CONFIG_SMP kernel, 
> > > and if yes, does the patch below fix the resume problem for you?
> > 
> > hm, you seem to have a CONFIG_SMP=y kernel. I dont immediately see where 
> > we re-enable interrupts in the SMP case, but could you try my patch 
> > nevertheless
> 
> We do in on_each_cpu() unconditionally. I missed that.

BTW, the on_each_cpu() in clock_was_set() is unnecessary, because
timekeeping_resume() is always run on one CPU.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Ingo Molnar

* Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:

> > We do in on_each_cpu() unconditionally. I missed that.
> 
> BTW, the on_each_cpu() in clock_was_set() is unnecessary, because 
> timekeeping_resume() is always run on one CPU.

yes - but that's not the only place where we do clock_was_set(), and the 
on_each_cpu() is necessary in every other case. So i think the right 
solution was the patch i did: to split the resume functionality from the 
clock_was_set() functionality.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] Enhance process freezer interface for usage beyond software suspend

2007-04-07 Thread Nigel Cunningham
Hi.

On Sat, 2007-04-07 at 11:33 +0200, Rafael J. Wysocki wrote:
> On Saturday, 7 April 2007 00:20, Nigel Cunningham wrote:
> > > > > > > - current->flags |= PF_NOFREEZE;
> > > > > > > + freezer_exempt(FE_ALL);
> > > > > > >   pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD);
> > > > > > >   if (pid > 0) {
> > > > > > >   while (pid != sys_wait4(-1, NULL, 0, NULL))
> > > > > > 
> > > > > > Does this mean we have userland /linuxrc running with PF_NOFREEZE?
> > > > > > That would be very bad...
> > > > > 
> > > > > No, actually it is _required_ for the userland resume to work.  Well, 
> > > > > perhaps
> > > > > I should place a comment in there so that I don't have to explain 
> > > > > this again
> > > > > and again. :-)
> > > > 
> > > > Can you put big bold comment there?
> > > >
> > > > Why is it needed? Freezer never freezes _current_ task.
> > > 
> > > No, it doesn't, but this task spawns linuxrc and then calls sys_wait4() 
> > > in a
> > > loop.
> > > 
> > > Well, actually, I'll try to plant try_to_freeze() in this loop and see if 
> > > that
> > > works.  If it doesn't, I'll add a comment.
> > 
> > It works. I've had:
> > 
> > while (pid != sys_wait4(-1, NULL, 0, NULL)) {
> > yield();
> > try_to_freeze();
> > }
> > 
> > there for ages for Suspend2.
> 
> OK, thanks.  Is there any particular reason to place try_to_freeze() after
> yield()?

Not that I remember. I haven't touched that for years :)

Nigel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] high-res timers: resume fix

2007-04-07 Thread Ingo Molnar

find updated patch below - only the patch description changed: i removed 
the 'UP' thing (patch has relevance on SMP too), and added Thomas' ack.

Ingo

>
Subject: [patch] high-res timers: resume fix
From: Ingo Molnar <[EMAIL PROTECTED]>

Soeren Sonnenburg reported that upon resume he is getting
this backtrace:

 [] smp_apic_timer_interrupt+0x57/0x90
 [] retrigger_next_event+0x0/0xb0
 [] apic_timer_interrupt+0x28/0x30
 [] retrigger_next_event+0x0/0xb0
 [] __kfifo_put+0x8/0x90
 [] on_each_cpu+0x35/0x60
 [] clock_was_set+0x18/0x20
 [] timekeeping_resume+0x7c/0xa0
 [] __sysdev_resume+0x11/0x80
 [] sysdev_resume+0x47/0x80
 [] device_power_up+0x5/0x10

it turns out that on resume we mistakenly re-enable interrupts.
Do the timer retrigger only on the current CPU.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Acked-by: Thomas Gleixner <[EMAIL PROTECTED]>
---
 include/linux/hrtimer.h |3 +++
 kernel/hrtimer.c|   12 
 2 files changed, 15 insertions(+)

Index: linux/include/linux/hrtimer.h
===
--- linux.orig/include/linux/hrtimer.h
+++ linux/include/linux/hrtimer.h
@@ -206,6 +206,7 @@ struct hrtimer_cpu_base {
 struct clock_event_device;
 
 extern void clock_was_set(void);
+extern void hres_timers_resume(void);
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
 /*
@@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim
  */
 static inline void clock_was_set(void) { }
 
+static inline void hres_timers_resume(void) { }
+
 /*
  * In non high resolution mode the time reference is taken from
  * the base softirq time variable.
Index: linux/kernel/hrtimer.c
===
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -459,6 +459,18 @@ void clock_was_set(void)
 }
 
 /*
+ * During resume we might have to reprogram the high resolution timer
+ * interrupt (on the local CPU):
+ */
+void hres_timers_resume(void)
+{
+   WARN_ON_ONCE(num_online_cpus() > 1);
+
+   /* Retrigger the CPU local events: */
+   retrigger_next_event(NULL);
+}
+
+/*
  * Check, whether the timer is on the callback pending list
  */
 static inline int hrtimer_cb_pending(const struct hrtimer *timer)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Rafael J. Wysocki
On Saturday, 7 April 2007 11:47, Ingo Molnar wrote:
> 
> * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:
> 
> > > We do in on_each_cpu() unconditionally. I missed that.
> > 
> > BTW, the on_each_cpu() in clock_was_set() is unnecessary, because 
> > timekeeping_resume() is always run on one CPU.
> 
> yes - but that's not the only place where we do clock_was_set(), and the 
> on_each_cpu() is necessary in every other case. So i think the right 
> solution was the patch i did: to split the resume functionality from the 
> clock_was_set() functionality.

Agreed.

Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: UP resume fix

2007-04-07 Thread Thomas Gleixner
On Sat, 2007-04-07 at 11:47 +0200, Ingo Molnar wrote:
> * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:
> 
> > > We do in on_each_cpu() unconditionally. I missed that.
> > 
> > BTW, the on_each_cpu() in clock_was_set() is unnecessary, because 
> > timekeeping_resume() is always run on one CPU.
> 
> yes - but that's not the only place where we do clock_was_set(), and the 
> on_each_cpu() is necessary in every other case. So i think the right 
> solution was the patch i did: to split the resume functionality from the 
> clock_was_set() functionality.

Right, I reused it and just did not notice, that interrupts are enabled
unconditionally in on_each_cpu().

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] high-res timers: resume fix

2007-04-07 Thread Rafael J. Wysocki
On Saturday, 7 April 2007 11:49, Ingo Molnar wrote:
> 
> find updated patch below - only the patch description changed: i removed 
> the 'UP' thing (patch has relevance on SMP too), and added Thomas' ack.
> 
>   Ingo
> 
> >
> Subject: [patch] high-res timers: resume fix
> From: Ingo Molnar <[EMAIL PROTECTED]>
> 
> Soeren Sonnenburg reported that upon resume he is getting
> this backtrace:
> 
>  [] smp_apic_timer_interrupt+0x57/0x90
>  [] retrigger_next_event+0x0/0xb0
>  [] apic_timer_interrupt+0x28/0x30
>  [] retrigger_next_event+0x0/0xb0
>  [] __kfifo_put+0x8/0x90
>  [] on_each_cpu+0x35/0x60
>  [] clock_was_set+0x18/0x20
>  [] timekeeping_resume+0x7c/0xa0
>  [] __sysdev_resume+0x11/0x80
>  [] sysdev_resume+0x47/0x80
>  [] device_power_up+0x5/0x10
> 
> it turns out that on resume we mistakenly re-enable interrupts.
> Do the timer retrigger only on the current CPU.
> 
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> Acked-by: Thomas Gleixner <[EMAIL PROTECTED]>
> ---
>  include/linux/hrtimer.h |3 +++
>  kernel/hrtimer.c|   12 
>  2 files changed, 15 insertions(+)
> 
> Index: linux/include/linux/hrtimer.h
> ===
> --- linux.orig/include/linux/hrtimer.h
> +++ linux/include/linux/hrtimer.h
> @@ -206,6 +206,7 @@ struct hrtimer_cpu_base {
>  struct clock_event_device;
>  
>  extern void clock_was_set(void);
> +extern void hres_timers_resume(void);
>  extern void hrtimer_interrupt(struct clock_event_device *dev);
>  
>  /*
> @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim
>   */
>  static inline void clock_was_set(void) { }
>  
> +static inline void hres_timers_resume(void) { }
> +
>  /*
>   * In non high resolution mode the time reference is taken from
>   * the base softirq time variable.
> Index: linux/kernel/hrtimer.c
> ===
> --- linux.orig/kernel/hrtimer.c
> +++ linux/kernel/hrtimer.c
> @@ -459,6 +459,18 @@ void clock_was_set(void)
>  }
>  
>  /*
> + * During resume we might have to reprogram the high resolution timer
> + * interrupt (on the local CPU):
> + */
> +void hres_timers_resume(void)
> +{
> + WARN_ON_ONCE(num_online_cpus() > 1);
> +
> + /* Retrigger the CPU local events: */
> + retrigger_next_event(NULL);
> +}
> +
> +/*
>   * Check, whether the timer is on the callback pending list
>   */
>  static inline int hrtimer_cb_pending(const struct hrtimer *timer)
> -

Hm, I'm probably missing something obvious, but where is it going to be called
from?

Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] dtlk: fix error checks in module_init()

2007-04-07 Thread Akinobu Mita
This patch fixes two things in module_init.

- fix register_chrdev() error check

  Currently dtlk doesn't check register_chrdev() failure correctly.
  register_chrdev() returns a errno on failure.

- check probe failure

  dtlk ignores probe failure and allows the module loading without
  such device. I got "Trying to free nonexistent resource" message
  by release_region() when unloading module without device.

Signed-off-by: Akinobu Mita <[EMAIL PROTECTED]>
Cc: Chris Pallotta <[EMAIL PROTECTED]>
Cc: Jim Van Zandt <[EMAIL PROTECTED]>

---
 drivers/char/dtlk.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: 2.6-mm/drivers/char/dtlk.c
===
--- 2.6-mm.orig/drivers/char/dtlk.c
+++ 2.6-mm/drivers/char/dtlk.c
@@ -324,16 +324,22 @@ static int dtlk_release(struct inode *in
 
 static int __init dtlk_init(void)
 {
+   int err;
+
dtlk_port_lpc = 0;
dtlk_port_tts = 0;
dtlk_busy = 0;
dtlk_major = register_chrdev(0, "dtlk", &dtlk_fops);
-   if (dtlk_major == 0) {
+   if (dtlk_major < 0) {
printk(KERN_ERR "DoubleTalk PC - cannot register device\n");
-   return 0;
+   return -EBUSY;
+   }
+   err = dtlk_dev_probe();
+   if (err) {
+   unregister_chrdev(dtlk_major, "dtlk");
+   return err;
}
-   if (dtlk_dev_probe() == 0)
-   printk(", MAJOR %d\n", dtlk_major);
+   printk(", MAJOR %d\n", dtlk_major);
 
init_waitqueue_head(&dtlk_process_list);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet
On Sat, 07 Apr 2007 19:30:14 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Eric Dumazet wrote:
 
> > 
> > - Current mm code have a problem with 64bit futexes, as spoted by Nick :
> > 
> > get_futex_key() does a check against sizeof(u32) regardless of futex being 
> > 64bits or not.
> > So it is possible a 64bit futex spans two pages of memory...
> > I had to change get_futex_key() prototype to be able to do a correct test.
> 
> I wonder if it should be encfocing alignment to keep in on 1 page?

I believe I just did that :)

Before the patch :

Alignment was only 4 bytes for all futexes, but some user app could trigger a 
kernel bug (since one 64bit futex could sit on two different pages, so possible 
separate vmas, so the inode refcounting was wrong, and access_ok did not a 
correct check)

After the patch :

Alignment is 8 bytes for 64 bit futexes, 4 bytes for 32bit futexes.
All futexes are contrained to be in one single page.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch, take #3] high-res timers: resume fix

2007-04-07 Thread Ingo Molnar

* Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:

> Hm, I'm probably missing something obvious, but where is it going to 
> be called from?

doh! :) Find new patch below :-/ Soeren, please test this one.

Ingo

>
Subject: [patch] high-res timers: resume fix
From: Ingo Molnar <[EMAIL PROTECTED]>

Soeren Sonnenburg reported that upon resume he is getting
this backtrace:

 [] smp_apic_timer_interrupt+0x57/0x90
 [] retrigger_next_event+0x0/0xb0
 [] apic_timer_interrupt+0x28/0x30
 [] retrigger_next_event+0x0/0xb0
 [] __kfifo_put+0x8/0x90
 [] on_each_cpu+0x35/0x60
 [] clock_was_set+0x18/0x20
 [] timekeeping_resume+0x7c/0xa0
 [] __sysdev_resume+0x11/0x80
 [] sysdev_resume+0x47/0x80
 [] device_power_up+0x5/0x10

it turns out that on resume we mistakenly re-enable interrupts.
Do the timer retrigger only on the current CPU.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Acked-by: Thomas Gleixner <[EMAIL PROTECTED]>
---
 include/linux/hrtimer.h |3 +++
 kernel/hrtimer.c|   12 
 kernel/timer.c  |2 +-
 3 files changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/linux/hrtimer.h
===
--- linux.orig/include/linux/hrtimer.h
+++ linux/include/linux/hrtimer.h
@@ -206,6 +206,7 @@ struct hrtimer_cpu_base {
 struct clock_event_device;
 
 extern void clock_was_set(void);
+extern void hres_timers_resume(void);
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
 /*
@@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim
  */
 static inline void clock_was_set(void) { }
 
+static inline void hres_timers_resume(void) { }
+
 /*
  * In non high resolution mode the time reference is taken from
  * the base softirq time variable.
Index: linux/kernel/hrtimer.c
===
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -459,6 +459,18 @@ void clock_was_set(void)
 }
 
 /*
+ * During resume we might have to reprogram the high resolution timer
+ * interrupt (on the local CPU):
+ */
+void hres_timers_resume(void)
+{
+   WARN_ON_ONCE(num_online_cpus() > 1);
+
+   /* Retrigger the CPU local events: */
+   retrigger_next_event(NULL);
+}
+
+/*
  * Check, whether the timer is on the callback pending list
  */
 static inline int hrtimer_cb_pending(const struct hrtimer *timer)
Index: linux/kernel/timer.c
===
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -1016,7 +1016,7 @@ static int timekeeping_resume(struct sys
clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
 
/* Resume hrtimers */
-   clock_was_set();
+   hres_timers_resume();
 
return 0;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel OOPSes when changing DVB-T adapter in 2.6.21-rc3

2007-04-07 Thread CIJOML
Hi,

I can confirm this fixes problem.
Please include into 2.6.21

Michal

Dne pátek 06 duben 2007 21:58 Markus Rechberger napsal(a):
> I committed a patch which fixes this issue:
>
> http://mcentral.de/hg/~mrec/v4l-dvb-stable/
>
> That problem got caused by releasing data structures which are still
> in use when the device gets unplugged. These patches delay the
> deallocation of the data till the last user releases its reference to
> the dvb nodes.
>
> This patch got tested with devices which use the dvb-usb framework as
> well as em28xx based dvb devices.
>
> Markus
>
> On 3/16/07, Oliver Neukum <[EMAIL PROTECTED]> wrote:
> > Am Freitag, 16. März 2007 10:13 schrieb CIJOML:
> > > Hi,
> > >
> > > looks like more general problem with 2.6.21-rc3. This happens when I
> >
> > remove my
> >
> > > PCMCIA USB2.0/IEEE1384 adapter from slot:
> >
> > Yes, the more important it is to know whether -rc2 works.
> > And please report this as a generic sysfs failure to lkml.
> >
> > Regards
> > Oliver
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] timekeeping: drop irq-context clocksource polling

2007-04-07 Thread Andrew Morton
On Thu, 05 Apr 2007 14:03:16 -0700 Daniel Walker <[EMAIL PROTECTED]> wrote:

> Before this change the timekeeping code would poll the clocksource
> list every interrupt. This changes that so the clocksource list is
> only checked when there has been and update, and no longer checks
> in interrupt context.

I get a complete lockup on i386 SMP - before the kernel has printed anything.

I'm suspecting a recursive taking of xtime_lock.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch, take #3] high-res timers: resume fix

2007-04-07 Thread Soeren Sonnenburg
On Sat, 2007-04-07 at 12:05 +0200, Ingo Molnar wrote:
> * Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:
> 
> > Hm, I'm probably missing something obvious, but where is it going to 
> > be called from?
> 
> doh! :) Find new patch below :-/ Soeren, please test this one.

OK, I did about 5 suspend/resume cycles with

CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HPET=y
CONFIG_HPET_MMAP=y

and no oops / no problem ...

So I guess the fix take #3 is good :-)

One not directly related to this patch (but probably all the timer
stuff) I noticed with -rc6 is that it takes 10 seconds to suspend (it
was ~2 seconds before)

Soeren

>   Ingo
> 
> >
> Subject: [patch] high-res timers: resume fix
> From: Ingo Molnar <[EMAIL PROTECTED]>
> 
> Soeren Sonnenburg reported that upon resume he is getting
> this backtrace:
> 
>  [] smp_apic_timer_interrupt+0x57/0x90
>  [] retrigger_next_event+0x0/0xb0
>  [] apic_timer_interrupt+0x28/0x30
>  [] retrigger_next_event+0x0/0xb0
>  [] __kfifo_put+0x8/0x90
>  [] on_each_cpu+0x35/0x60
>  [] clock_was_set+0x18/0x20
>  [] timekeeping_resume+0x7c/0xa0
>  [] __sysdev_resume+0x11/0x80
>  [] sysdev_resume+0x47/0x80
>  [] device_power_up+0x5/0x10
> 
> it turns out that on resume we mistakenly re-enable interrupts.
> Do the timer retrigger only on the current CPU.
> 
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> Acked-by: Thomas Gleixner <[EMAIL PROTECTED]>
> ---
>  include/linux/hrtimer.h |3 +++
>  kernel/hrtimer.c|   12 
>  kernel/timer.c  |2 +-
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> Index: linux/include/linux/hrtimer.h
> ===
> --- linux.orig/include/linux/hrtimer.h
> +++ linux/include/linux/hrtimer.h
> @@ -206,6 +206,7 @@ struct hrtimer_cpu_base {
>  struct clock_event_device;
>  
>  extern void clock_was_set(void);
> +extern void hres_timers_resume(void);
>  extern void hrtimer_interrupt(struct clock_event_device *dev);
>  
>  /*
> @@ -236,6 +237,8 @@ static inline ktime_t hrtimer_cb_get_tim
>   */
>  static inline void clock_was_set(void) { }
>  
> +static inline void hres_timers_resume(void) { }
> +
>  /*
>   * In non high resolution mode the time reference is taken from
>   * the base softirq time variable.
> Index: linux/kernel/hrtimer.c
> ===
> --- linux.orig/kernel/hrtimer.c
> +++ linux/kernel/hrtimer.c
> @@ -459,6 +459,18 @@ void clock_was_set(void)
>  }
>  
>  /*
> + * During resume we might have to reprogram the high resolution timer
> + * interrupt (on the local CPU):
> + */
> +void hres_timers_resume(void)
> +{
> + WARN_ON_ONCE(num_online_cpus() > 1);
> +
> + /* Retrigger the CPU local events: */
> + retrigger_next_event(NULL);
> +}
> +
> +/*
>   * Check, whether the timer is on the callback pending list
>   */
>  static inline int hrtimer_cb_pending(const struct hrtimer *timer)
> Index: linux/kernel/timer.c
> ===
> --- linux.orig/kernel/timer.c
> +++ linux/kernel/timer.c
> @@ -1016,7 +1016,7 @@ static int timekeeping_resume(struct sys
>   clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
>  
>   /* Resume hrtimers */
> - clock_was_set();
> + hres_timers_resume();
>  
>   return 0;
>  }
> 
-- 
Sometimes, there's a moment as you're waking, when you become aware of
the real world around you, but you're still dreaming.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Jan Engelhardt
Hi,


I just wanted to give my opinion on things...

(and enable utf8 to read this properly)

On Apr 7 2007 11:24, Egmont Koblinger wrote:
>
>> I strongly disagree.  First of all, you're changing the semantics of a 
>> 13-year-old API.  The semantics of the Linux console is that by 
>> specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
>> specified the fallback glyph.
>
>OK, I'm not against using U+FFFD for missing glyphs. In the mean time I
>think it's still a good idea to clearly separate the two cases in the code
>(that is, the case of invalid sequence from the case of missing glyph), but
>we can still use the same replacement character in these two cases. I'll
>send an updated patch after Easter if it sounds good for you.

I am quite ok with the way things are right now.

 - vc displays  for illegal sequences

 - vc displays e.g. "U" (latin capital U) in place when Û (latin capital
   U with accent circumflex) is not available in this font 
   (determined by the unicodemap) (I do use an unicode map, because I
   use a 4096-byte cp437 "DOS" font which requires one)

 - vc displays  for sequences it does not know how to print

 - xterm displays  for illegal sequences

 - xterm seems to display  on undefined glyphs (U+DFFF for ex.,
   using the "Unicode Best" font from the xterm menu)

 - xterm seems to display nothing on undefined glyphs (U+E000 for ex.,
   "Unicode Best" again)

>> What's worse, you've hard-coded the uses of specific visual 
>> representations.  That is completely unacceptable.
>
>Now that we've dropped the idea of "dot" for missing glyphs, the other thing
>
>[...]
>
>Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I
>choose for fallback is still '?' (the ASCII question mark), I just invert
>the color attributes of the cell where this is printed. This way it becomes
>visually distinguisable from the literal question mark. Using the current
>kernel you just cannot know whether the character printed is a real question
>mark, or a replacement glyph. Still, should you stongly disagree with this
>decision, the color inverting part can easily be removed.

Please, no dot, and no inverse color.
Imagine someone had the following bitmap for :





####
####

####
####

####
####





Then inverting that again would be susceptible to confusion with
the regular '?' at 0x3F. 

(cp437 for example maps unknown/illegal to 0xFD which happens to be the
block graphic '■', but YMMV depending on font.)

>I think I've (mostly) described it above. Set everything to UTF-8, load a
>latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an
>application print U+00FB (alt + numpad 251 is one trivial way), you'll see
>an "u with double accent", though the symbol to be displayed is "u with
>circumflex". This isn't present in the current font, so the replacement
>character should appear, not a different letter.

I blame your latin2 unicode map. (See above about 'Û'.)
It should perhaps display a regular 'u' if it cannot display 'û',
but definitely not 'ü' (which is not called a double accent, btw).

>> To be able to do CJK you need something like Kon anyway.  This feels 
>> like bloat.
>
>I don't want CJK support. All that I want is to be able to edit English
>words within a file that contains mixture of English and CJK, with a text
>editor like vim or joe.

+1 for this one :)

xterm## echo "韓国と日本にようこそ!" >/tmp/foobar.txt
vc## cat foobar.txt

currently gets things not so right, because multibyte characters are not
displayed with as many  as they are wide.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Two questions regarding Opening files within Kernel!

2007-04-07 Thread Jan Engelhardt

On Apr 7 2007 06:58, JanuGerman wrote:
>Hi Every one,
>
>  I have got two questions regarding opening files within the Linux 
>  kernel. If some body can help me, in sorting out this problem, i will 
>  be very thankful.
>
>1)   I have just a file path with me, an absolute path, but no dentry, 
>   no inode, no vfsmount object, which function i can call to get a 
>   "file" object associated with the absoulte file path. I have surfed 
>   arround the source code especially fs/open.c and some other files, 
>   but each function requires a parameter "mode" and "fd" beside file 
>   path. Actually, i was confuse about the "mode" parameter (and its 
>   differece with "flag"), like what to send, and secondly for "fd", i 
>   am not sure, what value to send as there is no file infact and only 
>   file path exists. Any idea?

Not sure if this is the right function, but it should get you started...

struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root);

>2) Any functionality within linux kernel source code, to read one line 
>   per file? or some indirect way to set buffer size for one read?. 
>   That is, any existing header file for doing text I/O rather than 
>   binary within the kernel source code?

http://kernelnewbies.org/FAQ/WhyWritingFilesFromKernelIsBad
(same goes for reading)


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Jakub Jelinek
On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote:
> get_futex_key() does a check against sizeof(u32) regardless of futex being 
> 64bits or not.
> So it is possible a 64bit futex spans two pages of memory...

That would be a user bug.  32-bit futexes have to be 32-bit aligned, 64-bit
futexes have to be 64-bit aligned.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: broken device locking, sg vs. sg_io on block devices

2007-04-07 Thread Eduard Bloch
#include 

First, we (me and Thomas Schmidt) are working on a draft for a mandatory
locking scheme which will take care of the most racy situations even
without having a proper in-kernel solution. But you need to exlain some
things, otherwise we cannot rely on your words.

> (open has side effects relocking doesnt)

What exactly does that mean in our scope?

Can we do following without having side effects:

open("/dev/sr0",O_EXCL|O_RDWR); /* no matter what it returns */
fcntl(..., F_SETLK); /* no matter what it returns */
ioctl(f, SCSI_IOCTL_GET_IDLUN, &x);
ioctl(f, SCSI_IOCTL_GET_BUS_NUMBER, &jo);

Can you guarantee us that bit? 

Or shall we really implement ugly workarounds to avoid every open call?
Note that "just do like UUCP guys" is not as easy or reliable as people
may pretend.

Eduard.

-- 
Naja, Garbage Collector eben. Holt den Müll sogar vom Himmel.
   (Heise Trollforum über Java in der Flugzeugsteuerung)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


compressing intermediate files with LZO on the fly

2007-04-07 Thread Al Boldi
Willy Tarreau wrote:
>
> ... for some usages (temporary space),
> light compression can increase speed. For instance, when processing logs,
> I get better speed by compressing intermediate files with LZO on the fly.

How can you do that on ext3?

Also, can you do that on a partition block-io level?


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH nf-2.6.22] [netfilter] early_drop imrovement

2007-04-07 Thread Vasily Averin
When the number of conntracks is reached nf_conntrack_max limit, early_drop() is
called and tries to free one of already used conntracks in one of the hash
buckets. If it does not find any conntracks that may be freed, it
leads to transmission errors.
However it is not fair because of current hash bucket may be empty but the
neighbour ones can have the number of conntracks that can be freed. On the other
hand the number of checked conntracks is not limited and it can cause a long 
delay.
The following patch limits the number of checked conntracks by average number of
conntracks in one hash bucket and allows to search conntracks in other hash 
buckets.

Signed-off-by:  Vasily Averin <[EMAIL PROTECTED]>

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e132c8a..d0b5794 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -525,7 +525,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken);

 /* There's a small race here where we may free a just-assured
connection.  Too bad: we're in trouble anyway. */
-static int early_drop(struct list_head *chain)
+static int __early_drop(struct list_head *chain, unsigned int *cnt)
 {
/* Traverse backwards: gives us oldest, which is roughly LRU */
struct nf_conntrack_tuple_hash *h;
@@ -540,6 +540,10 @@ static int early_drop(struct list_head *chain)
atomic_inc(&ct->ct_general.use);
break;
}
+   if (!--(*cnt)) {
+   dropped = 1;
+   break;
+   }
}
read_unlock_bh(&nf_conntrack_lock);

@@ -555,6 +559,21 @@ static int early_drop(struct list_head *chain)
return dropped;
 }

+static int early_drop(const struct nf_conntrack_tuple *orig)
+{
+   unsigned int i, hash, cnt;
+   int ret = 0;
+
+   hash = hash_conntrack(orig);
+   cnt = (nf_conntrack_max/nf_conntrack_htable_size) + 1;
+
+   for (i = 0;
+   !ret && i < nf_conntrack_htable_size;
+   ++i, hash = ++hash % nf_conntrack_htable_size)
+   ret = __early_drop(&nf_conntrack_hash[hash], &cnt);
+   return ret;
+}
+
 static struct nf_conn *
 __nf_conntrack_alloc(const struct nf_conntrack_tuple *orig,
 const struct nf_conntrack_tuple *repl,
@@ -574,9 +593,7 @@ __nf_conntrack_alloc(const struct nf_conntrack_tuple *orig,

if (nf_conntrack_max
&& atomic_read(&nf_conntrack_count) > nf_conntrack_max) {
-   unsigned int hash = hash_conntrack(orig);
-   /* Try dropping from this hash chain. */
-   if (!early_drop(&nf_conntrack_hash[hash])) {
+   if (!early_drop(orig)) {
atomic_dec(&nf_conntrack_count);
if (net_ratelimit())
printk(KERN_WARNING
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet

Jakub Jelinek a écrit :

On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote:

get_futex_key() does a check against sizeof(u32) regardless of futex being 
64bits or not.
So it is possible a 64bit futex spans two pages of memory...


That would be a user bug.  32-bit futexes have to be 32-bit aligned, 64-bit
futexes have to be 64-bit aligned.


I am not sure what you want to say.

User doing sys_futex64(0x..FFC, FUTEX_WAKE_OP, ...) and crashing kernel or 
corrupting data is ok because its a user bug ?



User is allowed to do anything, kernel must check and protect innocents.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: compressing intermediate files with LZO on the fly

2007-04-07 Thread Willy Tarreau
Hi Al,

On Sat, Apr 07, 2007 at 02:32:34PM +0300, Al Boldi wrote:
> Willy Tarreau wrote:
> >
> > ... for some usages (temporary space),
> > light compression can increase speed. For instance, when processing logs,
> > I get better speed by compressing intermediate files with LZO on the fly.
> 
> How can you do that on ext3?
> 
> Also, can you do that on a partition block-io level?

No, sorry for the confusion. My scripts simply do :

 $ lzop -cd file1.lzo | process | lzop -c3 > file2.lzo

With decent CPU, you can reach higher read/write data rates than what a
single off-the-shelf disk can achieve. For this reason, I think that
reiser4 would be worth trying for this particular usage. And in this case,
I'm not interested at all in reliability. It's just temporary storage. If
the disk fails, I throw it away and buy a new one.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH nf-2.6.22] [netfilter] early_drop imrovement

2007-04-07 Thread Eric Dumazet

Vasily Averin a e'crit :

When the number of conntracks is reached nf_conntrack_max limit, early_drop() is
called and tries to free one of already used conntracks in one of the hash
buckets. If it does not find any conntracks that may be freed, it
leads to transmission errors.
However it is not fair because of current hash bucket may be empty but the
neighbour ones can have the number of conntracks that can be freed. On the other
hand the number of checked conntracks is not limited and it can cause a long 
delay.
The following patch limits the number of checked conntracks by average number of
conntracks in one hash bucket and allows to search conntracks in other hash 
buckets.


Hi Vasily



atomic_inc(&ct->ct_general.use);
break;
}
+   if (!--(*cnt)) {
+   dropped = 1;
+   break;
+   }




+   cnt = (nf_conntrack_max/nf_conntrack_htable_size) + 1;


I am sorry but this wont help in the case you mentioned in an earlier mail :

If nf_conntrack_max  < nf_conntrack_htable_size, cnt will be set to 1.

Then in __early_drop() you endup in breaking the list_for_each_entry_reverse() 
loop after the first element was tested ! Not what you intended I'm afraid, 
because you wont event scan the whole chain as before your patch :(


I believe you should not test --cnt in __early_drop() but in the caller.

(That is not counting the number of found cells, but the number of hash chains 
you tried)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[dm-devel] bio too big device md1 (16 > 8)

2007-04-07 Thread syrius . ml

Hi,

i'm using 2.6.21-rc5-git9 +
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-merge-max_hw_sector.patch
 
( i've been testing with and without it, and first encountered it on
2.6.18-debian )

I've setup a raid1 array md1 (it was created in a degraded mode using
the debian installer)
(md0 is also a small raid1 array created in degraded mode, but i did
not have any issue with it)

md1 hold a lvm physical volume holding a vg and several lvs

mdadm -D /dev/md1:
/dev/md1:
Version : 00.90.03
  Creation Time : Sun Mar 25 16:34:42 2007
 Raid Level : raid1
 Array Size : 290607744 (277.15 GiB 297.58 GB)
Device Size : 290607744 (277.15 GiB 297.58 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Tue Apr  3 01:37:23 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : af8d2807:e573935d:04be1e12:bc7defbb
 Events : 0.422096

Number   Major   Minor   RaidDevice State
   0   330  active sync   /dev/hda3
   1   001  removed


the problem i'm encountering is when i add /dev/md2 to /dev/md1.

mdadm -D /dev/md2
/dev/md2:
Version : 00.90.03
  Creation Time : Sun Apr  1 15:06:43 2007
 Raid Level : linear
 Array Size : 290607808 (277.15 GiB 297.58 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Sun Apr  1 15:06:43 2007
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

   Rounding : 64K

   UUID : 887ecdeb:5f205eb6:4cd470d6:4cbda83c (local to host odo)
 Events : 0.1

Number   Major   Minor   RaidDevice State
   0  3440  active sync   /dev/hdg4
   1  5721  active sync   /dev/hdk2
   2  9132  active sync   /dev/hds3
   3  8923  active sync   /dev/hdo2

I use mdadm --manage --add /dev/md1 /dev/md2
when I do so here is what happen:
md: bind
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:hda3
 disk 1, wo:1, o:1, dev:md2
md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 290607744 blocks.
bio too big device md1 (16 > 8)
Device dm-7, XFS metadata write error block 0x243ec0 in dm-7
bio too big device md1 (16 > 8)
I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1b5b6550   
("xfs_trans_read_buf") error 5 buf count 8192
bio too big device md1 (16 > 8)
I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1fb3b00   
("xfs_trans_read_buf") error 5 buf count 8192

every filesystems on md1 get corrupted.
I manually fail md2 then reboot and so i can boot the fs again.
(but md1 is still degraded)

Any idea ?
I can provide more information if needed. (the only weird thing is
/dev/hdo that doesn't seem to be lba48-ready, but i guess that
shouldn't be a geometry issue.)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm snapshot broken-out-2007-04-07-03-27.tar.gz uploaded

2007-04-07 Thread Michal Piotrowski
[EMAIL PROTECTED] napisał(a):
> The mm snapshot broken-out-2007-04-07-03-27.tar.gz has been uploaded to
> 
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-04-07-03-27.tar.gz
> 
> It contains the following patches against 2.6.21-rc6:
> 

LTP triggered a ptrace problem.

[ cut here ]
kernel BUG at kernel/ptrace.c:1281!
invalid opcode:  [#1]
PREEMPT SMP 
last sysfs file: devices/platform/w83627hf.656/temp2_input
Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nfsd exportfs lockd 
nfs_acl autofs4 sunrpc af_packet nf_conntrack_netbios_ns ipt_REJECT 
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables 
ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc 
thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus 
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
snd_mixer_oss intel_agp snd_pcm agpgart evdev snd_timer snd soundcore i2c_i801 
snd_page_alloc ide_cd cdrom rtc unix
CPU:1
EIP:0060:[]Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc6-mm1 #1)
EIP is at ptrace_do_wait+0x1eb/0x510

 l *0xc0163566
0xc0163566 is in ptrace_do_wait (kernel/ptrace.c:1281).
1276
1277pr_debug("%d ptrace_do_wait (%d) found %d code %x (%u/%d)\n",
1278 current->pid, tsk->pid, p->pid, exit_code,
1279 p->exit_state, p->exit_signal);
1280
1281NO_LOCKS;
1282
1283/*
1284 * If there was a group exit in progress, all threads report 
that
1285 * status.  Most will have SIGKILL in their own exit_code.


eax: 0001   ebx: cff43550   ecx: c04b043c   edx: 0001
esi: fff6   edi: 000c   ebp: c96c5f10   esp: c96c5ee8
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process ptrace01 (pid: 8610, ti=c96c4000 task=c9554b00 task.ti=c96c4000)
Stack: 0002 0004 21a3 c9554b00 cc8aa808  c96c5f10 cff43550 
   0001 fff6 c96c5f80 c0127e10  bf833c78  0134 
    0004 21a3 c9554b00 0001 0001  c9554bbc 
Call Trace:
 [] do_wait+0x9d6/0xbad
 [] sys_wait4+0x30/0x32
 [] sys_waitpid+0x27/0x29
 [] syscall_call+0x7/0xb
 [] 0xb7f36410
 ===
INFO: lockdep is turned off.
Code: a9 9d 0b 00 85 c0 74 05 e8 87 b8 1e 00 89 e0 25 00 e0 ff ff 31 d2 83 78 
14 00 0f 95 c2 b8 3c 04 4b c0 e8 86 9d 0b 00 85 c0 74 04 <0f> 0b eb fe 8b 83 5c 
04 00 00 f6 40 54 08 74 03 8b 78 44 83 bb 
EIP: [] ptrace_do_wait+0x1eb/0x510 SS:ESP 0068:c96c5ee8
[ cut here ]
kernel BUG at kernel/ptrace.c:494!
invalid opcode:  [#2]
PREEMPT SMP 
last sysfs file: devices/platform/w83627hf.656/temp2_input
Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nfsd exportfs lockd 
nfs_acl autofs4 sunrpc af_packet nf_conntrack_netbios_ns ipt_REJECT 
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables 
ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc 
thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus 
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
snd_mixer_oss intel_agp snd_pcm agpgart evdev snd_timer snd soundcore i2c_i801 
snd_page_alloc ide_cd cdrom rtc unix
CPU:1
EIP:0060:[]Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc6-mm1 #1)
EIP is at ptrace_exit+0x29/0x21d

l *0xc0163f22
0xc0163f22 is in ptrace_exit (kernel/ptrace.c:494).
489 ptrace_exit(struct task_struct *tsk)
490 {
491 struct list_head *pos, *n;
492 int restart;
493
494 NO_LOCKS;
495
496 /*
497  * Taking the task_lock after PF_EXITING is set ensures that a
498  * child in ptrace_traceme will not put itself on our list when

eax: 0001   ebx:    ecx: c04b1044   edx: 0001
esi: c96c5ee8   edi: c9554b00   ebp: c96c5d78   esp: c96c5d60
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process ptrace01 (pid: 8610, ti=c96c4000 task=c9554b00 task.ti=c96c4000)
Stack:   c96c5d78  c96c5ee8 c9554b00 c96c5db8 c012820e 
   0001 0286 c96c5da8 c011fec8 0001 c04a007b c96c007b c02100d8 
   ff10 c96c5eb0 000b c96c5eb0 c96c5ee8 c03f0068 c96c5de8 c0105a77 
Call Trace:
 [] do_exit+0x16b/0x86c
 [] die+0x206/0x22c
 [] do_trap+0x8a/0xa4
 [] do_invalid_op+0x88/0x92
 [] error_code+0x79/0x80
 [] ptrace_do_wait+0x1eb/0x510
 [] do_wait+0x9d6/0xbad
 [] sys_wait4+0x30/0x32
 [] sys_waitpid+0x27/0x29
 [] syscall_call+0x7/0xb
 [] 0xb7f36410
 ===
INFO: lockdep is turned off.
Code: 5d c3 55 89 e5 57 56 53 83 ec 0c 89 c7 89 e0 25 00 e0 ff ff 31 d2 83 78 
14 00 0f 95 c2 b8 44 10 4b c0 e8 ca 93 0b 00 85 c0 74 04 <0f> 0b eb fe 8d 9f b8 
04 00 00 89 d8 e8 e7 d5 1e 00 8d 87 40 0a 
EIP: [] ptrace_exit+0x29/0x21d SS:ESP 0068:c96c5d60
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: ptrace

Re: REISER4: fix for reiser4_write_extent

2007-04-07 Thread Laurent Riffard

Le 06.04.2007 00:42, Ignatich a écrit :
While trying to find the cause of problems with reiser4 in recent 
kernels I came across this.


Incomplete write handling seem to be missing from reiser4_write_extent() 
thanks to reiser4-temp-fix.patch. Strangely, there is a patch by Edward 
Shishkin that should address that issue, but it is missing from -mm 
tree. Please check.


   Max



This patch was added to -mm tree the 14 Dec 2006 (see 
http://www.mail-archive.com/mm-commits@vger.kernel.org/msg05338.html).


It was then dropped from -mm tree the 05 Mar 2007 (see 
http://www.mail-archive.com/mm-commits@vger.kernel.org/msg10818.html), 
with this comment:

"This patch was dropped because it is obsolete"

No idea why it was obsolete. Does somebody know ?

~~
laurent

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread johnrobertbanks
Hi Willy,...

> With decent CPU, you can reach higher read/write data rates than what a
> single off-the-shelf disk can achieve. For this reason, I think that
> reiser4 would be worth trying for this particular usage.

Glad to see you are willing to give Reiser4 a go.

Good man.

--


On Sat, 7 Apr 2007 09:15:35 +0200, "Willy Tarreau" <[EMAIL PROTECTED]> said:
> On Fri, Apr 06, 2007 at 10:58:45PM -0700, [EMAIL PROTECTED]
> wrote:
> > You know,... you cut out this bit:
> > 
> > -
> > 
> > > The following benchmarks are from
> > > 
> > > http://linuxhelp.150m.com/resources/fs-benchmarks.htm or,
> > > http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm
> 
> ...
> 
> Hey John, please change your disk, it's scratched and you're repeating
> yourself again and again. At first I thought "Oh cool, some good news
> about reiser4", now when I see "reiserfs" in a thread, I think "oh no,
> not this boring guy who escaped from the asylum again !". I hope this
> thread will be cut shortly so that you stop doing bad publicity to
> reiserfs and its developers, because when a product is indicated as
> good by stupid people, it's really doing harm.
> 
> Also, about this part :
> [Jan]
> > > But in the end everything is a tradeoff. You can save diskspace, but
> > > increase the cost of corruption. 
> 
> I don't 100% agree with Jan, because for some usages (temporary space),
> light compression can increase speed. For instance, when processing logs,
> I get better speed by compressing intermediate files with LZO on the fly.
> 
> [John]
> > You deliberately ignored the fact that bad blocks are NOT dealt with by
> > the filesystem,... but by the operating system. Like I said: If your
> > filesystem is writing to bad blocks, then throw away your operating
> > system.
> 
> But what you write here is complete crap. The filesystem relies on a
> linear block device. The operating system is responsible for doing
> read retries or reporting errors on bad blocks, but the FS and only
> the FS can decide how not to use some known defective areas, for
> instance not putting any metadata on them nor any useful data.
> 
> Now if you want to stop writing stupid things again and again, take
> your bag, don't miss the bus to school, and listen to the teachers
> instead of playing games on your calculator.
> 
> Willy
> PS: non need to reply either, I'll kill this thread and your address
> here.
> 
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - One of many happy users:
  http://www.fastmail.fm/docs/quotes.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Dale Amon
Jan does have a point about bad blocks. A couple years ago
I had a relatively new disk start to go bad on random blocks.
I detected it fairly quickly but did have some data loss.

All the compressed archives which were hit were near
total losses; most other files were at least partially
recoverable.

It is not a matter of your operating system writing
to bad blocks. It is a matter of what happens when the
blocks on which your data sit go bad underneath you.

This issue has also been discussed by people working
with revision control system. If you are archiving
data, how do you know you if your data is still good
unless you actually need it? If you do not know it
is bad, you may well get rid of good copies thinking
you do not need the extras... it does happen.

I would be quite hesitant to go with on disk compression
unless damage was limited to only the bad bits or blocks
and did not propagate through the rest of the file.

Perhaps if everyone used hardware RAID and the RAID
automatically detected a difference due to trashed
data on one disk and flagged the admin with a warning...

BTW: I'm a CMU Alum, so who are you working with Jan?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Krzysztof Halasa
[EMAIL PROTECTED] writes:

> Why do you think I hate reiserfs developers? That is an insane claim.
> Why would I hate reiser3 developers?
> Why would I hate reiser4 developers? 
> Why would I even dislike them?
>
> I think Hans Reiser is a genius. Is that what you mean by hate?

I think they could hire a person with a bit better marketing skills,
though. People on a technical mailing list don't buy things just
because something on TV told them they have to.

> Answer this question. Why do YOU think I am antagonizing reiserfs
> developers?

That might be just a side effect.

> Think about it,... read speeds that are some FOUR times the physical
> disk read rate,... impossible without the use of compression (or
> something similar).

It's really impossible with compression only unless you're writing
only zeros or stuff alike. I don't know what bonnie uses for testing
but real life data doesn't compress 4 times. Two times, sometimes,
but then it will be typically slower than disk access (I mean read,
as write will be much slower).

You can get faster I/O (both linear speed and access times) using
multiple disks (mirrors etc). Perhaps some ZFS ideas would do us
some good?

Gzip - 3 files (zeros only, raw DV data from video camera, x86_64
kernel rpm file), 10 MB of data (10*1024*1024), done on tmpfs so no
real disk speed factor. The CPU is AMD64 with 1 MB cache per core,
2600 MHz clock (clock scaling disabled). That's my typical usage
pattern (well, not counting these zeros).

$ l -Ggh zeros dv bin
-rw-r--r-- 1 10M Apr  7 15:30 bin
-rw-r--r-- 1 10M Apr  7 15:31 dv
-rw-r--r-- 1 10M Apr  7 15:31 zeros

$ for f in zeros dv bin; do time gzip $f; done
real0m0.112s
real0m0.686s
real0m0.559s

Dealing with pure zeros gzip can get almost 90 MB/s compressing, but
with DV and rpm it only does 14.5 and almost 18 MB/s respectively...

$ l -Ggh zeros.gz dv.gz bin.gz
-rw-r--r-- 1  10K Apr  7 15:31 zeros.gz
-rw-r--r-- 1 9.1M Apr  7 15:31 dv.gz
-rw-r--r-- 1 9.3M Apr  7 15:30 bin.gz

... and though the numbers may still sound impressive, space savings
are less than 10%.

$ for f in zeros dv bin; do time gunzip $f.gz; done
real0m0.067s
real0m0.131s
real0m0.120s

Decompression gives 150 MB/s for zeros and ~ 80 MB/s for DV and rpm.

$ for f in zeros dv bin; do time gzip -1 $f; done
real0m0.079s
real0m0.572s
real0m0.530s

Supposed to be "fastest gzip". 126 MB/s for zeros but still less than
19 MB/s for DV and rpm.

$ l -Ggh zeros.gz dv.gz bin.gz
-rw-r--r-- 1  45K Apr  7 15:31 zeros.gz
-rw-r--r-- 1 9.2M Apr  7 15:31 dv.gz
-rw-r--r-- 1 9.3M Apr  7 15:30 bin.gz

$ for f in zeros dv bin; do time gunzip $f.gz; done
real0m0.044s
real0m0.135s
real0m0.120s

It seems gzip can decompress zeros with 227 MB/s rate.
I assume the "4x read speed" claim comes from something like this.

$ /sbin/hdparm -t /dev/sda

/dev/sda:
 Timing buffered disk reads:  210 MB in  3.02 seconds =  69.59 MB/sec

$ echo "69.59*4" | bc
278.36

Seems you'd need a faster algorithm, faster machine or slower disk
- slower than this cheap SATA with disabled NCQ (NV SATA) at least:

$ cat /sys/block/sda/device/model
Maxtor 6V250F0

Please note that aplication-level compression usually gives way
better results - the application knows much more.
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread johnrobertbanks

Krzysztof -- Aren't you missing the point? Twice the speed would be
great,... even a 50% increase,... even a 0% increase.

I checked what bonnie++ actually writes to its test files, for you. It
is about 98-99% zeros.

Still, the results record sequential reads, of 232,729 K/sec, nearly
four times the physical disk read rate, 63,160 K/sec, of the hard drive.

The sequential writes are about three times the physical disk write
rate.

Even if the speed increase was zero, the more efficient use of disk
space means that Reiser4 is worth investigating.

People use RAID arrays to achieve speed increases. 

The people who developed RAID clearly thought that increases in speed
were worth investigating.

> 
> > Why do you think I hate reiserfs developers? That is an insane claim.
> > Why would I hate reiser3 developers?
> > Why would I hate reiser4 developers? 
> > Why would I even dislike them?
> >
> > I think Hans Reiser is a genius. Is that what you mean by hate?
> 
> I think they could hire a person with a bit better marketing skills,
> though. People on a technical mailing list don't buy things just
> because something on TV told them they have to.

I don't work for Reiser if that is what you are suggesting.

And people buy all sorts of lies because someone on TV told them it was
true.

Did you believe Iraq had WMD (weapons of mass destruction) because a
bunch of American liars told you this on TV? Millions of Americans did.

> > Answer this question. Why do YOU think I am antagonizing reiserfs
> > developers?
> 
> That might be just a side effect.
> 
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Send your email first class

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel-doc: handle arrays with arithmetic expressions as initializers

2007-04-07 Thread Borislav Petkov
On Fri, Apr 06, 2007 at 05:53:25PM -0700, Randy Dunlap wrote:
> From: Jan Engelhardt <[EMAIL PROTECTED]>
> 
> Unfortunately, kernel-doc has problems with a struct field like this:
>   uint8_t databuf[NAND_MAX_PAGESIZE + NAND_MAX_OOBSIZE];
> 
> simply due to the spaces around the "+" sign, so drop all spaces inside
> [...] so that parsing is done correctly (in some sense).
> 
> Warning(linux-2.6.20-git15/include/linux/mtd/nand.h:304): No description 
> found for parameter 'NAND_MAX_OOBSIZE]'
> 
> This needs to sit in -mm for awhile to see if it has any adverse effects.
> 
> And yes, this is just a hack until kernel-doc learns to do better
> parsing.
> 
> Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  scripts/kernel-doc |5 +
>  1 file changed, 5 insertions(+)
> 
> --- linux-2.6.21-rc6.orig/scripts/kernel-doc
> +++ linux-2.6.21-rc6/scripts/kernel-doc
> @@ -1452,6 +1452,11 @@ sub create_parameterlist($$$) {
>   $arg =~ s/\s*:\s*/:/g;
>   $arg =~ s/\s*\[/\[/g;
>  
> + # no spaces inside [array size expression];
> + # messes up split/pop/shift/unshift below;
> + while ($arg =~ s/\[(.*)\s+(.*)\]/[$1$2]/) {
> + }
> +
>   my @args = split('\s*,\s*', $arg);
>   if ($args[0] =~ m/\*/) {
>   $args[0] =~ s/(\*+)\s*/ $1/;
> -

In a different approach here's a patch that handles the special case of
composite arithmetic expressions in array size initializers. With it,
prior to pushing the split strings on the @first_arg array, I split the
keywords before the array name as before and then keep the array name
along with the subscript expression as a single whole element which gets
pushed last. In this manner, kernel-doc produces correct output without
removing whitespaces which makes the array subscripts unreadable in the docs.

Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]>

--- 21-rc6/scripts/kernel-doc.orig  2007-04-07 16:48:51.0 +0200
+++ 21-rc6/scripts/kernel-doc   2007-04-07 16:51:17.0 +0200
@@ -1456,7 +1456,16 @@ sub create_parameterlist($$$) {
if ($args[0] =~ m/\*/) {
$args[0] =~ s/(\*+)\s*/ $1/;
}
-   my @first_arg = split('\s+', shift @args);
+
+   my @first_arg;
+   if ($args[0] =~ /^(.*\s+)(.*?\[.*\].*)$/) {
+   shift @args;
+   push(@first_arg, split('\s+', $1));
+   push(@first_arg, $2);
+   } else {
+   @first_arg = split('\s+', shift @args);
+   }
+
unshift(@args, pop @first_arg);
$type = join " ", @first_arg;
 

-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] nfs statfs error-handling fix

2007-04-07 Thread amnonaar

Hi,

The nfs statfs function returns a success code on error, and fills the 
output buffer with invalid values. The attached patch makes it return a 
correct error code instead.


Thanks,
Amnon

Signed-off-by: Amnon Aaronsohn <[EMAIL PROTECTED]>
--

--- linux-source-2.6.20-2.6.20/fs/nfs/super.c.orig  2007-04-07 
15:19:14.0 +0300
+++ linux-source-2.6.20-2.6.20/fs/nfs/super.c   2007-04-07 15:24:35.0 
+0300
@@ -203,9 +203,9 @@ static int nfs_statfs(struct dentry *den
lock_kernel();

error = server->nfs_client->rpc_ops->statfs(server, fh, &res);
-   buf->f_type = NFS_SUPER_MAGIC;
if (error < 0)
-   goto out_err;
+   goto out;
+   buf->f_type = NFS_SUPER_MAGIC;

/*
 * Current versions of glibc do not correctly handle the
@@ -234,13 +234,7 @@ static int nfs_statfs(struct dentry *den
buf->f_namelen = server->namelen;
  out:
unlock_kernel();
-   return 0;
-
- out_err:
-   dprintk("%s: statfs error = %d\n", __FUNCTION__, -error);
-   buf->f_bsize = buf->f_blocks = buf->f_bfree = buf->f_bavail = -1;
-   goto out;
-
+   return error;
 }

 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread johnrobertbanks
On Sat, 7 Apr 2007 13:59:14 +0100, "Dale Amon" <[EMAIL PROTECTED]> said:
> Jan does have a point about bad blocks. A couple years ago
> I had a relatively new disk start to go bad on random blocks.
> I detected it fairly quickly but did have some data loss.
> 
> All the compressed archives which were hit were near
> total losses; most other files were at least partially
> recoverable.

As you know, there is not substitute for backups. What if the disk had
totally crashed and scratched GBs of your data.

And did you ever trust those (non-compressed) executables that you saved
after recovering them from corruption?

Of course not. No one would. The fact that they were not compressed did
not save them.

You are really arguing for backups, not for one filesystem or another.

Besides, Jan claimed that corruption due to bad blocks propagates to
MULTIPLE files because of the compression in the file system. You are
arguing something different.

> It is not a matter of your operating system writing
> to bad blocks. It is a matter of what happens when the
> blocks on which your data sit go bad underneath you.
>
> This issue has also been discussed by people working
> with revision control system. If you are archiving
> data, how do you know you if your data is still good
> unless you actually need it? If you do not know it
> is bad, you may well get rid of good copies thinking
> you do not need the extras... it does happen.
> 
> I would be quite hesitant to go with on disk compression
> unless damage was limited to only the bad bits or blocks
> and did not propagate through the rest of the file.

You don't really mean that. Most backup uses compression (which
propagates errors through the rest of the file).

> Perhaps if everyone used hardware RAID and the RAID
> automatically detected a difference due to trashed
> data on one disk and flagged the admin with a warning...
> 
> BTW: I'm a CMU Alum, so who are you working with John?

I retired quite young.
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - Send your email first class

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


O_APPEND, lseek() and pwrite()

2007-04-07 Thread Timo Sirainen
If I open a file with O_APPEND and write() to it, it looks like the  
file offset is updated and I can get it with lseek(SEEK_CUR). Can I  
trust that this behavior won't change in future Linux versions?  
Apparently this isn't standard, because at least OS X and Solaris  
don't do this.


pwrite() ignores the file offset if the fd has O_APPEND set (with  
2.6.20). http://www.opengroup.org/austin/mailarchives/ag/ 
msg09453.html suggests that it shouldn't ignore it. Could this be  
changed? For now I can of course just change the flag with fcntl().


I guess there aren't any limits to how large blocks write() accepts  
without the data being mixed with another process's writes (both with  
O_APPEND)? And I guess there aren't any horrible performance problems  
with this, so that this is actually a good idea compared to file lock  
+ write() + unlock? :)




PGP.sig
Description: This is a digitally signed message part


Re: [RFD driver-core] Lifetime problems of the current driver model

2007-04-07 Thread Alan Stern
On March 30, 2007, Tejun Heo wrote:

> Hello, all.
> 
> This document tries to describe lifetime problems of the current
> device driver model primarily from the point view of device drivers
> and establish consensus, or at least, start discussion about how to
> solve these problems.  This is primarily based on my experience with
> IDE and SCSI layers and my knowledge on other drivers is limited, so I
> might have generalized too aggressively.  Feel free to point out.

...

> Example 1. sysfs_schedule_callback() not grabbing the owning module
> 
>   This function is recently added to be used by suicidal sysfs nodes
>   such that they don't deadlock when trying to unregister themselves.
> 
>  +#include 
>   static void sysfs_schedule_callback_work(struct work_struct *work)
>   {
>   struct sysfs_schedule_callback_struct *ss = container_of(work,
>   struct sysfs_schedule_callback_struct, work);
> 
>  +msleep(100);
>   (ss->func)(ss->data);
>   kobject_put(ss->kobj);
>   kfree(ss);
>   }
> 
>   int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
>   void *data)
>   {
>   struct sysfs_schedule_callback_struct *ss;
> 
>   ss = kmalloc(sizeof(*ss), GFP_KERNEL);
>   if (!ss)
>   return -ENOMEM;
>   kobject_get(kobj);
>   ss->kobj = kobj;
>   ss->func = func;
>   ss->data = data;
>   INIT_WORK(&ss->work, sysfs_schedule_callback_work);
>   schedule_work(&ss->work);
>   return 0;
>   }
> 
>   Two lines starting with '+' are inserted to make the problem
>   reliably reproducible.  With the above changes,
> 
>   # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod 
> drivers/ata/libata.ko;
> insmod drivers/ata/ahci.ko
>   # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod 
> sd_mod; rmmod scsi_mod
> 
>   It's assumed that ahci detects /dev/sda.  The above command sequence
>   causes the following oops.
> 
>   BUG: unable to handle kernel paging request at virtual address e0984020
>   [--snip--]
>   EIP is at 0xe0984020
>   [--snip--]
>[] run_workqueue+0x92/0x140
>[] worker_thread+0x137/0x160
>[] kthread+0xa3/0xd0
>[] kernel_thread_helper+0x7/0x10
> 
>   The problem here is that kobjec_get() in sysfs_schedule_callback()
>   doesn't grab the module backing the kobject it's grabbing.  By the
>   time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

As the author of this routine, I wish you had included my name in your
CC: list.  :-(

The problem here isn't exactly as you described.  scsi_mod needs to be
pinned (1) because it is the owner of the kobject and hence will be
called when the kobject is released, and (2) because it is the owner
of the callback routine.  However this is just a detail; clearly the
bug needs to be fixed.

One possibility would be to have scsi_mod's exit_scsi() routine call
flush_scheduled_work().  Another would be to add such a call in
sys_delete_module().  Neither of these is attractive.  They would add
overhead when it's not needed, and they would deadlock if a workqueue
routine tried to unload a module.

On balance, the patch below seems better.  Do you agree?


With regard to your analysis of lifetime issues, there is a whole
aspect you did not mention.  A basic assumption of the refcounting
approach is that once X has a reference to Y, X can freely access and
use Y as much as it wants until it drops the reference.

However this is not true when X is a device driver and Y is a device
structure.  Drivers can be unbound from devices.  If X has been
unbound from Y then it must not access Y again, no matter how many
references it possesses.  After all, some other driver may have bound
to Y in the meantime; this other driver would not appreciate the
interference.

Just as bad, if Y represents a hot-pluggable device then some other
device may have been plugged in and may be using Y's old address.  We
don't want X sending commands to a new device, thinking that it is Y!

The complications caused by this requirement affect both the subsystem
code and device drivers.  Drivers must synchronize their release()
methods with every action they take -- and refcounts cannot provide
synchronization.

A similar problem afflicts the char-device subsystem, and here even
less care has been taken to address the issues.  The race between
open() and unregister() is resolved in many places by relying on the
BKL!

We should be able to make things better and easier than they are.
Orphaning open sysfs files was a move in this direction.  But I doubt
they will ever become truly simple and clear.

Alan Stern



Index: usb-2.6/drivers/base/core.c
===
--- usb-2.6.orig/drivers/base/core.c
+++ usb-2.6/drivers/base/core.c
@@ -431,9 +431,10 @@ void device_remove_bin_file(struct devic
 EXPORT_SYMBOL_GPL(device_remove_bin_file);
 
 /**
- * device_schedule_callback - helper to schedule

Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Pekka Enberg

On 4/7/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I checked what bonnie++ actually writes to its test files, for you. It
is about 98-99% zeros.

Still, the results record sequential reads, of 232,729 K/sec, nearly
four times the physical disk read rate, 63,160 K/sec, of the hard drive.


Excellent! You've established the undeniable hard cold fact that
reiser4 beats the crap out of all other filesystems, when the files
are 98-99% filled with zeros. You've proven your point, so can we stop
this thread now?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm] freezer: Remove PF_NOFREEZE from handle_initrd

2007-04-07 Thread Rafael J. Wysocki
From: Rafael J. Wysocki <[EMAIL PROTECTED]>

Make handle_initrd() call try_to_freeze() in a suitable place instead of setting
PF_NOFREEZE for the current task.

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 init/do_mounts_initrd.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc6/init/do_mounts_initrd.c
===
--- linux-2.6.21-rc6.orig/init/do_mounts_initrd.c
+++ linux-2.6.21-rc6/init/do_mounts_initrd.c
@@ -55,11 +55,12 @@ static void __init handle_initrd(void)
sys_mount(".", "/", NULL, MS_MOVE, NULL);
sys_chroot(".");
 
-   current->flags |= PF_NOFREEZE;
pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD);
if (pid > 0) {
-   while (pid != sys_wait4(-1, NULL, 0, NULL))
+   while (pid != sys_wait4(-1, NULL, 0, NULL)) {
+   try_to_freeze();
yield();
+   }
}
 
/* move initrd to rootfs' /old */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ten percent test

2007-04-07 Thread Gene Heskett
On Saturday 07 April 2007, Con Kolivas wrote:
>On Friday 06 April 2007 20:03, Ingo Molnar wrote:
>> * Con Kolivas <[EMAIL PROTECTED]> wrote:
>[...]
>> 
>> firstly, testing on various workloads Mike's tweaks work pretty well,
>> while SD still doesnt handle the high-load case all that well. Note
>> that it was you who raised this whole issue to begin with: everything
>> was pretty quiet in scheduling interactivity land.

Con was scratching an itch, one we desktop users all have in a place we 
can't quite reach to scratch because we aren't quite the coding gods we 
should be.  Con at least has the coding knowledge to walk in and start 
shoveling, which is more than I can say of the efforts to derail the SD 
scheduler have demonstrated to this user.

>I'm terribly sorry but you have completely missed my intentions then. I
> was _not_ trying to improve mainline's interactivity at all. My desire
> was to fix the unfairness that mainline has, across the board without
> compromising fairness. You said yourself that an approach that fixed a
> lot and had a small number of regressions would be worth it. In a
> surprisingly ironic turnaround two bizarre things happened. People
> found SD fixed a lot of their interactivity corner cases which were
> showstoppers. That didn't surprise me because any unfair design will by
> its nature get it wrong sometimes. The even _more_ surprising thing is
> that you're now using interactivity as the argument against SD. I did
> not set out to create better interactivity, I set out to create
> widespread fairness without too much compromise to interactivity. As I
> said from the _very first email_, there would be cases of interactivity
> in mainline that performed better.
>
>> (There was one person who
>> reported wide-scale interactivity regressions against mainline but he
>> didnt answer my followup posts to trace/debug the scenario.)
>
>That was one user. As I mentioned in an earlier thread, the problem with
> email threads on drawn out issues on lkml is that all that people
> remember is the last one creating noise, and that has only been the
> noise from Mike for 2 weeks now. Has everyone forgotten the many many
> users who reported the advantages first up which generated the interest
> in the first place? Why have they stopped reporting? Well the answer is
> obvious; all the signs suggest that SD is slated for mainline. It is on
> the path, Linus has suggested it and now akpm is asking if it's ready
> for 2.6.22. So they figure there is no point testing and replying any
> further. SD is ready for prime time, finalised and does everything I
> intended it to. This is where I have to reveal to them the horrible
> truth. This is no guarantee it will go in. In fact, this one point that
> you (Ingo) go on and on about is not only a quibble, but you will call
> it an absolute showstopper. As maintainer of the cpu scheduler, in its
> current form you will flatly refuse it goes to mainline citing the 5%
> of cases where interactivity has regressed. So people will tell me to
> fix it, right?... Read on for this to unfold.

Sorry, this user got quiet to watch the cat fight.  Obviously I should 
have been throwing messages wrapped around rocks (or something).

>> SD has a built-in "interactivity estimator" as well, but hardcoded
>> into its design. SD has its own set of ugly-looking tweaks as well -
>> for example the prio_matrix.
>
>I'm sorry but this is a mis-representation to me, as I suggested on an
> earlier thread where I disagree about what an interactivity estimator
> is. The idea of fence posts in a clock that are passed as a way of
> metering out earliest-deadline-first in a design is well established.
> The matrix is simply an array designed for O(1) lookups of the fence
> posts. That is not the same as "oh how much have we slept in the last
> $magic_number period and how much extra time should we get for that".
>
>> So it all comes down on 'what interactivity
>> heuristics is enough', and which one is more tweakable. So far i've
>> yet to see SD address the hackbench and make -j interactivity
>> problems/regression for example, while Mike has been busy addressing
>> the 'exploits' reported against mainline.

Who gives a s*** about hackbench or a make -j 200?!  Those are NOT, and 
NEVER WILL BE, REAL WORLD LOADS for the vast majority of us.  For us SD 
Just Worked(TM).

>And BANG there is the bullet you will use against SD from here to
> eternity. SD obeys fairness at all costs. Your interactivity regression
> is that SD causes progressive slowdown with load which by definition is
> fairness. You repeatedly ask me to address it and there is on unfailing
> truth; the only way to address it is to add unfairness to the design.
> So why don't I? Because the simple fact is that any unfairness no
> matter how carefully administered or metered will always have cases
> where it's wrong. Look at the title of this email for example - it's
> yet another exploit for the mainline sleep/run 

Re: [patch] remove artificial software max_loop limit

2007-04-07 Thread Valdis . Kletnieks
On Fri, 06 Apr 2007 16:33:32 EDT, Bill Davidsen said:
> Jan Engelhardt wrote:

> > Who cares if the user specifies max_loop=8 but still is able to open up 
> > /dev/loop8, loop9, etc.? max_loop=X basically meant (at least to me) 
> > "have at least X" loops ready.
> > 
> You have just come up with a really good reason not to do unlimited 
> loops.

That, and I'd expect the intuitive name for "have at least N ready" to
be 'min_loop=N'.  'max_loop=N' means (to me, at least) "If I ask for N+1,
something has obviously gone very wrong, so please shoot my process before
it gets worse".

Maybe what's needed is *both* a max_ and min_ parameter?


pgppAX7GLTgkP.pgp
Description: PGP signature


SD scheduler testing hitch

2007-04-07 Thread Mike Galbraith
On Sat, 2007-04-07 at 11:24 +0200, Ingo Molnar wrote:
> * Andrew Morton <[EMAIL PROTECTED]> wrote:

> > Where are we at with staircase anyway?  Is it looking like a 2.6.22 
> > thing? I don't personally think we've yet seen enough serious 
> > performance testing to permit a merge, apart from other issues...
> 
> yes, that's my thinking too at the moment. I'd also like to see a 
> summary of 'open design questions' list from Mike (if Mike has 
> time/energy for that?) - many questions were raised, a good number of 
> them were answered, various changes done to SD but there's no good 
> summary of the current state of affairs.

I'm working on it. I started testing fairness, but ran into a snag.

What I was testing was my theory that SD can't possibly be fair to
sleeping tasks because the differential between long burn short sleep
tasks and long sleep short burn tasks is tossed at the end of every
rotation.  That theory seems to be true, but here's the snag...

2.6.21-rc6-sd-0.39, box is 3GHz P4/HT

tenpercent: tenpercent.c compiled to run 1 10% duty cycle task.
100ms and friends:  tenpercent.c hard coded for N ms burn + 1 usec sleep.

taskset -c 1 ./tenpercent
taskset -c 1 ./100ms (or ilk)

top - 10:47:57 up  3:11, 13 users,  load average: 1.65, 1.63, 2.50

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 7357 root   9   0  1568  440  360 R   92  0.0  10:55.01 1 100ms
 7356 root   1   0  1568  444  360 S8  0.0   1:00.01 1 tenpercent
 5557 root   1   0  164m  21m 4876 S0  2.1   1:58.90 0 Xorg
 6343 root   3   0  2376 1068  768 R0  0.1   2:51.19 0 top

top - 11:05:16 up  3:29, 13 users,  load average: 1.52, 1.50, 1.81

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 7395 root   5   0  1568  444  360 R   90  0.0   8:54.25 1 100ms
 7394 root   0 -10  1568  440  360 S   10  0.0   1:00.21 1 tenpercent
 6343 root   3   0  2376 1068  768 R0  0.1   3:04.16 0 top
1 root   1   0   736  288  240 S0  0.0   0:00.90 0 init

top - 11:20:58 up  3:44, 13 users,  load average: 1.89, 1.87, 1.78

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 7429 root   2 -10  1568  444  360 R   92  0.0  12:03.81 1 100ms
 7428 root   0 -10  1568  444  360 R8  0.0   1:00.08 1 tenpercent
 6343 root   3   0  2376 1068  768 R1  0.1   3:19.36 0 top
1 root   1   0   736  288  240 S0  0.0   0:00.90 0 init

top - 12:22:27 up  4:46, 13 users,  load average: 1.90, 1.92, 1.94

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 8235 root   1 -20  1568  444  360 R   95  0.0  19:31.20 1 100ms
 8234 root   0 -20  1568  444  360 S5  0.0   1:00.01 1 tenpercent
 6343 root   3   0  2376 1068  768 R1  0.1   4:24.24 0 top
 4926 root   1   0  1820  632  544 S0  0.1   0:02.34 0 hald-addon-stor

top - 13:38:22 up  6:02, 13 users,  load average: 1.53, 1.51, 1.51

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 8643 root   5   0  1564  444  360 R   93  0.0  12:15.49 1 50ms
 8642 root   1   0  1564  444  360 S7  0.0   1:00.28 1 tenpercent
 6343 root   3   0  2376 1080  768 R0  0.1   5:27.22 0 top
1 root   1   0   736  288  240 S0  0.0   0:00.91 0 init

top - 14:02:39 up  6:26, 13 users,  load average: 1.75, 1.71, 1.56

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 8726 root   5   0  1564  444  360 R   94  0.0  15:19.07 1 8ms
 8727 root   1   0  1564  444  360 R6  0.0   1:00.11 1 tenpercent
 5557 root   1   0  164m  21m 4632 S0  2.1   3:20.92 0 Xorg
 6079 root   1   0 31584  17m  12m S0  1.7   0:04.35 0 konsole

top - 16:22:01 up  8:45, 13 users,  load average: 1.73, 1.81, 1.60

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
  10622 root   1   0  1428  264  212 R   98  0.0  10:00.43 1 xx
  10621 root   1   0  1564  440  360 S1  0.0   0:06.49 1 tenpercent
  10423 root   3   0  2248 1052  764 R0  0.1   0:27.45 0 top
  1 root   1   0   736  288  240 S0  0.0   0:00.91 0 init

xx.c just tries to terminate the rotation if it gets preempted, and
seems to succeed.  It usually isn't this bad, but every few starts it
gets this bad.  I thought it might be screwing up the calibration of
tenpercent if xx started first, but I plugged it into tenp.c (attached)
after the calibration, and still see this every few starts.  It always
gets more cpu than it should, but sometimes it's extreme.

I have yet to see tenpercent start at 1 percent usage in many many
tries, but I just repeated it with the attached in seven tries.

xx.c

#include 
#include 

#define max(a,b) ((a) > (b) ? (a) : (b))
#define min(a,b) ((a) < (b) ? (a) : (b))

int main(void)
{
struct timeval then, now;
struct timespec t = {0, 1000}, r;

for(;;) {
int t1, t2;
short i;

if (gettimeofday(&then, 0))
break;
   

[PATCH] block layer: Add bdev capacity helper function get_sect_count

2007-04-07 Thread John Anthony Kazos Jr.
From: John Anthony Kazos Jr. <[EMAIL PROTECTED]>

Add static inline function get_sect_count to include/linux/genhd.h to 
complement get_start_sect. Returns sector_t capacity of block device 
whether it is whole or a partition.

Signed-off-by: John Anthony Kazos Jr. <[EMAIL PROTECTED]>

---

This will be useful for fill_super functions of filesystems with online 
resizing for checks against recorded and actual device size. 
get_start_sect and get_sect_count are helper functions useful to keep 
things from breaking in case the block_device structure decides to change.

Applied against Linux v2.6.20.6.

--- linux-2.6.20.6-orig/include/linux/genhd.h   2007-04-06 16:02:48.0 
-0400
+++ linux-2.6.20.6-mod/include/linux/genhd.h2007-04-07 11:58:55.0 
-0400
@@ -244,6 +244,10 @@ static inline sector_t get_start_sect(st
 {
return bdev->bd_contains == bdev ? 0 : bdev->bd_part->start_sect;
 }
+static inline sector_t get_sect_count(struct block_device *bdev)
+{
+   return bdev->bd_contains == bdev ? bdev->bd_disk->capacity : 
bdev->bd_part->nr_sects;
+}
 static inline sector_t get_capacity(struct gendisk *disk)
 {
return disk->capacity;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ten percent test

2007-04-07 Thread Mike Galbraith
On Sat, 2007-04-07 at 16:50 +1000, Con Kolivas wrote:
> On Friday 06 April 2007 20:03, Ingo Molnar wrote:

> > (There was one person who 
> > reported wide-scale interactivity regressions against mainline but he
> > didnt answer my followup posts to trace/debug the scenario.)
> 
> That was one user. As I mentioned in an earlier thread, the problem with 
> email 
> threads on drawn out issues on lkml is that all that people remember is the 
> last one creating noise, and that has only been the noise from Mike for 2 
> weeks now.

This doesn't even deserve a reply, so I'll just say "get well soon".

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] remove artificial software max_loop limit

2007-04-07 Thread Bill Davidsen

[EMAIL PROTECTED] wrote:

On Fri, 06 Apr 2007 16:33:32 EDT, Bill Davidsen said:
  

Jan Engelhardt wrote:



  
Who cares if the user specifies max_loop=8 but still is able to open up 
/dev/loop8, loop9, etc.? max_loop=X basically meant (at least to me) 
"have at least X" loops ready.


  
You have just come up with a really good reason not to do unlimited 
loops.



That, and I'd expect the intuitive name for "have at least N ready" to
be 'min_loop=N'.  'max_loop=N' means (to me, at least) "If I ask for N+1,
something has obviously gone very wrong, so please shoot my process before
it gets worse".

Maybe what's needed is *both* a max_ and min_ parameter?
  
I think that max_loop is a sufficient statement of the highest number of 
devices needed, and can reasonably interpreted as both "I may need this 
many" and "I won't legitimately want more."


As I recall memory is allocated as the device is set up, so unless you 
want to use the max memory at boot, "just in case," the minimum won't be 
guaranteed anyway. Something else could eat memory.


In practice I think asking for way too many is more common than not 
being able to get to the max. It may happen but it's a corner case, and 
status is returned.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Ulrich Drepper

On 4/7/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

I am not sure what you want to say.


What Jakub meant is that it is OK for the kernel to reject using
unaligned 64-bit futexes.  Just return an error in all cases (not just
in some).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: COMPILING AND CONFIGURING A NEW KERNEL.

2007-04-07 Thread Valdis . Kletnieks
On Fri, 06 Apr 2007 18:26:45 PDT, [EMAIL PROTECTED] said:
>
> YOU SHOULD compile all the drivers necessary to boot your system, into
> the kernel (ie, such drivers should not be built as modules).
> 
> This way you will NOT need an initrd file.

It is quite possible to build a kernel that has all the drivers built-in,
but still require an initrd file.  For instance, if you have a recent
RedHat or Fedora system, '/' may very well be on an LVM partition, which
means you need an initrd to do a 'lvm varyonvg' before mounting your real
root filesystem will work


pgpf4iuCwGrtE.pgp
Description: PGP signature


Re: [PATCH 12/13] maps#2: Add /proc/pid/pagemap interface

2007-04-07 Thread Matt Mackall
On Fri, Apr 06, 2007 at 11:55:10PM -0700, Andrew Morton wrote:
> On Fri, 06 Apr 2007 17:03:13 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote:
> 
> > Add /proc/pid/pagemap interface
> > 
> > This interface provides a mapping for each page in an address space to
> > its physical page frame number, allowing precise determination of what
> > pages are mapped and what pages are shared between processes.
> 
> Could we please have a simple read-proc-pid-pagemap.c placed under
> Documentation/ somewhere?  Also some sample output for the changelog
> so we can see what all this does.

Working on that. The userspace portion of my tools are very rough at
the moment. And in Python.
 
> Also for kpagemap, please.
> 
> Should /proc/pid/pagemap and kpagemap be versioned?

They've both got a variable-sized header, so we can add things there. 

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Two questions regarding Opening files within Kernel!

2007-04-07 Thread Robert Hancock

JanuGerman wrote:

Hi Every one,

  I have got two questions regarding opening files within the Linux kernel. If 
some body can help me, in sorting out this problem, i will be very thankful.


First off, likely not something you should be doing:

http://kernelnewbies.org/FAQ/WhyWritingFilesFromKernelIsBad

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: COMPILING AND CONFIGURING A NEW KERNEL.

2007-04-07 Thread Valdis . Kletnieks
On Sat, 07 Apr 2007 00:45:32 PDT, [EMAIL PROTECTED] said:

> Use rpm-pkg to create a Red Hat RPM kernel package.
> # make rpm-pkg
> 
> When built, the RPM package is put in
> /usr/src/packages/RPMS/*your*architecture*
> 
> # cd /usr/src/packages/RPMS/x86_64
> 
> Install the package (you may have to un-install previous installs)
> # rpm -i kernel-2.6.20-1.x86_64.rpm

It is *highly* recommended that you change the kernel identifier at
least slightly, so that you can install '2.6.20-1.local' without overlaying
the vendor-supplied 2.6.20-1 kernel.  Among other things, this lets you
boot back to the equivalent code level in the vendor kernel, so you can figure
out if it's your .config file that's broken, or if you hit a bug upggrading
from 2.6.19-10 to 2.6.20-1.


pgprBYClj9Gql.pgp
Description: PGP signature


Re: Two questions regarding Opening files within Kernel!

2007-04-07 Thread JanuGerman
Thanks Jan for the response.


>struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root);


But that gives me a dentry, where as file object is still not reachable. 

Question: I am currently using a function called fs.h/dentry_open which takes a 
"dentry", "vfsmount" object  and flag (usually RW i.e. 2), and gives me the 
file object. with  your suggested method, vfsmount is still not available. In 
this regard, any idea about a function, which gives directly the file object 
instead of dentry will be highly appreciated. 


OR,  (Kindly see the code below), i need some thing for "missing vfsmount".


struct dentry *fbar = lookup_one_len("/foo/bar", current->fs->root);
struct file *file1 = dentry_open(fbar, "missing vfsmount here",2)



Thanks,
JG





___ 
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at 
the Yahoo! Mail Championships. Plus: play games and win prizes. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


If not readdir() then what?

2007-04-07 Thread Ulrich Drepper

In their closed chambers (well, workshops,
http://lwn.net/Articles/226351/), the filesystem developers complain
about readdir.  I fully appreciate the difficulties.  But what I fail
to see so far is any proposal for an alternative interface.

The phase to get new functionality included in the next revision of
POSIX is over.  But that does not mean we should not try to get some
sensible new implementation in place.  There is, for example, the
"High End Computing Extensions Working Group" (the guys who showed up
here with their statlite and readdirplus proposals).  This is an
official working group at the OpenGroup which can produce a document
which can be the basis of inclusion in the next revision and become a
OpenGroup specification earlier than that.

So, if anybody has a proposal for better interfaces let's hear them.
"Now" is a very good time to start working on this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Valdis . Kletnieks
On Sat, 07 Apr 2007 16:11:46 +0200, Krzysztof Halasa said:

> > Think about it,... read speeds that are some FOUR times the physical
> > disk read rate,... impossible without the use of compression (or
> > something similar).
> 
> It's really impossible with compression only unless you're writing
> only zeros or stuff alike. I don't know what bonnie uses for testing
> but real life data doesn't compress 4 times. Two times, sometimes,

All depends on your data.  From a recent "compress the old logs" job on
our syslog server:

/logs/lennier.cc.vt.edu/2007/03/maillog-2007-0308:   85.4% -- replaced with 
/logs/lennier.cc.vt.edu/2007/03/maillog-2007-0308.gz

And it wasn't a tiny file either - it's a busy mailserver, the logs run to
several hundred megabytes a day.  Syslogs *often* compress 90% or more,
meaning a 10X compression.

> but then it will be typically slower than disk access (I mean read,
> as write will be much slower).

Actually, as far back as 1998 or so, I was able to document 20% *speedups*
on an AIX system that supported compressed file systems - and that was from
when a 133mz PowerPC 604e was a *fast* machine.   Since then, CPUs have gotten
faster at a faster rate than disks have, even increasing the speedup.

The basic theory is that unless you're sitting close to 100%CPU, it is *faster*
to burn some CPU to compress/decompress a 4K chunk of data down to 2K, and then
move 2K to the disk drive, than it is to move 4K.  It's particularly noticable
for larger files - if you can apply the compression to  remove the need to move
2M of data faster than you can move 2M of data, you win.



pgp1Fr9NtbQlR.pgp
Description: PGP signature


Re: SD scheduler testing hitch

2007-04-07 Thread Mike Galbraith
On Sat, 2007-04-07 at 18:20 +0200, Mike Galbraith wrote:

> xx.c
> 
> #include 
> #include 
> 
> #define max(a,b) ((a) > (b) ? (a) : (b))
> #define min(a,b) ((a) < (b) ? (a) : (b))
> 
> int main(void)
> {
> struct timeval then, now;
> struct timespec t = {0, 1000}, r;
> 
> for(;;) {
> int t1, t2;
> short i;
> 
> if (gettimeofday(&then, 0))
> break;
> for (i = 1; i > 0; i++);
> if (gettimeofday(&now, 0))
> break;
> t2 = max(then.tv_usec, now.tv_usec);
> t1 = min(then.tv_usec, now.tv_usec);
> if (t2 - t1 >= 1000 && nanosleep(&t, &r))
> break;
> }
> return 0;
> }

I lowered the time to 500us, and ran at nice -10.. it starves tenpercent
here every time.  (ran as taskset -c 1 nice -n -10 ./fairtest)  The
starving 10% duty cycle task has trouble getting 1% CPU.

-Mike
// gcc -O2 -o tenp tenp.c -lrt
// code from interbench.c
#include 
#include 
#include 
#include 
#include 
#include 
/*
 * Start $forks processes that run for 10% cpu time each. Set this to
 * 15 * number of cpus for best effect.
 */
int forks = 1;

unsigned long run_us = 10, sleep_us;
unsigned long loops_per_ms;

void terminal_error(const char *name)
{
	fprintf(stderr, "\n");
	perror(name);
	exit (1);
}

unsigned long long get_nsecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		terminal_error("clock_gettime");
	return (myts->tv_sec * 10 + myts->tv_nsec );
}

void burn_loops(unsigned long loops)
{
	unsigned long i;

	/*
	 * We need some magic here to prevent the compiler from optimising
	 * this loop away. Otherwise trying to emulate a fixed cpu load
	 * with this loop will not work.
	 */
	for (i = 0 ; i < loops ; i++)
	 asm volatile("" : : : "memory");
}

/* Use this many usecs of cpu time */
void burn_usecs(unsigned long usecs)
{
	unsigned long ms_loops;

	ms_loops = loops_per_ms / 1000 * usecs;
	burn_loops(ms_loops);
}

void microsleep(unsigned long long usecs)
{
	struct timespec req, rem;

	rem.tv_sec = rem.tv_nsec = 0;

	req.tv_sec = usecs / 100;
	req.tv_nsec = (usecs - (req.tv_sec * 100)) * 1000;
continue_sleep:
	if ((nanosleep(&req, &rem)) == -1) {
		if (errno == EINTR) {
			if (rem.tv_sec || rem.tv_nsec) {
req.tv_sec = rem.tv_sec;
req.tv_nsec = rem.tv_nsec;
goto continue_sleep;
			}
			goto out;
		}
		terminal_error("nanosleep");
	}
out:
	return;
}

/*
 * In an unoptimised loop we try to benchmark how many meaningless loops
 * per second we can perform on this hardware to fairly accurately
 * reproduce certain percentage cpu usage
 */
void calibrate_loop(void)
{
	unsigned long long start_time, loops_per_msec, run_time = 0,
		min_run_us = run_us;
	unsigned long loops;
	struct timespec myts;
	int i;

	printf("Calibrating loop\n");
	loops_per_msec = 100;
redo:
	/* Calibrate to within 1% accuracy */
	while (run_time > 101 || run_time < 99) {
		loops = loops_per_msec;
		start_time = get_nsecs(&myts);
		burn_loops(loops);
		run_time = get_nsecs(&myts) - start_time;
		loops_per_msec = (100 * loops_per_msec / run_time ? :
			loops_per_msec);
	}

	/* Rechecking after a pause increases reproducibility */
	microsleep(1);
	loops = loops_per_msec;
	start_time = get_nsecs(&myts);
	burn_loops(loops);
	run_time = get_nsecs(&myts) - start_time;

	/* Tolerate 5% difference on checking */
	if (run_time > 105 || run_time < 95)
		goto redo;
	loops_per_ms=loops_per_msec;
	printf("Calibrating sleep interval\n");
	microsleep(1);
	/* Find the smallest time interval close to 1ms that we can sleep */
	for (i = 0; i < 100; i++) {
		start_time=get_nsecs(&myts);
		microsleep(1000);
		run_time=get_nsecs(&myts)-start_time;
		run_time /= 1000;
		if (run_time < run_us && run_us > 1000)
			run_us = run_time;
	}
	/* Then set run_us to that duration and sleep_us to 9 x that */
	sleep_us = run_us * 9;
	printf("Calibrating run interval\n");
	microsleep(1);
	/* Do a few runs to see what really gets us run_us runtime */
	for (i = 0; i < 100; i++) {
		start_time=get_nsecs(&myts);
		burn_usecs(run_us);
		run_time=get_nsecs(&myts)-start_time;
		run_time /= 1000;
		if (run_time < min_run_us && run_time > run_us)
			min_run_us = run_time;
	}
	if (min_run_us < run_us)
		run_us = run_us * run_us / min_run_us;
	printf("Each fork will run for %lu usecs and sleep for %lu usecs\n",
		run_us, sleep_us);
}


#define max(a,b) ((a) > (b) ? (a) : (b))
#define min(a,b) ((a) < (b) ? (a) : (b))

void steal(void)
{
struct timeval then, now;
struct timespec t = {0, 500}, r;

for(;;) {
int t1, t2;
short i;

if (gettimeofday(&then, 0))
break;
for (i = 1; i > 0; i++);
if (gettimeofday(&now, 0))
break;
t2 = max(then.tv_usec, now.tv_usec);
t1 = min(then.tv_usec, now.tv_usec);
if (t2 - t1 >= 500 && nanosleep(&t, &r))
break;
}
}

int main(void){
	int i, child;

	calibrate_loop();
	pr

Re: [patch 2/4] clean up identify_cpu

2007-04-07 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> x86_64 uses this too.
>
> WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to 
> .init.text:mtrr_bp_init from .text.identify_cpu after 'identify_cpu' (at 
> offset 0x655)
>   

OK, two patches to follow: a x86-64 variant of the bugs.h cleanup, and a
replacement for this patch. I don't have a x86-64 compile environment on
hand, so the 64 bits are completely untested, but they *look* they
should work.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Valdis . Kletnieks
On Sat, 07 Apr 2007 16:11:46 +0200, Krzysztof Halasa said:
>
> Gzip - 3 files (zeros only, raw DV data from video camera, x86_64
> kernel rpm file), 10 MB of data (10*1024*1024),
> $ l -Ggh zeros dv bin
> -rw-r--r-- 1 10M Apr  7 15:30 bin
> -rw-r--r-- 1 10M Apr  7 15:31 dv
> -rw-r--r-- 1 10M Apr  7 15:31 zeros

> $ l -Ggh zeros.gz dv.gz bin.gz
> -rw-r--r-- 1  10K Apr  7 15:31 zeros.gz
> -rw-r--r-- 1 9.1M Apr  7 15:31 dv.gz
> -rw-r--r-- 1 9.3M Apr  7 15:30 bin.gz
> 
> ... and though the numbers may still sound impressive, space savings
> are less than 10%.

I am quite sure that the kernel RPM file is *already* compressed, at least
somewhat.  Otherwise, it's hard to explain this:

-rw-r--r--1 529  263 17835757   Apr  5 00:19   
kernel-2.6.20-1.3045.fc7.x86_64.rpm

% du -s /lib/modules/2.6.20-1.3038.fc7/
76436   /lib/modules/2.6.20-1.3038.fc7/

and it can't all be slack space at ends of files:

% find /lib/modules/2.6.20-1.3038.fc7/ -type f | wc -l
1482

Even on a 4K filesystem, the *max* wasted slack would be about 4M.

And what do you know - if you tar.gz that /lib/modules:

% tar czf /tmp/kern.tar.gz /lib/modules/2.6.20-1.3038.fc7/
tar: Removing leading `/' from member names
% ls -l /tmp/kern.tar.gz 
-rw-r--r-- 1 valdis valdis 15506359 2007-04-07 13:19 /tmp/kern.tar.gz

The *compressed* tar is about 15M (remember the .rpm contained a 2M vmlinuz
as well - that;s compressed too).  So we're right up to the 17M of the .rpm,
which indicates that the RPM is compressed at a factor close to tar.gz.

I'd not be surprised to find out that your digital-video also contains
at least some light compression - if it's mpeg or similar, that's already
had some *heavy* compression done to it


pgpYDc0gClyYr.pgp
Description: PGP signature


[PATCH 1/2] Clean up asm-x86_64/bugs.h

2007-04-07 Thread Jeremy Fitzhardinge
Most of asm-x86_64/bugs.h is code which should be in a C file, so put it there.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Linus Torvalds <[EMAIL PROTECTED]>

---
 arch/x86_64/kernel/Makefile  |3 ++-
 arch/x86_64/kernel/bugs.c|   28 
 include/asm-x86_64/alternative.h |1 +
 include/asm-x86_64/bugs.h|   30 --
 4 files changed, 35 insertions(+), 27 deletions(-)

===
--- a/arch/x86_64/kernel/Makefile
+++ b/arch/x86_64/kernel/Makefile
@@ -8,7 +8,8 @@ obj-y   := process.o signal.o entry.o trap
ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
x8664_ksyms.o i387.o syscall.o vsyscall.o \
setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
-   pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o sched-clock.o
+   pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o sched-clock.o \
+   bugs.o
 
 obj-$(CONFIG_STACKTRACE)   += stacktrace.o
 obj-$(CONFIG_X86_MCE)  += mce.o therm_throt.o
===
--- /dev/null
+++ b/arch/x86_64/kernel/bugs.c
@@ -0,0 +1,28 @@
+/*
+ *  arch/x86_64/kernel/bugs.c
+ *
+ *  Copyright (C) 1994  Linus Torvalds
+ *  Copyright (C) 2000  SuSE
+ *
+ * This is included by init/main.c to check for architecture-dependent bugs.
+ *
+ * Needs:
+ * void check_bugs(void);
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+void __init check_bugs(void)
+{
+   identify_cpu(&boot_cpu_data);
+#if !defined(CONFIG_SMP)
+   printk("CPU: ");
+   print_cpu_info(&boot_cpu_data);
+#endif
+   alternative_instructions();
+}
===
--- a/include/asm-x86_64/alternative.h
+++ b/include/asm-x86_64/alternative.h
@@ -16,6 +16,7 @@ struct alt_instr {
u8  pad[5];
 };
 
+extern void alternative_instructions(void);
 extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 
 struct module;
===
--- a/include/asm-x86_64/bugs.h
+++ b/include/asm-x86_64/bugs.h
@@ -1,28 +1,6 @@
-/*
- *  include/asm-x86_64/bugs.h
- *
- *  Copyright (C) 1994  Linus Torvalds
- *  Copyright (C) 2000  SuSE
- *
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- * void check_bugs(void);
- */
+#ifndef _ASM_X86_64_BUGS_H
+#define _ASM_X86_64_BUGS_H
 
-#include 
-#include 
-#include 
-#include 
+void check_bugs(void);
 
-extern void alternative_instructions(void);
-
-static void __init check_bugs(void)
-{
-   identify_cpu(&boot_cpu_data);
-#if !defined(CONFIG_SMP)
-   printk("CPU: ");
-   print_cpu_info(&boot_cpu_data);
-#endif
-   alternative_instructions(); 
-}
+#endif /* _ASM_X86_64_BUGS_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86: clean up identify_cpu

2007-04-07 Thread Jeremy Fitzhardinge
identify_cpu() is used to identify both the boot CPU and secondary
CPUs, but it performs some actions which only apply to the boot CPU.
Those functions are therefore really __init functions, but because
they're called by identify_cpu(), they must be marked __cpuinit.

This patch splits identify_cpu() into identify_boot_cpu() and
identify_secondary_cpu(), and calls the appropriate init functions
from each.  Also, identify_boot_cpu() and all the functions it
dominates are marked __init.

The same change applies to both i386 and x86_64, and both have to be
changed together because they share the mtrr setup code.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>

---
 arch/i386/kernel/cpu/common.c|   41 +
 arch/i386/kernel/cpu/mtrr/main.c |4 +--
 arch/i386/kernel/smpboot.c   |2 -
 arch/i386/kernel/sysenter.c  |2 -
 arch/x86_64/kernel/bugs.c|2 -
 arch/x86_64/kernel/setup.c   |   47 ++
 arch/x86_64/kernel/smpboot.c |2 -
 include/asm-i386/processor.h |3 +-
 include/asm-x86_64/processor.h   |3 +-
 9 files changed, 70 insertions(+), 36 deletions(-)

===
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -390,7 +390,7 @@ __setup("serialnumber", x86_serial_nr_se
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
-void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
+static void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
 {
int i;
 
@@ -486,30 +486,43 @@ void __cpuinit identify_cpu(struct cpuin
for (i = 0; i < NCAPINTS; i++)
printk(" %08lx", c->x86_capability[i]);
printk("\n");
-
+}
+
+void __init identify_boot_cpu(void)
+{
+   identify_cpu(&boot_cpu_data);
+
+   /* Init Machine Check Exception if available. */
+   mcheck_init(&boot_cpu_data);
+
+   sysenter_setup();
+   enable_sep_cpu();
+
+   mtrr_bp_init();
+}
+
+void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
+{
+   int i;
+
+   BUG_ON(c == &boot_cpu_data);
+   identify_cpu(c);
/*
 * On SMP, boot_cpu_data holds the common feature set between
 * all CPUs; so make sure that we indicate which features are
 * common between the CPUs.  The first time this routine gets
 * executed, c == &boot_cpu_data.
-*/
-   if ( c != &boot_cpu_data ) {
-   /* AND the already accumulated flags with these */
-   for ( i = 0 ; i < NCAPINTS ; i++ )
-   boot_cpu_data.x86_capability[i] &= c->x86_capability[i];
-   }
+* AND the already accumulated flags with these
+*/
+   for ( i = 0 ; i < NCAPINTS ; i++ )
+   boot_cpu_data.x86_capability[i] &= c->x86_capability[i];
 
/* Init Machine Check Exception if available. */
mcheck_init(c);
 
-   if (c == &boot_cpu_data)
-   sysenter_setup();
enable_sep_cpu();
 
-   if (c == &boot_cpu_data)
-   mtrr_bp_init();
-   else
-   mtrr_ap_init();
+   mtrr_ap_init();
 }
 
 #ifdef CONFIG_X86_HT
===
--- a/arch/i386/kernel/cpu/mtrr/main.c
+++ b/arch/i386/kernel/cpu/mtrr/main.c
@@ -571,7 +571,7 @@ extern void cyrix_init_mtrr(void);
 extern void cyrix_init_mtrr(void);
 extern void centaur_init_mtrr(void);
 
-static void __cpuinit init_ifs(void)
+static void __init init_ifs(void)
 {
 #ifndef CONFIG_X86_64
amd_init_mtrr();
@@ -639,7 +639,7 @@ static struct sysdev_driver mtrr_sysdev_
  * initialized (i.e. before smp_init()).
  * 
  */
-void __cpuinit mtrr_bp_init(void)
+void __init mtrr_bp_init(void)
 {
init_ifs();
 
===
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -157,7 +157,7 @@ static void __cpuinit smp_store_cpu_info
 
*c = boot_cpu_data;
if (id!=0)
-   identify_cpu(c);
+   identify_secondary_cpu(c);
/*
 * Mask B, Pentium, but not Pentium MMX
 */
===
--- a/arch/i386/kernel/sysenter.c
+++ b/arch/i386/kernel/sysenter.c
@@ -68,7 +68,7 @@ extern const char vsyscall_sysenter_star
 extern const char vsyscall_sysenter_start, vsyscall_sysenter_end;
 static struct page *syscall_pages[1];
 
-int __cpuinit sysenter_setup(void)
+int __init sysenter_setup(void)
 {
void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
syscall_pages[0] = virt_to_page(syscall_page);
===
--- a/arch/x86_64/kernel/bugs.c
+++ b/arch/x86_64/kernel/bugs.c
@@ -19,7 +19,7 @@
 
 void __init check_bugs(void)
 {
-   identify_cpu(&boot_cpu_data);
+   i

Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Egmont Koblinger
On Sat, Apr 07, 2007 at 01:00:48PM +0200, Jan Engelhardt wrote:

Hi,

> Please, no dot, and no inverse color.
> Imagine someone had the following bitmap for :

No dot, I'm already convinced. To clarify the inverse thingy:

This is what the current kernel does:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD (which usually looks similar to an
 inverted question mark)
  3) if this fails again then displays a normal '?'
 (or a different symbol due to a bug discussed below)

Here's my proposal. This only alters the 3rd step, not the first two:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD, still with _normal_ attributes
  3) if this fails then display an ascii '?' with inverted attributes

So you won't get "double" inversion. If you do have U+FFFD in your font then
this will introduce no chance. If you don't have U+FFFD, you'll see inverse
question marks instead of normal ones.


> I blame your latin2 unicode map. (See above about 'Û'.)

There's nothing wrong with my latin2 unicode map, and I've located and
changed the part _in the kernel_ that displays a false glyph using the
algorithm I've outlined. It just uses "the glyph at that code position
within the glyph table" as a fallback, which might be okay in 8-bit mode
(and I haven't modified the behavior in that case), but I got rid of this
behavior in UTF-8 mode since it's definitely a fault in the world of
Unicode.

> It should perhaps display a regular 'u' if it cannot display 'û',

I rather think it should display U+FFFD but YMMV.

> but definitely not 'ü' (which is not called a double accent, btw).

This is not the character I've been talking about, I actually _did_ talk
about u with double acute accent (ű - you might not have seen this character
so far, AFAIK it's only used in Hungarian, no other languages). But we agree
that the kernel definitely shouldn't display a character with a different
accent on it. This is one of the bugs my patch addresses.


bye,

Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] partitions: Enhance Kconfig help text for EESOX and MSDOS formats

2007-04-07 Thread John Anthony Kazos Jr.
From: John Anthony Kazos Jr. <[EMAIL PROTECTED]>

Adds help text for ACORN_PARTITION_EESOX and improves help text for 
MSDOS_PARTITION in fs/partitions/Kconfig.

Signed-off-by: John Anthony Kazos Jr. <[EMAIL PROTECTED]>

---

Applied against Linux v2.6.20.6.

--- linux-2.6.20.6-orig/fs/partitions/Kconfig   2007-04-06 16:02:48.0 
-0400
+++ linux-2.6.20.6-mod/fs/partitions/Kconfig2007-04-07 13:22:17.0 
-0400
@@ -32,6 +32,10 @@ config ACORN_PARTITION_EESOX
bool "EESOX partition support" if PARTITION_ADVANCED
default y if ARCH_ACORN
depends on ACORN_PARTITION
+   help
+ EESOX SCSI card on-disk partition format support for Acorn
+ systems. If you have one of these cards, or want to use a disk
+ written by one, say Y.
 
 config ACORN_PARTITION_ICS
bool "ICS partition support" if PARTITION_ADVANCED
@@ -108,7 +112,11 @@ config MSDOS_PARTITION
bool "PC BIOS (MSDOS partition tables) support" if PARTITION_ADVANCED
default y
help
- Say Y here.
+ Standard PC-compatible partition table support for Linux. Used by
+ i386 systems, Linux/Windows dual-boot systems, and many others.
+ Unless you are certain your system does not use this partition
+ table format, and you're not using any disks from a system that
+ does, say Y.
 
 config BSD_DISKLABEL
bool "BSD disklabel (FreeBSD partition tables) support"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] Containers (V8): Add generic multi-subsystem API to containers

2007-04-07 Thread Paul Menage

On 4/6/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:

On Fri, Apr 06, 2007 at 04:32:24PM -0700, [EMAIL PROTECTED] wrote:
> +static int attach_task(struct container *cont, struct task_struct *tsk)
>  {

[snip]

> + task_lock(tsk);

You need to check here if task state is PF_EXITING and fail with
-ESRCH if so? Otherwise we risk breaking refcount on
init_container_group.



Yes, I think you're right; I've now changed it to this in my tree:

   task_lock(tsk);
   if (tsk->flags & PF_EXITING) {
   task_unlock(tsk);
   put_container_group(newcg);
   return -ESRCH;
   }
   rcu_assign_pointer(tsk->containers, newcg);
   task_unlock(tsk);

Thanks,

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Valdis . Kletnieks
On Fri, 06 Apr 2007 19:47:36 PDT, [EMAIL PROTECTED] said:
> On Fri, 6 Apr 2007 11:21:19 -0400, "Jan Harkes" <[EMAIL PROTECTED]>

> > With compression there is a pretty high probability that one corrupted
> > byte or disk block will result in loss of a considerably larger amount
> > of data. 
> 
> Bad blocks are NOT dealt with by the filesystem,... so your comment is
> irrelevant, or just plain wrong.
> 
> If your filesystem is writing to bad blocks, then throw away your
> operating system.

You know... occasionally, blocks go bad *after* you write to them.  If
you have an uncompressed filesystem, it's often possible to recover most
of the file , and just have a few 512-byte blocks of zeros, simply by
doing something like 'dd if=bad.file of=bad.file bs=512 conv=noerror'
or careful applications of 'skip=N'.  If it's compressed, you usually
can't recover the rest of a compression group if a previous block is lost.

(And for those who talk about backups - yes, taking backups is good.
However, it's the rare laptop or desktop machine that can afford the
luxury of RAID disks, and backups usually happen once a night, if that
often.  This means that if you've been working hard on something important
all day, and the disk blows chunks at 4:30PM, you *will* be suddenly very
concerned over exactly how much you can recover off the failing drive

And yes, I'd *love* to have all my users connected to nice SAN systems that do
snapshotting and remote replication to DR sites and all that - but have you
ever *priced* a petabyte of SAN storage, the NAS gateways to serve it to users,
and upgrading several tens of thousands of network ports to Gig-E? Hint -
US$1M would get us through a pilot, and probably $5M and up to *start*
deployment. Anybody wanna buy us an EMC DMX-3? :)

http://www.emc.com/products/systems/symmetrix/DMX_series/DMX3.jsp



pgp1JOWRSl3hZ.pgp
Description: PGP signature


[PATCH] ip_tables.h

2007-04-07 Thread Patrick Ale

Hi lads,

I had some problems compiling the external netfilter modules due to
missing definitions.
I googled a lot, saw a lot of people having the same problems but no
real answer to how to fix it.

So.. I made a little patch which make things work for me, at least.

Modules that work after applying the patch are the geopip module,
connlimit module, and prolly more, but I didnt test them.

Please note, I am not a coder, not a maintainer and I am happy that I
didnt break anything so please don't consider this as a proposal to
include in the kernel or something, I am just in favor of sharing what
helped me getting things to work, if i can help others with it or if
it is interesting material for inclusion, even better :)

So, the patch is attached in this email and can also be found on
http://www.patrickale.eu/documents/archives/patches/ip_tables.h.diff


I hope this helps a people or two.


Patrick
--- include/linux/netfilter_ipv4/ip_tables.h.orig   2007-04-07 20:30:25.344365707 +0200
+++ include/linux/netfilter_ipv4/ip_tables.h2007-04-07 20:34:05.076887550 +0200
@@ -34,6 +34,12 @@
 #define ipt_table xt_table
 #define ipt_get_revision xt_get_revision

+#define ipt_register_match(mtch)\
+({  (mtch)->family = AF_INET;   \
+xt_register_match(mtch); })
+#define ipt_unregister_match(mtch) xt_unregister_match(mtch)
+
+
 /* Yes, Virginia, you have to zero the padding. */
 struct ipt_ip {
/* Source and destination IP addr */



  1   2   >