Re: [PATCH 1/1] fs: add 4th case to do_path_lookup

2007-05-04 Thread Andrew Morton
On Sun, 29 Apr 2007 23:30:12 -0400 Josef Sipek <[EMAIL PROTECTED]> wrote:

> Signed-off-by: Josef 'Jeff' Sipek <[EMAIL PROTECTED]>
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 2995fba..1516a9b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1125,6 +1125,10 @@ static int fastcall do_path_lookup(int dfd, const char 
> *name,
>   nd->mnt = mntget(fs->rootmnt);
>   nd->dentry = dget(fs->root);
>   read_unlock(&fs->lock);
> + } else if (flags & LOOKUP_ONE) {
> + /* nd->mnt and nd->dentry already set, just grab references */
> + mntget(nd->mnt);
> + dget(nd->dentry);
>   } else if (dfd == AT_FDCWD) {
>   read_lock(&fs->lock);
>   nd->mnt = mntget(fs->pwdmnt);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 92b422b..aa89d97 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -48,6 +48,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, 
> LAST_BIND};
>   *  - internal "there are more path compnents" flag
>   *  - locked when lookup done with dcache_lock held
>   *  - dentry cache is untrusted; force a real lookup
> + *  - lookup path from given dentry/vfsmount pair
>   */
>  #define LOOKUP_FOLLOW 1
>  #define LOOKUP_DIRECTORY  2
> @@ -55,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, 
> LAST_BIND};
>  #define LOOKUP_PARENT16
>  #define LOOKUP_NOALT 32
>  #define LOOKUP_REVAL 64
> +#define LOOKUP_ONE  128

Well the patch passes my too-small-to-care-about test ;)

Unless someone objects I'd suggest that you add it to the unionfs tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] Power Management: use mutexes instead of semaphores

2007-05-04 Thread Matthias Kaehlcke
El Thu, May 03, 2007 at 10:54:32PM -0700 Andrew Morton ha dit:

> On Fri, 27 Apr 2007 10:43:22 +0200 Matthias Kaehlcke <[EMAIL PROTECTED]> 
> wrote:
> 
> > the Power Management code uses semaphores as mutexes. use the mutex
> > API instead of the (binary) semaphores
> 
> I know it's a little thing, but given a choice between
> 
> a) changelogs which use capital letters and fullstops and
> 
> b) changelogs which do not,
> 
> I think a) gives a better result.

thanks for your suggestion, i'll take it into account for future patches
 
> I note that none of these patches added a #include .  Each C
> file which uses mutexes should do that, rather than relying upon accidental
> nested includes.  I hope you're checking for that.

initially i added the include line (i think at least one patch still
contains it), but then i realized that in most cases the original code
doesn't include semaphore.h and i (mis-)interpreted that it should be
handled the same way (relying upon nested includes) for mutexes. 

do you want me to send you a version of the patches containing the
include?

regards

-- 
Matthias Kaehlcke
Linux Application Developer
Barcelona

  The assumption that what currently exists must necessarily
exist is the acid that corrodes all visionary thinking
 .''`.
using free software / Debian GNU/Linux | http://debian.org  : :'  :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4  `-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Routing 600+ vlan's via linux problems (looks like arp problems)

2007-05-04 Thread Miquel van Smoorenburg
In article <[EMAIL PROTECTED]> you write:
>Its a Juniper M7i
>It comes default with a 5400 rpm laptop 2.5" harddrive but now we
>bought a more robust "server" 2.5" harddrive. It still barfs on the OS
>install, so the linux is doing all the job now. Will get a juniper guy
>to come and fix :)
>
>As a side note, i'm starting to wonder if it was worth the $20k when i
>could just have a linux machine to do the job with a clone for backup
>;)

Well, the features and esp. the JunOS cli are worth a lot. And
if you need to route more than say 3 gbit/s, PC hardware just
won't cut it.

Then again, if you like the CLI, don't need to route more
than 1 gbit/s, and don't need to do fancy stuff like MPLS,
QoS or shaping, there's always solutions like http://www.vyatta.com/

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] UBI: dereference after kfree in create_vtbl

2007-05-04 Thread Artem Bityutskiy
On Thu, 2007-05-03 at 11:49 -0400, Florin Malita wrote:
> Coverity (CID 1614) spotted new_seb being dereferenced after kfree() in 
> create_vtbl's write_error path.

Applied with minor trailing white-space cleanup, thanks.

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:
> > 
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit.  But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in 
> >> transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> > 
> > I don't understand.  Can you please describe the race in more detail?
> 
> if I understood your idea right, then in data=ordered mode, commit thread 
> writes
> all dirty mapped blocks before real commit.
> 
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
> thread
> 
> t1t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


>   going to commit
>   find inode I dirty
>   do NOT find these blocks because they're
> allocated only, but pages/bhs aren't 
> mapped
> to them
>   start commit

I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...

> 
> map pages/bhs to just allocate blocks
> 
> 
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound 
> nice, IMHO.
> 

I don't think I'm understanding you fully yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] Power Management: use mutexes instead of semaphores

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 09:08:40 +0200 Matthias Kaehlcke <[EMAIL PROTECTED]> wrote:

> > I note that none of these patches added a #include .  Each C
> > file which uses mutexes should do that, rather than relying upon accidental
> > nested includes.  I hope you're checking for that.
> 
> initially i added the include line (i think at least one patch still
> contains it), but then i realized that in most cases the original code
> doesn't include semaphore.h and i (mis-)interpreted that it should be
> handled the same way (relying upon nested includes) for mutexes. 
> 
> do you want me to send you a version of the patches containing the
> include?

erm, is OK, I'll make the changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Rewrite the MAJOR() macro as a call to imajor().

2007-05-04 Thread Robert P. J. Day
On Thu, 3 May 2007, Andrew Morton wrote:

> On Sat, 28 Apr 2007 06:23:54 -0400 (EDT) "Robert P. J. Day" <[EMAIL 
> PROTECTED]> wrote:
>
> > Replace the MAJOR() macro invocation with a call to the inline
> > imajor() routine.
> >
> > Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>
> >
> > ---
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 6b5b642..08da15b 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -710,7 +710,7 @@ static inline int is_loop_device(struct file *file)
> >  {
> > struct inode *i = file->f_mapping->host;
> >
> > -   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
> > +   return i && S_ISBLK(i->i_mode) && imajor(i) == LOOP_MAJOR;
> >  }
>
> there's no runtime change, and I count a couple hundred MAJORs in
> the tree.
>
> I don't want to receive 200 one-line patches please.  If you're
> going to do this then please do decent-sized per-subsystem patches
> and see if you can persuade the subsystem maintainers to take them
> directly.

  you misunderstand the point of that patch.  it's not to replace all
instances of MAJOR(), only those that are being used in specifically
that context -- to extract the major (or minor) number from an inode,
and there's a *very* small number of those:

$ grep -Er "(MINOR|MAJOR).*i_rdev" *
arch/sh/boards/landisk/landisk_pwb.c:   minor = MINOR(inode->i_rdev);
arch/sh/boards/landisk/landisk_pwb.c:   minor = MINOR(inode->i_rdev);
drivers/block/loop.c:   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == 
LOOP_MAJOR;
drivers/media/video/ivtv/ivtv-fileops.c:int minor = 
MINOR(inode->i_rdev);
include/linux/fs.h: return MINOR(inode->i_rdev);
include/linux/fs.h: return MAJOR(inode->i_rdev);
sound/oss/au1550_ac97.c:int minor = MINOR(inode->i_rdev);

  it's just standardizing on using the imajor() and iminor() inlines
defined in include/linux/fs.h.

rday

-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch]clarification of coding style regarding conditional statements with two branches

2007-05-04 Thread Oliver Neukum
Hi,

I suggest that the coding style should state that if either branch of
an 'if' statement needs braces, both branches should use them.

Regards
Oliver
Signed-off-by: Oliver Neukum <[EMAIL PROTECTED]>


--- a/Documentation/CodingStyle 2007-04-20 13:08:17.0 +0200
+++ b/Documentation/CodingStyle 2007-04-20 13:16:14.0 +0200
@@ -160,6 +160,21 @@
 25-line terminal screens here), you have more empty lines to put
 comments on.
 
+Do not unnecessarily use braces where a single statement will do.
+
+if (condition)
+   action();
+
+This does not apply if one branch of a conditional statement is a single
+statement. Use braces in both branches.
+
+if (condition) {
+   do_this();
+   do_that();
+} else {
+   otherwise();
+}
+
3.1:  Spaces
 
 Linux kernel style for use of spaces depends (mostly) on
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] fs: add 4th case to do_path_lookup

2007-05-04 Thread Christoph Hellwig
sorry, I proposed Jeff a reply long ago but haven't done yet.

On Fri, May 04, 2007 at 12:02:00AM -0700, Andrew Morton wrote:
> > @@ -1125,6 +1125,10 @@ static int fastcall do_path_lookup(int dfd, const 
> > char *name,
> > nd->mnt = mntget(fs->rootmnt);
> > nd->dentry = dget(fs->root);
> > read_unlock(&fs->lock);
> > +   } else if (flags & LOOKUP_ONE) {
> > +   /* nd->mnt and nd->dentry already set, just grab references */
> > +   mntget(nd->mnt);
> > +   dget(nd->dentry);
> > } else if (dfd == AT_FDCWD) {
> > read_lock(&fs->lock);
> > nd->mnt = mntget(fs->pwdmnt);
> 
> Well the patch passes my too-small-to-care-about test ;)
> 
> Unless someone objects I'd suggest that you add it to the unionfs tree.

The code is obviously correct.  There is one little thing that bothers
me, and that's that nd was purely an output paramter to path_lookup and
do_path_lookup, and no it's an input paramter for the least used case.
It might make sense to just a simple helper ala:



static int path_component_lookup(struct dentry *dentry, struct vfsmount *mnt,
const char *name, unsigned int flags, struct nameidata *nd)
{
int retval;

nd->last_type = LAST_ROOT;
nd->flags = flags;
nd->mnt = mntget(mnt);
nd->dentry = dget(dentry);
nd->depth = 0;

retval = path_walk(name, nd);
if (unlikely(!retval && !audit_dummy_context() &&
 nd->dentry && nd->dentry->d_inode))
audit_inode(name, nd->dentry->d_inode);

return retval;

}

instead.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-04 Thread David Chinner
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote:
> > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > This patch implements the fallocate() system call and adds support for
> > > > i386, x86_64 and powerpc.
> > > > 
> > > > ...
> > > > +{
> > > > +   struct file *file;
> > > > +   struct inode *inode;
> > > > +   long ret = -EINVAL;
> > > > +
> > > > +   if (len == 0 || offset < 0)
> > > > +   goto out;
> > > 
> > > The posix spec implies that negative `len' is permitted - presumably 
> > > "allocate
> > > ahead of `offset'".  How peculiar.
> > 
> > I just checked the man page for posix_fallocate() and it says:
> > 
> >   EINVAL  offset or len was less than zero.
> > 
> > We should probably follow this lead.
> 
> Yes, I think so.  I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy.  Or I can't read.
> 
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment?  Head spins.

I don't think we should care. If we provide a syscall with the
semantics of "allocate from offset to offset+len" then glibc's
implementation can turn negative length into two separate
fallocate syscalls

> > > > +   ret = -ENODEV;
> > > > +   if (!S_ISREG(inode->i_mode))
> > > > +   goto out_fput;
> > > 
> > > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > > seems a bit silly of them.
> > 
> > H - I thought that the intention of sys_fallocate() was to
> > be generic enough to eventually allow preallocation on directories.
> > If that is the case, then this check will prevent that
> 
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the 
> above
> check was relaxed?

No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.

It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).

IIRC directory preallocation has been talked about more for ext3/4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Routing 600+ vlan's via linux problems (looks like arp problems)

2007-05-04 Thread Andi Kleen
"Miquel van Smoorenburg" <[EMAIL PROTECTED]> writes:

> And
> if you need to route more than say 3 gbit/s, PC hardware just
> won't cut it.

Each new x86 hardware generation normally can route more than the previous
generation. If you give out such a dubious number you would always need to give
it a (short) expiry date.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Nick Piggin

Ingo Molnar wrote:

* Andrew Morton <[EMAIL PROTECTED]> wrote:


- If replying, please be sure to cc the appropriate individuals.  
 Please also consider rewriting the Subject: to something 
 appropriate.



i'm wondering about swap-prefetch:


Well I had some issues with it that I don't think were fully discussed,
and Andrew prompted me to say something, but it went off list for a
couple of posts (my fault, sorry). Posting it below with Andrew's
permission...



  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch
  
add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c|  581 

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
i've seen so far was positive. Time to have this upstream and time for a 
desktop-oriented distro to pick it up.


I think this has been held back way too long. It's .config selectable 
and it is as ready for integration as it ever is going to be. So it's a 
win/win scenario.


Being able to config all these core heuristics changes is really not that
much of a positive. The fact that we might _need_ to config something out,
and double the configuration range isn't too pleasing.

Here were some of my concerns, and where our discussion got up to.

Andrew Morton wrote:
> On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:
>
>
>>Andrew Morton wrote:
>>
>>>istr you had issues with swap-prefetch?
>>>
>>>If so, now's a good time to reiterate them ;)
>>
>>1) It is said to help with the updatedb overnight problem, however it
>>actually _doesn't_ prefetch swap when there are low free pages, which
>>is how updatedb will leave the system. So this really puzzles me how
>>it would work. However if updatedb is causing excessive swapout, I
>>think we should try improving use-once algorithms first, for example.
>
>
> Yes.  Perhaps it just doesn't help with the updatedb thing.  Or maybe with
> normal system activity we get enough free pages to kick the thing off and
> running.  Perhaps updatedb itself has a lot of rss, for example.

Could be, but I don't know. I'd think it unlikely to allow _much_ swapin,
if huge amounts of the desktop have been swapped out. But maybe... as I
said, nobody seems to have a recipe for these things.


> Would be useful to see this claim substantiated with a real testcase,
> description of results and an explanation of how and why it worked.

Yes... and then try to first improve regular page reclaim and use-once
handling.


>>2) It is a _highly_ speculative operation, and in workloads where periods
>>of low and high page usage with genuinely unused anonymous / tmpfs
>>pages, it could waste power, memory bandwidth, bus bandwidth, disk
>>bandwidth...
>
>
> Yes.  I suspect that's a matter of waiting for the corner-case reporters to
> complain, then add more heuristics.

Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is
happy to do a _lot_ of work for these things which we have already decided
are least likely to be used again.


>>3) I haven't seen a single set of numbers out of it. Feedback seems to
>>have mostly come from people who
>
>
> Yup.  But can we come up with a testcase?  It's hard.

I guess it is hard firstly because swapping is quite random to start with.
But I haven't even seen basic things like "make -jhuge swapstorm has no
regressions".


>>4) If this is helpful, wouldn't it be equally important for things like
>>mapped file pages? Seems like half a solution.
>
>
> True.
>
> Without thinking about it, I almost wonder if one could do a userspace
> implementation with something which pokes around in /proc/pid/pagemap and
> /proc/pid/kpagemap, perhaps with some additional interfaces added to
> do a swapcache read.  (Give userspace the ability to get at swapcache
> via a regular fd?)
>
> (otoh the akpm usersapce implementation is swapoff -a;swapon -a)

Perhaps. You may need a few indicators to see whether the system is idle...
but OTOH, we've already got a lot of indicators for memory, disk usage,
etc. So, maybe :)


>>5) New one: it is possibly going to interact badly with MADV_FREE lazy
>>freeing. The more complex we make page reclaim, the worse IMO.
>
>
> That's a bit vague.  What sort of problems do you envisage?

Well MADV_FREE pages aren't technically free, are they? So it might be
possible for a significant number of them to build up and prevent
swap prefetch from working. Maybe.


>>...) I had a few issues with implementation, like interaction with
>>cpusets. Don't know if these are all fixed or not. I sort of gave
>>up looking at it.
>
>
> Ah yes, I remember some mention of cpusets.  I

Re: [PATCH 1/1] fs: add 4th case to do_path_lookup

2007-05-04 Thread Christoph Hellwig

Oh and btw, net/sunrpc/rpc_pipe.c:rpc_lookup_parent() and
fs/nfsctl.c:do_open() should  be switched to the new code, at which
point the path_walk() export can go.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm Patch]nbd: check the return value of sysfs_create_file

2007-05-04 Thread WANG Cong
On Thu, May 03, 2007 at 11:14:50PM -0700, Andrew Morton wrote:
>On Sat, 28 Apr 2007 13:30:23 +0800 WANG Cong <[EMAIL PROTECTED]> wrote:
>
>> Since 'sysfs_create_file' is declared with attribute warn_unused_result, we 
>> must always check its return value carefully.
>> 
>
>Well that's not really the reason for your patch.
>
>warn_unused_result is there to tell us that there are deeper problems in
>the code which need addressing: the failure to check the
>sysfs_create_file() return value means that bugs in the kernel can remain
>undetected, or can be harder to find.


Oh, thanks very much for your pointing.

>
>> 
>> ---
>> 
>> --- linux-2.6.21-rc7-mm2/drivers/block/nbd.c.orig2007-04-27 
>> 17:27:47.0 +0800
>> +++ linux-2.6.21-rc7-mm2/drivers/block/nbd.c 2007-04-27 17:47:32.0 
>> +0800
>> @@ -373,7 +373,10 @@ static void nbd_do_it(struct nbd_device 
>>  BUG_ON(lo->magic != LO_MAGIC);
>>  
>>  lo->pid = current->pid;
>> -sysfs_create_file(&lo->disk->kobj, &pid_attr.attr);
>> +if (sysfs_create_file(&lo->disk->kobj, &pid_attr.attr)) {
>> +printk(KERN_ERR "nbd: sysfs_create_file failed!");
>> +return;
>> +}
>>  
>>  while ((req = nbd_read_stat(lo)) != NULL)
>>  nbd_end_request(req);
>
>It would better saner to propagate this error back through callers:
>
>--- a/drivers/block/nbd.c~nbd-check-the-return-value-of-sysfs_create_file-fix
>+++ a/drivers/block/nbd.c
>@@ -366,23 +366,25 @@ static struct disk_attribute pid_attr = 
>   .show = pid_show,
> };
> 
>-static void nbd_do_it(struct nbd_device *lo)
>+static int nbd_do_it(struct nbd_device *lo)
> {
>   struct request *req;
>+  int ret;
> 
>   BUG_ON(lo->magic != LO_MAGIC);
> 
>   lo->pid = current->pid;
>-  if (sysfs_create_file(&lo->disk->kobj, &pid_attr.attr)) {
>+  ret = sysfs_create_file(&lo->disk->kobj, &pid_attr.attr);
>+  if (ret) {
>   printk(KERN_ERR "nbd: sysfs_create_file failed!");
>-  return;
>+  return ret;
>   }
> 
>   while ((req = nbd_read_stat(lo)) != NULL)
>   nbd_end_request(req);
> 
>   sysfs_remove_file(&lo->disk->kobj, &pid_attr.attr);
>-  return;
>+  return 0;
> }
> 
> static void nbd_clear_que(struct nbd_device *lo)
>@@ -572,7 +574,9 @@ static int nbd_ioctl(struct inode *inode
>   case NBD_DO_IT:
>   if (!lo->file)
>   return -EINVAL;
>-  nbd_do_it(lo);
>+  error = nbd_do_it(lo);
>+  if (error)
>+  return error;
>   /* on return tidy up in case we have a signal */
>   /* Forcibly shutdown the socket causing all listeners
>* to error
>_


Well, better code. ;) I didn't consider changing the type of nbd_do_it().
Thanks again!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

2007-05-04 Thread Paul Jackson
> I will let this idea RIP.

Agreed.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).





going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.


nope, I mean sb->inode->page walk.


But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.


if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?


It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.


not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.


I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...


indeed. this is a root cause of all this complexity.

thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [BUG 2.6.21-rc7] acpi_pm clocksource loses time on x86-64

2007-05-04 Thread Mikael Pettersson
On Thu, 03 May 2007 19:38:50 -0700, john stultz wrote:
> > So that slow acpi_pm on x86_64 seems to be connected w/ the idle loop.
> > I'm guessing the chipset halts the ACPI PM in lower C states. Do you
> > have any guesses as to what might differ between x86_64 and i386 ACPI
> > idle loops? Or might this be something different in what the BIOS
> > exports in x86_64 mode or i386 mode?
> 
> Mikael,
>   Just trying to dig a bit more through the acpi_processor_idle code.
> Could you run "cat /proc/acpi/processor/CPU1/power" and reply w/ the
> output?

Here's that file with the x86-64 kernel:

active state:C2
max_cstate:  C8
bus master activity: 
maximum allowed latency: 2 usec
states:
C1:  type[C1] promotion[C2] demotion[--] latency[000] 
usage[00107840] duration[]
   *C2:  type[C2] promotion[--] demotion[C1] latency[010] 
usage[-1987043693] duration[003044809185]

/Mikael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Correct location for ADC/DAC drivers

2007-05-04 Thread Russell King
On Fri, May 04, 2007 at 08:11:42AM +0200, Stefan Roese wrote:
> On Wednesday 02 May 2007 21:11, Russell King wrote:
> > > > Is there a maintainer for this "drivers/mfd" directory?
> > >
> > > rmk
> >
> > I wouldn't go that far.  There's no real infrastructure there
> > to maintain, so I'd actually say that the directory was
> > maintainerless.  However, I'll own up to the UCB/MCP drivers
> > in there.
> 
> So perhaps you could answer is you feel that these ADC & DAC chrdev device 
> drivers would fit into this drivers/mfd directory, or are better suited for 
> the drivers/char directory?

No idea; firstly I've long since deleted the email, secondly I've not much
interest in the directory itself, and thirdly I've enough patches to review
already.  Finally, I'm out all day today, including the evening.  I doubt
I'll read any further email until the weekend.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/6] add config option to vmalloc stacks (was: Re: [-mm patch] i386: enable 4k stacks by default)

2007-05-04 Thread Bill Irwin
On Mon, Apr 30, 2007 at 10:43:10AM -0700, William Lee Irwin III wrote:
>> +  Allocates the stack physically discontiguously and from high
>> +  memory. Furthermore an unmapped guard page follows the stack.
>> +  This is not for end-users. It's intended to trigger fatal
>> +  system errors under various forms of stack abuse.

On Fri, May 04, 2007 at 01:35:30AM -0400, Joseph Fannin wrote:
> Why is this not for end-users?  Will it not trigger anything
> useful unless set up properly, or is a big performace hit -- and how,
> or what?
> All the kernel debug options are underdocumented this way -- I'd
> like to have as many of them on as I can without absolutely killing
> performance, (or rather, *you* would) -- but I can never tell without
> grovelling all over for the info, which... well, I haven't done it
> yet, anyway.

There aren't many effective sideband methods to document things. If I
knew of an "expanded help" thing people could look at in Kconfig, I'd
write storybook documentation and put it there as I'm wont to do.


On Fri, May 04, 2007 at 01:35:30AM -0400, Joseph Fannin wrote:
> "End-user" is just insufficently defined for anyone compiling
> their own kernel.  Could you add a bit more text here describing what
> the effect of physically discontiguous high-memory stacks is?  An
> additional frobnitz dereference on every badda-bing badda-bang, likely
> to double the time it takes to dance the hokey pokey?
>*shrug*  Some of those debug options probably don't get set very
> often on kernels that are run for more than to see if it boots.

It's short for "whoever doesn't understand the terse jargon" as I'm
using it. The assumption here is that it's essentially for kernel
hackers who know all about kernel internals up-front anyway.

The option really is not to be trifled with. Maybe I could work with the
kerneldoc developers to arrange for outlets for more verbose
documentation for options (actually I'd like similar for API functions
as well; I'd like to write IRIX-style roman fleuve manpages for things).
There is a slight danger, though, that the documentation may get out of
synch. For now, I just have nowhere appropriate to put it.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v4l-dvb-maintainer] [PATCH 35/36] Use menuconfig objects II - DVB

2007-05-04 Thread Jan Engelhardt

On May 3 2007 18:40, Trent Piepho wrote:
>
>How about these examples:
>
>menuconfig FOO
>if FOO
>config A
>   depends on FOO
>endif
>config B
>if FOO
>config C
>   depends on FOO
>endif

This does not work as expected in ncurses-menuconfig either.
It does not even need a "menuconfig" object for that, something
as simple as

config A
config B
config C
  depends on A

prints it

A
B
C

rather than

A
\_ C
B

which is why some of my menuconfig patches _move_ C to not come after B.

>How does it show the first one, keeping the config entries in the correct
>order and put them into the menu at the same time?
>
>And which of what should the second be show?
>
>foo
>\-bar
>  \-baz
>
>or
>
>foo
>|-bar
>\-baz
>
>There is no question with menus, as the menu tree is clearly lexically
>defined by the matching menu / endmenu pairs.  But menuconfig doesn't work
>that way, and it seems like it would make more sense if it did.
>

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-05-04 Thread David Greaves
Kyle Moffett wrote:
> On May 03, 2007, at 11:10:47, Pavel Machek wrote:
>> How mature is freezing filesystems -- will it work on at least ext2/3
>> and vfat?
> 
> I'm pretty sure it works on ext2/3 and xfs and possibly others, I don't
> know either way about VFAT though.  Essentially the "freeze" part
> involves telling the filesystem to sync all data, flush the journal, and
> mark the filesystem clean.  The intent under dm/LVM was to allow you to
> make snapshots without having to fsck the just-created snapshot before
> you mounted it.
> 
>> What happens if you try to boot and filesystems are frozen from
>> previous run?
> 
> If you're just doing a fresh boot then the filesystem is already clean
> due to the dm freeze and so it mounts up normally.  All you need to do
> then is have a little startup script which purges the saved image before
> you fsck or remount things read-write since either case means the image
> is no longer safe to resume.

Wouldn't it be better if freeze wrote a freeze-ID to the fs and returned it?
This would naturally be kept in the image and a UUID mismatch would be
detectable - seems safer and more flexible than 'a script'.

"This isn't the freeze you're looking for, move along"

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Routing 600+ vlan's via linux problems (looks like arp problems)

2007-05-04 Thread Jan Engelhardt

On May 4 2007 02:11, Willy Tarreau wrote:
>> 
>> Haha. Would you be happy if it ran on a CF card instead? :>
>
>Yes, because at least when you design a system to run on a CF card, you
>ensure never to write on it because you know that would kill it. Then
>since you never write on it, it does not wear out and has no problem
>running for years (unless you bought cheap end-user CF of course).

Funny, I just installed a 'full' Linux distro on a CF, like with a
regular harddisk, and it runs in full rw mode. Packing it up in a
squashfs and running the thing with aufs did not seem worth
the hassle of setting up a specialized initrd. And then, when you
need to make one change (firewall), it's faster than recreating the
sqfs image.
Will see how it long that lasts.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] powerpc: Remove obsolete prototype

2007-05-04 Thread Julio M. Merino Vidal

Hello,

The include/asm-powerpc/paca.h file has a prototype for a function that 
does not exist any more; its name is setup_boot_paca.  This function was 
removed in commit 4ba99b97dadd35b9ce1438b2bc7c992a4a14a8b1, so its 
prototype should have been removed at that time too.  Patch below.


Kind regards,


Signed-off-by: Julio M. Merino Vidal <[EMAIL PROTECTED]>

diff --git a/include/asm-powerpc/paca.h b/include/asm-powerpc/paca.h
index cf95274..00a70e5 100644
--- a/include/asm-powerpc/paca.h
+++ b/include/asm-powerpc/paca.h
@@ -107,7 +107,5 @@ struct paca_struct {

extern struct paca_struct paca[];

-void setup_boot_paca(void);
-
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_PACA_H */

--
Julio M. Merino Vidal <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > I'm still not understanding.  The terms you're using are a bit ambiguous.
> > 
> > What does "find some dirty unallocated blocks" mean?  Find a page which is
> > dirty and which does not have a disk mapping?
> > 
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
> 
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome.  If someone comes in and does
an fsync() we've lost our synchronisation point.  Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals.  We're OK wrt page reclaim and we're OK wrt
truncate and invalidate.  As long as the page is uptodate we _should_ be OK
wrt readpage().  But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order. 
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow.  But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


> > 
> > 
> >>going to commit
> >>find inode I dirty
> >>do NOT find these blocks because they're
> >>  allocated only, but pages/bhs aren't 
> >> mapped
> >>  to them
> >>start commit
> > 
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
> 
> nope, I mean sb->inode->page walk.
> 
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page().  Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
> 
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that.  Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view.  If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search.  But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data.  Files which
> > have chattr +j would screw things up, as usual.
> 
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon. 
It's what we do now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Routing 600+ vlan's via linux problems (looks like arp problems)

2007-05-04 Thread Jan Engelhardt

On May 4 2007 05:48, Øyvind Vågen Jægtnes wrote:
>
> As a side note, i'm starting to wonder if it was worth the $20k when i
> could just have a linux machine to do the job with a clone for backup
> ;) 

Most often not. The big bosses (which do most decisions yet are not
always the cluefulst wrt. tech) look after "certification", the
"enterprise" sticker, and the correct blame policy (if it breaks, you
can kill ; if your linux box breaks, you have to fix it
yourself). And here's an example case that it's not always optimal.

A $2k core router once died of the HD (sounds similar eh?), it took
the vendor 27h to replace it (and their office is just 500m away),
while if could have swapped the faulty disk ourselves, it would have
only taken as long as the disk copy takes (38 minutes for 80 GB at a
transfer rate of 35MB/s).


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] crypto: Add LZO compression support to the crypto interface

2007-05-04 Thread Satyam Sharma

On 5/1/07, Richard Purdie <[EMAIL PROTECTED]> wrote:

+static int lzo_init(struct crypto_tfm *tfm)
+{
+   struct lzo_ctx *ctx = crypto_tfm_ctx(tfm);
+
+   ctx->lzo_mem = vmalloc(LZO1X_MEM_COMPRESS);
+
+   if (!ctx->lzo_mem) {
+   vfree(ctx->lzo_mem);


Heh. What's (why's) this? You _can_ {k, v}free NULL but doing so after
explicitly checking for it is ... ... insane!


+   return -ENOMEM;


Yeah. Just return -ENOMEM; and be done with it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Rewrite the MAJOR() macro as a call to imajor().

2007-05-04 Thread Jan Engelhardt

On May 3 2007 23:18, Andrew Morton wrote:
>>  struct inode *i = file->f_mapping->host;
>> 
>> -return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
>> +return i && S_ISBLK(i->i_mode) && imajor(i) == LOOP_MAJOR;
>>  }
>
>there's no runtime change, and I count a couple hundred MAJORs in the tree.

Why do we even have imajor() if all it does is calling the MAJOR()
macro?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Move sig_kernel_* et al macros to linux/signal.h

2007-05-04 Thread Andrew Morton
On Sun, 29 Apr 2007 21:02:38 -0700 (PDT) Roland McGrath <[EMAIL PROTECTED]> 
wrote:

> This patch moves the sig_kernel_* and related macros from kernel/signal.c
> to linux/signal.h, and cleans them up slightly.  I need the sig_kernel_*
> macros for default signal behavior in the utrace code, and want to avoid
> duplication or overhead to share the knowledge.
> 
> ...
>
> +#ifdef SIGEMT
> +#define SIGEMT_MASK  rt_sigmask(SIGEMT)
> +#else
> +#define SIGEMT_MASK  0
> +#endif
> +
> +#if SIGRTMIN > BITS_PER_LONG
> +#define rt_sigmask(sig)  (1ULL << ((sig)-1))
> +#else
> +#define rt_sigmask(sig)  sigmask(sig)
> +#endif
> +#define siginmask(sig, mask) (rt_sigmask(sig) & (mask))

Should we undef rt_sigmask and siginmask after using them here?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Rewrite the MAJOR() macro as a call to imajor().

2007-05-04 Thread Robert P. J. Day
On Fri, 4 May 2007, Jan Engelhardt wrote:

>
> On May 3 2007 23:18, Andrew Morton wrote:
> >>struct inode *i = file->f_mapping->host;
> >>
> >> -  return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
> >> +  return i && S_ISBLK(i->i_mode) && imajor(i) == LOOP_MAJOR;
> >>  }
> >
> >there's no runtime change, and I count a couple hundred MAJORs in the tree.
>
> Why do we even have imajor() if all it does is calling the MAJOR()
> macro?

  i'm guessing it's to hide the underlying implementation of
extracting the major/minor numbers from an inode, in case that
implementation ever changes, which strikes me as perfectly reasonable.
and i don't think you'd have any luck arguing that it should be
removed at this point:

$ grep -Erw "(imajor|iminor)" * | wc -l
350

  all i was doing was standardizing the small handful of holdouts.

rday
-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-04 Thread Anton Altaparmakov

On 3 May 2007, at 23:40, Bernd Eckenfels wrote:

In article <[EMAIL PROTECTED]> you wrote:

For this particular case, Ted is probably right and the only place
we'll ever see this insane poor man's pre-allocate pattern is from  
the

Windows CIFS client, in which case fixing this in Samba makes sense -
although I'm a bit horrified by the idea of writing 128K of zeroes to
pre-allocate... oh well, it's temporary, and what we care about here
is the read performance, more than the write performance.


What about an ioctl or advice to avoid holes? Which could be issued by
samba? Is that related to SetFileValidData and SetEndOfFile win32  
functions?
What is the windows client calling, and what command is transmitted  
by smb?


Nothing to do with win32 functions.  Windows does NOT create sparse  
files therefore it never can have an issue like ext3 does in this  
scenario.  Windows will cause nice allocations to happen because of  
this and the 1-byte writes are perfectly sensible in this regard.   
(Although a little odd as Windows has a proper API for doing  
preallocation so I don't get why it is not using that instead...)


As far as I know the only time Windows will create sparse files is if  
you specifically mark a file as sparse using the FSCTL_SET_SPARSE  
ioctl and then create a sparse region using the FSCTL_SET_ZERO_DATA  
ioctl.


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] SCSI: Remove redundant GFP_KERNEL type flag in kmalloc().

2007-05-04 Thread Andrew Morton

Please be careful to add the appropriate cc's.

On Mon, 30 Apr 2007 04:37:22 -0400 (EDT) "Robert P. J. Day" <[EMAIL PROTECTED]> 
wrote:

> 
> Remove the apparently redundant GFP_KERNEL type flag in the call to
> kmalloc().
> 
> Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>
> 
> ---
> 
> diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
> index a988d5a..765ded0 100644
> --- a/drivers/scsi/aic7xxx_old.c
> +++ b/drivers/scsi/aic7xxx_old.c
> @@ -6581,7 +6581,7 @@ aic7xxx_slave_alloc(struct scsi_device *SDptr)
>struct aic7xxx_host *p = (struct aic7xxx_host *)SDptr->host->hostdata;
>struct aic_dev_data *aic_dev;
> 
> -  aic_dev = kmalloc(sizeof(struct aic_dev_data), GFP_ATOMIC | GFP_KERNEL);
> +  aic_dev = kmalloc(sizeof(struct aic_dev_data), GFP_ATOMIC);

No, this converts the allocation from a robust one which can sleep into a
flakey one which cannot.

If we want to just clean this code up, we should switch to

GFP_KERNEL|__GFP_HIGH

and add a comment explaining why we're turning on __GFP_HIGH (pointlessly,
I suspect).

However I suspect what the code really meant to do was to use just
GFP_KERNEL.  It's been that way since

commit 5c9342ceb292ac5c619db6eef4ef427a64bcd436
Author: torvalds 
Date:   Thu Nov 7 04:54:32 2002 +

Merge bk://linux-scsi.bkbits.net/scsi-dledford
into home.transmeta.com:/home/torvalds/v2.5/linux

2002/11/06 16:40:20-05:00 dledford
aic7xxx_old: multiple updates and fixes, driver ported to scsi
mid-layer new error handling scheme
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata /dev/scd0 problem: mount after burn fails without eject

2007-05-04 Thread Tejun Heo
Michal Piotrowski wrote:
> On 01/05/07, Mark Lord <[EMAIL PROTECTED]> wrote:
>> Forwarding to linux-scsi and linux-ide mailing lists.
>>
>> Frank van Maarseveen wrote:
>> > Tested on 2.6.20.6 and 2.6.21.1
>> >
>> > I decided to swich from the old IDE drivers to libata and now there
>> > seems to be a little but annoying problem: cannot mount an ISO image
>> > after burning it.
>> >
>> > May  1 14:32:55 kernel: attempt to access beyond end of device
>> > May  1 14:32:55 kernel: sr0: rw=0, want=68, limit=4
>> > May  1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0,
>> iso_blknum=16, block=16
>> >
>> > an "eject" command seems to fix the state of the PATA DVD writer
>> > or driver. The problem occurs for burning a CD and for DVD too with
>> > identical error messages.

Right after burning, if you run 'fuser -v /dev/sr0', what does it say?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel oops with 2.6.21 while using cdda2wav & cooked_ioctl (x64-64) [untainted]

2007-05-04 Thread Ross Alexander

Alexander,

Rather than try to hunt down the oops via the interrupt code I'll try to 
fix the original problem, which is cooked_ioctl incorrectly determines 
the length of the track when an audio CD also contains a data track.  I 
have noticed this bug for a long time (over a year) but only recently 
has it been causing the kernel to crash.  I know this is avoidable by 
using cdda2wav in raw mode but it would be good to fix.


If anybody has any pointers about how I should go about this it would be 
most appriciated.


Many thanks,

Ross

Ross Alexander
Phone: +44 20 8752 3394
SAP Basis, Technical Support Services, Corporate IT Centre, NEC Europe 
Limited



|The contents of this email are intended for the use of the 
individual(s) or entity named above and may contain information that is 
privileged and/or confidential. If you are not the intended recipient, 
you are not authorised to make any use of it. In such case, please 
notify the sender and immediately delete the message from your system. 
The message content may contain personal views which are not the views 
of the company unless specifically stated. All liability for loss or 
damage caused by viruses is excluded. |


| NEC Europe Limited | Registered Office: NEC House, 1 Victoria Road, 
London W3 6BL | Registered in England 2832014 |




smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH] i386: always clear bss

2007-05-04 Thread Jeremy Fitzhardinge
When the paravirt dispatcher gets run immediately on entry to
startup_32, the bss isn't cleared.  This happens to work if the
hypervisor's domain builder loaded the complete kernel image and
cleared the bss for us, but this may not always be true (for example,
if we're running out of a decompressed bzImage).

Change head.S so that it unconditionally clears the bss before doing
the paravirt dispatch or continuing on to normal native boot.

There are a couple of points to note:
 - We can't, in general, load the segment registers before paravirt
   dispatch, because we could be running with a non-standard gdt and
   segment selectors.  In practice though, all code which ends up
   jumping into startup_32 will have already set the segment registers
   up to sane values, so we don't need to do it again.
 - Paging may or may not be enabled, and if enabled we may or may not
   be mapped to the proper kernel virtual address.  To deal with this,
   we compare the kernel's linked address with where we're actually
   running, and use that to offset the bss pointer.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>
Cc: "H. Peter Anvin" <[EMAIL PROTECTED]>

---
 arch/i386/kernel/head.S |   48 ++-
 1 file changed, 27 insertions(+), 21 deletions(-)

===
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -70,6 +70,33 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + 
  */
 .section .text.head,"ax",@progbits
 ENTRY(startup_32)
+/*
+ * Clear BSS first so that there are no surprises...
+ * This relies on the the segment registers to be set
+ * to something sensible, which will have already happened.
+ */
+   cld
+   xorl %eax,%eax
+   movl $__bss_start,%edi
+   movl $__bss_stop,%ecx
+   subl %edi,%ecx
+   shrl $2,%ecx
+   /*
+* Work out whether we're running mapped or not:
+* - call a local label
+* - pop the return address to get the actual eip
+* - subtract local label from %edi (= bss pointer)
+* - add in actual eip
+*
+* This will result in %edi being a virtual pointer if
+* we're currently mapped, or a physical pointer if we're
+* not (either no paging or 1:1 mapping).
+*/
+   call 1f
+1: popl %ebx
+   subl $1b, %edi
+   addl %ebx, %edi
+   rep ; stosl
 
 #ifdef CONFIG_PARAVIRT
 movl %cs, %eax
@@ -77,27 +104,6 @@ ENTRY(startup_32)
 jnz startup_paravirt
 #endif
 
-/*
- * Set segments to known values.
- */
-   cld
-   lgdt boot_gdt_descr - __PAGE_OFFSET
-   movl $(__BOOT_DS),%eax
-   movl %eax,%ds
-   movl %eax,%es
-   movl %eax,%fs
-   movl %eax,%gs
-
-/*
- * Clear BSS first so that there are no surprises...
- * No need to cld as DF is already clear from cld above...
- */
-   xorl %eax,%eax
-   movl $__bss_start - __PAGE_OFFSET,%edi
-   movl $__bss_stop - __PAGE_OFFSET,%ecx
-   subl %edi,%ecx
-   shrl $2,%ecx
-   rep ; stosl
 /*
  * Copy bootup parameters out of the way.
  * Note: %esi still has the pointer to the real-mode data.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


"KERNEL/_KERNEL/INKERNEL" macro cruft

2007-05-04 Thread Robert P. J. Day

  can any of this be removed?  for a start, what's the point of
"#define KERNEL"?

$ grep -r "^#define KERNEL$" *
drivers/net/skfp/srf.c:#define KERNEL
drivers/net/skfp/pmf.c:#define KERNEL
drivers/net/skfp/smt.c:#define KERNEL
drivers/net/skfp/ecm.c:#define KERNEL
drivers/net/skfp/pcmplc.c:#define KERNEL
drivers/net/skfp/cfm.c:#define KERNEL
drivers/net/skfp/rmt.c:#define KERNEL
drivers/char/dtlk.c:#define KERNEL

  which seems unrelated to this:

$ grep -r "#if.* KERNEL$" *
include/linux/coda.h:#ifdef KERNEL

  and there's these lonely examples of "INKERNEL" and "_KERNEL":

$ grep -rw _KERNEL *
include/linux/coda.h:/* Catch new _KERNEL defn for NetBSD and 
DJGPP/__CYGWIN32__ */
include/linux/soundcard.h:#if (!defined(__KERNEL__) && !defined(KERNEL) && 
!defined(INKERNEL) && !defined(_KERNEL)) || defined(USE_SEQ_MACROS)

rday
-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] DRM TTM Memory Manager patch

2007-05-04 Thread Thomas Hellström

Keith Packard wrote:

On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellström wrote:

  
It might be possible to find schemes that work around this. One way 
could possibly be to have a buffer mapping -and validate order for 
shared buffers.



If mapping never blocks on anything other than the fence, then there
isn't any dead lock possibility. What this says is that ordering of
rendering between clients is *not DRMs problem*. I think that's a good
solution though; I want to let multiple apps work on DRM-able memory
with their own CPU without contention.

I don't recall if Eric layed out the proposed rules, but:

 1) Map never blocks on map. Clients interested in dealing with this 
are on their own.


 2) Submit blocks on map. You must unmap all buffers before submitting
them. Doing the relocations in the kernel makes this all possible.

 3) Map blocks on the fence from submit. We can play with pending the
flush until the app asks for the buffer back, or we can play with
figuring out when flushes are useful automatically. Doesn't matter
if the policy is in the kernel.

I'm interested in making deadlock avoidence trivial and eliminating any
map-map contention.

  
It's rare to have two clients access the same buffer at the same time. 
In what situation will this occur?


If we think of map / unmap and validation / fence  as taking a buffer 
mutex either for the CPU or for the GPU, that's the way implementation 
is done today. The CPU side of the mutex should IIRC be per-client 
recursive. OTOH, the TTM implementation won't stop the CPU from 
accessing the buffer when it is unmapped, but then you're on your own. 
"Mutexes" need to be taken in the correct order, otherwise a deadlock 
will occur, and GL will, as outlined in Eric's illustration, more or 
less encourage us to take buffers in the "incorrect" order.


In essence what you propose is to eliminate the deadlock problem by just 
avoiding taking the buffer mutex unless we know the GPU has it. I see 
two problems with this:


   * It will encourage different DRI clients to simultaneously access
 the same buffer.
   * Inter-client and GPU data coherence can be guaranteed if we issue
 a mb() / write-combining flush with the unmap operation (which,
 BTW, I'm not sure is done today). Otherwise it is up to the
 clients, and very easy to forget.

I'm a bit afraid we might in the future regret taking the easy way out?

OTOH, letting DRM resolve the deadlock by unmapping and remapping shared 
buffers in the correct order might not be the best one either. It will 
certainly mean some CPU overhead and what if we have to do the same with 
buffer validation? (Yes for some operations with thousands and thousands 
of relocations, the user space validation might need to stay).


Personally, I'm slightly biased towards having DRM resolve the deadlock, 
but I think any solution will do as long as the implications and why we 
choose a certain solution are totally clear.


For item 3) above the kernel must have a way to issue a flush when 
needed for buffer eviction.
The current implementation also requires the buffer to be completely 
flushed before mapping.
Other than that the flushing policy is currently completely up to the 
DRM drivers.


/Thomas











-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Correct location for ADC/DAC drivers

2007-05-04 Thread Robert Schwebel
On Tue, May 01, 2007 at 02:35:44PM +0200, Stefan Roese wrote:
> I'm in the stage of integrating some ADC and DAC drivers for the AMCC
> 405EZ PPC and looking for the correct location to place these drivers
> in the Linux source tree. The drivers are basically character-drivers,
> so my first thought is to put them in "drivers/char/adc/foo.c" or
> "drivers/char/adc_foo.c". Is this a good solution?
> 
> Any suggestions welcome (could be that I missed an already existing
> example).
> 
> BTW: I am aware of the hwmon subsystem, but I don't think it fits my
> needs in this case.

Could you elaborate the requirements a bit more? ADC is not ADC, because
slow i2c ADCs which measure a temperature every five minutes have other
requirements than multi-megabyte-per-second-dma-driven ADCs.

Robert
-- 
 Dipl.-Ing. Robert Schwebel | http://www.pengutronix.de
 Pengutronix - Linux Solutions for Science and Industry
   Handelsregister:  Amtsgericht Hildesheim, HRA 2686
 Hannoversche Str. 2, 31134 Hildesheim, Germany
   Phone: +49-5121-206917-0 |  Fax: +49-5121-206917-9

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] SCSI: Remove redundant GFP_KERNEL type flag in kmalloc().

2007-05-04 Thread Robert P. J. Day
On Fri, 4 May 2007, Andrew Morton wrote:

>
> Please be careful to add the appropriate cc's.
>
> On Mon, 30 Apr 2007 04:37:22 -0400 (EDT) "Robert P. J. Day" <[EMAIL 
> PROTECTED]> wrote:
>
> >
> > Remove the apparently redundant GFP_KERNEL type flag in the call to
> > kmalloc().
> >
> > Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>
> >
> > ---
> >
> > diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
> > index a988d5a..765ded0 100644
> > --- a/drivers/scsi/aic7xxx_old.c
> > +++ b/drivers/scsi/aic7xxx_old.c
> > @@ -6581,7 +6581,7 @@ aic7xxx_slave_alloc(struct scsi_device *SDptr)
> >struct aic7xxx_host *p = (struct aic7xxx_host *)SDptr->host->hostdata;
> >struct aic_dev_data *aic_dev;
> >
> > -  aic_dev = kmalloc(sizeof(struct aic_dev_data), GFP_ATOMIC | GFP_KERNEL);
> > +  aic_dev = kmalloc(sizeof(struct aic_dev_data), GFP_ATOMIC);
>
> No, this converts the allocation from a robust one which can sleep into a
> flakey one which cannot.

... snip ...

at this point, i'd be happier to leave the appropriate patches in the
hands of those who have a better handle on this.  as i posted earlier,
there's only two examples of this in the entire tree:

drivers/scsi/aic7xxx_old.c:  aic_dev = kmalloc(sizeof(struct aic_dev_data), 
GFP_ATOMIC | GFP_KERNEL);
drivers/message/i2o/device.c:   resblk = kmalloc(buflen + 8, GFP_KERNEL | 
GFP_ATOMIC);

it's all yours.

rday
-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


c 's OOP in VFS vs c++'s OOP

2007-05-04 Thread la deng

reference to the c++'s father's interview

http://www.artima.com/intv/abstreffi.html

fortran and c++ can  achive good performance for they can abstract in
higher level and  their compiler can think in higher level to avoid
the cache miss (like in the array vs vector )or to achive the "no
code" Optimization

The vfs using c to OOP but c compiler can't have the high level
Intelligence as the c++ or fortran's compiler to help at the high
level Optimization


one benchmark showed the c insert in the array will slow by speed of   1/17


Then ,what's the power of c OOP vs c++ with the compiler's
Intelligence in OOP?or maybe the c compiler have Outdated?

any input will be appreciated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Per-CPU data as a structure

2007-05-04 Thread Julio M. Merino Vidal

Andi Kleen wrote:

As far as I can tell, the advantage of percpu is that you can define
new "fields" anywhere in the code and independently from the rest of
the system. 



- Independent maintenance as you noted
- Fast access and relatively compact code
- Avoids false sharing by keeping cache lines of different CPUs separate
- Doesn't waste a lot of memory in padding like NR_CPUs arrays usually
need to to avoid the previous point.

Any replacement that doesn't have these properties too will probably
be not useful.
  
Thank you for the details.  I'll try to stick to per-cpu wherever 
possible for now.


Anyway, what do you think about adding the above text to the code (percpu.h
maybe) as documentation?  See the patch below.  (Dunno if the Signed-off-by
line is appropriate as most of the text is yours.)

Signed-off-by: Julio M. Merino Vidal <[EMAIL PROTECTED]>

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 600e3d3..b8e8b8c 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -1,6 +1,21 @@
#ifndef __LINUX_PERCPU_H
#define __LINUX_PERCPU_H

+/*
+ * percpu provides a mechanism to define variables that are specific to 
each

+ * CPU in the system.
+ *
+ * Each variable is defined as an independent array of NR_CPUS elements.
+ * This approach is used instead of a per-CPU structure because it has the
+ * following advantages:
+ * - Independent maintenance: a source file can define new per-CPU
+ *   variables without distorting others.
+ * - Fast access and relatively compact code.
+ * - Avoids false sharing by keeping cache lines of different CPUs 
separate.

+ * - Doesn't waste a lot of memory in padding like NR_CPUs arrays usually
+ *   need to to avoid the previous point.
+ */
+
#include  /* For preempt_disable() */
#include  /* For kmalloc() */
#include 


Kind regards.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata /dev/scd0 problem: mount after burn fails without eject

2007-05-04 Thread Frank van Maarseveen
On Fri, May 04, 2007 at 10:16:32AM +0200, Tejun Heo wrote:
> Michal Piotrowski wrote:
> > On 01/05/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> >> Forwarding to linux-scsi and linux-ide mailing lists.
> >>
> >> Frank van Maarseveen wrote:
> >> > Tested on 2.6.20.6 and 2.6.21.1
> >> >
> >> > I decided to swich from the old IDE drivers to libata and now there
> >> > seems to be a little but annoying problem: cannot mount an ISO image
> >> > after burning it.
> >> >
> >> > May  1 14:32:55 kernel: attempt to access beyond end of device
> >> > May  1 14:32:55 kernel: sr0: rw=0, want=68, limit=4
> >> > May  1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0,
> >> iso_blknum=16, block=16
> >> >
> >> > an "eject" command seems to fix the state of the PATA DVD writer
> >> > or driver. The problem occurs for burning a CD and for DVD too with
> >> > identical error messages.
> 
> Right after burning, if you run 'fuser -v /dev/sr0', what does it say?

Tried the fuser as root to be sure but it didn't show anything.

-- 
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[SOLVED] Serial buffer corruption [was Re: FTDI usb-serial possible bug]

2007-05-04 Thread Antonino Ingargiola

Hi,

On 4/14/07, Antonino Ingargiola <[EMAIL PROTECTED]> wrote:

Hi to the list,

I report a problem found with a device that use the FTDI chip to
communicate data to pc.



The scenario is: a serial device streams data continuously toward the
pc. The application requires the data be read once in a while (so not
all data is read). There is a corruption reading the data (old data
mixed with the new one).

The problem is absent in windows (using the same HW device and python code).


I report for reference how I solved the problem.

The problem arises because (while not reading)  the input stream fills
the input buffer. Once filled, the buffer does not accept new data, so
the lost data is the new one not the old.

When one need to read the data he flushes the input buffer, but this
is not sufficient. In fact there is a second (hardware?) buffer in the
chain that, suddenly after the flush, pushes a chunk of old data in to
the input buffer. After this the new data is queued. This result in
data corruption at the edge between the new and the old data in the
buffer.

The problem has been reproduced also with an usb-serial device using
the cdc-acm driver and with two serial port linked by a null-modem
cable (so its not FTDI specific).

To solve the problem we must do a complete flush of all the buffer
chain. I do this flushing the input multiple times with a small pause
between them. In my case 10 flushes separated by a 10ms pause always
empties the whole buffer chain, so I get no corruption anymore. I'ts
not an elegant solution but it works (10 flushes are an overkill but I
want to be _really_ sure to read the correct data).

If someone know a better way to solve the problem is more than welcome
to suggest. Furthermore just a confirmation of my analysis of the
serial buffer behavior would be appreciated.

What puzzled me at first was that on windows the "flush" flushes all
the buffers, so the problem does not arises. Maybe Linux serial stack
works this way because is dictated by the POSIX standard ... I don't
know, just guessing.

Thanks,

 ~ Antonio

PS: For reference, the original report was:
http://marc.info/?l=linux-kernel&m=117683012421931&w=2
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] crypto: Add LZO compression support to the crypto interface

2007-05-04 Thread Richard Purdie
On Fri, 2007-05-04 at 13:39 +0530, Satyam Sharma wrote:
> On 5/1/07, Richard Purdie <[EMAIL PROTECTED]> wrote:
> > +static int lzo_init(struct crypto_tfm *tfm)
> > +{
> > +   struct lzo_ctx *ctx = crypto_tfm_ctx(tfm);
> > +
> > +   ctx->lzo_mem = vmalloc(LZO1X_MEM_COMPRESS);
> > +
> > +   if (!ctx->lzo_mem) {
> > +   vfree(ctx->lzo_mem);
> 
> Heh. What's (why's) this? You _can_ {k, v}free NULL but doing so after
> explicitly checking for it is ... ... insane!

True, there used to be two buffers allocated there and I've missed a
sensible cleanup when I removed one. I'll fix it, thanks.

Cheers,

Richard


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata /dev/scd0 problem: mount after burn fails without eject

2007-05-04 Thread Tejun Heo
Frank van Maarseveen wrote:
> On Fri, May 04, 2007 at 10:16:32AM +0200, Tejun Heo wrote:
>> Michal Piotrowski wrote:
>>> On 01/05/07, Mark Lord <[EMAIL PROTECTED]> wrote:
 Forwarding to linux-scsi and linux-ide mailing lists.

 Frank van Maarseveen wrote:
> Tested on 2.6.20.6 and 2.6.21.1
>
> I decided to swich from the old IDE drivers to libata and now there
> seems to be a little but annoying problem: cannot mount an ISO image
> after burning it.
>
> May  1 14:32:55 kernel: attempt to access beyond end of device
> May  1 14:32:55 kernel: sr0: rw=0, want=68, limit=4
> May  1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0,
 iso_blknum=16, block=16
> an "eject" command seems to fix the state of the PATA DVD writer
> or driver. The problem occurs for burning a CD and for DVD too with
> identical error messages.
>> Right after burning, if you run 'fuser -v /dev/sr0', what does it say?
> 
> Tried the fuser as root to be sure but it didn't show anything.

I guess sr is forgetting to set media changed flag somewhere.  Don't
really know where tho.  CC'd Bartlomiej, linux-scsi and linux-ide.  Any
ideas?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch driver tree] s390: fix subsystem removal fallout

2007-05-04 Thread Cornelia Huck
Hi Greg!

This patch fixes compilation on s390 in the current driver tree. It
should probably be merged into
remove-struct-subsystem-as-it-is-no-longer-needed.patch.

Signed-off-by: Cornelia Huck <[EMAIL PROTECTED]>

---
 arch/s390/hypfs/inode.c |2 +-
 arch/s390/kernel/ipl.c  |   26 +-
 2 files changed, 14 insertions(+), 14 deletions(-)

--- linux.orig/arch/s390/hypfs/inode.c
+++ linux/arch/s390/hypfs/inode.c
@@ -477,7 +477,7 @@ static int __init hypfs_init(void)
goto fail_diag;
}
}
-   kset_set_kset_s(&s390_subsys, hypervisor_subsys);
+   kobj_set_kset_s(&s390_subsys, hypervisor_subsys);
rc = subsystem_register(&s390_subsys);
if (rc)
goto fail_sysfs;
--- linux.orig/arch/s390/kernel/ipl.c
+++ linux/arch/s390/kernel/ipl.c
@@ -816,23 +816,23 @@ static int __init ipl_register_fcp_files
 {
int rc;
 
-   rc = sysfs_create_group(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_group(&ipl_subsys.kobj,
&ipl_fcp_attr_group);
if (rc)
goto out;
-   rc = sysfs_create_bin_file(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_bin_file(&ipl_subsys.kobj,
   &ipl_parameter_attr);
if (rc)
goto out_ipl_parm;
-   rc = sysfs_create_bin_file(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_bin_file(&ipl_subsys.kobj,
   &ipl_scp_data_attr);
if (!rc)
goto out;
 
-   sysfs_remove_bin_file(&ipl_subsys.kset.kobj, &ipl_parameter_attr);
+   sysfs_remove_bin_file(&ipl_subsys.kobj, &ipl_parameter_attr);
 
 out_ipl_parm:
-   sysfs_remove_group(&ipl_subsys.kset.kobj, &ipl_fcp_attr_group);
+   sysfs_remove_group(&ipl_subsys.kobj, &ipl_fcp_attr_group);
 out:
return rc;
 }
@@ -846,7 +846,7 @@ static int __init ipl_init(void)
return rc;
switch (ipl_info.type) {
case IPL_TYPE_CCW:
-   rc = sysfs_create_group(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_group(&ipl_subsys.kobj,
&ipl_ccw_attr_group);
break;
case IPL_TYPE_FCP:
@@ -854,11 +854,11 @@ static int __init ipl_init(void)
rc = ipl_register_fcp_files();
break;
case IPL_TYPE_NSS:
-   rc = sysfs_create_group(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_group(&ipl_subsys.kobj,
&ipl_nss_attr_group);
break;
default:
-   rc = sysfs_create_group(&ipl_subsys.kset.kobj,
+   rc = sysfs_create_group(&ipl_subsys.kobj,
&ipl_unknown_attr_group);
break;
}
@@ -885,7 +885,7 @@ static int __init reipl_nss_init(void)
 
if (!MACHINE_IS_VM)
return 0;
-   rc = sysfs_create_group(&reipl_subsys.kset.kobj, &reipl_nss_attr_group);
+   rc = sysfs_create_group(&reipl_subsys.kobj, &reipl_nss_attr_group);
if (rc)
return rc;
strncpy(reipl_nss_name, kernel_nss_name, NSS_NAME_SIZE + 1);
@@ -900,7 +900,7 @@ static int __init reipl_ccw_init(void)
reipl_block_ccw = (void *) get_zeroed_page(GFP_KERNEL);
if (!reipl_block_ccw)
return -ENOMEM;
-   rc = sysfs_create_group(&reipl_subsys.kset.kobj, &reipl_ccw_attr_group);
+   rc = sysfs_create_group(&reipl_subsys.kobj, &reipl_ccw_attr_group);
if (rc) {
free_page((unsigned long)reipl_block_ccw);
return rc;
@@ -938,7 +938,7 @@ static int __init reipl_fcp_init(void)
reipl_block_fcp = (void *) get_zeroed_page(GFP_KERNEL);
if (!reipl_block_fcp)
return -ENOMEM;
-   rc = sysfs_create_group(&reipl_subsys.kset.kobj, &reipl_fcp_attr_group);
+   rc = sysfs_create_group(&reipl_subsys.kobj, &reipl_fcp_attr_group);
if (rc) {
free_page((unsigned long)reipl_block_fcp);
return rc;
@@ -990,7 +990,7 @@ static int __init dump_ccw_init(void)
dump_block_ccw = (void *) get_zeroed_page(GFP_KERNEL);
if (!dump_block_ccw)
return -ENOMEM;
-   rc = sysfs_create_group(&dump_subsys.kset.kobj, &dump_ccw_attr_group);
+   rc = sysfs_create_group(&dump_subsys.kobj, &dump_ccw_attr_group);
if (rc) {
free_page((unsigned long)dump_block_ccw);
return rc;
@@ -1014,7 +1014,7 @@ static int __init dump_fcp_init(void)
dump_block_fcp = (void *) get_zeroed_page(GFP_KERNEL);
if (!dump_block_fcp)
return -ENOMEM;
-   rc = sysfs_create_group(&dump_subsys.kset.kobj, &dump_fcp_attr_group);
+   rc = sysfs_create_group(&dump_subsys.kobj, &dump_fcp_attr_group);
if (rc) {
free_page((unsigned long)dump_block_fcp);

Re: libata /dev/scd0 problem: mount after burn fails without eject

2007-05-04 Thread Frank van Maarseveen
On Fri, May 04, 2007 at 10:41:41AM +0200, Tejun Heo wrote:
> Frank van Maarseveen wrote:
> > On Fri, May 04, 2007 at 10:16:32AM +0200, Tejun Heo wrote:
> >> Michal Piotrowski wrote:
> >>> On 01/05/07, Mark Lord <[EMAIL PROTECTED]> wrote:
>  Forwarding to linux-scsi and linux-ide mailing lists.
> 
>  Frank van Maarseveen wrote:
> > Tested on 2.6.20.6 and 2.6.21.1
> >
> > I decided to swich from the old IDE drivers to libata and now there
> > seems to be a little but annoying problem: cannot mount an ISO image
> > after burning it.
> >
> > May  1 14:32:55 kernel: attempt to access beyond end of device
> > May  1 14:32:55 kernel: sr0: rw=0, want=68, limit=4
> > May  1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0,
>  iso_blknum=16, block=16
> > an "eject" command seems to fix the state of the PATA DVD writer
> > or driver. The problem occurs for burning a CD and for DVD too with
> > identical error messages.
> >> Right after burning, if you run 'fuser -v /dev/sr0', what does it say?
> > 
> > Tried the fuser as root to be sure but it didn't show anything.
> 
> I guess sr is forgetting to set media changed flag somewhere.

Plausible. I get the same kernel messages when I try to mount the CD
before burning.

-- 
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] DRM TTM Memory Manager patch

2007-05-04 Thread Jerome Glisse

On 5/4/07, Thomas Hellström <[EMAIL PROTECTED]> wrote:

Keith Packard wrote:
> On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellström wrote:
>
>
>> It might be possible to find schemes that work around this. One way
>> could possibly be to have a buffer mapping -and validate order for
>> shared buffers.
>>
>
> If mapping never blocks on anything other than the fence, then there
> isn't any dead lock possibility. What this says is that ordering of
> rendering between clients is *not DRMs problem*. I think that's a good
> solution though; I want to let multiple apps work on DRM-able memory
> with their own CPU without contention.
>
> I don't recall if Eric layed out the proposed rules, but:
>
>  1) Map never blocks on map. Clients interested in dealing with this
> are on their own.
>
>  2) Submit blocks on map. You must unmap all buffers before submitting
> them. Doing the relocations in the kernel makes this all possible.
>
>  3) Map blocks on the fence from submit. We can play with pending the
> flush until the app asks for the buffer back, or we can play with
> figuring out when flushes are useful automatically. Doesn't matter
> if the policy is in the kernel.
>
> I'm interested in making deadlock avoidence trivial and eliminating any
> map-map contention.
>
>
It's rare to have two clients access the same buffer at the same time.
In what situation will this occur?

If we think of map / unmap and validation / fence  as taking a buffer
mutex either for the CPU or for the GPU, that's the way implementation
is done today. The CPU side of the mutex should IIRC be per-client
recursive. OTOH, the TTM implementation won't stop the CPU from
accessing the buffer when it is unmapped, but then you're on your own.
"Mutexes" need to be taken in the correct order, otherwise a deadlock
will occur, and GL will, as outlined in Eric's illustration, more or
less encourage us to take buffers in the "incorrect" order.

In essence what you propose is to eliminate the deadlock problem by just
avoiding taking the buffer mutex unless we know the GPU has it. I see
two problems with this:

* It will encourage different DRI clients to simultaneously access
  the same buffer.
* Inter-client and GPU data coherence can be guaranteed if we issue
  a mb() / write-combining flush with the unmap operation (which,
  BTW, I'm not sure is done today). Otherwise it is up to the
  clients, and very easy to forget.

I'm a bit afraid we might in the future regret taking the easy way out?

OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
buffers in the correct order might not be the best one either. It will
certainly mean some CPU overhead and what if we have to do the same with
buffer validation? (Yes for some operations with thousands and thousands
of relocations, the user space validation might need to stay).

Personally, I'm slightly biased towards having DRM resolve the deadlock,
but I think any solution will do as long as the implications and why we
choose a certain solution are totally clear.

For item 3) above the kernel must have a way to issue a flush when
needed for buffer eviction.
The current implementation also requires the buffer to be completely
flushed before mapping.
Other than that the flushing policy is currently completely up to the
DRM drivers.

/Thomas


I might say stupid things as i don't think i fully understand all
the input to this problem. Anyway here is my thought on all this:

1) First client map never block (as in Keith layout) except on
   fence from drm side (point 3 in Keith layout)
2) Client should always unmap buffer before submitting (as in Keith layout)
3) In drm side you always acquire buffer lock in a give order for
   instance each buffer got and id and you lock from smaller id to bigger
   one (with a clever implementation the cost for that will be small)
4) We got 2 gpu queue:
- one with pending apps ask in which we do all stuff necessary
  to be done before submitting (locking buffer, validation, ...)
  for instance we might wait here for each buffer that are still
  mapped by some other apps in user space
- one run queue in which we add each apps ask that are now
  ready to be submited to the gpu

Of course in this scheme we keep the fencing stuff so user space can
know when it safe to use previously submited buffer again. The outcome
of having two seperate queue in drm is that if two apps lockup each other
other apps can still use the gpu so only the apps fighting for a buffer will
suffer.

And for user space synchronization i believe it a user space problem i.e.
it's up to user space to add proper synch. For instance as map doesn't
block for any client in user space two apps can mess with same buffer
it's up to user to have a policy to exclude each other (i believe this will
be a dri or xorg problem to synch btw consumer).

I believe in this scheme you can only have

Re: [SOLVED] Serial buffer corruption [was Re: FTDI usb-serial possible bug]

2007-05-04 Thread Oliver Neukum
Am Freitag, 4. Mai 2007 10:38 schrieb Antonino Ingargiola:
> To solve the problem we must do a complete flush of all the buffer
> chain. I do this flushing the input multiple times with a small pause
> between them. In my case 10 flushes separated by a 10ms pause always
> empties the whole buffer chain, so I get no corruption anymore. I'ts
> not an elegant solution but it works (10 flushes are an overkill but I
> want to be _really_ sure to read the correct data).

How do you flush the buffers? Simply reading them out?

Regards
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > i'm wondering about swap-prefetch:

> Being able to config all these core heuristics changes is really not 
> that much of a positive. The fact that we might _need_ to config 
> something out, and double the configuration range isn't too pleasing.

Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.

On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)

This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).

> Here were some of my concerns, and where our discussion got up to.

> > Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
> > maybe with normal system activity we get enough free pages to kick 
> > the thing off and running.  Perhaps updatedb itself has a lot of 
> > rss, for example.
> 
> Could be, but I don't know. I'd think it unlikely to allow _much_ 
> swapin, if huge amounts of the desktop have been swapped out. But 
> maybe... as I said, nobody seems to have a recipe for these things.

can i take this one as a "no fundamental objection"? There are really 
only 2 maintainance options left:

  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)

And let me be really blunt about this, there is no option #3 to say: "I 
have no real better idea, I have no code, I have no time, but hey, lets 
not merge this because it 'could in theory' be possible to do it better" 
=B-)

really, we are likely be better off by risking the merge of _bad_ code 
(which in the swap-prefetch case is the exact opposite of the truth), 
than to let code stagnate. People are clearly unhappy about certain 
desktop aspects of swapping, and the only way out of that is to let more 
people hack that code. Merging code involves more people. It will cause 
'noise' and could cause regressions, but at least in this case the only 
impact is 'performance' and the feature is trivial to disable.

The maintainance drag outside of swap_prefetch.c is essentially _zero_. 
If the feature doesnt work it ends up on Con's desk. If it turns out to 
not work at all (despite years of testing and happy users) it still only 
ends up on Con's desk. A clear win/win scenario for you i think :-)

> > Would be useful to see this claim substantiated with a real 
> > testcase, description of results and an explanation of how and why 
> > it worked.
> 
> Yes... and then try to first improve regular page reclaim and use-once 
> handling.

agreed. Con, IIRC you wrote a testcase for this, right? Could you please 
send us the results of that testing?

> >>2) It is a _highly_ speculative operation, and in workloads where periods
> >>of low and high page usage with genuinely unused anonymous / tmpfs
> >>pages, it could waste power, memory bandwidth, bus bandwidth, disk
> >>bandwidth...
> > 
> > Yes.  I suspect that's a matter of waiting for the corner-case 
> > reporters to complain, then add more heuristics.
> 
> Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
> is happy to do a _lot_ of work for these things which we have already 
> decided are least likely to be used again.

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.

the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.

I do not even see any risk of "splitup of mindshare" - swap-prefetch is 
so clearly speculative that it's not really a different view about how 
to do swapping (which would split the tester base, etc.), it's simply a 
"do you want your system to speculate about the future or not" add-on 
decision. Every system has a pretty clear idea about that: desktops 
generally want to do it, clusters generally dont want to do it.

> >>3) I haven't seen a single s

Re: Correct location for ADC/DAC drivers

2007-05-04 Thread Stefan Roese
On Friday 04 May 2007 10:24, Robert Schwebel wrote:
> On Tue, May 01, 2007 at 02:35:44PM +0200, Stefan Roese wrote:
> > I'm in the stage of integrating some ADC and DAC drivers for the AMCC
> > 405EZ PPC and looking for the correct location to place these drivers
> > in the Linux source tree. The drivers are basically character-drivers,
> > so my first thought is to put them in "drivers/char/adc/foo.c" or
> > "drivers/char/adc_foo.c". Is this a good solution?
> >
> > Any suggestions welcome (could be that I missed an already existing
> > example).
> >
> > BTW: I am aware of the hwmon subsystem, but I don't think it fits my
> > needs in this case.
>
> Could you elaborate the requirements a bit more? ADC is not ADC, because
> slow i2c ADCs which measure a temperature every five minutes have other
> requirements than multi-megabyte-per-second-dma-driven ADCs.

The hardware (PPC405EZ) actually implements an high speed, dma capable, ADC 
controller with 10-bit resolution and up to 4MHz sample rate. The current 
driver doesn't support all these features though (dma is not supported right 
now for example). Could be that this will be added in future releases. It 
would be good though, to have the driver located at the "correct" place in 
the kernel tree right away.

Best regards,
Stefan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Per-CPU data as a structure

2007-05-04 Thread Eric Dumazet
On Fri, 04 May 2007 10:36:37 +0200
"Julio M. Merino Vidal" <[EMAIL PROTECTED]> wrote:


> 
> Anyway, what do you think about adding the above text to the code (percpu.h
> maybe) as documentation?  See the patch below.  (Dunno if the Signed-off-by
> line is appropriate as most of the text is yours.)
> 
> Signed-off-by: Julio M. Merino Vidal <[EMAIL PROTECTED]>
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 600e3d3..b8e8b8c 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -1,6 +1,21 @@
>  #ifndef __LINUX_PERCPU_H
>  #define __LINUX_PERCPU_H
>  
> +/*
> + * percpu provides a mechanism to define variables that are specific to 
> each
> + * CPU in the system.
> + *
> + * Each variable is defined as an independent array of NR_CPUS elements.
> + * This approach is used instead of a per-CPU structure because it has the
> + * following advantages:
> + * - Independent maintenance: a source file can define new per-CPU
> + *   variables without distorting others.
> + * - Fast access and relatively compact code.
> + * - Avoids false sharing by keeping cache lines of different CPUs 
> separate.
> + * - Doesn't waste a lot of memory in padding like NR_CPUs arrays usually
> + *   need to to avoid the previous point.
> + */
> +
>  #include  /* For preempt_disable() */
>  #include  /* For kmalloc() */
>  #include 

Documentation is good, and percpu probably misses one, but please add it in a 
Documentation/percpu.txt file, because it's the right place.

You then can really have an extensive documentation, and you wont slow down 
kernel compiles...

I suggest you document all variants (get_cpu_var(), __get_cpu_var(), ...) with 
examples of use

Also, please note that per cpu data is not allocated * NR_CPUS, but depends on 
possible cpus. So if you boot an SMP kernel on a one CPU desktop, kernel 
allocates only the needed space.

So per_cpu data has also a space saving argument against structures declared 
with [NR_CPUS] arrays.

Thank you

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build system: no module target ending with slash?

2007-05-04 Thread Christian Hesse
On Thursday 03 May 2007, Sam Ravnborg wrote:
> On Thu, May 03, 2007 at 09:17:15AM +0200, Christian Hesse wrote:
> > On Thursday 03 May 2007, Sam Ravnborg wrote:
> > > On Thu, May 03, 2007 at 06:25:11AM +0200, Sam Ravnborg wrote:
> > > > On Thu, May 03, 2007 at 12:43:43AM +0200, Christian Hesse wrote:
> > > > > Hi James, hi everybody,
> > > > >
> > > > > playing with iwlwifi I try to patch it into the kernel and to build
> > > > > it from there. But I have a problem with the build system.
> > > > >
> > > > > The file drivers/net/wireless/mac80211/Makefile contains one single
> > > > > line:
> > > > >
> > > > > obj-$(CONFIG_IWLWIFI)   += iwlwifi/
> > > > >
> > > > > When CONFIG_IWLWIFI=m in scripts/Makefile.lib line 29 the target is
> > > > > filtered as it ends with a slash. That results in
> > > > > drivers/net/wireless/mac80211/built-in.o not being built and the
> > > > > build process breaks with an error. What is the correct way to
> > > > > handle this? Why are targets ending with a slash filtered?
> > > >
> > > > Looks buggy. I will take a look tonight.
> > >
> > > After some coffee...
> > >
> > > Line 29 in Kbuild.include find all modules and a directory is not a
> > > module. In line 26 in same file the directory iwlwifi is included in
> > > the list of directories to visit.
> > > So there is something else going on.
> >
> > In scripts/Kbuild.include line 26 is empty and line 29 is a comment... Do
> > I look at the wrong place?
>
> I looked at lxr.linux.no - so probarly an outdated version.
>
> > I still believe in my version: built-in.o is built if any of $(obj-y)
> > $(obj-m) $(obj-n) $(obj-) $(lib-target) contains anything in
> > scripts/Makefile.build line 77. As scripts/Makefile.lib line 29 filters
> > the only target the object file is not built.
>
> I have applied your patch and tried it out.
> The reason for the problem is the placeholder directory mac80211.
> kbuild will not waste time building built-in.o for a directory where
> it is not necessary. So for mac80211 no built-in.o is created since there
> is no need. The only reference is to a module.

Agreed that it is not really needed. But if you don't build it you should not 
try to link it later...

> The quick-and-dirty workaround is to add a single
> obj-n := xx
> in mac80211/Makefile and kbuild is happy again.
>
> I could teach kbuild to create built-in.o also in the case
> where we refer to a subdirectory only. But then we would end up with a
> built-in.o in all directories where we have a kbuild MAkefile (almost) and
> that is not desireable.

I would prefer to teach it not to link object files that are not built.

> So I recommend the proposed workaround for now with a proper comment.

Ok, thanks for your help.
-- 
Regards,
Chris


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH 2/5] jffs2: Add LZO compression support to jffs2

2007-05-04 Thread Satyam Sharma

Hi Richard,

On 5/1/07, Richard Purdie <[EMAIL PROTECTED]> wrote:

Add LZO1X compression/decompression support to jffs2.

LZO's interface doesn't entirely match that required by jffs2 so a
buffer and memcpy is unavoidable.

Signed-off-by: Richard Purdie <[EMAIL PROTECTED]>
---
[...]
+++ b/fs/jffs2/compr_lzo.c
[...]
+static void *lzo_mem;
+static void *lzo_compress_buf;
+static DEFINE_MUTEX(deflate_mutex);
+
+static void free_workspace(void)
+{
+   vfree(lzo_mem);
+   vfree(lzo_compress_buf);
+}
+
+static int __init alloc_workspace(void)
+{
+   lzo_mem = vmalloc(LZO1X_MEM_COMPRESS);
+   lzo_compress_buf = vmalloc(lzo1x_worst_compress(PAGE_SIZE));
+
+   if (!lzo_mem || !lzo_compress_buf) {
+   printk(KERN_WARNING "Failed to allocate lzo deflate 
workspace\n");
+   free_workspace();
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static int jffs2_lzo_compress(unsigned char *data_in, unsigned char *cpage_out,
+ uint32_t *sourcelen, uint32_t *dstlen, void 
*model)
+{
+   unsigned long compress_size;
+   int ret;
+
+   mutex_lock(&deflate_mutex);
+   ret = lzo1x_1_compress(data_in, *sourcelen, lzo_compress_buf, 
&compress_size, lzo_mem);
+   mutex_unlock(&deflate_mutex);


Considering we do have to memcpy() the entire compressed result to the
destination output buffer later anyway (note that
fs/jffs2/compr_zlib.c doesn't need to do that), do we really gain much
by avoiding vmalloc() and vfree() in jffs2_lzo_compress() itself and
keeping the workspace buffers pre-allocated? I ask because I always
found these global static workspace buffers ugly, and all the
associated code + mutex could go away if we make them local to
jffs2_lzo_compress() -- as long as it doesn't hurt performance
terribly, of course.

Thanks,
Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Nick Piggin

Ingo Molnar wrote:

* Nick Piggin <[EMAIL PROTECTED]> wrote:



Here were some of my concerns, and where our discussion got up to.



Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
maybe with normal system activity we get enough free pages to kick 
the thing off and running.  Perhaps updatedb itself has a lot of 
rss, for example.


Could be, but I don't know. I'd think it unlikely to allow _much_ 
swapin, if huge amounts of the desktop have been swapped out. But 
maybe... as I said, nobody seems to have a recipe for these things.



can i take this one as a "no fundamental objection"? There are really 
only 2 maintainance options left:


  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)


The burden is not on me to get someone else's feature merged. If it
can be shown to work well and people's concerns addressed, then anything
will get merged. The reason Linux is so good is because of what we don't
merge, figuratively speaking.

I wanted to see some basic regression tests to show that it hasn't caused
obvious problems, and some basic scenarios where it helps, so that we can
analyse them. It is really simple, but I haven't got any since first
asking.

And note that I don't think I ever explicitly "nacked" anything, just
voiced my concerns. If my concerns had been addressed, then I couldn't
have stopped anybody from merging anything.



2) It is a _highly_ speculative operation, and in workloads where periods
  of low and high page usage with genuinely unused anonymous / tmpfs
  pages, it could waste power, memory bandwidth, bus bandwidth, disk
  bandwidth...


Yes.  I suspect that's a matter of waiting for the corner-case 
reporters to complain, then add more heuristics.


Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
is happy to do a _lot_ of work for these things which we have already 
decided are least likely to be used again.



i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.


So that's one of my issues with the code. If all you have to support a
merge is anecodal evidence, then I find it interesting that you would
easily discount something like this.



4) If this is helpful, wouldn't it be equally important for things like
  mapped file pages? Seems like half a solution.


[...]


(otoh the akpm usersapce implementation is swapoff -a;swapon -a)


Perhaps. You may need a few indicators to see whether the system is 
idle... but OTOH, we've already got a lot of indicators for memory, 
disk usage, etc. So, maybe :)



The time has passed for this. Let others play too. Please :-)


Play with what? Prefetching mmaped file pages as well? Sure.


I could be wrong, but IIRC there is no good way to know which cpuset 
to bring the page back into, (and I guess similarly it would be hard 
to know what container to account it to, if doing 
account-on-allocate).



(i think cpusets are totally uninteresting in this context: nobody in 
their right mind is going to use swap-prefetch on a big NUMA box. Nor 
can i see any fundamental impediment to making this more cpuset-aware, 
just like other subsystems were made cpuset-aware, once the requests 
from actual users came in and people started getting interested in it.)


OK, so make it more cpuset aware. This isn't a new issue, I raised it
a long time ago. And trust me, it is a nightmare to just assume that
nobody will use cpusets on a small box for example (AFAIK the resource
control guys are looking at doing just that).

All core VM features should play nicely with each other without *really*
good reason.


I think the "lack of testcase and numbers" is the only valid technical 
objection i've seen so far.


Well you're entitled to your opinion too.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: "cc" );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+"1:"   PPC_LLARX "%0,0,%3  # clear_bit_unlock\n"
+   "andc   %0,%0,%2\n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%0,0,%3\n"
+   "bne-   1b"
+   : "=&r" (old), "+m" (*p)
+   : "r" (mask), "r" (p)
+   : "cc" );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old & mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+"1:"   PPC_LLARX "%0,0,%3  # test_and_set_bit_lock\n"
+   "or %1,%0,%2 \n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%1,0,%3 \n"
+   "bne-   1b"
+   ISYNC_ON_SMP
+   : "=&r" (old), "=&r" (t)
+   : "r" (mask), "r" (p)
+   : "cc", "memory");
+
+   return (old & mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first < 0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page->mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c
===

Re: [linux-dvb] Re: DST/BT878 module customization (.. was: Critical points about ...)

2007-05-04 Thread Mauro Carvalho Chehab
> > It would be nice, however, to have a patch making dvb_attach more
> > generic, by e.g. having a variant that allows passing another message.
> 
> Only this message is from dvb_attach():
> > DVB: Unable to find symbol dst_attach()
> 
> Is it saying that it cannot load the module that dst_attach() is in (it
> doesn't know what module that is, modprobe knows that).  If you enabled dst
> support and deleted the module, it would be the same.
> 
> If you turn off dvb_attach() and also disable dst, you should instead get
> this message:
> dst_attach: driver disabled by Kconfig
> 
> Maybe that would look nicer with a "DVB:  " prefix?  That would easier if it
> wasn't necessary to update the printk in each boilerplate stub function.  What
> if one macro created these stubs
> 
> > frontend_init: Could not find a Twinhan DST.
> > dvb-bt8xx: A frontend driver was not found for device 109e/0878 subsystem 
> > fbfb/f800
> 
> These two messages are printed by the dvb-bt8xx driver, not by dvb_attach().
> It would be trivial to change of course, but I'm not sure what would be
> pedantically correct for both dst and non-dst based hardware.

Sorry, this is what I meant: to fix the above message. The dvb_attach is
generic enough.

Maybe a more nice text would be something like:
"Couldn't initialize frontend helper modules for device ..."

since dvb_attach will also print what modules were not loaded.
> 
> > There's an argument against the prototype changes on dst_attach and
> > dst_ca_attach since they aren't frontend.
> 
> The reason I changed that, is the dst_attach() already did return a
> dvb_frontend pointer, it was just inside an enclosing structure.  i.e. what
> existed before:
> 
> {
>   struct dst_state *state;
>   state = dst_attach(...);
>   card->fe = &state->frontend;
> } /* state goes out of scope */
> 
> The frontend is inside the state struct and the state pointer isn't saved
> anywhere.  dvb-bt8xx just saves a frontend pointer from inside the dst state
> and tosses the state pointer away.  So I changed that to:
> 
>   card->fe = dst_attach(...);

IMO, this made the code cleaner.

-- 
Cheers,
Mauro

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm] PM: Separate hibernation code from suspend code

2007-05-04 Thread Rafael J. Wysocki
From: Rafael J. Wysocki <[EMAIL PROTECTED]>, Johannes Berg <[EMAIL PROTECTED]>

Separate the hibernation (aka suspend to disk code) from the other suspend code.
In particular:
 * Remove the definitions related to hibernation from include/linux/pm.h
 * Introduce struct hibernation_ops and a new hibernate() function to hibernate
   the system, defined in include/linux/suspend.h
 * Separate suspend code in kernel/power/main.c from hibernation-related code
   in kernel/power/disk.c and kernel/power/user.c (with the help of
   hibernation_ops)
 * Switch ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 Documentation/power/userland-swsusp.txt |   26 ++--
 drivers/acpi/sleep/main.c   |   67 +--
 drivers/acpi/sleep/proc.c   |2 
 drivers/i2c/chips/tps65010.c|2 
 include/linux/pm.h  |   31 -
 include/linux/suspend.h |   32 +
 kernel/power/disk.c |  185 
 kernel/power/main.c |   42 ++-
 kernel/power/power.h|7 -
 kernel/power/user.c |   13 +-
 kernel/sys.c|2 
 11 files changed, 224 insertions(+), 185 deletions(-)

Index: linux-2.6.21/include/linux/pm.h
===
--- linux-2.6.21.orig/include/linux/pm.h2007-05-02 22:08:58.0 
+0200
+++ linux-2.6.21/include/linux/pm.h 2007-05-02 22:09:55.0 +0200
@@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_ON  ((__force suspend_state_t) 0)
 #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1)
 #define PM_SUSPEND_MEM ((__force suspend_state_t) 3)
-#define PM_SUSPEND_DISK((__force suspend_state_t) 4)
-#define PM_SUSPEND_MAX ((__force suspend_state_t) 5)
-
-typedef int __bitwise suspend_disk_method_t;
-
-/* invalid must be 0 so struct pm_ops initialisers can leave it out */
-#define PM_DISK_INVALID((__force suspend_disk_method_t) 0)
-#definePM_DISK_PLATFORM((__force suspend_disk_method_t) 1)
-#definePM_DISK_SHUTDOWN((__force suspend_disk_method_t) 2)
-#definePM_DISK_REBOOT  ((__force suspend_disk_method_t) 3)
-#definePM_DISK_TEST((__force suspend_disk_method_t) 4)
-#definePM_DISK_TESTPROC((__force suspend_disk_method_t) 5)
-#definePM_DISK_MAX ((__force suspend_disk_method_t) 6)
+#define PM_SUSPEND_MAX ((__force suspend_state_t) 4)
 
 /**
  * struct pm_ops - Callbacks for managing platform dependent suspend states.
  * @valid: Callback to determine whether the given state can be entered.
- * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
- * always valid and never passed to this call. If not assigned,
- * no suspend states are valid.
  * Valid states are advertised in /sys/power/state but can still
  * be rejected by prepare or enter if the conditions aren't right.
  * There is a %pm_valid_only_mem function available that can be assigned
@@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
  *
  * @finish: Called when the system has left the given state and all devices
  * are resumed. The return value is ignored.
- *
- * @pm_disk_mode: The generic code always allows one of the shutdown methods
- * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
- * %PM_DISK_TESTPROC. If this variable is set, the mode it is set
- * to is allowed in addition to those modes and is also made default.
- * When this mode is sent selected, the @prepare call will be called
- * before suspending to disk (if present), the @enter call should be
- * present and will be called after all state has been saved and the
- * machine is ready to be powered off; the @finish callback is called
- * after state has been restored. All these calls are called with
- * %PM_SUSPEND_DISK as the state.
  */
 struct pm_ops {
int (*valid)(suspend_state_t state);
int (*prepare)(suspend_state_t state);
int (*enter)(suspend_state_t state);
int (*finish)(suspend_state_t state);
-   suspend_disk_method_t pm_disk_mode;
 };
 
 /**
@@ -276,8 +249,6 @@ extern void device_power_up(void);
 extern void device_resume(void);
 
 #ifdef CONFIG_PM
-extern suspend_disk_method_t pm_disk_mode;
-
 extern int device_suspend(pm_message_t state);
 extern int device_prepare_suspend(pm_message_t state);
 
Index: linux-2.6.21/kernel/power/main.c
===
--- linux-2.6.21.orig/kernel/power/main.c   2007-05-02 22:08:58.0 
+0200
+++ linux-2.6.21/kernel/power/main.c2007-05-03 12:17:51.0 +0200
@@ -30,7 +30,6 @@
 DEFINE_MUTEX(pm_mutex);
 
 struct pm_ops *pm_

[SOLVED] Serial buffer corruption [was Re: FTDI usb-serial possible bug]

2007-05-04 Thread Antonino Ingargiola

Accidentally I've replied privately, sorry. Forwarding to ML...

-- Forwarded message --
From: Antonino Ingargiola <[EMAIL PROTECTED]>
Date: May 4, 2007 11:29 AM
Subject: Re: [SOLVED] Serial buffer corruption [was Re: FTDI
usb-serial possible bug]
To: Oliver Neukum <[EMAIL PROTECTED]>


On 5/4/07, Oliver Neukum <[EMAIL PROTECTED]> wrote:

Am Freitag, 4. Mai 2007 10:38 schrieb Antonino Ingargiola:
> To solve the problem we must do a complete flush of all the buffer
> chain. I do this flushing the input multiple times with a small pause
> between them. In my case 10 flushes separated by a 10ms pause always
> empties the whole buffer chain, so I get no corruption anymore. I'ts
> not an elegant solution but it works (10 flushes are an overkill but I
> want to be _really_ sure to read the correct data).

How do you flush the buffers? Simply reading them out?


Nope. In python I use the flushInput() method of the serial object
defined by the pyserial library[0]. The method does just this system
call:

   termios.tcflush(self.fd, TERMIOS.TCIFLUSH)

that I think is correct.

Cheers,

 ~ Antonio

[0]: http://pyserial.sourceforge.net/ (or python-serial debian package)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] UIO: Documentation

2007-05-04 Thread Hans-Jürgen Koch
Am Donnerstag 03 Mai 2007 08:39 schrieb Hans-Jürgen Koch:
> > 
> > Hm, I have about 3 different patches here now, all dependant on each
> > other, yet I can't tell which goes first :(
> > 
> > Can someone just send me 1, or 3 with the correct order in which to
> > apply them?
> > 
> > thanks,
> > 
> > greg k-h
> >
> 
> Hi Greg,
> I attached all the UIO patches I collected so far. This is my series file:
> 
> uio.patch
> fix-early-irq-problem-in-uio.patch
> uio-documentation.patch
> fix-uio_read-type-problem.patch
> uio-dummy.patch
> uio-hilscher-cif-card-driver.patch
> ioremap-in-uio-hilscher-cif.patch
> add-userspace-howto-to-uio-doc.patch
> fix-uio-doc-build-problems.patch
> 
> It should also work without uio-dummy.patch.
> I added Randy's last one-line patch to fix-uio-doc-build-problems.patch.
> 
> All patches are
> Signed-off-by: Hans J. Koch <[EMAIL PROTECTED]>
> 
> fix-uio-doc-build-problems.patch is also
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> 

I also updated the original patch set with all these changes. They can be 
found here:

http://www.osadl.org/projects/downloads/UIO/kernel/

Thanks,
Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFT][PATCH] swsusp: Change code ordering related to ACPI

2007-05-04 Thread Rafael J. Wysocki
Hi,

The change of the hibernation/suspend code ordering made before 2.6.21 has
caused some systems to have problems related to ACPI.  In particular, the
'platform' hibernation mode doesn't work any more on some systems.

It has been confirmed that the appended patch fixes the problem, but it's not
certain if this changes don't break some other systems.  For this reason, all
users of hibernation (swsusp, uswsusp) are gently requested to verify if this
patch doesn't break their systems.

Greetings,
Rafael

---
From: Rafael J. Wysocki <[EMAIL PROTECTED]>

The current code ordering in the hibernation code paths causes some systems to
have problems.  It has been confirmed by testers that these problems do not
appear if the acpi_pm_finish() function (called via platform_finish()) is
executed after device_suspend().

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 kernel/power/disk.c |4 ++--
 kernel/power/user.c |8 
 2 files changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6.21/kernel/power/disk.c
===
--- linux-2.6.21.orig/kernel/power/disk.c   2007-05-04 10:31:51.0 
+0200
+++ linux-2.6.21/kernel/power/disk.c2007-05-04 10:32:12.0 +0200
@@ -195,9 +195,9 @@ int hibernate(void)
 
if (in_suspend) {
enable_nonboot_cpus();
-   platform_finish();
device_resume();
resume_console();
+   platform_finish();
pr_debug("PM: writing image.\n");
error = swsusp_write();
if (!error)
@@ -214,9 +214,9 @@ int hibernate(void)
  Enable_cpus:
enable_nonboot_cpus();
  Resume_devices:
-   platform_finish();
device_resume();
resume_console();
+   platform_finish();
  Thaw:
mutex_unlock(&pm_mutex);
unprepare_processes();
Index: linux-2.6.21/kernel/power/user.c
===
--- linux-2.6.21.orig/kernel/power/user.c   2007-05-04 10:31:51.0 
+0200
+++ linux-2.6.21/kernel/power/user.c2007-05-04 10:32:12.0 +0200
@@ -169,11 +169,11 @@ static inline int snapshot_suspend(int p
}
enable_nonboot_cpus();
  Resume_devices:
+   device_resume();
+   resume_console();
if (platform_suspend)
platform_finish();
 
-   device_resume();
-   resume_console();
  Finish:
mutex_unlock(&pm_mutex);
return error;
@@ -201,11 +201,11 @@ static inline int snapshot_restore(int p
 
enable_nonboot_cpus();
  Resume_devices:
+   device_resume();
+   resume_console();
if (platform_suspend)
platform_finish();
 
-   device_resume();
-   resume_console();
  Finish:
pm_restore_console();
mutex_unlock(&pm_mutex);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib/hexdump

2007-05-04 Thread Pekka Enberg

Hi Randy,

On 5/4/07, Randy Dunlap <[EMAIL PROTECTED]> wrote:

+extern void hex_dumper(void *buf, size_t len, char *linebuf, size_t 
linebuflen);


Please do s/hex_dumper/hex_dump_to_buffer/ for consistency with print_hex_dump.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] DRM TTM Memory Manager patch

2007-05-04 Thread Jerome Glisse

On 5/4/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:

On 5/4/07, Thomas Hellström <[EMAIL PROTECTED]> wrote:
> Keith Packard wrote:
> > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellström wrote:
> >
> >
> >> It might be possible to find schemes that work around this. One way
> >> could possibly be to have a buffer mapping -and validate order for
> >> shared buffers.
> >>
> >
> > If mapping never blocks on anything other than the fence, then there
> > isn't any dead lock possibility. What this says is that ordering of
> > rendering between clients is *not DRMs problem*. I think that's a good
> > solution though; I want to let multiple apps work on DRM-able memory
> > with their own CPU without contention.
> >
> > I don't recall if Eric layed out the proposed rules, but:
> >
> >  1) Map never blocks on map. Clients interested in dealing with this
> > are on their own.
> >
> >  2) Submit blocks on map. You must unmap all buffers before submitting
> > them. Doing the relocations in the kernel makes this all possible.
> >
> >  3) Map blocks on the fence from submit. We can play with pending the
> > flush until the app asks for the buffer back, or we can play with
> > figuring out when flushes are useful automatically. Doesn't matter
> > if the policy is in the kernel.
> >
> > I'm interested in making deadlock avoidence trivial and eliminating any
> > map-map contention.
> >
> >
> It's rare to have two clients access the same buffer at the same time.
> In what situation will this occur?
>
> If we think of map / unmap and validation / fence  as taking a buffer
> mutex either for the CPU or for the GPU, that's the way implementation
> is done today. The CPU side of the mutex should IIRC be per-client
> recursive. OTOH, the TTM implementation won't stop the CPU from
> accessing the buffer when it is unmapped, but then you're on your own.
> "Mutexes" need to be taken in the correct order, otherwise a deadlock
> will occur, and GL will, as outlined in Eric's illustration, more or
> less encourage us to take buffers in the "incorrect" order.
>
> In essence what you propose is to eliminate the deadlock problem by just
> avoiding taking the buffer mutex unless we know the GPU has it. I see
> two problems with this:
>
> * It will encourage different DRI clients to simultaneously access
>   the same buffer.
> * Inter-client and GPU data coherence can be guaranteed if we issue
>   a mb() / write-combining flush with the unmap operation (which,
>   BTW, I'm not sure is done today). Otherwise it is up to the
>   clients, and very easy to forget.
>
> I'm a bit afraid we might in the future regret taking the easy way out?
>
> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
> buffers in the correct order might not be the best one either. It will
> certainly mean some CPU overhead and what if we have to do the same with
> buffer validation? (Yes for some operations with thousands and thousands
> of relocations, the user space validation might need to stay).
>
> Personally, I'm slightly biased towards having DRM resolve the deadlock,
> but I think any solution will do as long as the implications and why we
> choose a certain solution are totally clear.
>
> For item 3) above the kernel must have a way to issue a flush when
> needed for buffer eviction.
> The current implementation also requires the buffer to be completely
> flushed before mapping.
> Other than that the flushing policy is currently completely up to the
> DRM drivers.
>
> /Thomas

I might say stupid things as i don't think i fully understand all
the input to this problem. Anyway here is my thought on all this:

1) First client map never block (as in Keith layout) except on
fence from drm side (point 3 in Keith layout)
2) Client should always unmap buffer before submitting (as in Keith layout)
3) In drm side you always acquire buffer lock in a give order for
instance each buffer got and id and you lock from smaller id to bigger
one (with a clever implementation the cost for that will be small)
4) We got 2 gpu queue:
 - one with pending apps ask in which we do all stuff necessary
   to be done before submitting (locking buffer, validation, ...)
   for instance we might wait here for each buffer that are still
   mapped by some other apps in user space
 - one run queue in which we add each apps ask that are now
   ready to be submited to the gpu

Of course in this scheme we keep the fencing stuff so user space can
know when it safe to use previously submited buffer again. The outcome
of having two seperate queue in drm is that if two apps lockup each other
other apps can still use the gpu so only the apps fighting for a buffer will
suffer.

And for user space synchronization i believe it a user space problem i.e.
it's up to user space to add proper synch. For instance as map doesn't
block for any client in user space t

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-04 Thread Christoph Hellwig
On Fri, May 04, 2007 at 09:12:31AM +0100, Anton Altaparmakov wrote:
> Nothing to do with win32 functions.  Windows does NOT create sparse  
> files therefore it never can have an issue like ext3 does in this  
> scenario.  Windows will cause nice allocations to happen because of  
> this and the 1-byte writes are perfectly sensible in this regard.   
> (Although a little odd as Windows has a proper API for doing  
> preallocation so I don't get why it is not using that instead...)

Which means the right place to fix this is samba.  Samba just need
to intersept lseek and pread/pwrite to never allocate sparse files
but do the right thing instead.  Now what the right thing would probably
be a preallocate instead of writing zeroes, and we need to provide the
infrastructure for them to do it, which is in progress currently.
(And in fact samba already does the right thing for XFS if you use
the prealloc samba vfs module, which AFAIK is not the default)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build system: no module target ending with slash?

2007-05-04 Thread Sam Ravnborg
On Fri, May 04, 2007 at 10:56:14AM +0200, Christian Hesse wrote:
> Agreed that it is not really needed. But if you don't build it you should not 
> try to link it later...
> 
> > The quick-and-dirty workaround is to add a single
> > obj-n := xx
> > in mac80211/Makefile and kbuild is happy again.
> >
> > I could teach kbuild to create built-in.o also in the case
> > where we refer to a subdirectory only. But then we would end up with a
> > built-in.o in all directories where we have a kbuild MAkefile (almost) and
> > that is not desireable.
> 
> I would prefer to teach it not to link object files that are not built.

But you already _told_ kbuild that mac80211/ would contain a built-in.o
using following statement in drivers/net/wireless/Makefile:
obj-y += mac80211/

Changing this to obj-$(CONFIG_IWLWIFI) += mac80211/ would
give kbuild the _correct_ info.

obj-m += dir/
tell kbuild this directory contains a module so it will descend and build
obj-y += dir/
tall kbuild that directory contains stuff to be built-in and _maybe_ a module so
it will descend and build.

Note that when kbuild has entered a subdirectory it has lost knowledge of _how_
it came there so if you have lied to kbuild it will not detect it.
So in your case you told kbuild that there is stuff to be build in
in the mac80211/ dir which was incorrect.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] firewire: char device interface

2007-05-04 Thread Christoph Hellwig
On Wed, May 02, 2007 at 05:11:45PM -0400, Kristian H??gsberg wrote:
> The firewire-cdev.h file is meant to be a self-contained userspace header 
> file and shouldn't include other kernel header files.  All duplicated 
> values are standardized ieee1394 values and won't ever change.  I should 
> put a #ifndef __FW_COMMON_DEFINES protection around the duplicate values, I 
> guess, but I'm just wondering why I never saw a "symbol redefined" 
> warning...

No, defining things in two places is not okay.  Just add a new header
that defines these protocol constants, which needs to be included by
userspace that wants to use them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] firewire: SBP-2 highlevel driver

2007-05-04 Thread Christoph Hellwig
On Wed, May 02, 2007 at 11:53:30PM +0200, Stefan Richter wrote:
> > Please set the max_sectors value in your host template so that the
> > block layer doesn't build sg entries too big for you.
> 
> Hmm, what about this:
> 
> James Bottomley wrote on 2007-01-15:
> | Actually, there's one unfortunate case where Linux won't respect this:
> | an IOMMU that can do virtual merging.  This parameter is a block queue
> | parameter, so block will happily make sure the request segments obey it.
> | However, when you get to dma_map_rq() it doesn't see the segment limits,
> | so, if the iommu merges, you can end up with SG elements the other side
> | that violate this.  I've been meaning to do something about this for
> | ages (IDE is the other subsystem that has an absolute requirement for a
> | fixed maximum segment size) but never found an excuse to fix it.
> 
> http://marc.info/?l=linux-scsi&m=116889641203397

Hmm, okay.  Probably wants a comment in there explaining it, and we
should poke jejb to fix it for real :)

> >> +static int add_scsi_devices(struct fw_unit *unit)
> >> +{
> >> +  struct sbp2_device *sd = unit->device.driver_data;
> >> +  int retval, lun;
> >> +
> >> +  if (sd->scsi_host != NULL)
> >> +  return 0;
> >> +
> >> +  sd->scsi_host = scsi_host_alloc(&scsi_driver_template,
> >> +  sizeof(unsigned long));
> >> +  if (sd->scsi_host == NULL) {
> >> +  fw_error("failed to register scsi host\n");
> >> +  return -1;
> >> +  }
> >> +
> >> +  sd->scsi_host->hostdata[0] = (unsigned long)unit;
> > 
> > Please take a look ar ther other scsi drivers how this is supposed
> > to be used.
> 
> Do you mean the one Scsi_Host per LU?  If it is that, then it was just
> taken over from drivers/ieee1394/sbp2.c.  Sbp2 is doing this still today
> mostly for historical reasons; I just didn't find the time yet to try to
> get to a leaner scheme.

No, sorry.  I should have written a better explanation.  scsi_host_alloc
is designed to allocate space for your private data aswell.  So you
should call it early on an allocate the sbp2_device as part of the Scsi_Host
instead of just stuffing in a pointer.

> > Do we really need another scanning algorithm?
> 
> Yes.
> 
> > Can't you use scsi_scan_target instead and let the core scsi code
> > handle the scanning?
> 
> No.  The discovery of LUs of SBP-2 targets happens on the IEEE 1212
> level of things.  The initiator has to parse the configuration ROM of
> the target FireWire node; the ROM has entries for each LU.  (After that,
> SBP-2 login protocol commences for each LU, and only after that can SCSI
> requests be issued.  There is nothing SCSIish going on before that.)
> 
> What's missing as a /* FIXME */ here is actually implemented in the
> mainline sbp2.c and needs to be brought over here; converted to the new
> FireWire core APIs.

Okay, so sbp2 decided to be non-standard here, what a pity.  It's probably
better to use scsi_scan_target with a specific lun, though as scsi_add_device
is a rather awkward API.

> > This function seems rather oddly named.  And the checking and
> > setting of scsi_host looks like you have some lifetime rule
> > problems.
> > 
> 
> The NULL probably has to do with the ability to call remove_scsi_devices
> in different paths.  (These paths are not concurrent.)

Needs documentation at least.  And at least my preference would be
to have a deleted flag instead of the null setting because the latter
can easily paper over bugs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ACPI problem revealed

2007-05-04 Thread Pavel Machek
Hi!

> Bill Gates once said:
> 
> http://antitrust.slated.org/www.iowaconsumercase.org/011607/3000/PX03020.pdf

Hmm, that's really pretty nasty. I guess they succeeded even w/o patents.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] [RFC] Battery monitoring class

2007-05-04 Thread Pavel Machek
Hi!

> > I'll convert mXh to uXh a bit later, if there will no further objections
> > against uXh. Also I'd like to hear if there any objections on
> > mA/mV -> uA/uV conversion. I think we'd better keep all units at the
> > same order/precision.
> 
> Okay, would it make sense to use "long" instead of "int" after "milli" to
> "micro" conversion? On 32 bit machines int gives +-2147483648 limit. So
> 2147 volts/amperes/...

long == int on 32bit machines.

> Though 2147 amperes is unrealistic for batteries, but if used in
> calculations it could be dangerous.

Let the one doing calculations handle that ;-).
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ExpressCard hotswap support?

2007-05-04 Thread Daniel J Blueman

On 4 May, 01:20, Chris Adams <[EMAIL PROTECTED]> wrote:

I've got a Thinkpad Z60m with an ExpressCard slot, and I got a Belkin
F5U250 GigE ExpressCard (Marvell 88E8053 chip using sky2 driver).  It
appears that Linux only recognizes it if I insert the card with the
system powered off.  If I hot-insert the card, nothing happens (no
messages logged, no PCI device shows up, nothing).


The BIOS initialises and powers up the downstream PCI express port
when it detects a card is present.

When Linux boots, it enumerates the bus and sees it, but does not do
prior configuration to enable, configure and cause link negotiation on
all PCI express ports I believe; this requires chipset and (sometimes
revision-) specific code, which wouldn't be so robust as the BIOS
doing the footwork.

I have the same problem on my Sony VGN-SZ240. The problem would be
addressed if the BIOS powered up all PCI express ports, but then this
would draw more power. Perhaps if just needs to do the hotplug
configuration correctly, but then older OSs may have problems?


Does Linux support hotswapping ExpressCards?

This is with Fedora Core 6 with all updates, kernel 2.6.20-1.2948.fc6.

--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [Patch 2/3] readahead statistics slimmed down, adapt zfcp

2007-05-04 Thread Swen Schillig
ACK 

On Thursday 03 May 2007 19:56, Martin Peschke wrote:
> This patch adapts zfcp to the counter changes in lib/statistics.c.
> 
> Signed-off-by: Martin Peschke <[EMAIL PROTECTED]>
> ---
> 
>  zfcp_ccw.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux/drivers/s390/scsi/zfcp_ccw.c
> ===
> --- linux.orig/drivers/s390/scsi/zfcp_ccw.c
> +++ linux/drivers/s390/scsi/zfcp_ccw.c
> @@ -137,7 +137,7 @@ static struct statistic_info zfcp_statin
>   .name = "qdio_outb_full",
>   .x_unit   = "sbals_left",
>   .y_unit   = "",
> - .defaults = "type=counter_inc"
> + .defaults = "type=counter"
>   },
>   [ZFCP_STAT_A_QO] = {
>   .name = "qdio_outb",
> @@ -155,7 +155,7 @@ static struct statistic_info zfcp_statin
>   .name = "erp",
>   .x_unit   = "",
>   .y_unit   = "",
> - .defaults = "type=counter_inc"
> + .defaults = "type=counter"
>   }
>  };
> 
> 
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC/PATCH] ext3: remove inode constructor

2007-05-04 Thread Pekka J Enberg
From: Pekka Enberg <[EMAIL PROTECTED]>

As explained by Christoph Lameter, ext3_alloc_inode() touches the same
cache line as init_once() so we gain nothing from using slab
constructors.  The SLUB allocator will be more effective without it
(free pointer can be placed inside the free'd object), so move inode
initialization to ext3_alloc_inode completely.

[postmark: numbers = 1, transactions = 1, 2 GHz Pentium M]

2.6.21 vanilla:

  real  0m19.006s
  user  0m0.144s
  sys   0m7.424s

  real  0m9.040s
  user  0m0.156s
  sys   0m5.164s

  real  0m8.939s
  user  0m0.128s
  sys   0m5.180s

2.6.21 + ext3-remove-inode-constructor:

  real  0m19.176s
  user  0m0.176s
  sys   0m7.436s

  real  0m9.030s
  user  0m0.172s
  sys   0m5.120s

  real  0m8.966s
  user  0m0.168s
  sys   0m5.132s

Cc: Stephen C. Tweedie <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Cc: Andreas Dilger <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]>
---
 fs/ext3/super.c |   30 --
 1 file changed, 12 insertions(+), 18 deletions(-)

Index: 2.6/fs/ext3/super.c
===
--- 2.6.orig/fs/ext3/super.c2007-05-04 12:57:09.0 +0300
+++ 2.6/fs/ext3/super.c 2007-05-04 13:01:27.0 +0300
@@ -444,17 +444,26 @@ static struct kmem_cache *ext3_inode_cac
 static struct inode *ext3_alloc_inode(struct super_block *sb)
 {
struct ext3_inode_info *ei;
+   struct inode *inode;
 
ei = kmem_cache_alloc(ext3_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+INIT_LIST_HEAD(&ei->i_orphan);
+#ifdef CONFIG_EXT3_FS_XATTR
+init_rwsem(&ei->xattr_sem);
+#endif
+mutex_init(&ei->truncate_mutex);
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
ei->i_acl = EXT3_ACL_NOT_CACHED;
ei->i_default_acl = EXT3_ACL_NOT_CACHED;
 #endif
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
-   return &ei->vfs_inode;
+
+   inode = &ei->vfs_inode;
+   inode_init_once(inode);
+   inode->i_version = 1;
+   return inode;
 }
 
 static void ext3_destroy_inode(struct inode *inode)
@@ -462,28 +471,13 @@ static void ext3_destroy_inode(struct in
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 
-static void init_once(void * foo, struct kmem_cache * cachep, unsigned long 
flags)
-{
-   struct ext3_inode_info *ei = (struct ext3_inode_info *) foo;
-
-   if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
-   SLAB_CTOR_CONSTRUCTOR) {
-   INIT_LIST_HEAD(&ei->i_orphan);
-#ifdef CONFIG_EXT3_FS_XATTR
-   init_rwsem(&ei->xattr_sem);
-#endif
-   mutex_init(&ei->truncate_mutex);
-   inode_init_once(&ei->vfs_inode);
-   }
-}
-
 static int init_inodecache(void)
 {
ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 sizeof(struct ext3_inode_info),
 0, (SLAB_RECLAIM_ACCOUNT|
SLAB_MEM_SPREAD),
-init_once, NULL);
+NULL, NULL);
if (ext3_inode_cachep == NULL)
return -ENOMEM;
return 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cpufreq longhaul locks up

2007-05-04 Thread Jan Engelhardt
Hi,


I found that setting the cpufreq governor to ondemand making the box 
lock up solid in 2.6.20.2 and 2.6.21 after a few seconds. Sysrq 
does not work anymore, and the last messages are:

May  3 19:16:58 cn kernel: longhaul: VIA C3 'Nehemiah C' [C5P] CPU 
detected.  Powersaver supported.
May  3 19:16:58 cn kernel: longhaul: Using northbridge support.
May  3 19:17:22 cn kernel: Time: acpi_pm clocksource has been installed.
May  3 19:17:22 cn kernel: Clocksource tsc unstable (delta = -136422685 
ns)

I have also tried 2.6.20.2 in single-user mode (so that I can have 
the disk readonly), and it take a little longer (magnitude: minutes) 
to lock up; not sure if it's 20.2 or the single-user mode but I suspect 
the latter since nothing is running then that could potentially 
contribute to quickly changing workloads/frequencies.

If you need more info, please let me know.


Thanks,
Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 04/33] m68k: Atari keyboard and mouse support.

2007-05-04 Thread Michael Schmitz
> > > > +   // need to init core driver if not already done so
> > > > +   if (atari_keyb_init())
> > >
> > > Memory leak
> >
> > How so? If the core has been initialized already this will just return ...
> >
>
> You just allocated atakbd_dev. If atari_keyb_init() fails you leak it.

I see. Won't probably happen, but there's no reason to take a risk.

> > > It looks like this driver is not using standard input event codes. If
> > > Roman does not want to adjust keymaps on Amiga and Atari that should
> > > be handled in legacy keyboard driver (drivers/char/keyboard.c). As it
> > > is programs using /dev/input/eventX have no chance of working.
> >
> > The translation map should not have been overwritten like above, is that
> > what you mean?
> > My original patch didn't have that bit; scancodes were translated to
> > input keycodes using atakbd_keycode[scancode] instead. I'll have that
> > reverted...
> >
>
> Does KEY_1 actually  maps to scancode 2 on atari?

It sure does. For most of the keys, the identity mapping is in fact OK.
There's a lot of keys where this breaks, though. Plus the notorious
scancode 0 :-)

Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/40] mm: slab allocation fairness

2007-05-04 Thread Peter Zijlstra
The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is 
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.

Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.

This patch solves that by using the newly introduced page allocation rank.

Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page. 
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most 
shallow allocation possible (ALLOC_WMARK_HIGH).

When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.

If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.

Thus if we grew the slab under great duress while PF_MEMALLOC was set and we 
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.

So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.

If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.

[AIM9 results go here]

 AIM9 test  2.6.21-rc52.6.21-rc5-slab1 
 CONFIG_SLAB_FAIR=y

54 tcp_test  2124.48 +/-  10.852137.43 +/-  9.2212.95  
55 udp_test  5204.43 +/-  45.135231.59 +/- 56.6627.16  
56 fifo_test20991.42 +/-  46.71   19675.97 +/- 56.35  1315.44  
57 stream_pipe  10024.16 +/- 119.889912.53 +/- 75.52   111.63  
58 dgram_pipe9460.18 +/- 119.509502.75 +/- 89.0642.57  
59 pipe_cpy 30719.81 +/- 117.01   27885.52 +/- 46.81  2834.28  

  2.6.21-rc5-slab2
 CONFIG_SLAB_FAIR=n
   
54 tcp_test  2124.48 +/-  10.852137.97 +/-  12.8513.50
55 udp_test  5204.43 +/-  45.135268.21 +/-  83.3863.78
56 fifo_test20991.42 +/-  46.71   19394.42 +/-  65.15  1596.99
57 stream_pipe  10024.16 +/- 119.88   10042.49 +/- 132.1318.33
58 dgram_pipe9460.18 +/- 119.509575.97 +/- 111.86   115.80
59 pipe_cpy 30719.81 +/- 117.01   27226.52 +/- 120.15  3493.28

Given that the CONFIG_SLAB_FAIR=n numbers are worse than =y, I'm not sure
how to interpret these numbers.

Will work on getting =n equal. Also, will work on a SLUB version of
these patches.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/Kconfig |3 ++
 mm/slab.c  |   81 -
 2 files changed, 57 insertions(+), 27 deletions(-)

Index: linux-2.6-git/mm/slab.c
===
--- linux-2.6-git.orig/mm/slab.c2007-03-26 13:34:55.0 +0200
+++ linux-2.6-git/mm/slab.c 2007-03-26 14:18:59.0 +0200
@@ -114,6 +114,7 @@
 #include   
 #include   
 #include   
+#include   "internal.h"
 
 /*
  * DEBUG   - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
 
 struct kmem_cache {
 /* 1) per-cpu data, touched during every alloc/free */
+   int rank;
struct array_cache *array[NR_CPUS];
 /* 2) Cache tunables. Protected by cache_chain_mutex */
unsigned int batchcount;
@@ -1023,21 +1025,21 @@ static inline int cache_free_alien(struc
 }
 
 static inline void *alternate_node_alloc(struct kmem_cache *cachep,
-   gfp_t flags)
+   gfp_t flags, int rank)
 {
return NULL;
 }
 
 static inline void *cache_alloc_node(struct kmem_cache *cachep,
-gfp_t flags, int nodeid)
+gfp_t flags, int nodeid, int rank)
 {
return NULL;
 }
 
 #else  /* CONFIG_NUMA */
 
-static void *cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_ca

[PATCH 07/40] mm: allow mempool to fall back to memalloc reserves

2007-05-04 Thread Peter Zijlstra
Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/mempool.c |   10 ++
 1 file changed, 10 insertions(+)

Index: linux-2.6-git/mm/mempool.c
===
--- linux-2.6-git.orig/mm/mempool.c 2007-01-12 08:03:44.0 +0100
+++ linux-2.6-git/mm/mempool.c  2007-01-12 10:38:57.0 +0100
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -229,6 +230,15 @@ repeat_alloc:
}
spin_unlock_irqrestore(&pool->lock, flags);
 
+   /* if we really had right to the emergency reserves try those */
+   if (gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS) {
+   if (gfp_temp & __GFP_NOMEMALLOC) {
+   gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+   goto repeat_alloc;
+   } else
+   gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+   }
+
/* We must not sleep in the GFP_ATOMIC case */
if (!(gfp_mask & __GFP_WAIT))
return NULL;

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/40] netvm: INET reserves.

2007-05-04 Thread Peter Zijlstra
Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Account the route cache to the auxillary reserve.
Account the fragments to the skb reserve so that one can at least
overflow the fragment cache (avoids fragment deadlocks).

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 net/ipv4/ip_fragment.c |1 +
 net/ipv4/route.c   |   19 ++-
 net/ipv4/sysctl_net_ipv4.c |   14 +-
 net/ipv6/reassembly.c  |1 +
 net/ipv6/route.c   |   19 ++-
 net/ipv6/sysctl_net_ipv6.c |   13 -
 6 files changed, 63 insertions(+), 4 deletions(-)

Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c   2007-03-26 
12:01:01.0 +0200
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c2007-03-26 12:37:19.0 
+0200
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c
 
 }
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file 
*filp,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   int old_thresh = *(int *)table->data;
+   ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+   if (write)
+   skb_reserve_memory(*(int *)table->data - old_thresh);
+   return ret;
+}
+
 ctl_table ipv4_table[] = {
{
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
.data   = &sysctl_ipfrag_high_thresh,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = &proc_dointvec
+   .proc_handler   = &proc_dointvec_fragment
},
{
.ctl_name   = NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c   2007-03-26 
12:01:01.0 +0200
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c2007-03-26 12:37:52.0 
+0200
@@ -15,6 +15,17 @@
 
 #ifdef CONFIG_SYSCTL
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file 
*filp,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   int old_thresh = *(int *)table->data;
+   ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+   if (write)
+   skb_reserve_memory(*(int *)table->data - old_thresh);
+   return ret;
+}
+
 static ctl_table ipv6_table[] = {
{
.ctl_name   = NET_IPV6_ROUTE,
@@ -44,7 +55,7 @@ static ctl_table ipv6_table[] = {
.data   = &sysctl_ip6frag_high_thresh,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = &proc_dointvec
+   .proc_handler   = &proc_dointvec_fragment
},
{
.ctl_name   = NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c   2007-03-26 12:01:01.0 
+0200
+++ linux-2.6-git/net/ipv4/ip_fragment.c2007-03-26 12:03:07.0 
+0200
@@ -743,6 +743,7 @@ void ipfrag_init(void)
ipfrag_secret_timer.function = ipfrag_secret_rebuild;
ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
add_timer(&ipfrag_secret_timer);
+   skb_reserve_memory(sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===
--- linux-2.6-git.orig/net/ipv6/reassembly.c2007-03-26 12:01:01.0 
+0200
+++ linux-2.6-git/net/ipv6/reassembly.c 2007-03-26 12:03:07.0 +0200
@@ -772,4 +772,5 @@ void __init ipv6_frag_init(void)
ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
ip6_frag_secret_timer.expires = jiffies + 
sysctl_ip6frag_secret_interval;
add_timer(&ip6_frag_secret_timer);
+   skb_reserve_memory(sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6-git/net/ipv4/route.c
===
--- linux-2.6-git.orig/net/ipv4/route.c 2007-03-26 12:01:01.0 +0200
+++ linux-2.6-git/net/ipv4/route.c  2007-03-26 12:31:43.0 +0200
@@ -2884,6 +2884,21 @@ static int ipv4_sysctl_rtcache_flush_str
return 0;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file 
*filp,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   int new_pages;
+   int old_pages = guess_kmem_cache

[PATCH 30/40] nfs: fixup missing error code

2007-05-04 Thread Peter Zijlstra
Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
to -EIO when there was no progress with short reads.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/read.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/read.c
===
--- linux-2.6-git.orig/fs/nfs/read.c2007-03-13 14:35:53.0 +0100
+++ linux-2.6-git/fs/nfs/read.c 2007-03-13 14:36:05.0 +0100
@@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
/* This is a short read! */
nfs_inc_stats(data->inode, NFSIOS_SHORTREAD);
/* Has the server at least made some progress? */
-   if (resp->count == 0)
+   if (resp->count == 0) {
+   task->tk_status = -EIO;
return 0;
+   }
 
/* Yes, so retry the read at the end of the data */
argp->offset += resp->count;

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/40] net: packet split receive api

2007-05-04 Thread Peter Zijlstra
Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs.  Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 drivers/net/e1000/e1000_main.c |8 ++--
 drivers/net/sky2.c |   16 ++--
 include/linux/skbuff.h |   23 +++
 net/core/skbuff.c  |   20 
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6-git/drivers/net/e1000/e1000_main.c
===
--- linux-2.6-git.orig/drivers/net/e1000/e1000_main.c   2007-02-14 
08:31:12.0 +0100
+++ linux-2.6-git/drivers/net/e1000/e1000_main.c2007-02-14 
11:42:07.0 +0100
@@ -4412,12 +4412,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
PAGE_SIZE, PCI_DMA_FROMDEVICE);
ps_page_dma->ps_page_dma[j] = 0;
-   skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-  length);
+   skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
ps_page->ps_page[j] = NULL;
-   skb->len += length;
-   skb->data_len += length;
-   skb->truesize += length;
}
 
/* strip the ethernet crc, problem is we're using pages now so
@@ -4623,7 +4619,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
if (j < adapter->rx_ps_pages) {
if (likely(!ps_page->ps_page[j])) {
ps_page->ps_page[j] =
-   alloc_page(GFP_ATOMIC);
+   netdev_alloc_page(netdev);
if (unlikely(!ps_page->ps_page[j])) {
adapter->alloc_rx_buff_failed++;
goto no_buffers;
Index: linux-2.6-git/include/linux/skbuff.h
===
--- linux-2.6-git.orig/include/linux/skbuff.h   2007-02-14 11:29:54.0 
+0100
+++ linux-2.6-git/include/linux/skbuff.h2007-02-14 11:59:04.0 
+0100
@@ -813,6 +813,9 @@ static inline void skb_fill_page_desc(st
skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+   int off, int size);
+
 #define SKB_PAGE_ASSERT(skb)   BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb)   BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1148,6 +1151,26 @@ static inline struct sk_buff *netdev_all
return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+
+/**
+ * netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @dev: network device to receive on
+ *
+ * Allocate a new page node local to the specified device.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+   return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+   __free_page(page);
+}
+
 /**
  * skb_cow - copy header of skb when it is required
  * @skb: buffer to cow
Index: linux-2.6-git/net/core/skbuff.c
===
--- linux-2.6-git.orig/net/core/skbuff.c2007-02-14 11:29:54.0 
+0100
+++ linux-2.6-git/net/core/skbuff.c 2007-02-14 12:01:40.0 +0100
@@ -279,6 +279,24 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+   int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+   struct page *page;
+
+   page = alloc_pages_node(node, gfp_mask, 0);
+   return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+   int size)
+{
+   skb_fill_page_desc(skb, i, page, off, size);
+   skb->len += size;
+   skb->data_len += size;
+   skb->truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
@@ -2066,6 +2084,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Inde

[PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK

2007-05-04 Thread Peter Zijlstra
Modify the netlink code so that SOCK_VMIO has the desired effect on the
user-space side of the connection.

Modify sys_{send,recv}msg to use sk->sk_allocation instead of GFP_KERNEL,
this should not change existing behaviour because the default of
sk->sk_allocation is GFP_KERNEL, and no user-space exposed socket would
have it any different at this time.

This change allows the system calls to succeed for SOCK_VMIO sockets 
(who have sk->sk_allocation |= GFP_EMERGENCY) even under extreme memory
pressure.

Since netlink_sendmsg is used to transfer msgs from user- to kernel-space
treat the skb allocation there as a receive allocation.

Also export netlink_lookup, this is needed to locate the kernel side struct
sock object associated with the user-space netlink socket. 

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: David Miller <[EMAIL PROTECTED]>
Cc: Mike Christie <[EMAIL PROTECTED]>
---
 include/linux/netlink.h  |1 +
 net/compat.c |2 +-
 net/netlink/af_netlink.c |   12 +---
 net/socket.c |6 +++---
 4 files changed, 14 insertions(+), 7 deletions(-)

Index: linux-2.6-git/net/netlink/af_netlink.c
===
--- linux-2.6-git.orig/net/netlink/af_netlink.c
+++ linux-2.6-git/net/netlink/af_netlink.c
@@ -203,7 +203,7 @@ netlink_unlock_table(void)
wake_up(&nl_table_wait);
 }
 
-static __inline__ struct sock *netlink_lookup(int protocol, u32 pid)
+struct sock *netlink_lookup(int protocol, u32 pid)
 {
struct nl_pid_hash *hash = &nl_table[protocol].hash;
struct hlist_head *head;
@@ -1157,7 +1157,7 @@ static int netlink_sendmsg(struct kiocb 
if (len > sk->sk_sndbuf - 32)
goto out;
err = -ENOBUFS;
-   skb = alloc_skb(len, GFP_KERNEL);
+   skb = __alloc_skb(len, GFP_KERNEL, SKB_ALLOC_RX, -1);
if (skb==NULL)
goto out;
 
@@ -1186,8 +1186,13 @@ static int netlink_sendmsg(struct kiocb 
}
 
if (dst_group) {
+   gfp_t gfp_mask = sk->sk_allocation;
+
+   if (skb_emergency(skb))
+   gfp_mask |= __GFP_EMERGENCY;
+
atomic_inc(&skb->users);
-   netlink_broadcast(sk, skb, dst_pid, dst_group, GFP_KERNEL);
+   netlink_broadcast(sk, skb, dst_pid, dst_group, gfp_mask);
}
err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags&MSG_DONTWAIT);
 
@@ -1850,6 +1855,7 @@ panic:
 
 core_initcall(netlink_proto_init);
 
+EXPORT_SYMBOL(netlink_lookup);
 EXPORT_SYMBOL(netlink_ack);
 EXPORT_SYMBOL(netlink_run_queue);
 EXPORT_SYMBOL(netlink_broadcast);
Index: linux-2.6-git/net/socket.c
===
--- linux-2.6-git.orig/net/socket.c
+++ linux-2.6-git/net/socket.c
@@ -1817,7 +1817,7 @@ asmlinkage long sys_sendmsg(int fd, stru
err = -ENOMEM;
iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-   iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+   iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
if (!iov)
goto out_put;
}
@@ -1846,7 +1846,7 @@ asmlinkage long sys_sendmsg(int fd, stru
ctl_len = msg_sys.msg_controllen;
} else if (ctl_len) {
if (ctl_len > sizeof(ctl)) {
-   ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
+   ctl_buf = sock_kmalloc(sock->sk, ctl_len, 
sock->sk->sk_allocation);
if (ctl_buf == NULL)
goto out_freeiov;
}
@@ -1922,7 +1922,7 @@ asmlinkage long sys_recvmsg(int fd, stru
err = -ENOMEM;
iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-   iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+   iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
if (!iov)
goto out_put;
}
Index: linux-2.6-git/include/linux/netlink.h
===
--- linux-2.6-git.orig/include/linux/netlink.h
+++ linux-2.6-git/include/linux/netlink.h
@@ -157,6 +157,7 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds)
 
 
+extern struct sock *netlink_lookup(int protocol, __u32 pid);
 extern struct sock *netlink_kernel_create(int unit, unsigned int groups,
  void (*input)(struct sock *sk, int 
len),
  struct mutex *cb_mutex,
Index: linux-2.6-git/net/compat.c
===
--- linux-2.6-git.orig/net/compat.c
+++ linux-2.6-git/net/compat.c
@@ -169,7 +169,7 @@ int cmsghdr_from_user_compat_to_kern(str
 * from the user

[PATCH 34/40] sock: safely expose kernel sockets to userspace

2007-05-04 Thread Peter Zijlstra
SOCK_KERNEL - avoids user-space from actually using this socket for anything.
This enables sticking kernel sockets into the files_table for identifying and
reference counting purposes.

(iSCSI wants to do this)

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Mike Christie <[EMAIL PROTECTED]>
---
 include/net/sock.h |1 +
 net/socket.c   |   10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/net/sock.h
===
--- linux-2.6-git.orig/include/net/sock.h   2007-03-22 11:29:07.0 
+0100
+++ linux-2.6-git/include/net/sock.h2007-03-22 11:29:08.0 +0100
@@ -394,6 +394,7 @@ enum sock_flags {
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
+   SOCK_KERNEL, /* userspace cannot touch this socket */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
Index: linux-2.6-git/net/socket.c
===
--- linux-2.6-git.orig/net/socket.c 2007-03-22 11:28:58.0 +0100
+++ linux-2.6-git/net/socket.c  2007-03-26 12:00:36.0 +0200
@@ -353,7 +353,7 @@ static int sock_alloc_fd(struct file **f
return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file)
+static noinline int sock_attach_fd(struct socket *sock, struct file *file)
 {
struct qstr this;
char name[32];
@@ -381,6 +381,10 @@ static int sock_attach_fd(struct socket 
file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops;
file->f_mode = FMODE_READ | FMODE_WRITE;
file->f_flags = O_RDWR;
+   if (unlikely(sock->sk && sock_flag(sock->sk, SOCK_KERNEL))) {
+   file->f_mode = 0;
+   file->f_flags = 0;
+   }
file->f_pos = 0;
file->private_data = sock;
 
@@ -806,6 +810,10 @@ static long sock_ioctl(struct file *file
int pid, err;
 
sock = file->private_data;
+
+   if (unlikely(sock_flag(sock->sk, SOCK_KERNEL)))
+   return -EBADF;
+
if (cmd >= SIOCDEVPRIVATE && cmd <= (SIOCDEVPRIVATE + 15)) {
err = dev_ioctl(cmd, argp);
} else

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/40] netvm: hook skb allocation to reserves

2007-05-04 Thread Peter Zijlstra
Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. Skbs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

(NOTE the extra atomic overhead is only for those pages allocated from the
reserves - it does not affect the normal fast path.)

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   22 +-
 net/core/skbuff.c  |  161 ++---
 2 files changed, 157 insertions(+), 26 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===
--- linux-2.6-git.orig/include/linux/skbuff.h
+++ linux-2.6-git/include/linux/skbuff.h
@@ -277,7 +277,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   emergency:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
@@ -323,10 +324,19 @@ struct sk_buff {
 
 #include 
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
+#ifdef CONFIG_NETVM
+#define skb_emergency(skb) unlikely((skb)->emergency)
+#else
+#define skb_emergency(skb) false
+#endif
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
@@ -336,7 +346,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern void   kfree_skbmem(struct sk_buff *skb);
@@ -1279,7 +1289,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
  gfp_t gfp_mask)
 {
-   struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+   struct sk_buff *skb =
+   __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -1325,6 +1336,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  * netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1341,7 +1353,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-   __free_page(page);
+   __netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6-git/net/core/skbuff.c
===
--- linux-2.6-git.orig/net/core/skbuff.c
+++ linux-2.6-git/net/core/skbuff.c
@@ -144,21 +144,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int flags, int node)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+   int emergency = 0;
 
-   cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+   size = SKB_DATA_ALIGN(size);
+   cache = (flags & SKB_ALLOC_FCLONE)
+   ? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+   if (flags & SKB_ALLOC_RX)
+   gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+#endif
 
+retry_alloc:
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
if (!skb)
-   goto out;
+   goto noskb;
 
-   size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
@@ -168,6 +175,7 @@ struct sk_buff *__alloc_skb(unsigned int
 * See comment in sk_buff definition, just before the 'tail' member
 */
memset(skb, 0, offse

[PATCH 03/40] mm: allow PF_MEMALLOC from softirq context

2007-05-04 Thread Peter Zijlstra
Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/sched.h |4 
 kernel/softirq.c  |3 +++
 mm/internal.h |7 ---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/internal.h
===
--- linux-2.6-git.orig/mm/internal.h2007-02-22 14:44:37.0 +0100
+++ linux-2.6-git/mm/internal.h 2007-02-22 15:16:58.0 +0100
@@ -75,9 +75,10 @@ static int inline gfp_to_alloc_flags(gfp
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-   if (!in_interrupt() &&
-   ((p->flags & PF_MEMALLOC) ||
-unlikely(test_thread_flag(TIF_MEMDIE
+   if (!in_irq() && (p->flags & PF_MEMALLOC))
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_interrupt() &&
+   unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
 
Index: linux-2.6-git/kernel/softirq.c
===
--- linux-2.6-git.orig/kernel/softirq.c 2007-02-22 14:44:35.0 +0100
+++ linux-2.6-git/kernel/softirq.c  2007-02-22 15:29:38.0 +0100
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+   unsigned long pflags = current->flags;
+   current->flags &= ~PF_MEMALLOC;
 
pending = local_softirq_pending();
account_system_vtime(current);
@@ -248,6 +250,7 @@ restart:
 
account_system_vtime(current);
_local_bh_enable();
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6-git/include/linux/sched.h
===
--- linux-2.6-git.orig/include/linux/sched.h2007-02-22 15:17:39.0 
+0100
+++ linux-2.6-git/include/linux/sched.h 2007-02-22 15:29:05.0 +0100
@@ -1185,6 +1185,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+   do {(p)->flags &= ~(mask); \
+   (p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages

2007-05-04 Thread Peter Zijlstra
In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---
 include/linux/mm.h  |   25 +
 include/linux/pagemap.h |2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/linux/mm.h
===
--- linux-2.6-git.orig/include/linux/mm.h   2007-02-21 12:15:01.0 
+0100
+++ linux-2.6-git/include/linux/mm.h2007-02-21 12:15:07.0 +0100
@@ -594,6 +594,16 @@ static inline struct swap_info_struct *p
return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page)))
+   return page_swap_info(page)->swap_file->f_mapping;
+#endif
+   return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -611,6 +621,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page))) {
+   swp_entry_t swap = { .val = page_private(page) };
+   return swp_offset(swap);
+   }
+#endif
+   return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6-git/include/linux/pagemap.h
===
--- linux-2.6-git.orig/include/linux/pagemap.h  2007-02-21 12:14:54.0 
+0100
+++ linux-2.6-git/include/linux/pagemap.h   2007-02-21 12:15:07.0 
+0100
@@ -120,7 +120,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-   return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+   return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/40] net: sk_allocation() - concentrate socket related allocations

2007-05-04 Thread Peter Zijlstra
Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h|7 ++-
 net/ipv4/tcp_output.c |   11 ++-
 net/ipv6/tcp_ipv6.c   |   14 +-
 3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6-git/net/ipv4/tcp_output.c
===
--- linux-2.6-git.orig/net/ipv4/tcp_output.c
+++ linux-2.6-git/net/ipv4/tcp_output.c
@@ -2011,7 +2011,7 @@ void tcp_send_fin(struct sock *sk)
} else {
/* Socket is locked, keep trying until memory is available. */
for (;;) {
-   skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+   skb = alloc_skb_fclone(MAX_TCP_HEADER, 
sk->sk_allocation);
if (skb)
break;
yield();
@@ -2044,7 +2044,7 @@ void tcp_send_active_reset(struct sock *
struct sk_buff *skb;
 
/* NOTE: No TCP options attached and we never retransmit this. */
-   skb = alloc_skb(MAX_TCP_HEADER, priority);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
return;
@@ -2117,7 +2117,8 @@ struct sk_buff * tcp_make_synack(struct 
__u8 *md5_hash_location;
 #endif
 
-   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+   sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
 
@@ -2376,7 +2377,7 @@ void tcp_send_ack(struct sock *sk)
 * tcp_transmit_skb() will set the ownership to this
 * sock.
 */
-   buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2418,7 +2419,7 @@ static int tcp_xmit_probe_skb(struct soc
struct sk_buff *skb;
 
/* We don't queue it, tcp_transmit_skb() sets ownership. */
-   skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
 
Index: linux-2.6-git/include/net/sock.h
===
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -415,6 +415,11 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+   return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
sk->sk_ack_backlog--;
@@ -1207,7 +1212,7 @@ static inline struct sk_buff *sk_stream_
int hdr_len;
 
hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header);
-   skb = alloc_skb_fclone(size + hdr_len, gfp);
+   skb = alloc_skb_fclone(size + hdr_len, sk_allocation(sk, gfp));
if (skb) {
skb->truesize += mem;
if (sk_stream_wmem_schedule(sk, skb->truesize)) {
Index: linux-2.6-git/net/ipv6/tcp_ipv6.c
===
--- linux-2.6-git.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6-git/net/ipv6/tcp_ipv6.c
@@ -581,7 +581,8 @@ static int tcp_v6_md5_do_add(struct sock
} else {
/* reallocate new list if current one is full. */
if (!tp->md5sig_info) {
-   tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), 
GFP_ATOMIC);
+   tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+   sk_allocation(sk, GFP_ATOMIC));
if (!tp->md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -590,7 +591,8 @@ static int tcp_v6_md5_do_add(struct sock
tcp_alloc_md5sig_pool();
if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-  (tp->md5sig_info->entries6 + 1)), 
GFP_ATOMIC);
+  (tp->md5sig_info->entries6 + 1)),
+  sk_allocation(sk, GFP_ATOMIC));
 
if (!keys) {
tcp_free_md5sig_pool();
@@ -715,7 +717,7 @@ static int tcp_v6_parse_md5_keys (struct
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
 
-   p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+   p = kzall

[PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs

2007-05-04 Thread Peter Zijlstra
Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Patrick McHardy <[EMAIL PROTECTED]>
---
 net/netfilter/core.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-git/net/netfilter/core.c
===
--- linux-2.6-git.orig/net/netfilter/core.c 2007-02-22 15:48:28.0 
+0100
+++ linux-2.6-git/net/netfilter/core.c  2007-02-26 14:23:25.0 +0100
@@ -184,9 +184,12 @@ next_hook:
ret = 1;
goto unlock;
} else if (verdict == NF_DROP) {
+drop:
kfree_skb(*pskb);
ret = -EPERM;
} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+   if (skb_emergency(*pskb))
+   goto drop;
NFDEBUG("nf_hook: Verdict = QUEUE.\n");
if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
  verdict >> NF_VERDICT_BITS))

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/40] Swap over Networked storage -v12

2007-05-04 Thread Peter Zijlstra
There is a fundamental deadlock associated with paging; when writing out a page
to free memory requires free memory to complete. The usually solution is to
keep a small amount of memory available at all times so we can overcome this
problem. This however assumes the amount of memory needed for writeout is
(constant and) smaller than the provided reserve.

It is this latter assumption that breaks when doing writeout over network.
Network can take up an unspecified amount of memory while waiting for a reply
to our write request. This re-introduces the deadlock; we might never complete
the writeout, for we might not have enough memory to receive the completion
message.

The proposed solution is simple, only allow traffic servicing the VM to make
use of the reserves.

This however implies you know what packets are for whom, which generally
speaking you don't. Hence we need to receive all packets but discard them as
soon as we encounter a non VM bound packet allocated from the reserves.

Also knowing it is headed towards the VM needs a little help, hence we
introduce the socket flag SOCK_VMIO to mark sockets with.

Of course, since we are paging all this has to happen in kernel-space, since
user-space might just not be there.

Since packet processing might also require memory, this all also implies that
those auxiliary allocations may use the reserves when an emergency packet is
processed. This is accomplished by using PF_MEMALLOC.

How much memory is to be reserved is also an issue, enough memory to saturate
both the route cache and IP fragment reassembly, along with various constants.

This patch-set comes in 6 parts:

1) introduce the memory reserve and make the SLAB allocator play nice with it.
   patches 01-10

2) add some needed infrastructure to the network code
   patches 11-13

3) implement the idea outlined above
   patches 14-20

4) teach the swap machinery to use generic address_spaces
   patches 21-24

5) implement swap over NFS using all the new stuff
   patches 25-31

6) implement swap over iSCSI
   patches 32-40

Patches can also be found here:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12/

If I receive no feedback, I will assume the various maintainers do not object
and I will respin the series against -mm and submit for inclusion.

There is interest in this feature from the stateless linux world; that is both
the virtualization world, and the cluster world.

I have been contacted by various groups, some have just expressed their
interest, others have been testing this work in their environments.

Various hardware vendors have also expressed interest, and, of course, my
employer finds it important enough to have me work on it.

Also, while it doesn't present a full-fledged reserve-based allocator API yet,
it does lay most of the groundwork for it. There is a GFP_NOFAIL elimination
project wanting to use this as a foundation. Elimination of GFP_NOFAIL will
greatly improve the basic soundness and stability of the code that currently
uses that construct - most disk based filesystems.

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 39/40] mm: a process flags to avoid blocking allocations

2007-05-04 Thread Peter Zijlstra
PF_MEM_NOWAIT - will make allocations fail before blocking. This is usefull
to convert process behaviour to non-blocking.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Mike Christie <[EMAIL PROTECTED]>
---
 include/linux/sched.h |1 +
 kernel/softirq.c  |4 ++--
 mm/internal.h |   11 ++-
 mm/page_alloc.c   |4 ++--
 4 files changed, 15 insertions(+), 5 deletions(-)

Index: linux-2.6-git/include/linux/sched.h
===
--- linux-2.6-git.orig/include/linux/sched.h2007-03-26 12:03:07.0 
+0200
+++ linux-2.6-git/include/linux/sched.h 2007-03-26 12:03:09.0 +0200
@@ -1158,6 +1158,7 @@ static inline void put_task_struct(struc
 #define PF_SPREAD_SLAB 0x0200  /* Spread some slab caches over cpuset 
*/
 #define PF_MEMPOLICY   0x1000  /* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER0x2000  /* Thread belongs to the rt 
mutex tester */
+#define PF_MEM_NOWAIT  0x4000  /* Make allocations fail instead of 
block */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6-git/mm/page_alloc.c
===
--- linux-2.6-git.orig/mm/page_alloc.c  2007-03-26 12:03:07.0 +0200
+++ linux-2.6-git/mm/page_alloc.c   2007-03-26 12:03:09.0 +0200
@@ -1234,11 +1234,11 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
 {
-   const gfp_t wait = gfp_mask & __GFP_WAIT;
+   struct task_struct *p = current;
+   const bool wait = gfp_wait(gfp_mask);
struct zone **z;
struct page *page;
struct reclaim_state reclaim_state;
-   struct task_struct *p = current;
int do_retry;
int alloc_flags;
int did_some_progress;
Index: linux-2.6-git/mm/internal.h
===
--- linux-2.6-git.orig/mm/internal.h2007-03-26 12:03:07.0 +0200
+++ linux-2.6-git/mm/internal.h 2007-03-26 12:03:09.0 +0200
@@ -46,6 +46,15 @@ extern void fastcall __init __free_pages
 #define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */
 #define ALLOC_CPUSET   0x40 /* check for correct cpuset */
 
+static bool inline gfp_wait(gfp_t gfp_mask)
+{
+   bool wait = gfp_mask & __GFP_WAIT;
+   if (wait && !in_irq() && (current->flags & PF_MEM_NOWAIT))
+   wait = false;
+
+   return wait;
+}
+
 /*
  * get the deepest reaching allocation flags for the given gfp_mask
  */
@@ -53,7 +62,7 @@ static int inline gfp_to_alloc_flags(gfp
 {
struct task_struct *p = current;
int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-   const gfp_t wait = gfp_mask & __GFP_WAIT;
+   const bool wait = gfp_wait(gfp_mask);
 
/*
 * The caller may dip into page reserves a bit more if the caller
Index: linux-2.6-git/kernel/softirq.c
===
--- linux-2.6-git.orig/kernel/softirq.c 2007-03-26 12:03:07.0 +0200
+++ linux-2.6-git/kernel/softirq.c  2007-03-26 12:12:58.0 +0200
@@ -211,7 +211,7 @@ asmlinkage void __do_softirq(void)
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
unsigned long pflags = current->flags;
-   current->flags &= ~PF_MEMALLOC;
+   current->flags &= ~(PF_MEMALLOC|PF_MEM_NOWAIT);
 
pending = local_softirq_pending();
account_system_vtime(current);
@@ -250,7 +250,7 @@ restart:
 
account_system_vtime(current);
_local_bh_enable();
-   tsk_restore_flags(current, pflags, PF_MEMALLOC);
+   tsk_restore_flags(current, pflags, (PF_MEMALLOC|PF_MEM_NOWAIT));
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/40] uml: rename arch/um remove_mapping()

2007-05-04 Thread Peter Zijlstra
When 'include/linux/mm.h' includes 'include/linux/swap.h', the global
remove_mapping() definition clashes with the arch/um one.

Rename the arch/um one.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Acked-by: Jeff Dike <[EMAIL PROTECTED]>
---
 arch/um/kernel/physmem.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-git/arch/um/kernel/physmem.c
===
--- linux-2.6-git.orig/arch/um/kernel/physmem.c 2007-02-12 09:40:47.0 
+0100
+++ linux-2.6-git/arch/um/kernel/physmem.c  2007-02-12 11:17:47.0 
+0100
@@ -160,7 +160,7 @@ int physmem_subst_mapping(void *virt, in
 
 static int physmem_fd = -1;
 
-static void remove_mapping(struct phys_desc *desc)
+static void um_remove_mapping(struct phys_desc *desc)
 {
void *virt = desc->virt;
int err;
@@ -184,7 +184,7 @@ int physmem_remove_mapping(void *virt)
if(desc == NULL)
return 0;
 
-   remove_mapping(desc);
+   um_remove_mapping(desc);
return 1;
 }
 
@@ -205,7 +205,7 @@ void physmem_forget_descriptor(int fd)
page = list_entry(ele, struct phys_desc, list);
offset = page->offset;
addr = page->virt;
-   remove_mapping(page);
+   um_remove_mapping(page);
err = os_seek_file(fd, offset);
if(err)
panic("physmem_forget_descriptor - failed to seek "

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/40] netvm: skb processing

2007-05-04 Thread Peter Zijlstra
In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency skbs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |4 
 net/core/dev.c |   42 +-
 net/core/sock.c|   19 +++
 3 files changed, 60 insertions(+), 5 deletions(-)

Index: linux-2.6-git/net/core/dev.c
===
--- linux-2.6-git.orig/net/core/dev.c
+++ linux-2.6-git/net/core/dev.c
@@ -1756,10 +1756,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+   unsigned long pflags = current->flags;
+
+   /* Emergency skb are special, they should
+*  - be delivered to SOCK_VMIO sockets only
+*  - stay away from userspace
+*  - have bounded memory usage
+*
+* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+* This saves us from propagating the allocation context down to all
+* allocation sites.
+*/
+   if (skb_emergency(skb))
+   current->flags |= PF_MEMALLOC;
 
/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
-   return NET_RX_DROP;
+   goto out;
 
if (!skb->tstamp.tv64)
net_timestamp(skb);
@@ -1770,7 +1783,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);
 
if (!orig_dev)
-   return NET_RX_DROP;
+   goto out;
 
__get_cpu_var(netdev_rx_stat).total++;
 
@@ -1789,6 +1802,9 @@ int netif_receive_skb(struct sk_buff *sk
}
 #endif
 
+   if (skb_emergency(skb))
+   goto skip_taps;
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
@@ -1797,6 +1813,7 @@ int netif_receive_skb(struct sk_buff *sk
}
}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1809,16 +1826,28 @@ int netif_receive_skb(struct sk_buff *sk
 
if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
-   goto out;
+   goto unlock;
}
 
skb->tc_verd = 0;
 ncls:
 #endif
 
+   if (skb_emergency(skb))
+   switch(skb->protocol) {
+   case __constant_htons(ETH_P_ARP):
+   case __constant_htons(ETH_P_IP):
+   case __constant_htons(ETH_P_IPV6):
+   case __constant_htons(ETH_P_8021Q):
+   break;
+
+   default:
+   goto drop;
+   }
+
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 
type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1833,6 +1862,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
 * me how you were going to use this. :-)
@@ -1840,8 +1870,10 @@ ncls:
ret = NET_RX_DROP;
}
 
-out:
+unlock:
rcu_read_unlock();
+out:
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
 }
 
Index: linux-2.6-git/include/net/sock.h
===
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -527,10 +527,14 @@ static inline void sk_add_backlog(struct
skb->next = NULL;
 }
 
+#ifndef CONFIG_NETVM
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
return sk->sk_backlog_rcv(sk, skb);
 }
+#else
+extern int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+#endif
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
 ({ int rc; \
Index: linux-2.6-git/net/core/sock.c
===
--- linux-2.6-git.orig/net/core/sock.c
+++ linux-2.6-git/net/core/sock.c
@@ -332,6 +332,25 @@ int sk_clear_vmio(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_vmio);
 
+#ifdef CONFIG_NETVM
+int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+   if (skb_emergency(skb)) {
+   int ret;
+   unsigned long pflags = current->flags;
+   /* these should have been dropped before queuei

[PATCH 40/40] iscsi: support for swapping over iSCSI.

2007-05-04 Thread Peter Zijlstra
Set blk_queue_swapdev for iSCSI. This method takes care of reserving the
extra memory needed and marking all relevant sockets with SOCK_VMIO.

When used for swapping, TCP socket creation is done under GFP_MEMALLOC and
the TCP connect is done with SOCK_VMIO to ensure their success. 

Also the netlink userspace interface is marked SOCK_VMIO, this will ensure
that even under pressure we can still communicate with the daemon (which
runs as mlockall() and needs no additional memory to operate).

Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is
present. This ensures that the netlink socket will not block. User-space
will need to retry failed requests.

The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets.
This makes sure we do not block the critical socket, and that we do not
fail to process incoming data.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
CC: Mike Christie <[EMAIL PROTECTED]>
---
 drivers/scsi/Kconfig|   17 
 drivers/scsi/iscsi_tcp.c|   70 ---
 drivers/scsi/libiscsi.c |   18 ++---
 drivers/scsi/qla4xxx/ql4_os.c   |2 -
 drivers/scsi/scsi_transport_iscsi.c |   72 
 include/scsi/scsi_transport_iscsi.h |   12 +-
 6 files changed, 170 insertions(+), 21 deletions(-)

Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c 2007-03-26 12:59:39.0 
+0200
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c  2007-03-26 13:07:54.0 
+0200
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "iscsi_tcp.h"
 
@@ -1740,15 +1741,19 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 {
struct socket *sock;
int rc, size;
+   int swapper = sk_vmio_socks();
+   unsigned long pflags = current->flags;
+
+   if (swapper)
+   pflags |= PF_MEMALLOC;
 
rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
  &sock);
if (rc < 0) {
printk(KERN_ERR "Could not create socket %d.\n", rc);
-   return rc;
+   goto out;
}
-   /* TODO: test this with GFP_NOIO */
-   sock->sk->sk_allocation = GFP_ATOMIC;
+   sock->sk->sk_allocation = GFP_NOIO;
 
if (dst_addr->sa_family == PF_INET)
size = sizeof(struct sockaddr_in);
@@ -1765,6 +1770,8 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 * we don't want it used by user-space at all.
 */
sock_set_flag(sock->sk, SOCK_KERNEL);
+   if (swapper)
+   sk_set_vmio(sock->sk);
 
rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
O_NONBLOCK);
@@ -1779,11 +1786,14 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
if (rc < 0)
goto release_sock;
*ep_handle = (uint64_t)rc;
-   return 0;
+   rc = 0;
+out:
+   current->flags = pflags;
+   return rc;
 
 release_sock:
sock_release(sock);
-   return rc;
+   goto out;
 }
 
 static struct iscsi_cls_conn *
@@ -1908,8 +1918,13 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
sk->sk_reuse = 1;
sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 
+   if (!cls_session->swapper && sk_has_vmio(sk))
+   sk_clear_vmio(sk);
+
/* FIXME: disable Nagle's algorithm */
 
+   BUG_ON(!sk_has_vmio(sk) && cls_session->swapper);
+
/*
 * Intercept TCP callbacks for sendfile like receive
 * processing.
@@ -2167,6 +2182,50 @@ static void iscsi_tcp_session_destroy(st
iscsi_session_teardown(cls_session);
 }
 
+#ifdef CONFIG_ISCSI_TCP_SWAP
+
+#define ISCSI_TCP_RESERVE_PAGES(TX_RESERVE_PAGES)
+
+static int iscsi_tcp_swapdev(void *objp, int enable)
+{
+   int error = 0;
+   struct scsi_device *sdev = objp;
+   struct Scsi_Host *shost = sdev->host;
+   struct iscsi_session *session = iscsi_hostdata(shost->hostdata);
+
+   if (enable) {
+   iscsi_swapdev(session->tt, session_to_cls(session), 1);
+   sk_adjust_memalloc(1, ISCSI_TCP_RESERVE_PAGES);
+   }
+
+   spin_lock(&session->lock);
+   if (session->leadconn) {
+   struct iscsi_tcp_conn *tcp_conn = session->leadconn->dd_data;
+   if (enable)
+   sk_set_vmio(tcp_conn->sock->sk);
+   else
+   sk_clear_vmio(tcp_conn->sock->sk);
+   }
+   spin_unlock(&session->lock);
+
+   if (!enable) {
+   sk_adjust_memalloc(-1, -ISCSI_TCP_RESERVE_PAGES);
+   iscsi_swapdev(session->tt, session_to_cls(session), 0);
+   }
+
+   return error;
+}
+#endif
+
+static int iscsi_tcp_slave_configure(struct scsi_device *sdev)
+{
+#ifdef CONFIG_ISCSI_TCP_SWAP
+   if (sdev->type =

[PATCH 25/40] nfs: remove mempools

2007-05-04 Thread Peter Zijlstra
With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Trond Myklebust <[EMAIL PROTECTED]>
---
 fs/nfs/read.c  |   15 +++
 fs/nfs/write.c |   27 +--
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6-git/fs/nfs/read.c
===
--- linux-2.6-git.orig/fs/nfs/read.c
+++ linux-2.6-git/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ  (32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-   struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+   struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
else {
p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p->pagevec) {
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
p = NULL;
}
}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
struct nfs_read_data *p = container_of(head, struct nfs_read_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -590,16 +587,10 @@ int __init nfs_init_readpagecache(void)
if (nfs_rdata_cachep == NULL)
return -ENOMEM;
 
-   nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-nfs_rdata_cachep);
-   if (nfs_rdata_mempool == NULL)
-   return -ENOMEM;
-
return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-   mempool_destroy(nfs_rdata_mempool);
kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c
+++ linux-2.6-git/fs/nfs/write.c
@@ -29,9 +29,6 @@
 
 #define NFSDBG_FACILITYNFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE (32)
-#define MIN_POOL_COMMIT(4)
-
 /*
  * Local function declarations
  */
@@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -64,7 +59,7 @@ void nfs_commit_rcu_free(struct rcu_head
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_commit_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -74,7 +69,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -85,7 +80,7 @@ struct nfs_write_data *nfs_writedata_all
else {
p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p->pagevec) {
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
}
}
@@ -98,7 +93,7 @@ static void nfs_writedata_rcu_free(struc
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1465,16 +1460,6 @@ int __init nfs_init_writepagecac

[PATCH 01/40] mm: page allocation rank

2007-05-04 Thread Peter Zijlstra
Introduce page allocation rank.

This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim 
was activated) to obtain the page.

It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind: 
  'would this allocation have succeeded using these gfp flags'.

For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.

The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
  0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
  1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
  ...
  15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
  16 is the softest allocation - ALLOC_WMARK_HIGH.

Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.

Rank > 8 rarely happens and means lots of memory free (due to parallel oom 
kill).

The allocation rank is stored in page->index for successful allocations.

'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.

The purpose of this measure is to introduce some fairness into the slab
allocator.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/internal.h   |   70 
 mm/page_alloc.c |   58 +-
 2 files changed, 87 insertions(+), 41 deletions(-)

Index: linux-2.6-git/mm/internal.h
===
--- linux-2.6-git.orig/mm/internal.h2007-02-22 13:56:00.0 +0100
+++ linux-2.6-git/mm/internal.h 2007-02-22 14:08:41.0 +0100
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include 
+#include 
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,73 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
 
+#define ALLOC_HARDER   0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH   0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+   struct task_struct *p = current;
+   int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+   const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+   /*
+* The caller may dip into page reserves a bit more if the caller
+* cannot run direct reclaim, or if the caller has realtime scheduling
+* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+*/
+   if (gfp_mask & __GFP_HIGH)
+   alloc_flags |= ALLOC_HIGH;
+
+   if (!wait) {
+   alloc_flags |= ALLOC_HARDER;
+   /*
+* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+*/
+   alloc_flags &= ~ALLOC_CPUSET;
+   } else if (unlikely(rt_task(p)) && !in_interrupt())
+   alloc_flags |= ALLOC_HARDER;
+
+   if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+   if (!in_interrupt() &&
+   ((p->flags & PF_MEMALLOC) ||
+unlikely(test_thread_flag(TIF_MEMDIE
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   }
+
+   return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK 16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+   int rank;
+
+   if (alloc_flags & ALLOC_NO_WATERMARKS)
+   return 0;
+
+   rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+   rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+   return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+   return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+}
+
 #endif
Index: linux-2.6-git/mm/page_alloc.c
===
--- linux-2.6-git.orig/mm/page_alloc.c  2007-02-22 13:56:00.0 +0100
+++ linux-2.6-git/mm/page_alloc.c   2007-02-22 14:08:41.0 +0100
@@ -892,14 +892,6 @@ failed:
   

[PATCH 32/40] block: add a swapdev callback to the request_queue

2007-05-04 Thread Peter Zijlstra
Networked storage devices need a swap-on/off callback in order to setup
some state and reserve memory. Place the block device callback in the
request_queue as suggested by James Bottomley.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
Cc: James Bottomley <[EMAIL PROTECTED]>
---
 include/linux/blkdev.h |   19 +++
 mm/swapfile.c  |4 
 2 files changed, 23 insertions(+)

Index: linux-2.6-git/include/linux/blkdev.h
===
--- linux-2.6-git.orig/include/linux/blkdev.h   2007-01-08 11:53:13.0 
+0100
+++ linux-2.6-git/include/linux/blkdev.h2007-01-16 14:14:50.0 
+0100
@@ -341,6 +341,7 @@ typedef int (merge_bvec_fn) (request_que
 typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *);
 typedef void (prepare_flush_fn) (request_queue_t *, struct request *);
 typedef void (softirq_done_fn)(struct request *);
+typedef int (swapdev_fn)(void*, int);
 
 enum blk_queue_state {
Queue_down,
@@ -379,6 +380,8 @@ struct request_queue
issue_flush_fn  *issue_flush_fn;
prepare_flush_fn*prepare_flush_fn;
softirq_done_fn *softirq_done_fn;
+   swapdev_fn  *swapdev_fn;
+   void*swapdev_obj;
 
/*
 * Dispatch queue sorting
@@ -766,6 +769,22 @@ request_queue_t *blk_alloc_queue(gfp_t);
 request_queue_t *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(request_queue_t *);
 
+static inline
+void blk_queue_swapdev(struct request_queue *rq,
+  swapdev_fn *swapdev_fn, void *swapdev_obj)
+{
+   rq->swapdev_fn = swapdev_fn;
+   rq->swapdev_obj = swapdev_obj;
+}
+
+static inline
+int blk_queue_swapdev_fn(struct request_queue *rq, int enable)
+{
+   if (rq->swapdev_fn)
+   return rq->swapdev_fn(rq->swapdev_obj, enable);
+   return 0;
+}
+
 /*
  * tag stuff
  */
Index: linux-2.6-git/mm/swapfile.c
===
--- linux-2.6-git.orig/mm/swapfile.c2007-01-15 09:59:02.0 +0100
+++ linux-2.6-git/mm/swapfile.c 2007-01-16 14:14:50.0 +0100
@@ -1305,6 +1305,7 @@ asmlinkage long sys_swapoff(const char _
inode = mapping->host;
if (S_ISBLK(inode->i_mode)) {
struct block_device *bdev = I_BDEV(inode);
+   blk_queue_swapdev_fn(bdev->bd_disk->queue, 0);
set_blocksize(bdev, p->old_block_size);
bd_release(bdev);
} else {
@@ -1524,6 +1525,9 @@ asmlinkage long sys_swapon(const char __
error = set_blocksize(bdev, PAGE_SIZE);
if (error < 0)
goto bad_swap;
+   error = blk_queue_swapdev_fn(bdev->bd_disk->queue, 1);
+   if (error < 0)
+   goto bad_swap;
p->bdev = bdev;
} else if (S_ISREG(inode->i_mode)) {
p->bdev = inode->i_sb->s_bdev;

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/40] selinux: tag avc cache alloc as non-critical

2007-05-04 Thread Peter Zijlstra
Failing to allocate a cache entry will only harm performance not correctness.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Acked-by: James Morris <[EMAIL PROTECTED]>
---
 security/selinux/avc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/security/selinux/avc.c
===
--- linux-2.6-git.orig/security/selinux/avc.c   2007-02-14 08:31:13.0 
+0100
+++ linux-2.6-git/security/selinux/avc.c2007-02-14 10:10:47.0 
+0100
@@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
 {
struct avc_node *node;
 
-   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;
 

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/40] mm: serialize access to min_free_kbytes

2007-05-04 Thread Peter Zijlstra
There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page_alloc.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/page_alloc.c
===
--- linux-2.6-git.orig/mm/page_alloc.c  2007-01-15 09:58:49.0 +0100
+++ linux-2.6-git/mm/page_alloc.c   2007-01-15 09:58:51.0 +0100
@@ -95,6 +95,7 @@ static char * const zone_names[MAX_NR_ZO
 #endif
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -3074,12 +3075,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -3133,6 +3134,15 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(&min_free_lock, flags);
+   __setup_per_zone_pages_min();
+   spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -3168,7 +3178,7 @@ static int __init init_per_zone_pages_mi
min_free_kbytes = 128;
if (min_free_kbytes > 65536)
min_free_kbytes = 65536;
-   setup_per_zone_pages_min();
+   __setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
return 0;
 }

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/40] netvm: link network to vm layer

2007-05-04 Thread Peter Zijlstra
Hook up networking to the memory reserve.

There are two kinds of reserves: skb and aux. 
 - skb reserves are used for incomming packets,
 - aux reserves are used for processing these packets.

The consumers for these reserves are sockets marked with:
  SOCK_VMIO

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |   43 
 net/Kconfig|3 +
 net/core/sock.c|  135 +
 3 files changed, 180 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/net/sock.h
===
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -49,6 +49,7 @@
 #include   /* struct sk_buff */
 #include 
 #include 
+#include 
 
 #include 
 
@@ -393,6 +394,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -415,9 +417,48 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_vmio(struct sock *sk)
+{
+   return sock_flag(sk, SOCK_VMIO);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t vmio_socks;
+
+static inline int sk_vmio_socks(void)
+{
+   return atomic_read(&vmio_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+static inline
+int guess_kmem_cache_pages(struct kmem_cache *cachep, int nr_objs)
+{
+   int guess = DIV_ROUND_UP((kmem_cache_objsize(cachep) * nr_objs),
+   PAGE_SIZE);
+   guess += ilog2(guess);
+   return guess;
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void skb_reserve_memory(int skb_reserve_bytes);
+extern void aux_reserve_memory(int aux_reserve_pages);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-   return gfp_mask;
+   return gfp_mask | (sk->sk_allocation & __GFP_EMERGENCY);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6-git/net/core/sock.c
===
--- linux-2.6-git.orig/net/core/sock.c
+++ linux-2.6-git/net/core/sock.c
@@ -112,6 +112,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -198,6 +199,139 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static atomic_t rx_emergency_bytes;
+
+static int skb_reserve_bytes;
+static int aux_reserve_pages;
+
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+atomic_t vmio_socks;
+EXPORT_SYMBOL_GPL(vmio_socks);
+
+/*
+ * is there room for another emergency packet?
+ * we account in power of two units to approx the slab allocator.
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+   int size = roundup_pow_of_two(bytes);
+   int nr = atomic_add_return(size, &rx_emergency_bytes);
+   int thresh = 2 * skb_reserve_bytes;
+   if (nr < thresh || overcommit)
+   return 1;
+
+   atomic_dec(&rx_emergency_bytes);
+   return 0;
+}
+
+int rx_emergency_get(int bytes)
+{
+   return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+   return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+   int size = roundup_pow_of_two(bytes);
+   return atomic_sub(size, &rx_emergency_bytes);
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_VMIO sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ *@tx_reserve_pages is an upper-bound of memory used for TX hence
+ *we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+   unsigned long flags;
+   int reserve = tx_reserve_pages;
+   int 

[PATCH 22/40] mm: prepare swap entry methods for use in page methods

2007-05-04 Thread Peter Zijlstra
Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---
 include/linux/mm.h  |8 
 include/linux/swap.h|   48 
 include/linux/swapops.h |   44 
 mm/swapfile.c   |1 +
 4 files changed, 57 insertions(+), 44 deletions(-)

Index: linux-2.6-git/include/linux/mm.h
===
--- linux-2.6-git.orig/include/linux/mm.h   2007-02-21 12:15:00.0 
+0100
+++ linux-2.6-git/include/linux/mm.h2007-02-21 12:15:01.0 +0100
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -586,6 +587,13 @@ static inline struct address_space *page
return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+   swp_entry_t swap = { .val = page_private(page) };
+   BUG_ON(!PageSwapCache(page));
+   return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
Index: linux-2.6-git/include/linux/swap.h
===
--- linux-2.6-git.orig/include/linux/swap.h 2007-02-21 12:15:00.0 
+0100
+++ linux-2.6-git/include/linux/swap.h  2007-02-21 12:15:01.0 +0100
@@ -79,6 +79,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e) ((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+   swp_entry_t ret;
+
+   ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+   (offset & SWP_OFFSET_MASK(ret));
+   return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+   return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+   return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_
return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+   return NULL;
+}
 #define can_share_swap_page(p) (page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6-git/include/linux/swapops.h
===
--- linux-2.6-git.orig/include/linux/swapops.h  2007-02-21 12:15:00.0 
+0100
+++ linux-2.6-git/include/linux/swapops.h   2007-02-21 12:15:01.0 
+0100
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e) ((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-   swp_entry_t ret;
-
-   ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-   (offset & SWP_OFFSET_MASK(ret));
-   return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-   return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-s

[PATCH 36/40] iscsi: fixup of the ep_connect patch

2007-05-04 Thread Peter Zijlstra
Make sure a malicious user-space program cannot crash the kernel module
by prematurely closing the filedesc.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Acked-by: Mike Christie <[EMAIL PROTECTED]>
---
 drivers/scsi/iscsi_tcp.c |   23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c 2007-01-16 14:15:50.0 
+0100
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c  2007-01-16 14:24:05.0 
+0100
@@ -1830,11 +1830,25 @@ tcp_conn_alloc_fail:
 }
 
 static void
+iscsi_tcp_release_conn(struct iscsi_conn *conn)
+{
+   struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+   if (!tcp_conn->sock)
+   return;
+
+   sockfd_put(tcp_conn->sock);
+   tcp_conn->sock = NULL;
+   conn->recv_lock = NULL;
+}
+
+static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
struct iscsi_conn *conn = cls_conn->dd_data;
struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
+   iscsi_tcp_release_conn(conn);
iscsi_conn_teardown(cls_conn);
if (tcp_conn->tx_hash.tfm)
crypto_free_hash(tcp_conn->tx_hash.tfm);
@@ -1851,6 +1865,7 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
iscsi_conn_stop(cls_conn, flag);
+   iscsi_tcp_release_conn(conn);
tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -1873,8 +1888,10 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
}
 
err = iscsi_conn_bind(cls_session, cls_conn, is_leading, transport_eph);
-   if (err)
-   goto done;
+   if (err) {
+   sockfd_put(sock);
+   return err;
+   }
 
/* bind iSCSI connection and socket */
tcp_conn->sock = sock;
@@ -1898,8 +1915,6 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 */
tcp_conn->in_progress = IN_PROGRESS_WAIT_HEADER;
 
-done:
-   sockfd_put(sock);
return err;
 }
 

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/40] mm: optimize gfp_to_rank()

2007-05-04 Thread Peter Zijlstra
The gfp_to_rank() call in the slab allocator severely impacts performance.
Hence reduce it to the bone, keeping only what is needed to make the reserve
work.

[more AIM9 results go here]

 AIM9 test  2.6.21-rc52.6.21-rc5-slab1 
 CONFIG_SLAB_FAIR=y

54 tcp_test  2124.48 +/-  10.852137.43 +/-  9.2212.95  
55 udp_test  5204.43 +/-  45.135231.59 +/- 56.6627.16  
56 fifo_test20991.42 +/-  46.71   19675.97 +/- 56.35  1315.44  
57 stream_pipe  10024.16 +/- 119.889912.53 +/- 75.52   111.63  
58 dgram_pipe9460.18 +/- 119.509502.75 +/- 89.0642.57  
59 pipe_cpy 30719.81 +/- 117.01   27885.52 +/- 46.81  2834.28  

  2.6.21-rc5-slab2
 CONFIG_SLAB_FAIR=y   
  
54 tcp_test  2124.48 +/-  10.852122.80 +/-   4.70 1.68
55 udp_test  5204.43 +/-  45.135136.98 +/-  62.3167.45
56 fifo_test20991.42 +/-  46.71   19646.81 +/-  53.61  1344.60
57 stream_pipe  10024.16 +/- 119.889940.87 +/- 280.7383.29
58 dgram_pipe9460.18 +/- 119.509432.69 +/- 250.2727.49
59 pipe_cpy 30719.81 +/- 117.01   27870.70 +/-  65.50  2849.10

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/internal.h |   33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

Index: linux-2.6-git/mm/internal.h
===
--- linux-2.6-git.orig/mm/internal.h2007-02-22 14:09:39.0 +0100
+++ linux-2.6-git/mm/internal.h 2007-02-22 14:24:34.0 +0100
@@ -105,9 +105,38 @@ static inline int alloc_flags_to_rank(in
return rank;
 }
 
-static inline int gfp_to_rank(gfp_t gfp_mask)
+static __always_inline int gfp_to_rank(gfp_t gfp_mask)
 {
-   return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+   /*
+* Although correct this full version takes a ~3% performance hit
+* on the network test in aim9.
+*
+* return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+*
+* So we go cheat a little. We'll only focus on the correctness of
+* rank 0.
+*/
+
+   if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+   if (gfp_mask & __GFP_EMERGENCY)
+   return 0;
+   else if (!in_irq() && (current->flags & PF_MEMALLOC))
+   return 0;
+   /*
+* We skip the TIF_MEMDIE test:
+*
+* if (!in_interrupt() && 
unlikely(test_thread_flag(TIF_MEMDIE)))
+*  return 0;
+*
+* this will force an alloc but since we are allowed the memory
+* that will succeed. This will make this very rare occurence
+* very expensive when under severe memory pressure, but it
+* seems a valid tradeoff.
+*/
+   }
+
+   /* Cheat by lumping everybody else in rank 1. */
+   return 1;
 }
 
 #endif

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >