Re: Larger dev_t

2001-04-03 Thread Alexander Viro



On Tue, 3 Apr 2001, Richard Gooch wrote:

> However, a large number of people run devfs on small to large systems,
> and these "races" aren't causing problems. People tell me it's quite
> stable. I run devfs on my systems, and not once have I had a problem
> due to devfs "races". So I feel it's quite unfair to paint such a dire
> picture (I'm referring to Martin's comments here, not Alan's).

And _that_ approach is the reason why I absolutely refuse to run your code
on any of my boxen.  Sorry.  If devfs (without serious cleanup) will become
mandatory I'll fork the tree - better backporting patches to Linus' one than
depending on current devfs.  You've been sitting on known (and easily fixable)
bugs and asking to leave fixing them to you for what, 10 months already?
Furrfu...  You are maintainer of that code.  You keep insisting on having
everything and a kitchen sink in the devfs and refuse to split the
functionality into reasonable pieces.  Essentially you are saying that it's
all or nothing deal.  Fine with me - out of these options I certainly
prefer the latter.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.3-ac2

2001-04-03 Thread Alexander Viro


Alan, please, replace the unmap_buffer() in fs/buffer.c with

static void unmap_buffer(struct buffer_head * bh)
{
if (buffer_mapped(bh)) {
mark_buffer_clean(bh);
lock_buffer(bh);
clear_bit(BH_Uptodate, &bh->b_state);
clear_bit(BH_Mapped, &bh->b_state);
clear_bit(BH_Req, &bh->b_state);
clear_bit(BH_New, &bh->b_state);
unlock_buffer(bh);
}
}
Current tree has wait_on_buffer() instead of lock/unlock, which is racey on
SMP.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [QUESTION] MOD_INC/MOD_DEC: useful to check for correct usage?

2001-04-04 Thread Alexander Viro



On Wed, 4 Apr 2001, Dawson Engler wrote:

> Hi,
> 
> in the old days you couldn't call a sleeping function in a module
> before doing a MOD_INC or after doing a MOD_DEC.  Then some safety nets
> were added that made these obsolete (in some number of places).  I was
> told that people had decided to potentially get rid of all safety
> nets.  Is this true?  Is it worthwhile to have a checker for these two
> rules?

It's worth removing the MOD_{INC,DEC}_USE_COUNT. Which had been done
in quite a few places. Let the caller handle the refcount on callee -
_that_ is definitely safe.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: which gcc version?

2001-04-04 Thread Alexander Viro



On Thu, 5 Apr 2001, Manoj Sontakke wrote:

> Hi
>   I am getting linker error "undefined reference to __divdi3".
> This is because c = a/b; where a,b,c are of type "long long"
> I understand this is gcc problem.
>   I am doing this on a pentium with gcc -v = egcs-2.91.66

Don't do it in the kernel. It has nothing to gcc version.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: race condition on looking up inodes

2001-04-08 Thread Alexander Viro



On Mon, 9 Apr 2001, warren wrote:

> 
> 
> Hi,
> I had post a simillar message before.
> Thanks for the replay from Albert D. Cahalan. But i found some results
> confusing me.
> For example,  process 1 and process 2 run concurrently and execute the
> following system calls.
> 
> rename("/usr/hybrid/cfg/data","/usr/mytemp/data1"); /*for process 1*/
> 
>
> 
>   rename("/usr/mytemp/data1","/usr/test");/* for process 2*/

>   
>   It is possible that context switch happens when process 1 is look ing up
> the inode for "/usr/mytemp/data1"  or the inode for "/usr/hybrid/cfg/data".

>  It will result in diffrent behaviour for process 2 and confuses the
> application.

>   If so,how does Linux solve?

Solves what, precisely? Result depends on the order of these calls. If
you don't provide any serialization - you get timing-dependent results
you were asking for. What's the problem and what behaviour do you expect?

Besides, what's the difference caused by the moment of context switch?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] exec_via_sudo

2001-04-10 Thread Alexander Viro



On Tue, 10 Apr 2001, kees wrote:

> Hi
> 
> Unix/Linux have a lot of daemons that have to run as root because they
> need to acces some specific data or run special programs. They are
> vulnerable as we learn.
> Is there any way to have something like an exec call that is
> subject to a sudo like permission system? That would run the daemons
> as a normal user but allow only for specific functions i.e. NOT A SHELL.
> comments?

Thou shalt not put policy into the kernel.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fwd: Re: memory usage - dentry_cacheg

2001-04-11 Thread Alexander Viro



On Wed, 11 Apr 2001, Andreas Dilger wrote:

> I just discovered a similar problem when testing Daniel Philip's new ext2
> directory indexing code with bonnie++.  I was running bonnie under single
> user mode (basically nothing else running) to create 100k files with 1 data
> block each (in a single directory).  This would create a directory about
> 8MB in size, 32MB of dirty inode tables, and about 400M of dirty buffers.
> I have 128MB RAM, no swap for the testing.
> 
> In short order, my single user shell was OOM killed, and in another test
> bonnie was OOM-killed (even though the process itself is only 8MB in size).
> There were 80k entries each of icache and dcache (38MB and 10MB respectively)
> and only dirty buffers otherwise.  Clearly we need some VM pressure on the
> icache and dcache in this case.  Probably also need more agressive flushing
> of dirty buffers before invoking OOM.

We _have_ VM pressure there. However, such loads had never been used, so
there's no wonder that system gets unbalanced under them.

I suspect that simple replacement of goto next; with continue; in the
fs/dcache.c::prune_dcache() may make situation seriously better.

Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[CFT][PATCH] Re: Fwd: Re: memory usage - dentry_cache

2001-04-11 Thread Alexander Viro



On Thu, 12 Apr 2001, Jeff Garzik wrote:

> Alexander Viro wrote:
> > We _have_ VM pressure there. However, such loads had never been used, so
> > there's no wonder that system gets unbalanced under them.
> > 
> > I suspect that simple replacement of goto next; with continue; in the
> > fs/dcache.c::prune_dcache() may make situation seriously better.
> 
> Awesome.  With the obvious patch attached, some local ramfs problems
> disappeared, and my browser and e-mail program are no longer swapped out
> when doing a kernel build.
> 
> Thanks :)

OK, how about wider testing? Theory: prune_dcache() goes through the
list of immediately killable dentries and tries to free given amount.
It has a "one warning" policy - it kills dentry if it sees it twice without
lookup finding that dentry in the interval. Unfortunately, as implemented
it stops when it had freed _or_ warned given amount. As the result, memory
pressure on dcache is less than expected.

Patch being:
--- fs/dcache.c Sun Apr  1 23:57:19 2001
+++ /tmp/dcache.c   Thu Apr 12 03:07:39 2001
@@ -340,7 +340,7 @@
if (dentry->d_flags & DCACHE_REFERENCED) {
dentry->d_flags &= ~DCACHE_REFERENCED;
list_add(&dentry->d_lru, &dentry_unused);
-   goto next;
+   continue;
}
dentry_stat.nr_unused--;
 
@@ -349,7 +349,6 @@
BUG();
 
prune_one_dentry(dentry);
-   next:
if (!--count)
break;
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[race][RFC] d_flags use

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Andreas Dilger wrote:

> Al writes:
> > We _have_ VM pressure there. However, such loads had never been used, so
> > there's no wonder that system gets unbalanced under them.
> > 
> > I suspect that simple replacement of goto next; with continue; in the
> > fs/dcache.c::prune_dcache() may make situation seriously better.
> 
> Yes, it appears that this would be a bug.  We were only _checking_
> "count" dentries, rather than pruning "count" dentries.
> 
> Testing continues.

Uh-oh... After looking at prune_dcache for a minute... Folks, what
protects ->d_flags? That may very well be the reason of some NFS
and autofs problems.

If nobody objects I'll go for test_bit/set_bit/clear_bit here.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [race][RFC] d_flags use

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, David S. Miller wrote:

> 
> Alexander Viro writes:
>  > If nobody objects I'll go for test_bit/set_bit/clear_bit here.
> 
> Be sure to make d_flags an unsigned long when you do this! :-)

Oh, fsck... Thanks for reminder - I've completely forgotten about
that.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [CFT][PATCH] Re: Fwd: Re: memory usage - dentry_cache

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Marcin Kowalski wrote:

> Hi
> 
> Regarding the patch 
> 
> I don't have experience with the linux kernel internals but could this patch 
> not lead to a run-loop condition as the only thing that can break our of the 
> for(;;) loop is the tmp==&dentry_unused statement. So if the required number 
> of dentries does not exist and this condition is not satisfied we would have 
> an infinate loop... sorry if this is a silly question.

Nope. Notice that "warned" dentries are not killed, but they are returned
to the list. If we meet them again - they are goners.

More formally, on each iteration you either decrement count or you
decrement the number of dentries that have DCACHE_REFERENCED. count
can't grow at all.  Number of dentries with DCACHE_REFERENCED can't grow
unless you release dcache_lock, which happens only in the branch that
decrements count. I.e. loop does terminate.

> Also the comment >/* If the dentry was recently referenced, don't free it. 
> */<, the code inside is excuted if the DCACHE_REFERENCED flags are set and in 
> the code is is reversing the DCACHE_REFERENCED flag on the dentry and adding 
> it to the dentry_unsed list??? So a Refrenched entry is set Not Referenced 
> and place in the unsed list?? I am unclear about that... is the comment 
> correct or is my understanding lacking (which is very probable :-))..

"referenced" as in "had been found by d_lookup, don't shoot me at sight".
When prune_dcache() picks it up it moves the thing on the other end of list
and removes the mark. Caught twice - too bad, it will be freed.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: Fwd: Re: memory usage - dentry_cacheg

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Jan Harkes wrote:

> But the VM pressure on the dcache and icache only comes into play once
> the system still has a free_shortage _after_ other attempts of freeing
> up memory in do_try_to_free_pages.

I don't think that it's necessary bad.

> sync_all_inodes, which is called from shrink_icache_memory is
> counterproductive at this point. Writing dirty inodes to disk,
> especially when there is a lot of them, requires additional page
> allocations.

Agreed, but that's
a) a separate story
b) not the case in situation mentioned above (all inodes are
busy).

> I have a patch that avoids unconditionally puts pressure on the dcache
> and icache, and avoids sync_all_inodes in shrink_icache_memory. An
> additional wakeup for the kupdate thread makes sure that inodes are more
> frequently written when there is no more free shortage. Maybe kupdated
> should be always get woken up.

Maybe, but I really doubt that constant pressure on dcache/icache is a
good idea. I'd rather see what will change from fixing that bug in
prune_dcache() before deciding what to do next.

> btw. Alexander, is the following a valid optimization to improve
> write-coalescing when calling sync_one for several inodes?
> 
> inode.c:sync_one
> 
> -filemap_fdatawait(inode->i_mapping);
> +if (sync) filemap_fdatawait(inode->i_mapping);

Umm... Probably.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: memory usage - dentry_cacheg

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Rik van Riel wrote:

> On Thu, 12 Apr 2001, Ed Tomlinson wrote:
> 
> > I have been playing around with patches that fix this problem.  What
> > seems to happen is that the VM code is pretty efficent at avoiding the
> > calls to shrink the caches.  When they do get called its a case of to
> > little to late.  This is espically bad in lightly loaded systems.  
> > The following patch helps here.  I also have a more complex version
> > that uses autotuning, but would rather push the simple code, _if_ it
> > does the job.
> 
> I like this patch. The thing I like most is that it tries to free
> from this cache if there is little activity, not when we are low
> on memory and it is physically impossible to get rid of the cache.
> 
> Remember that evicting early from the inode and dentry cache doesn't
> matter since we can easily rebuild this data from the buffer and page
> cache.

Ahem. Yes, for local block-based filesystems, provided that directories are
small and that indirect blocks will not flush the inode table buffers out of
buffer cache, etc., etc.

Keeping inodes clean when pressure is low is a nice idea. That way you can
easily evict when needed. Evicting early... Not really.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: Fwd: Re: memory usage - dentry_cacheg

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Rik van Riel wrote:

> Please take a look at Ed Tomlinson's patch. It also puts pressure
> on the dcache and icache independent of VM pressure, but it does
> so based on the (lack of) pressure inside the dcache and icache
> themselves.
>
> The patch looks simple, sane and it might save us quite a bit of
> trouble in making the prune_{icache,dcache} functions both able
> to avoid low-memory deadlocks *AND* at the same time able to run
> fast under low-memory situations ... we'd just prune from the
> icache and dcache as soon as a "large portion" of the cache isn't
> in use.

Bad idea. If you do loops over directory contents you will almost
permanently have almost all dentries freeable. Doesn't make freeing
them a good thing - think of the effects it would have.

Simple question: how many of dentries in /usr/src/linux/include/linux
are busy at any given moment during the compile? At most 10, I suspect.
I.e. ~4%.

I would rather go for active keeping the amount of dirty inodes low,
so that freeing would be cheap. Doing massive write_inode when we
get low on memory is, indeed, a bad thing, but you don't have to
tie that to freeing stuff. Heck, IIRC you are using quite a similar
logics for pagecache...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: Fwd: Re: memory usage - dentry_cacheg

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Alexander Viro wrote:

> Bad idea. If you do loops over directory contents you will almost
> permanently have almost all dentries freeable. Doesn't make freeing
> them a good thing - think of the effects it would have.
> 
> Simple question: how many of dentries in /usr/src/linux/include/linux
> are busy at any given moment during the compile? At most 10, I suspect.
> I.e. ~4%.
> 
> I would rather go for active keeping the amount of dirty inodes low,
> so that freeing would be cheap. Doing massive write_inode when we
> get low on memory is, indeed, a bad thing, but you don't have to
> tie that to freeing stuff. Heck, IIRC you are using quite a similar
> logics for pagecache...

PS: with your approach negative entries are dead meat - they won't be
caught used unless you look at them exactly at the moment of d_lookup().

Welcome to massive lookups in /bin due to /usr/bin stuff (and no, shell
own cache doesn't help - it's not shared; think of scripts).

IOW. keeping dcache/icache size low is not a good thing, unless you
have a memory pressure that requires it. More agressive kupdate _is_
a good thing, though - possibly kupdate sans flushing buffers, so that
it would just keep the icache clean and let bdflush do the actual IO.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH][CFT] ext2 directories in pagecache

2001-04-12 Thread Alexander Viro

Folks, IMO ext2-dir-patch got to the stable stage. Currently
it's against 2.4.4-pre2, but it should apply to anything starting with
2.4.2 or so.

Ted, could you review it for potential inclusion into 2.4 once
it gets enough testing? It's ext2-only (the only change outside of
ext2 is exporting waitfor_one_page()), it doesn't change fs layout,
it seriously simplifies ext2/dir.c and ext2/namei.c and it gives better
VM behaviour.

Patch is on ftp.math.psu.edu/pub/viro/ext2-dir-patch.gz

Folks, please give it a good beating - it works here, but I'd
really like it to get wide testing. Help would be very welcome.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: memory usage - dentry_cacheg

2001-04-12 Thread Alexander Viro



On Thu, 12 Apr 2001, Ed Tomlinson wrote:

> On Thursday 12 April 2001 11:12, Alexander Viro wrote:
> What prompted my patch was observing situations where the icache (and dcache 
> too) got so big that they were applying artifical pressure to the page and 
> buffer caches. I say artifical since checking the stats these caches showed 
> over 95% of the entries unused.  At this point there is usually another 10% 
> or so of objects allocated by the slab caches but not accounted for in the 
> stats (not a problem they are accounted if the cache starts using them).

"Unused" as in "->d_count==0"? That _is_ OK. Basically, you will have
positive ->d_count only on directories and currently opened files.
E.g. during compile in /usr/include/* you will have 3-5 file dentries
with ->d_count > 0 - ones that are opened _now_. It doesn't mean that
everything else rest is unused in any meaningful sense. Can be freed - yes,
but that's a different story.

If you are talking about "unused" from the slab POV - _ouch_. Looks like
extremely bad fragmentation ;-/ It's surprising, and if that's thte case
I'd like to see more details.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: EXPORT_SYMBOL for chrdev_open 2.4.3

2001-04-13 Thread Alexander Viro



On Fri, 13 Apr 2001, Jeff V. Merkey wrote:

> It would be nice if chrdev_open were added to ksyms.c along with
> blkdev_open since tape devices seem are always registered as character
> rather than block devices.  
> 
> I am finding that kernel modules that need to open and close a tape 
> drive have to export chrdev_open manually on 2.4.3.  Can this get 
> exported as well?  Closing is not a problem since the method of 
> calling (->release) seems to work OK with SCSI tape devices.

They don't need it. Moreover, blkdev_open shouldn't be exported too -
the only potentially modular piece of code that refers to it is
drivers/block/rd.c and it's in initrd loading, so it isn't even
compiled when we do rd as a module.

BTW, Linus, could we remove blkdev_open() from the export list?
I don't see any legitimate reason to export it - certainly not in
the official tree.

BTW, fs/partitions/ibm.c also doesn't need blkdev_open() - it should
use ioctl_by_bdev() and be done with that.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: EXPORT_SYMBOL for chrdev_open 2.4.3

2001-04-13 Thread Alexander Viro


On Fri, 13 Apr 2001, Jeff V. Merkey wrote:

> How are folks supposed to open disk and tape devices from kernel modules
> without these?  Not everything should be done in user space Al.  If you 

Normally - filp_open(). If all you want is ioctl on block device -
blkdev_get() + ioctl_by_bdev() + blkdev_put(). If you want it by
device _number_ - use bdget().

> remove blkdev_open I will not be able to properly increment the use 
> count an a disk device I may be reading or writing to.  

Yes, you will. And I would _really_ advice you to do that by
name instead of device number - that way you will avoid a lot of pain
couple of years down the road.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: EXPORT_SYMBOL for chrdev_open 2.4.3

2001-04-13 Thread Alexander Viro



On Fri, 13 Apr 2001, Jeff V. Merkey wrote:

> Not meaning to offend, but how could you know what everyone 
> who uses Linux needs in every instance?  NT, NetWare, etc. all
> expose these types of APIs for Backup and anti-virus software,
> etc.  The APIs in question are the very calls user space apps
> call through the syscall to indicate who is using a device. 

Backup and AV software is not in the kernel, so they would
be unable to use the thing, exported or not. Please, don't
bring the strawmen.

Novell's model (aka. "we don't need no stinkin' userland, everything
is NLM and security be damned") is better left to rot in hell with Novell.

> Sure, I can send blind I/O requests to a device and I guess 
> someone running fdisk in user space can blow the device away from beneath 
> me since I have no way of locking those partitions I exclusively
> own and stopping this is these apis are removed and modules 
> cannot call them.  

Use filp_open() - it's that simple.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: EXPORT_SYMBOL for chrdev_open 2.4.3

2001-04-13 Thread Alexander Viro



On Fri, 13 Apr 2001, Jeff V. Merkey wrote:

> > Backup and AV software is not in the kernel, so they would
> > be unable to use the thing, exported or not. Please, don't
> > bring the strawmen.
> 
> Some NT anti-virus stuff is in-kernel, and it's there to catch people 
> writing viruses that act like device drivers.  One day, if and 

 If attacker can trick kernel into loading _any_ untrusted code,
no matter what contents it got, in ring 0 - you've lost anyway.

> when a Linux virus shows it's ugly head disguised as a kernel module, you 
> will be backpeddling on this statement, and wishing we had in 

No, I will be busy fixing the hole that allows to get untrusted code loaded.
I don't give a fsck whether it's a virus or not - if admin authorized it
it's his responsibility, if not - ability to get it into the kernel space is
a gaping hole that should be closed.

> > Use filp_open() - it's that simple.
> 
> Thanks.  This is what I needed to know.  I saw filp_open() in the 
> EXPORTS file, but was uncertain if this would be an unchanging API.  

Yes, it is. It's a kernel counterpart of open() - the only difference
is that instead of installing a reference to file into descriptor
table and returning the descriptor it returns the reference itself.
Arguments are the same as in case of open() and it's certainly there
to stay.

Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH(?): linux-2.4.4-pre2: fork should run child first

2001-04-13 Thread Alexander Viro



On Fri, 13 Apr 2001, Linus Torvalds wrote:

> 
> 
> On 14 Apr 2001, John Fremlin wrote:
> >
> > . In fact, if you think
> > fork+exec is such a big performance hit why not go for spawn(2) and
> > have Linus and Al jump on you? ;-)
> 
> spawn() is trivial to implement if you want to. I don't think it's all
> that much more interesting than vfork()+execve(), though.

Or faster, for that matter...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: small bug/oversight found in 2.4.3

2001-04-15 Thread Alexander Viro



On Sun, 15 Apr 2001, Jeff Garzik wrote:

> Swivel wrote:
> > 
> > drivers/char/char.c, line 247
> > create_proc_read_entry() is called regardless of the definition of
> > CONFIG_PROC_FS, simply wrap call with #ifdef CONFIG_PROC_FS and #endif.
> 
> create_proc_read_entry exists, as a static inline no-op, without
> CONFIG_PROC_FS.

... while drivers/char/char.c doesn't exist at all.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: loop problems continue in 2.4.3

2001-04-16 Thread Alexander Viro



On Mon, 16 Apr 2001, Jens Axboe wrote:

> > I can mount the same file on the same mountpoint more than once. If I
> > mount a file twice (same file on the same mount point)
> 
> This is a 2.4 feature

Ability to losetup different loop devices to the same underlying
file is a bug, though. Not that it was new, though...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH][crapectomy] death of filesystem_setup()

2001-04-17 Thread Alexander Viro

Patch below switches the last 3 filesystems that are initialized from
filesystem_setup() to module_init/module_exit. Result: filesystem_setup() is
no more.

Linus, could you apply it?
Al

diff -urN S4-pre3/fs/devfs/base.c S4-pre3-init/fs/devfs/base.c
--- S4-pre3/fs/devfs/base.c Mon Feb 26 14:18:32 2001
+++ S4-pre3-init/fs/devfs/base.cTue Apr 17 09:03:51 2001
@@ -3339,7 +3339,7 @@
 }   /*  End Function devfsd_close  */
 
 
-int __init init_devfs_fs (void)
+static int __init init_devfs_fs (void)
 {
 int err;
 
@@ -3369,3 +3369,5 @@
 if (err == 0) printk ("Mounted devfs on /dev\n");
 else printk ("Warning: unable to mount devfs, err: %d\n", err);
 }   /*  End Function mount_devfs_fs  */
+
+module_init(init_devfs_fs)
diff -urN S4-pre3/fs/devpts/inode.c S4-pre3-init/fs/devpts/inode.c
--- S4-pre3/fs/devpts/inode.c   Mon Feb 26 14:18:32 2001
+++ S4-pre3-init/fs/devpts/inode.c  Tue Apr 17 09:10:52 2001
@@ -228,28 +228,25 @@
err = PTR_ERR(devpts_mnt);
if (!IS_ERR(devpts_mnt))
err = 0;
-   }
-   return err;
-}
-
 #ifdef MODULE
-
-int init_module(void)
-{
-   int err = init_devpts_fs();
-   if ( !err ) {
-   devpts_upcall_new  = devpts_pty_new;
-   devpts_upcall_kill = devpts_pty_kill;
+   if ( !err ) {
+   devpts_upcall_new  = devpts_pty_new;
+   devpts_upcall_kill = devpts_pty_kill;
+   }
+#endif
}
return err;
 }
 
-void cleanup_module(void)
+void __exit exit_devpts_fs(void)
 {
+#ifdef MODULE
devpts_upcall_new  = NULL;
devpts_upcall_kill = NULL;
+#endif
unregister_filesystem(&devpts_fs_type);
kern_umount(devpts_mnt);
 }
 
-#endif
+module_init(init_devpts_fs)
+module_exit(exit_devpts_fs)
diff -urN S4-pre3/fs/filesystems.c S4-pre3-init/fs/filesystems.c
--- S4-pre3/fs/filesystems.cMon Sep 25 19:05:01 2000
+++ S4-pre3-init/fs/filesystems.c   Tue Apr 17 09:53:31 2001
@@ -7,36 +7,10 @@
  */
 
 #include 
-#include 
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
+#include 
 #include 
 #include 
-#include 
-#include 
 #include 
-
-#ifdef CONFIG_DEVPTS_FS
-extern int init_devpts_fs(void);
-#endif
-
-void __init filesystem_setup(void)
-{
-   init_devfs_fs();  /*  Header file may make this empty  */
-
-#ifdef CONFIG_NFS_FS
-   init_nfs_fs();
-#endif
-
-#ifdef CONFIG_DEVPTS_FS
-   init_devpts_fs();
-#endif
-}
 
 #if defined(CONFIG_NFSD_MODULE)
 struct nfsd_linkage *nfsd_linkage = NULL;
diff -urN S4-pre3/fs/nfs/inode.c S4-pre3-init/fs/nfs/inode.c
--- S4-pre3/fs/nfs/inode.c  Mon Apr  2 16:51:04 2001
+++ S4-pre3-init/fs/nfs/inode.c Tue Apr 17 09:49:20 2001
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1060,8 +1061,7 @@
 /*
  * Initialize NFS
  */
-int
-init_nfs_fs(void)
+static int __init init_nfs_fs(void)
 {
int err;
 
@@ -1079,23 +1079,7 @@
 return register_filesystem(&nfs_fs_type);
 }
 
-/*
- * Every kernel module contains stuff like this.
- */
-#ifdef MODULE
-
-EXPORT_NO_SYMBOLS;
-/* Not quite true; I just maintain it */
-MODULE_AUTHOR("Olaf Kirch <[EMAIL PROTECTED]>");
-
-int
-init_module(void)
-{
-   return init_nfs_fs();
-}
-
-void
-cleanup_module(void)
+static void __exit exit_nfs_fs(void)
 {
nfs_destroy_readpagecache();
nfs_destroy_nfspagecache();
@@ -1104,4 +1088,10 @@
 #endif
unregister_filesystem(&nfs_fs_type);
 }
-#endif
+
+EXPORT_NO_SYMBOLS;
+/* Not quite true; I just maintain it */
+MODULE_AUTHOR("Olaf Kirch <[EMAIL PROTECTED]>");
+
+module_init(init_nfs_fs)
+module_exit(exit_nfs_fs)
diff -urN S4-pre3/include/linux/devfs_fs_kernel.h 
S4-pre3-init/include/linux/devfs_fs_kernel.h
--- S4-pre3/include/linux/devfs_fs_kernel.h Fri Mar 23 16:09:44 2001
+++ S4-pre3-init/include/linux/devfs_fs_kernel.hTue Apr 17 09:43:03 2001
@@ -96,7 +96,6 @@
   unsigned int minor_start,
   umode_t mode, void *ops, void *info);
 
-extern int init_devfs_fs (void);
 extern void mount_devfs_fs (void);
 extern void devfs_make_root (const char *name);
 #else  /*  CONFIG_DEVFS_FS  */
@@ -233,10 +232,6 @@
 return;
 }
 
-static inline int init_devfs_fs (void)
-{
-return 0;
-}
 static inline void mount_devfs_fs (void)
 {
 return;
diff -urN S4-pre3/include/linux/nfs_fs.h S4-pre3-init/include/linux/nfs_fs.h
--- S4-pre3/include/linux/nfs_fs.h  Wed Mar 28 21:12:47 2001
+++ S4-pre3-init/include/linux/nfs_fs.h Tue Apr 17 09:42:48 2001
@@ -137,7 +137,6 @@
  * linux/fs/nfs/inode.c
  */
 extern struct super_block *nfs_read_super(struct super_block *, void *, int);
-extern int init_nfs_fs(void);
 extern void nfs_zap_caches(struct inode *);
 extern int nfs_inode_is_stale(struct inode *, struct nfs_fh *,
st

Re: [PATCH] proc_lookup not exported

2001-04-17 Thread Alexander Viro



On Tue, 17 Apr 2001, Jeff Golds wrote:

> Hi folks.
> 
> I noticed that proc_lookup is not exported in fs/proc/procfs_syms.c but
> that the function is an external in include/linux/proc_fs.h.

Not every public function needs to be exported. proc_lookup() is
shared between different files in fs/proc/, so it can't be made
static. However, it got no business being used outside of the
fs/proc and it certainly shouldn't be used in modules.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Re: 2.4.4-pre4 nfsd.o unresolved symbol

2001-04-17 Thread Alexander Viro



On Wed, 18 Apr 2001, Jeff Chua wrote:

> depmod: *** Unresolved symbols in /lib/modules/2.4.4-pre4/kernel/fs/nfsd/nfsd.o
> depmod: nfsd_linkage_Rb56858ea

Grrr...

Add #include  to fs/filesystems.c. My apologies.

--- fs/filesystems.cTue Apr 17 23:40:32 2001
+++ /tmp/filesystems.c  Wed Apr 18 00:41:01 2001
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] d_flags races

2001-04-18 Thread Alexander Viro

Linus, we need to make access to bits of ->d_flags atomic.
That used to be protected by BKL, but since we've got DCACHE_REFERENCED
it's no longer true - prune_dcache() and d_lookup() are not under BKL
and while they are safe wrt each other (both are under dcache_lock)
they race with every place in filesystems that touches ->d_flags.
Solution: use set_bit/clear_bit/test_bit.
Patch follows. Please, apply.
A:

diff -urN S4-pre4/fs/autofs/root.c S4-pre4-d_flags/fs/autofs/root.c
--- S4-pre4/fs/autofs/root.cFri Feb 16 17:28:15 2001
+++ S4-pre4-d_flags/fs/autofs/root.cWed Apr 18 06:17:59 2001
@@ -94,7 +94,7 @@
/* Turn this into a real negative dentry? */
if (status == -ENOENT) {
dentry->d_time = jiffies + AUTOFS_NEGATIVE_TIMEOUT;
-   dentry->d_flags &= ~DCACHE_AUTOFS_PENDING;
+   clear_bit(D_Autofs_Pending, &dentry->d_flags);
return 1;
} else if (status) {
/* Return a negative dentry, but leave it "pending" */
@@ -129,7 +129,7 @@
autofs_update_usage(&sbi->dirhash,ent);
}
 
-   dentry->d_flags &= ~DCACHE_AUTOFS_PENDING;
+   clear_bit(D_Autofs_Pending, &dentry->d_flags);
return 1;
 }
 
@@ -152,7 +152,7 @@
sbi = autofs_sbi(dir->i_sb);
 
/* Pending dentry */
-   if ( dentry->d_flags & DCACHE_AUTOFS_PENDING ) {
+   if (test_bit(D_Autofs_Pending, &dentry->d_flags))
if (autofs_oz_mode(sbi))
res = 1;
else
@@ -219,7 +219,7 @@
 * We need to do this before we release the directory semaphore.
 */
dentry->d_op = &autofs_dentry_operations;
-   dentry->d_flags |= DCACHE_AUTOFS_PENDING;
+   set_bit(D_Autofs_Pending, &dentry->d_flags);
d_add(dentry, NULL);
 
up(&dir->i_sem);
@@ -230,7 +230,7 @@
 * If we are still pending, check if we had to handle
 * a signal. If so we can force a restart..
 */
-   if (dentry->d_flags & DCACHE_AUTOFS_PENDING) {
+   if (test_bit(D_Autofs_Pending, &dentry->d_flags)) {
if (signal_pending(current))
return ERR_PTR(-ERESTARTNOINTR);
}
diff -urN S4-pre4/fs/autofs4/autofs_i.h S4-pre4-d_flags/fs/autofs4/autofs_i.h
--- S4-pre4/fs/autofs4/autofs_i.h   Fri Feb 16 22:52:15 2001
+++ S4-pre4-d_flags/fs/autofs4/autofs_i.h   Wed Apr 18 06:17:59 2001
@@ -119,7 +119,7 @@
 {
struct autofs_info *inf = autofs4_dentry_ino(dentry);
 
-   return (dentry->d_flags & DCACHE_AUTOFS_PENDING) ||
+   return (test_bit(D_Autofs_Pending, &dentry->d_flags)) ||
(inf != NULL && inf->flags & AUTOFS_INF_EXPIRING);
 }
 
diff -urN S4-pre4/fs/autofs4/expire.c S4-pre4-d_flags/fs/autofs4/expire.c
--- S4-pre4/fs/autofs4/expire.c Fri Feb 16 19:36:08 2001
+++ S4-pre4-d_flags/fs/autofs4/expire.c Wed Apr 18 06:17:59 2001
@@ -194,7 +194,7 @@
}
 
/* No point expiring a pending mount */
-   if (dentry->d_flags & DCACHE_AUTOFS_PENDING)
+   if (test_bit(D_Autofs_Pending, &dentry->d_flags))
continue;
 
if (!do_now) {
diff -urN S4-pre4/fs/autofs4/root.c S4-pre4-d_flags/fs/autofs4/root.c
--- S4-pre4/fs/autofs4/root.c   Fri Feb 16 19:36:08 2001
+++ S4-pre4-d_flags/fs/autofs4/root.c   Wed Apr 18 06:17:59 2001
@@ -82,7 +82,7 @@
if (de_info && (de_info->flags & AUTOFS_INF_EXPIRING)) {
DPRINTK(("try_to_fill_entry: waiting for expire %p name=%.*s, 
flags&PENDING=%s de_info=%p de_info->flags=%x\n",
 dentry, dentry->d_name.len, dentry->d_name.name, 
-dentry->d_flags & DCACHE_AUTOFS_PENDING?"t":"f",
+test_bit(D_Autofs_Pending, &dentry->d_flags)?"t":"f",
 de_info, de_info?de_info->flags:0));
status = autofs4_wait(sbi, &dentry->d_name, NFY_NONE);

@@ -109,7 +109,7 @@
/* Turn this into a real negative dentry? */
if (status == -ENOENT) {
dentry->d_time = jiffies + AUTOFS_NEGATIVE_TIMEOUT;
-   dentry->d_flags &= ~DCACHE_AUTOFS_PENDING;
+   clear_bit(D_Autofs_Pending, &dentry->d_flags);
return 1;
} else if (status) {
/* Return a negative dentry, but leave it "pending" */
@@ -134,7 +134,7 @@
if (!autofs4_oz_mode(sbi))
autofs4_update_usage(dentry);
 
-   dentry->d_flags &= ~DCACHE_AUTOFS_PENDING;
+   clear_bit(D_Autofs_Pending, &dentry->d_flags);
return 1;
 }
 
@@ -277,7 +277,7 @@
dentry->d_op = &autofs4_root_dentry_operations;
 
if (!oz_mod

Re: [PATCH][CFT] ext2 directories in pagecache

2001-04-18 Thread Alexander Viro



On Wed, 18 Apr 2001, James Lewis Nance wrote:

> On Thu, Apr 12, 2001 at 12:33:42PM -0400, Alexander Viro wrote:
> > Folks, IMO ext2-dir-patch got to the stable stage. Currently
> > it's against 2.4.4-pre2, but it should apply to anything starting with
> > 2.4.2 or so.
> 
> Have you had any feedback about this patch?  I applied it last night to
> 2.4.3.  It seemed to work.  When I booted my computer this morning fsck
> complained about problems with the directory on one of my ext2 file systems.

Anything prior to 2.4.4-pre2 has known metadata-corrupting bugs on ext2.
Whether they show up or not depends on the load, phase of moon, etc. but
they are there.

> Since fsck does not run on every boot I dont really have a way of knowing if
> this has anything to do with your patch or not.  I'm running the patched
> kernel again right now.  Ill shutdown and force an fsck later today to see
> if anything shows up.

Please, upgrade to 2.4.4-pre2 or later. Or, at least, replace bforget()
call in ext2_get_block() with brelse() - that's was the worst one
(and last to be fixed).
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] proc_lookup not exported

2001-04-18 Thread Alexander Viro



On Wed, 18 Apr 2001, Jeff Golds wrote:

> I don't see why not. I created my own mkdir and rmdir handlers in my
> module.  I'd like to use the lookup function that proc supplies instead
> of supplying my own, why shouldn't I be allowed to do that?  It's not as
> if I am doing something other than what normally happens:  I am
> assigning inode_operations::lookup to be proc_lookup.

Use ramfs as a model; procfs is not well-suited for that sort of work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] proc_mknod() should check the mode parameter

2001-04-18 Thread Alexander Viro



On Wed, 18 Apr 2001, Erik Mouw wrote:

> Hi all,
> 
> While documenting the procfs interface (more of that later), I came
> across proc_mknod() which is supposed to be used to create devices in
> the procfs. IMHO it should therefore check if the mode parameter
> contains S_IFBLK or S_IFCHR.

Why? All callers of proc_mknod() are in the kernel and they should
know better. I could understand
if ()
BUG();
but silently doing nothing is really odd.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2 Directory Index - Delete Performance

2001-04-18 Thread Alexander Viro



On Thu, 19 Apr 2001, Rik van Riel wrote:

> Hmmm, considering this, it may be wise to limit the amount of
> inodes in the inode cache to, say, 10% of RAM ... because we
> can cache MORE inodes if we store them in the buffer cache
> instead!

Rik, I'd rather check the effect of prune_icache() patch before
deciding what to do. It doesn't make much sense to limit icache
size when we leave unused inodes for a long time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2 Directory Index - Delete Performance

2001-04-18 Thread Alexander Viro



On Wed, 18 Apr 2001, Rik van Riel wrote:

> On Thu, 19 Apr 2001, Daniel Phillips wrote:
> 
> > OK, now I know what's happening, the next question is, what should be
> > dones about it.  If anything.
> 
> [ discovered by alexey on #kernelnewbies ]
> 
> One thing we should do is make sure the buffer cache code sets
> the referenced bit on pages, so we don't recycle buffer cache
> pages early.
> 
> This should leave more space for the buffercache and lead to us
> reclaiming the (now unused) space in the dentry cache instead...

Sorry, but that's just plain wrong. We shouldn't keep inode table in
buffer-cache at all. And we should be more aggressive on icache -
dcache looks sane now (recent 2.4.4-pre), but icache holds unused
inodes for too long. And freeing them is very slow _and_ random -
recipe for kmem_cache fragmentation.

/me sits down to port inode-table-in-pagecache to 2.4.4-pre4...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Abramo Bagnara wrote:

> Alon Ziv wrote:
> > 
> > Hmm...
> > I already started (long ago, and abandoned since due to lack of time :-( )
> > down another path; I'd like to resurrect it...
> > 
> > My lightweight-semaphores were actually even simpler in userspace:
> > * the userspace struct was just a signed count and a file handle.
> > * Uncontended case is exactly like Linus' version (i.e., down() is decl +
> > js, up() is incl()).
> > * The contention syscall was (in my implementation) an ioctl on the FH; the
> > FH was a special one, from a private syscall (although with the new VFS I'd
> > have written it as just another specialized FS, or even referred into the
> > SysVsem FS).
> > 
> > So, there is no chance for user corruption of kernel data (as it just ain't
> > there...); and the contended-case cost is probably equivalent (VFS cost vs.
> > validation).
> 
> This would also permit:
> - to have poll()
> - to use mmap() to obtain the userspace area
> 
> It would become something very near to sacred Unix dogmas ;-)

I suspect that simple pipe with would be sufficient to handle contention
case - nothing fancy needed (read when you need to block, write upon up()
when you have contenders)

Would something along the lines of (inline as needed, etc.)

down:
lock decl count
js __down_failed
down_done:
ret

up:
lock incl count
jle __up_waking
up_done:
ret

__down_failed:
call down_failed
jmp down_done
__up_waking:
call up_waking
jmp up_done

down_failed()
{
read(pipe_fd, &dummy, 1);
}

up_waking()
{
write(pipe_fd, &dummy, 1);
}

be enough?
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ext2 inode size (on-disk)

2001-04-19 Thread Alexander Viro

Erm... Folks, can ->s_inode_size be not a power of 2? Both
libext2fs and kernel break in that case. Example:

dd if=/dev/zero of=foo bs=1024 count=20480
mkfs -I 192 foo

corrupts memory and segfaults. Reason: ext2_read_inode() (same problem
is present in the kernel version of said beast) finds inode offset within
cylinder group piece of inode table, splits it into block*block_size+offset,
reads the block and works with the structure at given offset.

I.e. it does
group = (ino-1) / inodes_per_group;
number_in_group = (ino-1) % inodes_per_group;
offset_in_group = number_in_group * inode_size;
block_number = inode_table_base[group] + offset_in_group/block_size;
offset_in_block = offset_in_group % block_size

Guess what happens if inode crosses block boundary? Exactly.

AFAICS we have two sane solutions:

a) require inode size to be a power of 2

b) switch to

group = (ino-1) / inodes_per_group;
number_in_group = (ino-1) % inodes_per_group;
block_in_group = number_in_group / inodes_per_block;
number_in_block = number_in_group % inodes_per_block;
block = inode_table_fragments[group] + block_in_group;
offset_in_block = number_in_block * inode_size;

i.e. instead of current "pack inodes into piece of inode table and
pad it in the end" do "pack inodes into blocks padding the end of every block".

Something has to be done - right now mke2fs effectively mandates "inode size
is a power of 2" and as far as I'm concerned it's OK, but segfaulting is
a bit too drastic way of telling user "don't do it"...
Al

PS: can we assume that inodes_per_group is a multiple of inodes_per_block
or it isn't guaranteed?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Linus Torvalds wrote:

> 
> 
> On Thu, 19 Apr 2001, Alexander Viro wrote:
> >
> > I certainly agree that introducing ioctl() in _any_ API is a shootable
> > offense. However, I wonder whether we really need any kernel changes
> > at all.
> 
> I'd certainly be interested in seeing the pipe-based approach. Especially
> if you make the pipe allocation lazy. That isn'tr trivial (it needs to be
> done right with both up_failed() and down_failed() trying to allocate the
> pipe on contention and using an atomic cmpxchg-style setting if none
> existed before). It has the BIG advantage of working on old kernels, so
> that you don't need to have backwards compatibility cruft in the
> libraries.

Ehh... Non-lazy variant is just read() and write() as down_failed() and
up_wakeup() Lazy... How about

if (Lock <= 1)
goto must_open;
opened:
/* as in non-lazy case */


must_open:
pipe(fd);
lock decl Lock
jg lost_it  /* Already seriously positive - clean up and go */
jl spin_and_lose
/* Lock went from 1 to 0 - go ahead */
reader = fd[0];
writer = fd[1];
Lock = MAX_INT;
goto opened;
spin_and_lose:
/* Won't take long - another guy got to do 3 memory writes */
while (Lock <= 0)
;
lost_it:
lock incl Lock
close(fd[0]);
close(fd[1]);
goto opened;

Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ext2 inode size (on-disk)

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Al, you write:
> > Erm... Folks, can ->s_inode_size be not a power of 2? Both
> > libext2fs and kernel break in that case. Example:
> > 
> > dd if=/dev/zero of=foo bs=1024 count=20480
> > mkfs -I 192 foo
> 
> I had always assumed that it would be a power-of-two size, but since it
> is an undocumented option to mke2fs, I suppose it was never really
> intended to be used.  It appears, however, that the mke2fs code
> doesn't do ANY checking on the parameter, so you could concievably make
> the inode size SMALLER than the current size, and this would DEFINITELY
> be bad as well.

In some sense it does - it dies if you've passed it not a power of two ;-)
I don't think that segfault is a good way to report the problem, though...

Problem with mkfs is obvious, but kernel side is also shady - we could
have cleaner code if we assumed that inode size is power of 2. As it
is, we have a code in read_super() that checks for size == 128 _and_
code that was apparently writen in assumption that it can be not a
power of 2. However, if that was the really the goal, we fail - code
in ext2_read_inode() actually would break with such sizes.

In other words, the real question is what the hell are we trying to
do there. If we want code that deals with sizes that are not powers of 2
we need to change ext2_read_inode() and friends. It wouldn't be
hard. OTOH, if we guarantee that inode size will always remain a power of
2 we can simplify the thing. In any case current situation doesn't
make much sense. The only question is direction of fix.

Could those who introduced ->s_inode_size tell what use had been intended?

> mke2fs will always set up the filesystem this way, and there is no real
> reason NOT to do that.  If you are using a filesystem block for the inode
> table, it is pointless to leave part of it empty, because you can't use
> it for anything else anyways.

It's not that simple - if you need 160 bytes per inode rounding it up
to the next power of two will lose a lot. On 4Kb fs it will be
16 inodes per block instead of 25 - 36% loss...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Linus Torvalds wrote:

> 
> 
> On Thu, 19 Apr 2001, Alexander Viro wrote:
> >
> > Ehh... Non-lazy variant is just read() and write() as down_failed() and
> > up_wakeup() Lazy... How about
> 
> Looks good to me. Anybody want to try this out and test some benchmarks?

Ugh. It doesn't look good for me. s/MAX_INT/MAX_INT>>1/ or we will
get into trouble on anything that goes into spin_and_lose. Window is
pretty narrow (notice that lost_it is OK - we only need to worry
about somebody coming in after winner drives Lock from 1 to 0
and before it gets it from 0 to MAX_INT), but we can get into serious
trouble if schedule() will hit that window.

MAX_INT/2 should be enough to deal with that, AFAICS.

However, I would _really_ like to get that code reviewed from the memory
access ordering POV. Warning: right now I'm half-asleep, so the thing can
very well be completely bogus in that area. Extra eyes would be certainly
welcome.

Al

PS: ->Lock should be set to 1 when we initialize semaphore. Destroying
semaphore should do
if (sem->Lock > 1) {
close(sem->writer);
close(sem->reader);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: active after unmount?

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Matthew Jacob wrote:

> 
> 'kay, great, thanks.. I'll put it in and see if things show up again

Guys, it's a known bug, fixed in 2.4.4-pre3. See the change to fs/super.c
between 2.4.4-pre2 and 2.4.4-pre3 - it's quite small.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-19 Thread Alexander Viro



On 19 Apr 2001, Ulrich Drepper wrote:

> Linus Torvalds <[EMAIL PROTECTED]> writes:
> 
> > I'm not interested in re-creating the idiocies of Sys IPC.
> 
> I'm not talking about sysv semaphores (couldn't care less).  And you
> haven't read any of the mails with examples I sent.
> 
> If the new interface can be useful for anything it must allow to
> implement process-shared POSIX mutexes.

Pardon me the bluntness, but... Why?
* on _any_ UNIX we can implement semaphore (object that has Dijkstra's
P and V operations, whatever) shared by processes that have access to pipe.
In a portable way. That's the part of pipe semantics that had been there
since way before v6. Pre-sysv, pre-POSIX, etc. When named pipes appeared
the same semantics had been carried to them. Agreed so far?
* if we have shared memory _and_ some implementation of semaphores
we can (on architectures that allow atomic_dec() and atomic_inc()) produce
semaphores that work via memory access in uncontended case and use slow
semaphores to handle contention side of the business. Nothing UNIX-specific
here.
* such objects _are_ useful. They are reasonably portable and
if they fit the task at hand and are cheaper than POSIX mutexes - that's
all rationale one could need for using them.

Sure, the variant I've posted was intra-process only, simply because it
uses normal pipes. Implementation with named pipes is also trivial -
when you map the shared area, allocate private one of the corresponding
size and keep descriptors there. End of story.

AFAICS mechanism is portable enough (and even on the architectures that
do not allow atomic userland operations we can survive - just fall back
to "slow" ones via read()/write() on pipes).  And excuse me, but when
one writes an application code the question is not "how to make it use
POSIX semaphores", it's "how to get the serialization I need in a
portable way".

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ext2 inode size (on-disk)

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any
> problems with e2fsck or debugfs on the resulting filesystem.  I'm running
> 1.20-WIP, but I don't think anything was changed in this area for some time.
 
May depend on the libc version/size of device/phase of the moon. I've
got segfaults with 1.18, 1.19 and 1.20-WIP on a Debian box with glibc
2.1.3-18 and 20Mb image. What really happens is memory corruption in
libe2fs (ext2_write_inode()), segfault comes later (usually in free()).

> Basically, packing inodes across block boundaries is TOTALLY broken.
> This can lead to all sorts of data corruption issues if one block is
> written to disk, and the other is not.  For that matter, the same would

Yup.

> PS - is this a code cleanup issue, or do you have some reason that you want
>  to increase the inode size?

Code cleanup
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] ext2 inode size (on-disk)

2001-04-19 Thread Alexander Viro



On Thu, 19 Apr 2001 [EMAIL PROTECTED] wrote:

> This was a project that was never completed.  I thought at one point
> of allowing the inode size to be not a power of 2, but if you do that,
> you really want to avoid letting an inode cross a block boundary ---
> for reliability and performance reasons if nothing else.   

Agreed.

> In the long run, it probably makes sense to adjust the algorithms to
> allow for non-power-of-two inode sizes, but require an incompatible
> filesystem feature flag (so that older kernels and filesystem
> utilities won't choke when mounting filesystems with non-standard
> sized inodes.

I don't think that it's needed - old kernels (up to -CURRENT ;-) will
simply refuse to mount if ->s_inode_size != 128. Old utilites may be
trickier, though...

I'm somewhat concerned about the following: last block of inode table
fragment may have less inodes than the rest. Reason: number of inodes
per group should be a multiple of 8 and with inodes bigger than 128
bytes it may give such effect. Comments?

I would really, really like to end up with accurate description of
inode table layout somewhere in Documentation/filesystems. Heck, I
volunteer to write it down and submit into the tree ;-)
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for SMP deadlock in autofs4

2001-04-20 Thread Alexander Viro



On Fri, 20 Apr 2001, Linus Torvalds wrote:

> Why are we doing the mntget/dget at all? We hold the spinlock, so we know
> they are not going away. Not doing the mntget/dget means that we (a) run
> faster and (b) don't have the bug, because we don't need to put the damn
> things.
> 
> Comments?

It looks like you are right, but I wonder how the hell did that code
happen at all. Looks like somewhere around 2.4.0-test10-pre* dcache_lock
was moved out of is_tree_busy() and covered dget/dput. Hmm... Might be
my fault - I don't remember doing that, but...
 
Anyway, it looks like in that case we can forget about games with
->d_count/->mnt_count. Other cases when we do "safe" dput() under
spinlocks are done under _different_ spinlocks, so they are not
a problem.
 
Removing that will require an obvious change in is_tree_busy() (shift
count by 1). However, the real question is WTF are we trying to 
get in autofs4_expire() - it returns dentry without grabbing a
reference to it. The only thing that saves us is that we have a
ramfs-style situation (dentries are pinned until we rmdir) and
everything up to the point where we silently forget about dentry
is covered by BKL. Since ->rmdir() is under BKL too it's enough,
but... Eww... 

Jeremy, what are you really trying to do there? is_tree_busy()
seems to be written in assumption that mnt/dentry is not a
mountpoint but root of a subtree with something mounted on its
leaves. And autofs4_expire() traverses the list of root's
subdirectories, picks one that has nothing busy mounted in
_its_ subdirectories and essentially pass the name to caller.
Which sends that name (of first-level subdirectory) to
userland.

Is that what you really want there? It looks very odd - why don't we pass
the names of actual mountpoints? What's wrong with the case when foo/bar
is busy, but foo/baz is not?
Al


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for SMP deadlock in autofs4

2001-04-20 Thread Alexander Viro



On Fri, 20 Apr 2001, Jeremy Fitzhardinge wrote:

> This is a fix for a potential deadlock in autofs4's expire routine.
> It tries to use dput() while holding the dcache_lock.  This isn't a
> problem in principle since dput() should only try to take the dcache_lock
> when the counter makes a transition to zero, which can't happen in
> this case.  Unfortunately the generic (and only) implementation of
> atomic_dec_and_lock always takes the lock, so deadlocks.

Frankly, I'd rather add dput_locked() in dcache.c. The bug is real and
since autofs4 is not the only place like that... I'll look into that
stuff.
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for SMP deadlock in autofs4

2001-04-20 Thread Alexander Viro



On Fri, 20 Apr 2001, Linus Torvalds wrote:

> 
> 
> On Fri, 20 Apr 2001, Jeremy Fitzhardinge wrote:
> >
> > I kept the dget/put out caution and ignorance, but they're clearly
> > problematic.  I'm happy to drop them if holding dcache_lock is enough
> > to keep the tree stable while I traverse it.
> 
> How does this patch look to you people?
> 
> It's untested, but looks fairly obvious. It removes the increment, and
> changes autofs4_expire() to properly bump the count of the returned dentry
> (and callers will dput() it when done). This may be unnecessarily careful,
> but it's the RightThing(tm) to do.

Looks sane for me. However, I would add check for dentry being hashed and
would skip the unhashed ones. Otherwise you can get a directory that
had been removed but is still busy - doesn't look like a right thing to
do. Jeremy?
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Races in affs_unlink(), affs_rmdir() and affs_rename()

2001-04-21 Thread Alexander Viro

mkdir /A
mkdir /B
mkdir /C
touch /A/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
rm /B/b

can corrupt filesystem. Scenario:

unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
affs_remove_link(), which blocks.

unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
affs_remove_link(). Which locks /B, renames removed entry into "b",
removes old "b" and inserts renamed "a" into /B.

The rest is irrelevant - we're already in it.

Similar race exists between unlink() and rename();

mkdir /A
mkdir /B
mkdir /C
touch /A/a
touch /B/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
mv /B/a /B/b
- similar scenario, different source of affs_remove_header().

Another one: unlink() and rmdir():
mkdir /A
mkdir /B
touch /A/a
ln /A/a /B/a
rm /A/a &
rmdir /B

Since you don't lock /B for affs_empty_dir(), you can hit the
window between removing old /B/a and inserting renamed /A/a into /B.
Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
for /A/a doesn't even look at it.

Same thing for rename()/rmdir() (rmdir victim contains a link to rename
target, apply the previous scenario).
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Request for comment -- a better attribution system

2001-04-21 Thread Alexander Viro



On Sat, 21 Apr 2001, Albert D. Cahalan wrote:

> Eric S. Raymond writes:
> 
> > This is a proposal for an attribution metadata system in the Linux
> > kernel sources.  The goal of the system is to make it easy for
> > people reading any given piece of code to identify the responsible
> > maintainer.  The motivation for this proposal is that the present
> > system, a single top-level MAINTAINERS file, doesn't seem to be
> > scaling well.
> 
> It is nice to have a single file for grep. With the proposed
> changes one would sometimes need to grep every file.

The real problem is that large part of the kernel has no permanent
maintainers. Which makes the whole (overdesigned) idea completely moot.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, Alon Ziv wrote:

> Well, that's the reason for my small-negative-integer semaphore-FD idea...
> (It won't support select() easily, but poll() is prob'ly good enough)
> Still, there is the problem of read()/write()/etc. semantics; sure, we can
> declare that 'negative FDs' have their own semantics which just happen to
> include poll(), but it sure looks like a kludge...

You _still_ don't get it. The question is not "how to add magic kernel
objects that would look like descriptors and support a binch of
ioctls, allowing to do semaphores", it's "do we need semaphores
to be kernel-level objects". Implementation with pipes allows to avoid
the magic crap - they are real, normal pipes - nothing special from
the kernel POV. read(), write(), etc. are nothing but reading and writing
for pipes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, David L. Parsley wrote:

> Hi,
> 
> I'm still working on a packaging system for diskless (quasi-embedded)
> devices.  The root filesystem is all tmpfs, and I attach packages inside
> it.  Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
> considering using mount --bind for everything.  This appears to use very
> little memory, but I'm wondering if I'll run into problems when I start
> having many hundreds of bind mountings.  Any feel for this?

Memory use is sizeof(struct vfsmount) per binding. In principle, you can get
in trouble when size of /proc/mount will get past 4Kb - you'll get only
first 4 (actually 3, IIRC) kilobytes, so stuff that relies on the contents
of said file may get unhappy. It's fixable, though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Request for comment -- a better attribution system

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, Eric S. Raymond wrote:

> Alexander Viro <[EMAIL PROTECTED]>:
> > Eric, it would save everyone a lot of time if you actually cared to
> > pull your head out of your... theoretical constructions and spent
> > some efforts figuring out how the things really work.
> 
> I've had my nose rubbed in how things really work.  That's why I want to
> fix the things that are broken about how things really work.

Sigh... Would these broken things, by any chance, be "my grand ideas are
not met with applause"?

Take it from a guy who've done  quite a few global changes: they are pretty
much doable, but spamming maintainers with requests to support your k3wl
ideas is not a way to go. All you are getting that way is a bunch of procmail
rules.

Everyone who had been on l-k for more than a couple of months had seen
$BIGNUM of "visionary" lusers with grand schemes of Changing The World(tm)
and monumental lack of desire to learn. Until you demonstrate that you
understand what you are "fixing" - don't expect special treatment.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Request for comment -- a better attribution system

2001-04-22 Thread Alexander Viro



On Sun, 22 Apr 2001, Eric S. Raymond wrote:

> Alexander Viro <[EMAIL PROTECTED]>:
> > Sigh... Would these broken things, by any chance, be "my grand ideas are
> > not met with applause"?
> 
> Nope.  Not at all.  Stay tuned, because I'll explain.
> 
> And before you write me off as one of the $BIGNUM clueless
> visionaries, you might do well to remember that I actually *have*
> radically changed the world lkml operates in.  At least twice.

So had certain wa.us-based company. If you refer to your "Cathedral
and Bazaar" - pardon me the bluntness, but it doesn't speak well of your
clue level.  L-k is not a place for detailed analysis of that text, so let
me just point to the fact that
* you've ignored the robustness of design behind the UNIX kernel.
These beasts keep going without falling apart even after serious injuries.
* you've ignored another factor - maintainer with a taste and ability
to say "no".
* you've made a completely unwarranted assumption - that widely-used
and available code actually gets reviewed by many people.  It's demonstrably
false.

Ability to do PR != having a shred of clue in other areas.
I'm sure that you can come up with relevant examples yourself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, David L. Parsley wrote:

> What I'm not sure of is which solution is actually 'better' - I'm
> guessing that performance-wise, neither will make a noticable
> difference, so I guess memory usage would be the deciding factor.  If I

Bindings are faster on lookup. For obvious reasons - in case of symlinks
you do name resolution every time you traverse the link; in case of
bindings it is done when you create them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Ingo Oeser wrote:

> Hi Chris,
> 
> On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > > The question is: How? If you do it like ramfs, you cannot swap
> > > these symlinks and this is effectively a mlock(symlink) operation
> > > allowed for normal users. -> BAD!
> > 
> > How about storing it into the inode structure if it fits into the
> > fs-private union? If it is too big we allocate the page as we do it
> > now. The union has 192 bytes. This should be sufficient for most
> > cases.
> 
> Great idea. We allocate this space anyway. And we don't have to
> care about the internals of this union, because never have to use
> it outside the kernel ;-)
> 
> I like it. ext2fs does the same, so there should be no VFS
> hassles involved. Al?

We should get ext2 and friends to move the sucker _out_ of struct inode.
As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
it really has to be done. More filesystems adding stuff into the union
is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
->clear_inode() is the right place for freeing it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Tue, 24 Apr 2001, Ingo Oeser wrote:

> We have this kind of stuff all over the place. If we allocate
> some small amount of memory and and need some small amount
> associated with this memory, there is no problem with a little
> waste.

Little? How about quarter of kilobyte per inode? sizeof(struct inode)
is nearly half-kilobyte. And icache can easily get to ~10 elements.

> Waste is better than fragmentation. This is the lesson people
> learned from segments in the ia32.
> 
> Objects are easier to manage, if they are the same size.

So don't keep them in the same cache. Notice that quite a few systems
keep vnode separately from fs-specific data. For a very good reason.

Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH][CFT] (updated) ext2 directories in pagecache

2001-04-23 Thread Alexander Viro



On Thu, 12 Apr 2001, Alexander Viro wrote:

>   Folks, IMO ext2-dir-patch got to the stable stage. Currently
> it's against 2.4.4-pre2, but it should apply to anything starting with
> 2.4.2 or so.
> 
>   Ted, could you review it for potential inclusion into 2.4 once
> it gets enough testing? It's ext2-only (the only change outside of
> ext2 is exporting waitfor_one_page()), it doesn't change fs layout,
> it seriously simplifies ext2/dir.c and ext2/namei.c and it gives better
> VM behaviour.

Previous variant left junk in ->d_type of directory entries
on "old" filesystems (i.e. ones where it should be zeroed). Harmless
(on these filesystems readdir() returned DT_UNKNOWN anyway), but
it PO'd fsck and was the wrong thing anyway.

Fixed and rediffed against current tree (2.4.4-pre6). Folks,
please help with testing.

Patch is on ftp.math.psu.edu/pub/viro/ext2-dir-patch-S4-pre6.gz

Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Richard Gooch wrote:

> - keep a separate VFSinode and FSinode slab cache

Yup.

> - allocate an enlarged VFSinode that contains the FSinode at the end,
>   with the generic pointer in the VFSinode part pointing to FSinode
>   part.

Please, don't. It would help with bloat only if you allocated these
beasts separately for each fs and then you end up with _many_ allocators
that can generate pointer to struct inode. 

"One type - one allocator" is a good rule - violating it turns into major
PITA couple of years down the road 9 times out of 10.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[CFT][PATCH] namespaces patch (2.4.4-pre6)

2001-04-23 Thread Alexander Viro



Folks, updated namespace patch is on
ftp.math.psu.edu/pub/viro/namespaces-c-S4-pre6.gz
 
News:
* ported to 2.4.4-pre6
* fixes for d_flags races (already in -ac, hopefully will go into
the main tree soon)
* fixes for sync_inodes()/kill_super() races (submitted to Linus
and Alan, hopefully will go into the tree soon)
* killed low-memory deadlocks between {u,re,}mount and kswapd.
* further cleanup of fs/super.c

It works here. Please, help with testing. Patch had somewhat grown, but
new pieces are fixes for the bugs present in the main tree and these
fixes had been submitted for inclusion in 2.4, so hopefully it will
shrink again.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-23 Thread Alexander Viro



On 24 Apr 2001, David Wagner wrote:

> Linus Torvalds  wrote:
> >Ehh.. I will bet you $10 USD that if libc allocates the next file
> >descriptor on the first "malloc()" in user space (in order to use the
> >semaphores for mm protection), programs _will_ break.
> >
> >You want to take the bet?
> 
> Good point.  Speaking of which:
>   ioctl(fd, UIOCATTACHSEMA, ...);
> seems to act like dup(fd) if fd was opened on "/dev/usemaclone"
> (see drivers/sgi/char/usema.c).  According to usema(7), this is
> intended to help libraries implement semaphores.
> 
> Is this a bad coding?

Yes. Not to mention side effects, it's just plain ugly. Anyone who invents
identifiers of _that_ level of ugliness should be forced to read them
aloud for a week or so, until somebody will shoot him out of mercy.
Out of curiosity: who was the author? It looks unusually nasty, even for
SGI.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-23 Thread Alexander Viro



On Mon, 23 Apr 2001, Jan Harkes wrote:

> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

> > BTW: Is it still less than one page? Then it doesn't make me
> >nervous. Why? Guess what granularity we allocate at, if we
> >just store pointers instead of the inode.u. Or do you like
> >every FS creating his own slab cache?

Oh, for crying out loud. All it takes is half an hour per filesystem.
Here - completely untested patch that does it for NFS. Took about that
long. Absolutely straightforward, very easy to verify correctness.

Some stuff may need tweaking, but not much (e.g. some functions
should take nfs_inode_info instead of inodes, etc.). From the look
of flushd cache it seems that we would be better off with cyclic
lists instead of single-linked ones for the hash, but I didn't look
deep enough.

So consider the patch below as proof-of-concept. Enjoy:

diff -urN S4-pre6/fs/nfs/flushd.c S4-pre6-nfs/fs/nfs/flushd.c
--- S4-pre6/fs/nfs/flushd.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/flushd.c Mon Apr 23 22:23:11 2001
@@ -162,11 +162,11 @@
 
if (NFS_FLAGS(inode) & NFS_INO_FLUSH)
goto out;
-   inode->u.nfs_i.hash_next = NULL;
+   NFS_I(inode)->hash_next = NULL;
 
q = &cache->inodes;
while (*q)
-   q = &(*q)->u.nfs_i.hash_next;
+   q = &NFS_I(*q)->hash_next;
*q = inode;
 
/* Note: we increase the inode i_count in order to prevent
@@ -188,9 +188,9 @@
 
q = &cache->inodes;
while (*q && *q != inode)
-   q = &(*q)->u.nfs_i.hash_next;
+   q = &NFS_I(*q)->hash_next;
if (*q) {
-   *q = inode->u.nfs_i.hash_next;
+   *q = NFS_I(inode)->hash_next;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
iput(inode);
}
@@ -238,8 +238,8 @@
cache->inodes = NULL;
 
while ((inode = next) != NULL) {
-   next = next->u.nfs_i.hash_next;
-   inode->u.nfs_i.hash_next = NULL;
+   next = NFS_I(next)->hash_next;
+   NFS_I(inode)->hash_next = NULL;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
 
if (flush) {
diff -urN S4-pre6/fs/nfs/inode.c S4-pre6-nfs/fs/nfs/inode.c
--- S4-pre6/fs/nfs/inode.c  Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/inode.c  Mon Apr 23 22:43:45 2001
@@ -40,11 +40,14 @@
 #define NFSDBG_FACILITYNFSDBG_VFS
 #define NFS_PARANOIA 1
 
+static kmem_cache_t *nfs_inode_cachep;
+
 static struct inode * __nfs_fhget(struct super_block *, struct nfs_fh *, struct 
nfs_fattr *);
 void nfs_zap_caches(struct inode *);
 static void nfs_invalidate_inode(struct inode *);
 
 static void nfs_read_inode(struct inode *);
+static void nfs_clear_inode(struct inode *);
 static void nfs_delete_inode(struct inode *);
 static void nfs_put_super(struct super_block *);
 static void nfs_umount_begin(struct super_block *);
@@ -52,6 +55,7 @@
 
 static struct super_operations nfs_sops = { 
read_inode: nfs_read_inode,
+   clear_inode:nfs_clear_inode,
put_inode:  force_delete,
delete_inode:   nfs_delete_inode,
put_super:  nfs_put_super,
@@ -96,23 +100,44 @@
 static void
 nfs_read_inode(struct inode * inode)
 {
+   struct nfs_inode_info *nfsi;
+
+   nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL);
+   if (!nfsi)
+   goto Enomem;
+
inode->i_blksize = inode->i_sb->s_blocksize;
inode->i_mode = 0;
inode->i_rdev = 0;
+   inode->u.generic_ip = nfsi;
NFS_FILEID(inode) = 0;
NFS_FSID(inode) = 0;
NFS_FLAGS(inode) = 0;
-   INIT_LIST_HEAD(&inode->u.nfs_i.read);
-   INIT_LIST_HEAD(&inode->u.nfs_i.dirty);
-   INIT_LIST_HEAD(&inode->u.nfs_i.commit);
-   INIT_LIST_HEAD(&inode->u.nfs_i.writeback);
-   inode->u.nfs_i.nread = 0;
-   inode->u.nfs_i.ndirty = 0;
-   inode->u.nfs_i.ncommit = 0;
-   inode->u.nfs_i.npages = 0;
+   INIT_LIST_HEAD(&nfsi->read);
+   INIT_LIST_HEAD(&nfsi->dirty);
+   INIT_LIST_HEAD(&nfsi->commit);
+   INIT_LIST_HEAD(&nfsi->writeback);
+   nfsi->nread = 0;
+   nfsi->ndirty = 0;
+   nfsi->ncommit = 0;
+   nfsi->npages = 0;
NFS_CACHEINV(inode);
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
+   return;
+
+Enomem:
+   make_bad_inode(inode);
+   return;
+}
+
+static void
+nfs_clear_inode(struct inode * inode)
+{
+   struct nfs_inode_info *p = NFS_I(inode);
+   inode->u.generic_ip = NULL;
+   if (p)
+   kmem_cache_free(nfs_inode_cachep, p);
 }
 
 static void
@@ -594,7 +619,7 @@
NFS_CACHE_ISIZE(inode) = fattr->size;
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
-   memcpy(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh));
+   me

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Mon, 23 Apr 2001, Andreas Dilger wrote:

> Al posted a patch to the NFS code which removes nfs_inode_info from the
> inode union.  Since it is (AFAIK) the largest member of the union, we
> have just saved 24 bytes per inode (hfs_inode_info is also rather large).
> If we removed hfs_inode_info as well, we would save 108 bytes per inode,
> about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
builds (it had most of the accesses to ->u.hfs_i already encapsulated).

What I really don't understand is why the hell people keep coming up
with the grand and convoluted plans of removing the inode bloat and
nobleedinone of them actually cared to sit down and do the simplest variant
possible.

I can certainly go through the rest of filesystems and even do a testing
for most of them, but WTF? Could the rest of you please join the show?
It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
into inlined function, find ->read_inode, find places that do get_empty_inode()
or new_inode(), add allocation there, add freeing to ->clear_inode()
(defining one if needed), change that inlined function so that it would
return ->u.generic_ip and you are done. Clean the results up and test
them. Furrfu...

It's not like it was a global change that affected the whole kernel -
at every step changes are local to one filesystem and changes for
different filesystems are independent from each other. If at some point
in 2.5 .generic_ip is the only member of union - fine, we just do
%s/u.generic_ip/fs_inode/g
or something like that. Moreover, if maintainer of filesystem foo is
OK with change it _can_ be done in 2.4 - it doesn't affect anything
outside of foofs.

Guys, doing all these patches is ~20 man-hours. And that's bloody generous
estimate. Looking through the results and doing necessary tweaking
(as in "hmm... we keep passing pointer to inode through the long chain
of functions and all of them need only fs-specific part", etc.) - about
the same. Verifiying that thing wasn't fucked up - maybe an hour or two of
audit per filesystem (split the patch into encapsulation part - trivial
to verify - and the rest - pretty small). Grrr...

Oh, well... Initial HFS patch follows:

diff -urN S4-pre6/fs/hfs/inode.c S4-pre6-hfs/fs/hfs/inode.c
--- S4-pre6/fs/hfs/inode.c  Fri Feb 16 22:55:36 2001
+++ S4-pre6-hfs/fs/hfs/inode.c  Tue Apr 24 05:10:21 2001
@@ -231,7 +231,7 @@
 static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, 
unsigned to)
 {
return cont_prepare_write(page,from,to,hfs_get_block,
-   &page->mapping->host->u.hfs_i.mmu_private);
+   &HFS_I(page->mapping->host)->mmu_private);
 }
 static int hfs_bmap(struct address_space *mapping, long block)
 {
@@ -309,7 +309,7 @@
return NULL;
}
 
-   if (inode->i_dev != sb->s_dev) {
+   if (inode->i_dev != sb->s_dev || !HFS_I(inode)) {
iput(inode); /* automatically does an hfs_cat_put */
inode = NULL;
} else if (!inode->i_mode || (*sys_entry == NULL)) {
@@ -373,7 +373,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;
 
@@ -433,7 +433,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;
 
@@ -479,7 +479,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
-   inode->u.hfs_i.mmu_private = inode->i_size;
+   HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;
 
diff -urN S4-pre6/fs/hfs/super.c S4-pre6-hfs/fs/hfs/super.c
--- S4-pre6/fs/hfs/super.c  Sat Apr 21 14:35:20 2001
+++ S4-pre6-hfs/fs/hfs/super.c  Tue Apr 24 05:26:04 2001
@@ -35,6 +35,7 @@
 /* Forward declarations */
 
 static void hfs_read_inode(struct inode *);
+static void hfs_clear_inode(struct inode *);
 static void hfs_put_super(struct super_block *);
 static int hfs_statfs(struct super_block *, struct statfs *);
 static void hfs_write_super(struct super_block *);
@@ -43,6 +44,7 @@
 
 static struct super_operations hfs_super_operations = { 
read_inode: hfs_read_inode,
+   clear_inode:hfs_clear_inode

Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, David Woodhouse wrote:

> 
> [EMAIL PROTECTED] said:
> >  Oh, for crying out loud. All it takes is half an hour per filesystem.
> 
> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
> surprised.

 What's stopping you? 
You _are_ JFFS maintainer, aren't you?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Christoph Rohland wrote:

> Hi Al,
> 
> On Tue, 24 Apr 2001, Alexander Viro wrote:
> >> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
> >> be very surprised.
> > 
> >  What's stopping you? 
> > You _are_ JFFS maintainer, aren't you?
> 
> So is this the start to change all filesystems in 2.4? I am not sure
> we should do that. 

Encapsulation part is definitely worth doing - it cleans the code up
and doesn't change the result of compile. Adding allocation/freeing/
cache initialization/cache removal and chaninging FOOFS_I() definition -
well, it's probably worth to keep such patches around, but whether
to switch any individual filesystem during 2.4 is a policy decision.
Up to maintainer, indeed. Notice that these patches (separate allocation
per se) are going to be within 3-4Kb per filesystem _and_ completely
straightforward.

What I would like to avoid is scenario like

Maintainers of filesystems with large private inodes: Why would we separate
them? We would only waste memory, since the other filesystems stay in ->u
and keep it large.

Maintainers of the rest of filesystems: Since there's no patches that would
take large stuff out of ->u, why would we bother?

So yes, IMO having such patches available _is_ a good thing. And in 2.5
we definitely want them in the tree. If encapsulation part gets there
during 2.4 and separate allocation is available for all of them it will
be easier to do without PITA in process.
Al


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001 [EMAIL PROTECTED] wrote:

> a friend of my asked me on how to make linux easier to use
> for personal/casual win user.
> 
> i found out that one of the big problem with linux and most
> other operating system is the multi-user thing.

What, makes it hard to write viruses for it? Awww, poor skr1pt k1dd13z...

> i think, no personal computer user should know about what's
> an operating system idea of a user. they just want to use
> the computer, that's it.

And would that "use" by any chance include access to network?

> by a personal computer i mean home pc, notebook, tablet,
> pda, and communicator. only one user will use those devices,
> or maybe his/her friend/family. do you think that user want
> to know about user account?

So let him log in as root, do everything as root and be cracked
like a bloody moron he is. Next?

> from that, i also found out that it is very awkward to type
> username and password every time i use my computer.

So break your /sbin/login.

> so here's a patch. i also have removed the user_struct from
> my kernel, but i don't think you'd like #ifdef's.
> may be it'll be good for midori too.

[snip the patch that makes all user ids equivalent to root, but
doesn't remove networking support]

What for? If they want root - give them root and be done with that.
No need to change the kernel.

You know, if you really do not understand the implications of
running everything with permissions equivalent to root - get
the hell out of any UNIX-related programming until you learn.

If you want CP/M or MacOS - you know where to find them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001 [EMAIL PROTECTED] wrote:

[snip long wankage]

Equivalent of your "patch" can be achieved by making login(1) and
friends let everyone in as root without asking password. End of
story. If you don't understand even _that_ - you don't understand
the bloody basics of the system and I certainly don't want to
deal with your code anywhere near the kernel.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Mohammad A. Haque wrote:

> [EMAIL PROTECTED] wrote:

[snip]
 
> Sounds to me like you really don't get the whole concept of permissions
> and that it's how Unix works.
> 
> Besides, why should the kernel do anythign different for you when there
> are userland tools that you can use to have the system auto-login as a
> specified user?

With apologies to Tom Lehrer...

Hooray for the Folk Song Army,
We will show you the way.
'Cause we all hate poverty, war, and injustice,
And chords that are too hard to play.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OFFTOPIC] Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Tomas Telensky wrote:

> of linux distributions the standard daemons (httpd, sendmail) are run as
> root! Having multi-user system or not! Why? For only listening to a port
> <1024? Is there any elegant solution?

Sendmail is old. Consider it as a remnant of times when network was
more... friendly. Security considerations were mostly ignored - and
not only by sendmail. It used to be choke-full of holes. They were
essentially debugged out of it in late 90s. It seems to be more or
less OK these days, but it's full of old cruft. And splitting the
thing into reasonable parts and leaving them with minaml privileges
they need is large and painful work.

There are alternatives (e.g. exim, or two unmentionable ones) that are
cleaner. Besides, there are some, erm, half-promises that next major
release of sendmail may be a big cleanup. Hell knows what will come out
of that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OFFTOPIC] Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Mohammad A. Haque wrote:

> Correct. <1024 requires root to bind to the port.

... And nothing says that it should be done by daemon itself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OFFTOPIC] Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Tomas Telensky wrote:

> Thanks for the comment. And why not just let it listen to 25 and then
> being run as uid=nobody, gid=mail?

Handling of .forward, for one thing. Or pipe aliases, or...

None of this stuff is unsolvable (e.g. handling of .forward belongs to
MDA, not MTA), but changing that will break existing setups.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OFFTOPIC] Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Alan Cox wrote:

> > On Tue, 24 Apr 2001, Mohammad A. Haque wrote:
> > > Correct. <1024 requires root to bind to the port.
> > ... And nothing says that it should be done by daemon itself.
> 
> Or that you shouldnt let inetd do it for you
> And that you shouldn't drop the capabilities except that bind
> 
> It is possible to implement the entire mail system without anything running
> as root but xinetd.

You want an MDA with elevated privileges, though...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OFFTOPIC] Re: [PATCH] Single user linux

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Alan Cox wrote:

> > > It is possible to implement the entire mail system without anything running
> > > as root but xinetd.
> > 
> > You want an MDA with elevated privileges, though...
 ^
> What role requires priviledge once the port is open ?

.forward handling may, depending on how much do you want to put into it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> One thing to watch out for is that the current code zeros the u. struct
> for us (as you pointed out to me previously), but allocating from the
> slab cache will not...  This could be an interesting source of bugs for
> some filesystems that assume zero'd inode_info structs.

True, but easy to catch.
 
> Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> approximate for 32-bit arches - I was just counting by hand and not
> strictly checking alignment), then almost all other filesystems are below
> 25 * __u32 (i.e. half of the previous size).

Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
I'm thinking about.
 
> Maybe the size of the union can depend on CONFIG_*_FS?  There should be
> an absolute minimum size (16 * __u32 or so), but then people who want
> reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> For ext2 (the next largest and most common fs), we could make it part of
> the union if it is compiled in, and on a slab cache if it is a module?

NO. Sorry about shouting, but that's the way to madness. I can understand
code depending on SMP vs. UP and similar beasts, but presense of specific
filesystems 

> Should uncommon-but-widely-used things like socket and shmem have their
> own slab cache, or should they just allocate from the generic size-32 slab?

That's pretty interesting - especially for sockets. I wonder whether
we would get problems with separate allocation of these - we don't
go from inode to socket all that often, but...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> While I applaud your initiative, you made an unfortunate choice of
> filesystems to convert.  The iso_inode_info is only 4*__u32, as is
> proc_inode_info.  Given that we still need to keep a pointer to the
> external info structs, and the overhead of the slab cache itself
> (both CPU usage and memory overhead, however small), I don't think
> it is worthwhile to have isofs and procfs in separate slabs.
> 
> On the other hand, sockets and shmem are both relatively large...
> Watch out that the *_inode_info structs have all of the fields
> initialized, because the union field is zeroed for us, but slab is not.

Frankly, I'd rather start with encapsulation part. It's easy to
verify, it can go in right now and it makes separate allocation
part uncluttered. Besides, it simply makes code cleaner, so it
makes sense even if don't want to go for separate allocation for
that particular fs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: hundreds of mount --bind mountpoints?

2001-04-24 Thread Alexander Viro



On 24 Apr 2001, Trond Myklebust wrote:

> Hi Al,
> 
>   I believe your patch introduces a race for the NFS case. The problem
> lies in the fact that nfs_find_actor() needs to read several of the
> fields from nfs_inode_info. By adding an allocation after the inode
> has been hashed, you are creating a window during which the inode can
> be found by find_inode(), but during which you aren't even guaranteed
> that the nfs_inode_info exists let alone that it's been initialized
> by nfs_fill_inode().

_Ouch_. So what are you going to do if another iget4() comes between
the moment when you hash the inode and set these fields? You are
filling them only after you drop inode_lock, so AFAICS the current
code has the same problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH][CFT] per-process namespaces for Linux

2001-02-24 Thread Alexander Viro

He's back. And this time he's got a chainsaw.

Yes, folks. We got per-process namespaces. Working. With proper
behaviour on exit(), yodda, yodda. Enjoy. Help with testing would be more
than welcome.

Current patch is on ftp.math.psu.edu/pub/viro/namespaces-S2.gz
It's against 2.4.2.

Contents:
* proper refcounting of struct super_block
* GC for vfsmounts (finally)
* fix for races between get_super() and umount()
* SMP-safe lock_super()
* general cleanup of fs/super.c
* "lazy" option for umount() (detach from mountpoint now, do the
rest when it will cease to be busy - use MNT_DETACH in 'flags' argument
to get that behaviour).
* Plan 9 per-process namespaces (sans unions so far)
* large cleanup of boot process (ramdisk handling, etc.)

Variant without namespaces (they were the last part) is in the same
directory, called s_lock-S2.gz.

rfork.c (in the same place) will copy a namespace and start shell in it.
Use for testing... It's an equivalent of rfork(RFNAMEG) on Plan 9.

One detail - patch requires ramfs built into the kernel (boot process cleanup
part needs that).

It works here (ran for about 12 hours with no problems). It's _NOT_ for
inclusion into 2.4. Some pieces might go (get_super() races have to be
fixed, after all), but most of this stuff is 2.5 fodder. However, it
seems to be working. No doubt there are bugs and it's far from being
a final version. I would call it _very_ early beta. Please, help with
testing.

Comments on the code/design/amount of dope it took to write the thing (zero,
actually) are welcome. I _will_ document it, but it's still not in the
final form. Pretty close to it, hopefully, but...

I'm more than willing to answer questions on the design of the thing - just
ask. So far that's the best I can do - all documentation is a pile of notes
+ CVS log.

Cheers,
Al
PS: hopefully - back for good.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-24 Thread Alexander Viro



On Sun, 25 Feb 2001, Rick Hohensee wrote:

[I wrote]

> >ask. So far that's the best I can do - all documentation is a pile of
> >notes
> >+ CVS log.

[snip]

> That sounds like an especially fascinating pile of notes. Perhaps you
> could pile it next to the patch on the ftp site?

You know, CDA is dead and gone, but I really doubt that putting this
pile as-is in any vicinity of this account would be a good idea.
Besides, half of them will need a translation - I doubt that 80Kb of
grep output intermixed with comments in English and Russian, some of
them printable, would be useful. Fasicanting - maybe, but... IOW, turning
that into documentation will take some efforts.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK to mount multiple FS in one dir?

2001-02-24 Thread Alexander Viro



On Tue, 6 Feb 2001, Wakko Warner wrote:

> > > > I found I could mount three partitions on /mnt
> > > 
> > > Yes.  New feature, appeared in the 2.4.0test series, or shortly before.
> 
> I have a question, why was this idea even considered?

Direct request from HPA. Autofs can win from having that (mounting
atop of mountpoint). I'd rather live without that stuff, but back then it
looked like an OK idea - we could do that. There is a better solution for
original problem, but...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK to mount multiple FS in one dir?

2001-02-24 Thread Alexander Viro



On Wed, 7 Feb 2001, John R Lenton wrote:

> On Wed, Feb 07, 2001 at 12:25:10AM -0600, Peter Samuelson wrote:
> > 
> > [Wakko Warner]
> > > I have a question, why was this idea even considered?
> > 
> > Al Viro likes Plan9 process-local namespaces.  He seems to be trying to
> > move Linux in that direction.  In the past year he has been hacking the
> > semantics of filesystems and mounting, probably with namespaces as an
> > eventual goal, and this is one of the things that has fallen out of the
> > implementation.
> 
> Aren't "translucid" mounts the idea behind this?

Nope. Completely different beast - bindings have nothing to layered
filesystems. I.e. if we bind /foo to /bar then /foo/barf and /bar/barf
are the same object. Translucent-type would have one of them redirecting
all requests to another.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OK to mount multiple FS in one dir?

2001-02-24 Thread Alexander Viro



On Wed, 7 Feb 2001, David L. Nicol wrote:

> Peter Samuelson wrote:
>  
> > A more useful thing to fall out of the same hacking is loopback
> > mounting -- i.e. the same filesystem mounted multiple places.  In
> > Linux-land I guess we call it 'mount --bind'.
> > 
> > Peter
> 
> Does this kind of thing play nice with nfs and coda, in terms of
> change notifications and write-backs? In distributed FS we've got
> the same thing mounted multiple places, of course, but not on the
> same machine

There is no cache coherency problems since we have no copies to keep
in sync ;-) Dentry tree is shared by all instances.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-25 Thread Alexander Viro



On Sun, 25 Feb 2001, Manfred Spraul wrote:

> 
> >  * large cleanup of boot process (ramdisk handling, etc.)
> 
> Have you thought about supporting .tar.gz into ramfs? Creating custom
> boot images would be simpler.

*uh*. It's definitely easier to do than it used to be, but I'm seriously
sceptical about adding more cruft into the thing. Let's sort it out
and then see what can be added to the sequences. At least now it's in
one place and doesn't have to pull the tricks it used to need for dealing
with IO...

(I presume that you mean "unpacking tar.gz into initrd/floppy-loaded ramdisk"
and not "adding into ramfs a loader of tarballs" - the latter is out of
question, as far as I'm concerned; such code belongs to do_mounts.c if
it belongs anywhere at all)

IOW, look into init/do_mounts.c - that's the right place to do that
stuff.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-25 Thread Alexander Viro



On Sun, 25 Feb 2001, Sandy Harris wrote:

> One is just mount a ramdisk and extract a tarball into its root. Yes, this has
> some problems -- how do you load tar when you haven't set up your root? -- but
> I suspect they can be solved. At worst, this would involve some strictly limited
> kluge to do that.

No kludges actually needed. "Simplified boot sequence" _is_ simplified -
we overmount the "final" root over ramfs. Initially empty. So you have
the normal environment when you load ramdisk, etc.

IOW, with the namespaces patch you can have root (empty, writable)
as soon as you've registered ramfs driver. I.e. _very_ early - before
device initialization, for one thing. Actual mounting of the "final"
root happen very late, along with all initrd games, etc. That stuff
(in do_mounts.c) could be executed as userland process, actually -
see the comments in init/do_mounts.c and actual code there.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-25 Thread Alexander Viro



On Sun, 25 Feb 2001, Werner Almesberger wrote:

> Alexander Viro wrote:
> > No kludges actually needed. "Simplified boot sequence" _is_ simplified -
> > we overmount the "final" root over ramfs. Initially empty. So you have
> > the normal environment when you load ramdisk, etc.
> 
> So is this the Holy Grail, err, union mount we've discussed about one year
> ago ? I.e.

No. Just an overmount. Final root ends up mounted atop of absolute root -
see comments in fs/super.c:mount_root() and in init/do_mounts.c

> stat foo  # output A
> mount /dev/whatever /
> stat foo  # output B
> 
> with A != B ?

We end with forced chroot to covering one. Due to details of path_walk()
it's unbreakable even for root (well, barring the direct access to
kernel data structures via /dev/kmem ;-)

So yes, you'll see the covering fs. Chech do_chroot() in init/do_mounts.c
 
> If yes, is there also a way to destroy/empty ramfs after this ?

At the end of boot process we are left with (at most) 8 dentries, 8 inodes and
no data pages on ramfs. Is it worth emptying? I can do that (reduce to
1 dentry/1 inode), just add sys_rmdir() and sys_unlink() calls in
the end of init/do_mounts.c:prepare_namespace(), but I don't really
see the point of it.

Fs _is_ covered - you don't get its objects after mouting the final root.

BTW, Werner - could you take a look at the prepare_namespace()/handle_initrd()?
That's our late boot process taken into one place. I'm really not happy
about the following:
a) initrd with /linuxrc exec'ing init leaves init with PID > 1.
Is it a good idea? I've reproduced the behaviour we have in the main tree,
but I have a bad feeling about it. For one thing, init is killable that
way. Not good...
b) can we _please_ kill the real_root_dev sysctl?
c) you had plans for mandating non-exiting /linuxrc. What's the status
of these plans? I'd be glad if we could pull that one off... More than
half of handle_initrd() implements the behaviour for the case when /linuxrc
does exit and I would be only happy to remove that cruft. AFAICS both
RH and Debian have /linuxrc that _does_ exit, though...

Again, current patch reproduces the behaviour of the main tree. Every
boot setup that used to work should stay working - that was the design
goal. I want to, erm, concentrate the existing logics in one place
and make it readable before even thinking of changing behaviour.

I've tested it with all combinations that end up with root on local
fs (initrd or not, ramdisks from floppies, devfs mounted or not and
their combinations, with different variants of /linuxrc in cases that
did initrd).  I didn't do exhaustive testing for NFS-root. If someone
can find a setup that works with official tree and doesn't work with 
the patched one - yell. I consider that as a bug.

BTW, people with rootfs=... patches may find that in this variant
their patches would become _much_ simpler - they can actually call sys_mount()
to mount the final root.

Don't get me wrong - I would be glad to see both rootfs=... and tar patches
done atop of that namespace/s_lock (and archive them, keep up-to-date, put
on FTP, etc).  Just don't expect me to _merge_ them until their counterparts
get merged into the Linus' tree (or this patch ends up these, but that won't
happen until 2.5).

IOW, I consider the boot-process part of the patch as cleanup of
existing code. If it makes some experiments easier - great, but
in _that_ respect namespace patch is in permanent feature freeze.
Unless behaviour is accepted by Linus - it won't get merged.

Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Sun, 25 Feb 2001, Jens Axboe wrote:

> On Sun, Feb 25 2001, Nate Eldredge wrote:
> > Nate Eldredge writes:
> >  > Kernel 2.4.2-ac3.
> >  > 
> >  >  FLAGS   UID   PID  PPID PRI  NI   SIZE   RSS WCHAN   STA TTY TIME COMMAND
> >  > 40 0   425 1  -1 -20  0 0 downDW< ?   0:00 (loop0)
> > 
> > It looks like this has been addressed in the thread "242-ac3 loop
> > bug".  Jens Axboe posted a patch, but the list archive I'm reading
> > mangled it.  Jens, could you make this patch available somewhere, or
> > at least email me a copy?  (If it's going in an upcoming -ac patch,
> > then don't bother; I can wait until then.)
> 
> Patch is here, I haven't checked whether Alan put it in ac4 yet (I
> did cc him, but noone knows for sure :-).

Jens, you have a race in lo_clr_fd() (loop-6). I've put the fixed
variant on ftp.math.psu.edu/pub/viro/loop-S2.gz. Diff and you'll
see - it's in the very beginning of the lo_clr_fd().
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001, Werner Almesberger wrote:

> Alexander Viro wrote:
> > No. Just an overmount.
> 
> Ah, too bad. Union mounts would have been really elegant (allowing the
> operation to be repeated without residues, and also allowing umounting
> of the covered FS as a sanity check). But I guess there's no way to
> implement them without performance penalty ...

There is no way to implement them without credentials' cache. Which needs
to be done for many other reasons, but that's a separate patch and
separate story. If it's done - no serious penalty involved. However,
I doubt that we want a union on / itself. /dev - sure, /bin and /lib -
maybe, but /... What for?
 
> > Is it worth emptying?
> 
> Probably not ... the only interesting case would be if you could completely
> umount it.

What's the point in unmounting it? Let the root of the mount tree be fixed -
it actually simplifies the things big way. Not that we had any performance
penalty for having the thing in place - after this forced chroot we never
touch it in lookups. BTW, pivot_root() is simpler that way.

BTW, we probably want to add mount --move   - atomically moving
a subtree from one place to another. Code is there, we just need to
decide on API. Andries?

> So with some luck, distributors will switch to pivot_root sometime soon,
> when deploying 2.4. So if we drop all the old junk in 2.5, the amount of
> letter bombs should be small ;-)

Tomorrow I'll try to catch Erik and talk with him about that. I'm not sure
that I know anyone in Debian Install System Team (oh, boy... somebody sure
loved capital letters). And I've absolutely no idea who is doing that stuff
in other distributions...
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001, Jens Axboe wrote:

> On Sun, Feb 25 2001, Alexander Viro wrote:
> > Jens, you have a race in lo_clr_fd() (loop-6). I've put the fixed
> > variant on ftp.math.psu.edu/pub/viro/loop-S2.gz. Diff and you'll
> > see - it's in the very beginning of the lo_clr_fd().
> 
> Oops yeah you are right. Here's a diff of my current loop stuff
> against -ac4, Alan could you apply? Andrea suggested removing
> the loop private slab cache for buffer heads and just using the
> bh_cachep pool, and it seems like a good idea to me.

Erm... Jens, it really should be
if (atomic_dec_and_test(...))
up(...);
not just
atomic_dec(...);
up(...);

Otherwise you can end up with too early exit of loop_thread. Normally
it would not matter, but in pathological cases...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001, Jens Axboe wrote:

> On Sun, Feb 25 2001, Alexander Viro wrote:
> > Erm... Jens, it really should be
> > if (atomic_dec_and_test(...))
> > up(...);
> > not just
> > atomic_dec(...);
> > up(...);
> > 
> > Otherwise you can end up with too early exit of loop_thread. Normally
> > it would not matter, but in pathological cases...
> 
> How so? We dec it and up the semaphore, loop_thread runs until it's
> done and ups lo_sem.

You are risking an extra up() here. Think what happens if you already had a
pending request.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Sun, 25 Feb 2001, Alexander Viro wrote:

> 
> 
> On Mon, 26 Feb 2001, Jens Axboe wrote:
> 
> > On Sun, Feb 25 2001, Alexander Viro wrote:
> > > Erm... Jens, it really should be
> > >   if (atomic_dec_and_test(...))
> > >   up(...);
> > > not just
> > >   atomic_dec(...);
> > >   up(...);
> > > 
> > > Otherwise you can end up with too early exit of loop_thread. Normally
> > > it would not matter, but in pathological cases...
> > 
> > How so? We dec it and up the semaphore, loop_thread runs until it's
> > done and ups lo_sem.
> 
> You are risking an extra up() here. Think what happens if you already had a
> pending request.

Let me elaborate: the race is very narrow and takes deliberate efforts to
hit. It _can_ be triggered, unfortunately. This extra up() will mess your
life later on.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001, Jens Axboe wrote:

> On Sun, Feb 25 2001, Alexander Viro wrote:
> > Let me elaborate: the race is very narrow and takes deliberate efforts to
> > hit. It _can_ be triggered, unfortunately. This extra up() will mess your
> > life later on.
> 
> What's the worst that can happen? We do an extra up, but loop_thread
> will still quit once we hit zero lo_pending. And loop_clr_fd
> is still protected by lo_ctl_mutex.

Well, for one thing you'll get some surprises next time you losetup
the same device ;-) There are more subtle scenarios, but that one
is pretty unpleasant in itself.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001 [EMAIL PROTECTED] wrote:

> > BTW, we probably want to add mount --move   - atomically moving
> > a subtree from one place to another. Code is there, we just need to
> > decide on API. Andries?
> 
> Since we already have "mount --bind olddir newdir" this is not
> an unreasonable extension of the mount(8) syntax.
> And since the kernel is no longer so interested in coeds as
> some former mount author, we have lots of free bits.

/me scratches head and tries to figure out waht does "coed" mean...

C|N>K


> There are even old bits.
> 
> #define MS_MOVE   0x2000

Works for me...
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2-ac3: loop threads in D state

2001-02-25 Thread Alexander Viro



On Mon, 26 Feb 2001, Jens Axboe wrote:

> Ah ok, I see what you mean. Updated patch attached.

Corresponding patch against 2.4.2 is on ftp.math.psu.edu/pub/viro/loop-S2.gz

Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-26 Thread Alexander Viro



On Mon, 26 Feb 2001, Marco d'Itri wrote:

> On Feb 26, Alexander Viro <[EMAIL PROTECTED]> wrote:
> 
>  >There is no way to implement them without credentials' cache. Which needs
>  >to be done for many other reasons, but that's a separate patch and
>  >separate story. If it's done - no serious penalty involved. However,
>  >I doubt that we want a union on / itself. /dev - sure, /bin and /lib -
>  >maybe, but /... What for?
> What I'd really like to do is remount / somewhere with mount --bind,
> mount over it another skeleton file system which hides setuid programs
> and some directories and then run a chrooted sshd in the new root.
> If I'm not missing something, this would make creation of secure chroot
> environments very easy.

I'm making NOSUID per-mountpoint. So
pid = clone(CLONE_NEWNS,0);
if (!pid) {
...
remount everything with nosuid
exec sshd
}
should be OK
As for hiding the directories - also easy, mount --bind an empty 
immutable directory over each of them.

NODEV is also easy to make per-mountpoint, but readonly may be trickier;
we need permission() to take vfsmount+dentry instead of inode for that.
Doable, but will touch quite a few places.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-26 Thread Alexander Viro

New version uploaded on ftp.math.psu.edu/pub/viro/namespaces-a-S2.gz
Changes:
* nosuid, nodev and noexec are per-mountpoint now.
* new flag for mount() - MS_MOVE (move a subtree, probable syntax
for mount(8) - mount --move old new; old must be a mountpoint)
* Fixes for "lazy" umount.
* CLONE_NEWNS is made root-only (CAP_SYS_ADMIN, actually)

Folks, please help with testing. Again, It Works Here(tm).
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-27 Thread Alexander Viro



New version uploaded on ftp.math.psu.edu/pub/viro/namespaces-d-S2.gz

Changes:
* fixed an idiotic bug in get_filesystem_info() that din't 
unfortunately) show up on UP.
* nosuid/nodev/noexec work in any combinations (had been b0rken in
previous version).
* fixed multiple-mount (had been b0rken; --bind worked, but attempt
to mount the device you've already had mounted did bad things).
* sanity checks for mount --move were missing. Fixed.
* Assorted cleanups.

Folks, please help with testing.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Will Mosix go into the standard kernel?

2001-02-27 Thread Alexander Viro



On Tue, 27 Feb 2001, David L. Nicol wrote:

> /proc/cluster/this would be standard root point for clustering stuff
> 
> /proc/mosix would go away, become proc/cluster/mosix
> 
> and the same with whatever bproc puts into /proc; that stuff would move to
> /proc/cluster/bproc

#include 

Guys, if you want a large subtree in /proc - whack yourself over the head
until you realize that you want an fs of your own. I'll be more than
happy to help with both parts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-27 Thread Alexander Viro



On Wed, 28 Feb 2001, Albert D. Cahalan wrote:

> Alexander Viro writes:
> 
> > * CLONE_NEWNS is made root-only (CAP_SYS_ADMIN, actually)
> 
> Would an unprivileged version that killed setuid be OK to have?

Not until we get decent resource accounting here.

> Evil idea of the day: non-directory (even non-existant) mount points and
> non-directory mounts. So then "mount --bind /etc/foo /dev/bar" works.

Try it. It _does_ work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-27 Thread Alexander Viro



On Wed, 28 Feb 2001, Albert D. Cahalan wrote:

> Alexander Viro writes:
> 
> > * CLONE_NEWNS is made root-only (CAP_SYS_ADMIN, actually)
> 
> Would an unprivileged version that killed setuid be OK to have?
> 
> Evil idea of the day: non-directory (even non-existant) mount points and
> non-directory mounts. So then "mount --bind /etc/foo /dev/bar" works.

BTW, out of curiosity: what's that evil about non-directory mounts?
You obviously shouldn't mix directories with non-directories in that
context (userland will not take that lightly, same as with rename(),
etc.), but binding a non-directory over non-directory... Why not?
Me, I'm playing with
% mount -t devloop /tmp/image /dev/loop0 -o offset=4096
Yes, in that order. /dev/loop0 is the mountpoint here. ioctls? We don't
need on stinkin' ioctls. Now, _that_ I would call evil... Pretty simple,
actually - filesystem with ->read_super() making ->s_root not a directory
but a block device. And setting it up (lo_set_fd() with small modifications).
Still alpha, requires namespace patch (or at least s_lock one), but seems
to be working. Simpler than loop.c in official tree, BTW - no ioctls, no
handling pending requests since we unset device only upon umount, when
we have nobody keeping it open. losetup? What losetup? Shell script, if
somebody would bother to write it (going through losetup options and turning
them into mount ones).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: Kernel bug in inode.c:885 when floppy disk removed

2001-02-28 Thread Alexander Viro



On Wed, 28 Feb 2001, Ph. Marek wrote:

> Hi everybody!
> 
> Hope I didn't forget something necessary.
> 
> 
> 
> 1:
> Kernel bug/Segmentation fault when floppy disk removed 2nd time
> 
> 
> 2: 
> Segmentation fault in a program, 
> hanging processes in "D"-state,
> Kernel bug in inode.c:885!
> 
> when removing floppy disk before unmounting and then using again

- Doctor, it hurts when I do it!
- Don't do it, then.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: Kernel bug in inode.c:885 when floppy disk removed

2001-02-28 Thread Alexander Viro



On Wed, 28 Feb 2001, Manfred Spraul wrote:

> Alexander Viro wrote:
> > 
> > - Doctor, it hurts when I do it! 
> > - Don't do it, then. 
> >
> Interesting bugfix:
> have you checked which BUG was triggered?

[snip]

Fsck. I plead guilty on replying to postings before the first cup of
coffee.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-28 Thread Alexander Viro



On Wed, 28 Feb 2001, David L. Parsley wrote:

> Alexander Viro wrote:
> > > Evil idea of the day: non-directory (even non-existant) mount points and
> > > non-directory mounts. So then "mount --bind /etc/foo /dev/bar" works.
> > 
> > Try it. It _does_ work.
> 
> Yeah, mount --bind is cool, I've been using it on one of my projects
> today.  But - maybe I'm just not thinking creatively enough - what are
> the advantages of mount --bind versus just symlinking?

1) Correctly working ".." (obviously relevant only for directories)
2) Try to create symlinks on read-only NFS mount. For bonus points, try
to do that one one client without disturbing everybody else.
3) Try to make it different for different users, for that matter.

> Also, I tried mount --bind fileone filetwo, and it fails if filetwo
> doesn't exist. ('mount point filetwo doesn't exist').  Is that supposed
> to work?  (using mount from latest redhat beta)

Nope. It does exactly what it should - changing that is a too large
can of worms I simply don't want to touch.

> BTW, pivot_root is nifty, too. ;-)

Thank Werner for that ;-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Re: fat problem in 2.4.2

2001-03-01 Thread Alexander Viro



On Thu, 1 Mar 2001, Peter Daum wrote:

> In that case, why was it changed for FAT only? Ext2 will still
> happily enlarge a file by truncating it.

Basically, the program depends on behaviour that was never guaranteed to
be there.

> Staroffice (the binary-only version; the new "open source"
> version is not yet ready for real-world use) for example
> currently doesn't write to FAT filesystems anymore - which is
> pretty annoying for people who need it.

Staroffice is non-portable and badly written, film at 11...

BTW, _some_ subset is doable on FAT. You can't always do it (bloody
thing doesn't support holes), but you can try the following (warning -
untested patch):

diff -urN S2/fs/fat/inode.c S2-fat/fs/fat/inode.c
--- S2/fs/fat/inode.c   Fri Feb 16 22:52:07 2001
+++ S2-fat/fs/fat/inode.c   Thu Mar  1 10:02:45 2001
@@ -897,6 +897,36 @@
unlock_kernel();
 }
 
+int generic_cont_expand(struct inode *inode, loff_t size)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct page *page;
+   unsigned long index, offset, limit;
+   int err;
+
+   limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
+   if (limit != RLIM_INFINITY) {
+   if (size > limit) {
+   send_sig(SIGXFSZ, current, 0);
+   size = limit;
+   }
+   }
+   offset = (size & (PAGE_CACHE_SIZE-1)); /* Within page */
+   index = size >> PAGE_CACHE_SHIFT;
+   err = -ENOMEM;
+   page = grab_cache_page(mapping, index);
+   if (!page)
+   goto out;
+   err = mapping->a_ops->prepare_write(NULL, page, offset, offset);
+   if (!err)
+   err = mapping->a_ops->commit_write(NULL, page, offset, offset);
+   UnlockPage(page);
+   page_cache_release(page);
+   if (err > 0)
+   err = 0;
+out:
+   return err;
+}
 
 int fat_notify_change(struct dentry * dentry, struct iattr * attr)
 {
@@ -904,11 +934,17 @@
struct inode *inode = dentry->d_inode;
-   int error;
+   int error = 0;
 
-   /* FAT cannot truncate to a longer file */
+   /*
+* On FAT truncate to a longer file may fail with -ENOSPC. No
+* way to report it from fat_truncate(), so...
+*/
if (attr->ia_valid & ATTR_SIZE) {
if (attr->ia_size > inode->i_size)
-   return -EPERM;
+   error = generic_cont_expand(inode, attr->ia_size);
}
+
+   if (error)
+   return error;
 
error = inode_change_ok(inode, attr);
if (error)

That said, if your only problem is Staroffice...  I would rather
rm the crapware and forget about it, but YMMV.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH][CFT] per-process namespaces for Linux

2001-02-28 Thread Alexander Viro



On Wed, 28 Feb 2001, Ion Badulescu wrote:

> On Wed, 28 Feb 2001 13:07:29 -0500 (EST), Alexander Viro <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, 28 Feb 2001, David L. Parsley wrote:
> 
> >> Yeah, mount --bind is cool, I've been using it on one of my projects
> >> today.  But - maybe I'm just not thinking creatively enough - what are
> >> the advantages of mount --bind versus just symlinking?
> > 
> > 1) Correctly working ".." (obviously relevant only for directories)
> > 2) Try to create symlinks on read-only NFS mount. For bonus points, try
> > to do that one one client without disturbing everybody else.
> > 3) Try to make it different for different users, for that matter.
> 
> And disadvantages: you can't have broken symlinks.
> 
> This actually turns out to be quite a bit of a problem when one tries
> to use bind mounts with autofs. For one thing, it's perfectly legal
> to have /autofs/foo as a symlink to /autofs/bar/foo, where /autofs/bar
> is not yet mounted -- but a bind mount can't handle that...

First of all, you still have symlinks. What's more, the right solution
is to use local objects at the mountpoints. And forget about having a
small tree full of links to real mountpoints. Think of autofs-with-one-node.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   3   4   5   6   7   8   9   10   >