[bug] set_pgdir can skip mm's

2000-11-24 Thread V Ganesh

set_pgdir() needs to modify all active mm's to include the new entry.
what it really does is 
for_each_task(p) {
if (!p->mm)
continue;
*pgd_offset(p->mm,address) = entry;
}

however, there could be a lazy-tlb thread on another cpu whose active_mm
belongs to a process which is dead and gone, and hence won't be covered by the
above code. if this thread then accesses an address covered by this entry, it
would fault.
ideally, we ought to loop through a list of all mm's rather than processes.
but since we don't have such a list, an easier solution is to use p->active_mm
rather than p->mm. this can cause multiple updates of the same pgd, but
the number of such unnecessary extra updates is bound by the number of CPUs.

ganesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [bug] set_pgdir can skip mm's

2000-11-24 Thread V Ganesh

> From ganesh Fri Nov 24 18:08:15 2000

[ set_pgdir() blah blah blah ]

damn. I was looking at test9 and as usual after shooting my mouth off on l-k
I go look at test11 and find it's fixed there, at least in i386, thanks to
the vmalloc_fault: stuff in do_page_fault. but a lot of other architectures
still use the old method.

ganesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[bug] infinite loop in generic_make_request()

2000-11-29 Thread V Ganesh

[cc'ed to maintainers of md and lvm]

hi,
in generic_make_request(), the following code handles stacking:

do {
   q = blk_get_queue(bh->b_rdev);
   if (!q) {
printk(...)
buffer_IO_error(bh);
break;
   }
} while (q->make_request_fn(q, rw, bh));

however, make_request_fn may return -1 in case of errors. one such fn is
raid0_make_request(). this causes generic_make_request() to loop endlessly.
lvm returns 1 unconditionally, so it would also loop if an error occured in
lvm_map(). other bdevs might have the same problem.
I think a better mechanism would be to mandate that make_request_fn ought
to call bh->b_end_io() in case of errors and return 0.

ganesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



beware of add_waitqueue/waitqueue_active

2000-11-30 Thread V Ganesh

there's a very subtle race with using add_waitqueue() as a barrier,
in the __find_lock_page() which used to exist in test9. it seems to be fixed
in test11, but I thought I should mention this just in case it's ever used in
a similar manner elsewhere.

the race manifests itself as a lost wakeup. the process is stuck in schedule()
in __find_lock_page(), with TASK_UNINTERRUPTIBLE, but on examining the struct
page * it was waiting for, page->flags is 72, indicating that it's unlocked.
page->wait shows this process in the waitqueue.

looking at the code for __find_lock_page...

__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
add_wait_queue(&page->wait, &wait);

if (PageLocked(page)) {
schedule();
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(...)

and that of UnlockPage(),
#define UnlockPage(page)do { \
  smp_mb__before_clear_bit(); \
  clear_bit(PG_locked, &(page)->flags); \
  smp_mb__after_clear_bit(); \
  if (waitqueue_active(&page->wait)) \
  wake_up(&page->wait); \
} while (0)

now assuming that everyone in the system uses UnlockPage() and no one
directly does a clear_bit(PG_locked) (which I'm absolutely certain of),
a possible sequence of events which could lead to this is:
1. __set_task_state(tsk, TASK_UNINT...)
2. add_wait_queue is called. takes spinlock. this is serializing.
3. add_wait_queue adds this process to the waitqueue. but all the writes
   are in write-buffers and have not gone down to cache/memory yet.
4. PageLocked() finds that the page is locked.
5. UnlockPage is called from another CPU from interrupt context. it clears
   bit, waitqueue_active() decides that there's no waiting process
   (because the add_waitqueue() changes haven't gone to cache yet) and 
   returns. note that waitqueue_active() doesn't take the waitqueue spinlock.
   if it did, it would have been obliged to wait until the spin_unlock (and
   therefore the preceding writes) hit cache, and would have found a nonempty
   list.
6. schedule() is called, never to return.

another thing which could increase the race window is speculative execution
of PageLocked() even before add_wait_queue returns, since the return
address might have been predicted by the RSB.

I've attached a simple module which reproduces the problem (at least on my
4 * 550 MHz PIII(Katmai), model 7, stepping 3). compile with
gcc -O2 -D__SMP__ -fno-strength-reduce -g -fno-omit-frame-pointer -fno-strict-aliasing 
-c -g race.c
insmod race.o
it will printk appropriately on termination.
basically it consists of two threads moving in lockstep. one locks a page-like
structure, the other unlocks. if the unlocker detects that the locker hasn't
locked in 5 seconds it concludes that a wakeup is lost.

one solution is not to presume add_wait_queue() acts as a barrier if we have a
conditional schedule and the wakeup path makes use of waitqueue_active(). we can
use 
add_wait_queue(...);
set_current_state(...); /* serializes */
if (condition) schedule();

__find_lock_page() no longer uses the racy mechanism in test11 and a cursory
examination of the rest of the kernel doesn't show any prominent offenders.
so this is just an interesting postmortem of an obsolete corpse...

ganesh

#define __KERNEL__
#define MODULE
#include 
#include 
#include 
#include 
#include 
#include 


struct foo {
int state;
wait_queue_head_t wait;
} foo;

#define ITER(1 << 30)
#define LockFoo(foop)   set_bit(0, &(foop)->state)
#define FooLocked(foop) test_bit(0, &(foop)->state)
#define UnlockFoo(foop) {   \
smp_mb__before_clear_bit(); \
clear_bit(0, &(foop)->state);   \
smp_mb__after_clear_bit();  \
if (waitqueue_active(&(foop)->wait)) {  \
wake_up(&(foop)->wait); \
}   \
}

static int race_exit = 0;

int locker(void *tmp)
{
struct task_struct *tsk = current;
unsigned int n = 0;
DECLARE_WAITQUEUE(wait, tsk);

exit_files(current);
daemonize();
while (1) {
LockFoo(&foo);

__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
add_wait_queue(&foo.wait, &wait);

if (FooLocked(&foo)) {
schedule();
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(&foo.wait, &wait);
if (race_exit) {
printk("locker looped %u times\n", n);
return 1;
}
n++;
}
}

int unlocker(void *tmp)
{
int counter = 0;
unsigned int n = 0;

exit_files(current);
daemonize();
while (n < 

inode->i_dirty_buffers redundant ?

2001-01-24 Thread V Ganesh

now that we have inode->i_mapping->dirty_pages, what do we need
inode->i_dirty_buffers for ? I understand the latter was added for the O_SYNC
changes before dirty_pages came into the picture. but now both seem to be
doing more or less the same thing.

ganesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: inode->i_dirty_buffers redundant ?

2001-01-25 Thread V Ganesh

Stephen C. Tweedie <[EMAIL PROTECTED]> wrote:
: Hi,

: On Wed, Jan 24, 2001 at 03:25:16PM +0530, V Ganesh wrote:
:> now that we have inode->i_mapping->dirty_pages, what do we need
:> inode->i_dirty_buffers for ?

: Metadata.  Specifically, directory contents and indirection blocks.

: --Stephen

ah, mark_buffer_dirty_inode(). thanks.
so if I understand it,
1. read() and mmap read faults will put the page in i_mapping->clean_pages
2. mmaped writes will (eventually, from msync or unmap or swapout) put the
   page in i_mapping->dirty_pages
3. write() will put pages into i_dirty_buffers (__block_commit_write() calls
   buffer_insert_inode_queue()).

so i_dirty_buffers contains buffer_heads of pages coming from write() as
well as metadata buffers from mark_buffer_dirty_inode(). a dirty MAP_SHARED
page which has been write()n to will potentially exist in both lists.
won't doing a set_dirty_page() instead of buffer_insert_inode_queue() in
__block_commit_write() make things much simpler ? then we'd have i_dirty_buffers
having _only_ metadata, and all data pages in the i_mapping->*_pages lists.

ganesh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: inode->i_dirty_buffers redundant ?

2001-01-26 Thread V Ganesh

Stephen C. Tweedie <[EMAIL PROTECTED]> wrote:

: That would only complicate things: it would mean we'd have to scan
: both lists on fsync instead of just the one, for example.  There are a

we already do; filemap_fdatasync() is called first in sys_fsync(), though
it usually doesn't have much work I guess.

: number of places where we need buffer lists for dirty data anyway,
: such as for bdflush's background sync to disk.  We also maintain the
: per-page buffer lists as caches of the virtual-to-physical mapping to
: avoid redundant bmap()ping.  So, removing the buffer_heads which alias
: the page cache data isn't an option.  Given that, it's as well to keep
: all the inode's dirty buffers in the one place.

keeping dirty pages in the address space list doesn't preclude any of the
above. the pages could still have buffer_heads attached to them, and
these would cache the block location and be a part of the dirty buffer
list used by bdflush.
I guess both approaches would be roughly the same from a performance
point of view. I feel that keeping all data pages in the address space is more
elegant from a design point of view, but that's quite subjective, of course.

ganesh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/