Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-09 Thread bsuparna


>One question:
>isn't it the case that the alternative to using synchronize_kernel()
>is to protect the read side with explicit locks, which will themselves
>suppress preemption?  If so, why not just suppress preemption on the read
>side in preemptible kernels, and thus gain the simpler implementation
>of synchronize_kernel()?  You are not losing any preemption latency
>compared to a kernel that uses traditional locks, in fact, you should
>improve latency a bit since the lock operations are more expensive than
>are simple increments and decrements.  As usual, what am I missing
>here?  ;-)
>...
>...
>I still prefer suppressing preemption on the read side, though I
>suppose one could claim that this is only because I am -really-
>used to it.  ;-)

Since this point has come up , I just wanted to mention that it may still
be nice to be able to do without explicit locks on the read-side. This is
not so much for performance reasons (I agree with your assessment on that
point) as for convinience / flexibility in the kind of situations where
this concept (i.e. synchronize_kernel or read-copy-update) could be used.

For example, consider situations where it is an executable code block that
is being protected. The read side is essentially the execution of that code
block - i.e. every entry/exit into the code block.

This is perhaps the case with module unload races. Having to acquire a
read-lock explicitly before every entry point seems to reduce  the
simplicity of the solution, doesn't it ?

This is also the case with kernel code patching, which I agree, may appear
to be a rather unlikely application of this concept to handle races in
multi-byte code patching on a running kernel, a rather difficult problem,
otherwise.  In this case, the read-side is totally unaware of the
possibility of an updater modifying the code, so it isn't even possible for
a read-lock to be acquired explicitly (if we wish to have the flexibility
of being able to patch any portion of the code).

Have been discussing this with Dipankar last week, so I realize that the
above situations were perhaps not what these locking mechanisms were
intended for, but just thought I'd bring up this perspective.

As you've observed, with the approach of waiting for all pre-empted tasks
to synchronize, the possibility of a task staying pre-empted for a long
time could affect the latency of an update/synchonize (though its hard for
me to judge how likely that is). Besides, as Andi pointed out, there
probably are a lot of situations where the readers are not pre-emptible
anyway, so that waiting for all pre-empted tasks may be superfluos.

Given these possibilities, does it make sense to simply let the
updater/synchronize kernel specify an option indicating whether it would
wait for pre-empted tasks or not ?

Regards
Suparna


  Suparna Bhattacharya
  IBM Software Lab, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Common Abstraction of Notification & Completion Handling Mechanisms - observations and potential RFC

2001-01-17 Thread bsuparna


I have been looking at various notification and operation completion
processing mechanisms that we currently have in the kernel. (The
"operation" is typically I/O, but could be something else too).

This comes about as a result of observing  similar patterns in async i/o
handling aspects in filter filesystems and then in layered block drivers
like lvm and evms with kiobuf, and  recalling a suggestion that Ben LaHaise
had made about extending the wait queue interface to support callbacks.
Coming to think of it, we might observe similar patterns elsewhere wherever
there is a need for some processing that needs to be done on the completion
of an operation or in a more general sense, on the triggering of an event.

The pattern is something like this:

1. Post process data : Invoke callbacks for layers above (reverse order) :
(layer1 is the highest level - layer n lowest)
i.e.  callbackn(argn) ->   -> callback2(arg2) -> callback1(arg1)
- the sequence may get aborted temporarily at any level if required
(e.g for error correction)
2. Mark data as ready for use ( e.g unlock buffer/page, mark as up-to-date
etc)
(We could perhaps think of this as a level0 callback)
3. Notify registered consumers
- wakeup synchronous waiters   (typically wait_queue based)
- signal async consumers (SIGIO)
(hereafter any further processing happens in the context of the consumer)

We have all the separate mechanisms that are needed to achieve this (I
wonder if we have too many; and if we have some duplication of logic / data
structure patterns in certain cases, just to handle slight distinctions in
flavour ).

Here are some of them:
1. io completion callback routines + private data embedded in specific i/o
structures
-- in bh, kiobuf (for example) ( sock structure too ?)
2. task queues that can be used for triggering a list of callbacks perhaps
?
3. wait queues for registering synchronous waiters
4. fasync helper routines for registering async waiters to be signalled
(SIGIO)

Other places where we have a callback, arg pattern:
- timer callback + arg (specially for timer events)
- softirq handlers ?

So, if we wanted to have a generic abstraction for the mentioned pattern,
it could be done using a collection of the following:
 - something like a task queue for queueing up multiple callbacks to be
invoked in LIFO order; add some extra functionality to break in case a
callback returns a failure.
 - a wait queue for synchronous waiters
 - an fasync pointer for asynchronous notification requesters
 - a status field (to check on completion status)
 - a private data pointer (to help store persistent state; such state
may also be required for operation
cancellation)
 - A zeroth level callback registered in the  queue during
initialization to mark the status as completed
 and then notify synchronous and asynchronous waiters
 - Now, if there are multiple related event structures - like compound
events (compound i/os - e.g multiple bh's componded to a page or kiobuf,
sub-kiobufs compounded to a compound kiobuf etc), then there is a
requirement of triggering a similar sequence on that compound event. Have
still not decided at what stage this should happen and how.
 - Another item to think about is the operation cancellation path

One question is whether an extension to the wait queue is indeed
appropriate for the above. Or should it be a different abstraction in
itself ?

I know this needs further thinking through, and definitely some more
detailing, but I'd like to hear some feedback on how it sounds. Besides, I
don't know if anyone is already working on something like this. Does it
even make sense to attempt this ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[ANNOUNCE] Dynamic Probes 1.3 released

2001-01-17 Thread bsuparna


We've just released version 1.3 of the Dynamic Probes facility. This has
2.4.0 and 2.2.18 support and some bug fixes, including Andi Kleen's
suggestions for fixing the races in handling of swapped out COW pages.

For more information on DProbes see the dprobes homepage at
http://oss.software.ibm.com/developer/opensource/linux/projects/dprobes/

Changes in this version:

- DProbes for kernel version 2.2.18 and 2.4.0.
- Fix for race condition in COW page handling and some other
  corrections in the detection of correct offsets in COW pages.
- Removed the mmap_sem trylock from trap3 handling. The correct
  thing to do is to use the page_table_lock.
- Probe on BOUND instruction now logs data.
- Probes can now be inserted into modules that are getting initialized.
- kernel/time.c and arch/i386/kernel/time.c treated as exclude regions.
- Architecture specific FPU code moved to arch specific dprobes file.
- Minor bug fix in merge code which merges probes that are excluded.
- Exit to crashdump supported for 2.2.x kernels also.

We are no longer updating the patches for 2.2.12, 2.2.16 on the site.
If you require patches for these kernel versions, contact us at
[EMAIL PROTECTED]



Race condition fixes in handling COW pages:


Some updates on the race condition fixes for COW page handling, since the
discussions that happened last, for those who might have been following the
thread:

We eventually decided to drop the idea of trying achieve on-demand probe
insertion for swapped out COW pages by having a vm_ops->swapin() routine
that we could take over.

The reason was that we realized that the vm_ops->swapin() replacement
approach, while good for insert, is not suitable for remove, since then we
might need to keep around the probes records for a removed module, until
the swapped out pages eventually get cleaned out of the probes. (This may
have been feasible without too much work because we already have the logic
for quietly removing probes, but then it didn't seem like a good idea to
have ghost records around in this way -  makes it harder to maintain/debug
).

So, we ended up implementing Andi Kleen's original suggestion of a 2 pass
approach, instead. In the first pass, which takes place with the i_shared
lock held, we build a list of swapped out page locators (mm, addr), while
taking care of incrementing the mm reference count, and in the second pass
which happens with the spin locks released, we actually bring in the page
and cross-check that it is the same one that we'd meant to bring in. In the
second pass, we hold the mmap_sem while bringing the page in. With  David
Miller's lock hierarchy fixes, the i_shared_lock is now always higher in
the hierarchy than the page_table_lock, so we are OK there.

I hope we haven't missed anything. If anyone spots any gotchas or slips, or
some other races that still exist, do let us know. We've done some simple
testing to try to exercise a few scenarios that we could simulate easily,
but that's about it.

Regards
Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread bsuparna


>Ok. Then we need an additional more or less generic object that is used
for
>passing in a rw_kiovec file operation (and we really want that for many
kinds
>of IO). I thould mostly be used for communicating to the high-level
driver.
>
>/*
> * the name is just plain stupid, but that shouldn't matter
> */
>struct vfs_kiovec {
>struct kiovec * iov;
>
>/* private data, mostly for the callback */
>void * private;
>
>/* completion callback */
>void (*end_io) (struct vfs_kiovec *);
>wait_queue_head_t wait_queue;
>};
>
>Christoph

Shouldn't we have an error / status field too ?


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [ANNOUNCE] DProbes 1.1

2000-10-24 Thread bsuparna


Hello Andi,

Thanks for taking the trouble to go through our code in such detail and
thinking through the race conditions in dp_vaddr_to_page, which I had sort
of shut my eyes to and postponed for while because it didn't seem very easy
to close all the loopholes in an elegant way. I need to understand the  mm
locking hierachies and appropriate usage more thoroughly to do a complete
job.

I have labelled the key points you have brought up in your note below as
(a), (b), (c) for ease of reference.

For (a), your suggestion of a two pass approach is I guess feasible, but I
wish there were a simpler way to do it.
Actually I don't even really like the idea of forcing the swapped out page
back in, which we are having to do right now -   it would have been nicer
if there were a swapin() routine in the vma ops that we could have used for
on-demand probe insertion, just the way we use inode address space
readpage() right now for discardable pages, but maybe that's asking for too
much :-)  [Could vma type based swapin() logic be a useful abstraction in
general, aside from dprobes ?].
Anyway, let me think over this for a while ...

We don't quite understand (b), though. There is indeed a race due to our
not holding the page given to us by handle_mm_fault, while we try to access
it, and we need to fix that of course, but that doesn't sound exactly like
what you mention here. We do have handle_mm_fault being called under the mm
semaphore. Could you explain the deadlock situation that you have in mind ?

We've taken (c) as a very reasonable usability feedback. Now, we would be
displaying the opcodes as part of the dprobes query results. Hope that
would help.

It's good to hear that you've found dprobes useful and that you ported it
to 2.4 yourself. Do send us any comments, suggestions, improvements etc
that come to mind as you use it or go through the code.

Regards
Suparna



  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


Richard J Moore@IBMGB
10/19/2000 01:27 AM

To:   DProbes Team India
cc:
From: Richard J Moore/UK/IBM@IBMGB
Subject:  Re: [ANNOUNCE] DProbes 1.1





Richard Moore -  RAS Project Lead - Linux Technology Centre (PISC).

http://oss.software.ibm.com/developerworks/opensource/linux
Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183
IBM UK Ltd,  MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK
-- Forwarded by Richard J Moore/UK/IBM on 18/10/2000
20:57 ---



Andi Kleen <[EMAIL PROTECTED]> on 18/10/2000 18:38:13

Please respond to Andi Kleen <[EMAIL PROTECTED]>

To:   Richard J Moore/UK/IBM@IBMGB
cc:   [EMAIL PROTECTED]

Subject:  Re: [ANNOUNCE] DProbes 1.1





Hallo Richard,

On Wed, Oct 18, 2000 at 10:44:11AM +0100, [EMAIL PROTECTED] wrote:
>
>
> We've release v1.1 of DProbes - deatils and code is on the DProbes web
> page.
>
> the enhancements include:
>
> - DProbes for kernel version 2.4.0-test7 is now available.

First thanks for this nice work.

I ported the older 1.0 dprobes to 2.4 a few weeks ago for my own use.
It is very useful for kernel work. Unfortunately the user space support
had still one ugly race which I didn't fix because it required too
extensive changes for my simple port (and it didn't concern me because
I only use kernel level breakpoints)

I see the problems are still in 1.1.

(a) The problem is the vma loop in process_recs_in_cow_pages over the vmas
of an
address_space. In 2.4 the only way to do that safely is to hold the
address_space spinlock. Unfortunately you cannot take the semaphore
or execute handle_mm_fault while holding the spinlock, because they could
sleep. The only way I think to do it relatively race free without adding
locks
to the core VM is to do it two pass (first collect all the mms with mmget()
and their addresses in a separate list with the spinlock and then process
it
with the spinlock released)

(b) Then dp_vaddr_to_page has another race. It cannot hold the mm semaphore
because that would deadlock with handle_mm_struct. Not holding it means
though that the page could be swapped out again after you faulted it in
before you have a change to access it. It probably can be done with an
loop that checks and locks the page atomically (e.g. using cmpexchg)
and retries the handle_mm_fault as needed.

There may be more races I missed, the 2.4 SMP MM locking hierarchy is
unfortunately not very flexible and makes things like what dprobes wants
to do relatively hard.

(c) Another change I added and which I found useful is a printk to show
the opcode of mismatched probes (this way wrong offsets in the probe
definitions are easier to fix)

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

Re: [Dprobes] Re: [ANNOUNCE] DProbes 1.1: proposing a vm_ops->swapin() abstraction

2000-10-25 Thread bsuparna

 Andi,

   Thanks.  Then, I'll work it out in more detail and propose it on
linux-mm as you've suggested.
Maybe I should also try to think of another example where it might be
useful.
Anything that comes to  mind ?

   Regards
   Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


Andi Kleen <[EMAIL PROTECTED]> on 10/24/2000 07:51:39 PM

Please respond to Andi Kleen <[EMAIL PROTECTED]>

To:   Suparna Bhattacharya/India/IBM@IBMIN
cc:   [EMAIL PROTECTED], Richard J Moore/UK/IBM@IBMGB,
  [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  [Dprobes] Re: [ANNOUNCE] DProbes 1.1




Hallo,

On Tue, Oct 24, 2000 at 07:37:08PM +0530, [EMAIL PROTECTED] wrote:
> For (a), your suggestion of a two pass approach is I guess feasible, but
I
> wish there were a simpler way to do it.
> Actually I don't even really like the idea of forcing the swapped out
page
> back in, which we are having to do right now -   it would have been nicer
> if there were a swapin() routine in the vma ops that we could have used
for
> on-demand probe insertion, just the way we use inode address space
> readpage() right now for discardable pages, but maybe that's asking for
too
> much :-)  [Could vma type based swapin() logic be a useful abstraction in
> general, aside from dprobes ?].

I think it would be. You could propose it on the linux-mm mailing list and
ask Linus what he thinks.

I agree that it would be much nicer to do it this way.

> We don't quite understand (b), though. There is indeed a race due to our
> not holding the page given to us by handle_mm_fault, while we try to
access
> it, and we need to fix that of course, but that doesn't sound exactly
like
> what you mention here. We do have handle_mm_fault being called under the
mm
> semaphore. Could you explain the deadlock situation that you have in mind
?

It does not exist sorry. I was misremembering the lock hierarchy at that
place when I wrote the mail and should have double checked it.


-Andi

___
Dprobes mailing list
[EMAIL PROTECTED]
http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/dprobes



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Oddness in i_shared_lock and page_table_lock nesting hierarchies ?

2000-11-04 Thread bsuparna

The vma list lock  can nest with i_shared_lock, as per Kanoj Sarcar's
   note on mem-mgmt locks (Documentation/vm/locking), and currently
   vma list lock == mm->page_table_lock.
   But there appears to be some inconsistency in the hierarchy of these
   2 locks.  (By vma list lock I mean vmlist_access/modify_lock(s) )

   Looking at mmap code, it appears that the vma list lock
   i.e. page_table_lock right now,  is to be acquired first
   (e.g insert_vm_struct which acquires i_shared_lock internally,
   is called under the page_table_lock/vma list lock).
   Elsewhere in madvise too, I see a similar hierarchy.
   In the unmap code, care has been taken not to have these locks
   acquired simultaneously.

   However, in the vmtruncate code, it looks like the hierarchy is
reversed.
  There, the i_shared_lock is acquired, in order to traverse the i_mmap
list
   and inside the loop it calls zap_page_range, which aquires the
   page_table_lock.

  This is odd. Isn't there a possibility of a deadlock if mmap and
truncation
   for the same file happen simultaneously (on an SMP) ?

  I'm wondering if this could be a side effect of the doubling up of the
  page_table_lock as a vma list lock ?

  Or have I missed something ?

  [I have checked upto  2.4-test10-pre5 ]

  I had put this query up on linux-mm, as part of a much larger mail, but
  didn't get any response yet, so I thought of putting up a more focussed
  query this time.

  Regards
  Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


>List: linux-mm
>Subject:  Re: Updated Linux 2.4 Status/TODO List (from the ALS show)
>From: Kanoj Sarcar <[EMAIL PROTECTED]>
>Date: 2000-10-13 18:19:06
>[Download message RAW]

>>
>>
>> On Thu, 12 Oct 2000, David S. Miller wrote:
>> >
>> >page_table_lock is supposed to protect normal page table activity
(like
>> >what's done in page fault handler) from swapping out.
>> >However, grabbing this lock in swap-out code is completely missing!
>> >
>> > Audrey, vmlist_access_{un,}lock == unlocking/locking page_table_lock.
>>
>> Yeah, it's an easy mistake to make.
>>
>> I've made it myself - grepping for page_table_lock and coming up empty
in
>> places where I expected it to be.
>>
>> In fact, if somebody sends me patches to remove the
"vmlist_access_lock()"
>> stuff completely, and replace them with explicit page_table_lock things,
>> I'll apply it pretty much immediately. I don't like information hiding,
>> and right now that's the only thing that the vmlist_access_lock() stuff
is
>> doing.

>Linus,
>
>I came up with the vmlist_access_lock/vmlist_modify_lock names early in
>2.3. The reasoning behind that was that in most places where the "vmlist
>lock" was being taken was to protect the vmlist chain, vma_t fields or
>mm_struct fields. The fact that implementation wise this lock could be
>the same as page_table_lock was a good idea that you suggested.
>
>Nevertheless, the name was chosen to indicate what type of things it was
>guarding. For example, in the future, you might actually have a different
>(possibly sleeping) lock to guard the vmachain etc, but still have a
>spin lock for the page_table_lock (No, I don't want to be drawn into a
>discussion of why this might be needed right now). Some of this is
>mentioned in Documentation/vm/locking.
>
>Just thought I would mention, in case you don't recollect some of this
>history. Of course, I understand the "information hiding" part.
>
>Kanoj
>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[ANNOUNCE] Dynamic Probes 2.0 released

2001-03-09 Thread bsuparna


Version 2.0 of the Dynamic Probes facility is now available at
http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes

This release includes a new feature called "watchpoint probes" which
exploits hardware watchpoint capabilities of the underlying hardware
architecture to allow probing specific types of storage accesses.
Watchpoint probes are specified by the location (virtual address) and the
type of access (rw/w/x) on which the probe is fired.

This capability enables fine-grained storage profiling when Dprobes is used
in conjunction with LTT from Opersys as it permits tracing of memory read
and write access at specific locations.

Changes in this version:
--
- New watchpoint probes feature allows probes to be fired on specific type
of memory accesses(execute|write|read or write|io), implemented using the
debug registers available on Intel x86 processors.
- New RPN instructions: divide/remainder and propagate bit/zero
instructions.
- Kernel data can now be referenced symbolically in the probe program
files.
- Memory logging instructions like "log mrf" now write the fault location
in the log buffer in case a page fault occurs when accessing the concerned
memory area.
- Log can now be optionally saved using the new keyword logonfault=yes even
 if the probed instruction faults.
- Bug fixes in the interpreter
 - validity check for the selector in segmented to flat address
conversion in case where the selector references the GDT.
 - log memory and log ascii functions now don't log if the logmax was
set to zero.


About Dprobes
---
Dprobes is a generic and pervasive non-interactive system debugging
facility designed to operate under the most extreme software conditions
such as debugging a deep rooted operating system problem in a live
environment. It allows the insertion of fully automated breakpoints or
probepoints, anywhere in the system and user space along with a  set of
probe instructions that are interpreted when a specific probe fires. These
instructions allow memory and CPU registers to be examined and altered
using conditional logic and a log to be generated.

Dprobes is currently available only on the IA32 platform.

An interesting aspect of Dprobes is that it allows the probe program to
conditionally trigger any external debugging facility that registers for
this purpose, e.g. crash dump, kernel debugger. Dprobes interfaces with
Opersys's LTT to provide a Universal Dynamic Trace facility.

For more information on DProbes please visit the dprobes homepage at
http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes/



  Suparna Bhattacharya
  Linux Technology Center
  IBM Software Lab, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525







-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Core dumps for threads: Request for your patch (hack)

2001-03-15 Thread bsuparna


>
>(I have a complimentary hack that will dump the stacks of all the
>rest of the threads as well (though its a good trick to get gdb to
>interpret this). Available upon request.)
>

Hello Adam,

Could I take a look at your patch ?

Regards
Suparna

  Suparna Bhattacharya
  IBM Software Lab, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: (struct dentry *)->vfsmnt;

2001-03-15 Thread bsuparna



>Actually, I'm pretty sure you _never_ need to exportvg in order to have
>it work on another system.  That's one of the great things about AIX LVM,
>because it means you can move a VG to another system after a hardware
>problem, and not have any problems importing it (journaled fs also helps).
>AFAIK, the only think exportvg does is remove VG information from the
>ODM and /etc/filesystems.
>

Yes that's correct as far as I know too. The VGDA and LVCB contain all the
information required for import even without an exportvg.

>I suppose it is possible that because AIX is so tied into the ODM and
>SMIT, that it updates the VGDA mountpoint info whenever a filesystem
>mountpoint is changed, but this will _never_ work on Linux because of
>different tools versions, distributions, etc.  Also, it would mean on
>AIX that anyone editing /etc/filesystems might have a broken system at
>vgimport time (wouldn't be the first time that not using ODM/SMIT caused
>such a problem).

Yes, you can think of crfs (or chfs) as a composite command that handles
this (writing to LVCB. These are more like
administrative/setup/configuration commands -- one time, or occasional
system configuration changes.

On the other hand a mount doesn't cause a persistent configuration
information change. You can issue a mount even if an entry doesn't exist in
/etc/filesystems.

>
>> ... I do think that the LVM is a reasonable place to store this kind of
>> information.
>
>Yes, even though it would tie the user into using a specific version of
>mount(), I suppose it is a better solution than storing it inside the
>filesystem.  It will work with non-ext2 filesystems, and it also allows
>you to store more information than simply the mountpoint (e.g. mount
>options, dump + fsck info, etc).  In the end, I will probably just
>save the whole /etc/fstab line into the LV header somewhere, and extract
>it at importvg time (possibly with modifications for vgname and
mountpoint).
>
>Cheers, Andreas

Is mount the right time to do this ? A mount happens on every boot of the
system.
And then, one can issue a mount by explicitly specifying all the parameters
without having an entry in fstab. [Doesn't that also mean that you have a
possibility of inconsistency even here ?]



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: (struct dentry *)->vfsmnt;

2001-03-15 Thread bsuparna


>Because this is totally filesystem specific - why put extra knowledge
>of filesystem internals into mount?  I personally don't want it writing
>into the ext2 or ext3 superblock.  How can it possibly know what to do,
>without embedding a lot of knowledge there?  Yes, mount(8) can _read_
>the UUID and LABEL for ext2 filesystems, but I would rather not have it
>_write_ into the superblock.  Also, InterMezzo and SnapFS have the same
>on-disk format as ext2, but would mount(8) know that?
>
>There are other filesystems (at least IBM JFS) that could also take
>advantage of this feature, should we make mount(8) have code for each
>and every filesystem?  Yuck.  Sort of ruins the whole modularity thing.
>Yes, I know mount(8) does funny stuff for SMB and NFS, but that is a
>reason to _not_ put more filesystem-specific information into mount(8).
>

Since you've brought up this point.
I have wondered why Linux doesn't seem to yet have the option of a generic
user space filesystem type specific mount helper command. I recall having
seen code in mount(8) implementation to call mount., but its still
under an ifdef isn't it, except for smb or ncp perhaps ? (Hope I'm not
out-of-date on this)
Having something like that lets one stream-line userland filesystem
specific stuff like this, without having the generic part of mount(8) know
about it.

For example, in AIX, the association between type and the program for mount
helpers (and also for filesystem helpers for things like mkfs, fsck etc) is
configured in /etc/vfs, while SUN and HP look for them under particular
directory locations (by fstype name).

Actually, it'd be good to have this in such a way that if a specific helper
doesn't exist, default mount processing continues. This avoids the extra
work of writing such helpers for every new filesystem, unless we need
specialized behaviour there.



 Suparna Bhattacharya
  IBM Software Lab, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-01-29 Thread bsuparna


Comments, suggestions, advise, feedback solicited !

If this seems like something that might (after some refinements) be a
useful abstraction to have, then I need some help in straightening out the
design. I am not very satisfied with it in its current form.


A Kernel Mechanism for Compound Event Wait/Notify with Callback Chaining
-

Objective:
-

This is a proposal for a kernel mechanism for event notification that can
provide for:
 (i)   triggering of multiple actions via callback chains (the way
notifier lists do) -  some of these actions could involve waking up
synchronous waiters and asynchronous signalling
 (ii)  compound events  : a compound event is an event that is
essentially an aggregation of several sub-events, each of which in turn may
have their own sub-events)  [should extend to more than 2 levels, unlike
semaphore groups or poll wait tables]
 (iii) a single abstraction that can serve as an IO completion
indicator

The need for such a structure originally came about in the context of
layered drivers and layered filesystem implementations, where an io
structure may need to pass through several layers of post-processing as
part of actual i/o completion, which must be asynchronous.

Ben LaHaise had some thoughts about extending the wait queue interface to
allow callbacks, thus building a generic way to address this requirement
for any kind of io structure.

Here is an attempt to take that idea further, keeping in mind certain
additional requirements (including buffer splitting, io cancellation) and
various potential usage scenarios and also considering  existing
synchronization/ notification/ callback primitives that exist in the linux
kernel today.

Calling this abstraction a "compound_event" or "cev", for short  right now
( for want of a better name ).


How/where would/could this be used ?
--

A cev would typically be associated with an object -- e.g. an io structure
or descriptor. Either have the cev embedded in the io structure, or just a
private pointer field in the cev used to link the cev to the object.

1. In io/mem structures -- buffer/kiobuf/page_cache - especially places
where layered callbacks and compound IOs may occur (e.g encryption filter
filesystems, encryption drivers, lvm, raid, evms type situations).
[Need to explore the possibility of using it in n/w i/o structures, as
well]

2. As a basic synchronization primitive to trigger action when multiple
pre-conditions are involved.  Could potentially be used as an alternative
way to implement select/poll and even semaphore groups. The key advantage
is that multiple actions/checks can be invoked without involving a context
switch, and that sub-events can be aborted top down through multiple levels
of aggregation  (using the cancellation feature).

3. A primitive to trigger a sequence of actions on an event, and to wake up
waiters on completion of the sequence. (e.g a timer could be associated
with a cev to trigger a chain of actions and a notification of completion
of all of these to multiple interested parties)

Some reasons for using such a primitive rather than current approach of
adding end_io/private/wait fields directly within buffer heads or
vfs_kiovecs :

1. Uniformity of structure at different levels; may also be helpful in
terms of statistics collection/metering.  Keeps the completion linkage
mechanism external to the specific io structures.
2. Callback chaining makes layered systems/interceptors easier to add on,
as no sub-structures need to be created in cases where splitting/merging
isn't necessary.
3. The ability to abort the callback chain for the time being based on the
return value of a callback has some interesting consequences in making it
possible to support error recovery/retries of failed sub-ios, and also in
networking stack kind of layering architecture involving queues at each
level with delayed callbacks rather than a direct callback invokation


---
Proposed Design
--

"Perfection in design is achieved not when there is nothing more to add,
but when there is nothing left to take away"

So, not there yet !
Anyway, after a few iterations to try to bring this down, and to keep it
efficient in simplest usage scenarios, while at the same time supporting
infrastructure for all the intended functionality, this is how it stands
(Some fields are there for search efficiency reasons when sub-events are
involved, though not actually essential):

I'm not too happy with this design in its current form. Need some
suggestions to improve this.

struct compound_event {
 /* 1. Basic infrastructure for simplest cases */
 void (*done)();
 unsigned i

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-01-30 Thread bsuparna


Ben,

This indeed looks neat and simple !
I'd avoided touching the wait queue structure as I suspected that you might
already have something like this in place :-)
and was hoping that you'd see this posting and comment.
I agree entirely that it makes sense to have chaining of events built over
simple minimalist primitives. That's what was making me uncomfortable with
the cev design I had.

So now I'm thinking how to do this using the wait queues extension you
have. Some things to consider:
 - Since non-exclusive waiters are always added to the head of the
queue (unless we use a tq in a wtd kind of structure as ), ordering of
layered callbacks might still be a problem. (e.g. with an encryption filter
fs, we want the decrypt callback to run  before any waiter gets woken up;
irrespective of whether the wait was issued before or after the decrypt
callback was added by the filter layer)
 - The wait_queue_func gets only a pointer to the wait structure as an
argument, with no other means to pass any state about the sub-event that
caused it (could that be a problem with event chaining ... ? every
encapsulating structure will have to maintain a pointer to the related
sub-event ...  ? )

Regards
Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


Ben LaHaise <[EMAIL PROTECTED]> on 01/30/2001 10:59:46 AM

Please respond to Ben LaHaise <[EMAIL PROTECTED]>

To:   Suparna Bhattacharya/India/IBM@IBMIN
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  Re: [Kiobuf-io-devel] RFC:  Kernel mechanism: Compound event
  wait/notify + callback chains




On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote:

>
> Comments, suggestions, advise, feedback solicited !
>
> If this seems like something that might (after some refinements) be a
> useful abstraction to have, then I need some help in straightening out
the
> design. I am not very satisfied with it in its current form.

Here's my first bit of feedback from the point of "this is what my code
currently does and why".

The waitqueue extension below is a minimalist approach for providing
kernel support for fully asynchronous io.  The basic idea is that a
function pointer is added to the wait queue structure that is called
during wake_up on a wait queue head.  (The patch below also includes
support for exclusive lifo wakeups, which isn't crucial/perfect, but just
happened to be part of the code.)  No function pointer or other data is
added to the wait queue structure.  Rather, users are expected to make use
of it by embedding the wait queue structure within their own data
structure that contains all needed info for running the state machine.

Here's a snippet of code which demonstrates a non blocking lock of a page
cache page:

struct worktodo {
 wait_queue_twait;
 struct tq_structtq;
 void *data;
};

static void __wtd_lock_page_waiter(wait_queue_t *wait)
{
struct worktodo *wtd = (struct worktodo *)wait;
struct page *page = (struct page *)wtd->data;

if (!TryLockPage(page)) {
__remove_wait_queue(&page->wait, &wtd->wait);
wtd_queue(wtd);
} else {
schedule_task(&run_disk_tq);
}
}

void wtd_lock_page(struct worktodo *wtd, struct page *page)
{
if (TryLockPage(page)) {
int raced = 0;
wtd->data = page;
init_waitqueue_func_entry(&wtd->wait,
__wtd_lock_page_waiter);
add_wait_queue_cond(&page->wait, &wtd->wait,
TryLockPage(page), raced = 1);

if (!raced) {
run_task_queue(&tq_disk);
return;
}
}

wtd->tq.routine(wtd->tq.data);
}


The use of wakeup functions is also useful for waking a specific reader or
writer in the rw_sems, making semaphore avoid spurious wakeups, etc.

I suspect that chaining of events should be built on top of the
primatives, which should be kept as simple as possible.  Comments?

  -ben


diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h
--- v2.4.1pre10/include/linux/mm.h Fri Jan 26 19:03:05 2001
+++ work/include/linux/mm.h   Fri Jan 26 19:14:07 2001
@@ -198,10 +198,11 @@
  */
 #define UnlockPage(page) do { \
 smp_mb__before_clear_bit(); \
+if (!test_bit(PG_locked, &(page)->flags)) {
printk("last: %p\n", (page)->last_unlock); BUG(); } \
+(page)->last_unlock = current_text_addr(); \
 if (!test_and_clear_bit(PG_locked,
&(page)->flags)) BUG(); \
 smp_mb__after_clear_bit(); \
-if (waitqueue_active(&page->wait)) \
- wake_up(&page->wait); \
+wake_up(&page->wait); \
} while (0)
 #define PageError(page)  test_bit(PG_error, &(page)->flags)
 #def

Re: [Kiobuf-io-devel] Re: RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-01-31 Thread bsuparna

Mathew,

   Thanks for mentioning this. I didn't know about it earlier. I've been
going through the 4/00 kqueue patch on freebsd ...

   Regards
   Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


Matthew Jacob <[EMAIL PROTECTED]> on 01/30/2001 12:08:48 PM

Please respond to [EMAIL PROTECTED]

To:   Suparna Bhattacharya/India/IBM@IBMIN
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  [Kiobuf-io-devel] Re: RFC:  Kernel mechanism: Compound event
  wait/notify + callback chains





Why don't you look at Kqueues (from FreeBSD)?




___
Kiobuf-io-devel mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-01-31 Thread bsuparna


>The waitqueue extension below is a minimalist approach for providing
>kernel support for fully asynchronous io.  The basic idea is that a
>function pointer is added to the wait queue structure that is called
>during wake_up on a wait queue head.  (The patch below also includes
>support for exclusive lifo wakeups, which isn't crucial/perfect, but just
>happened to be part of the code.)  No function pointer or other data is
>added to the wait queue structure.  Rather, users are expected to make use
>of it by embedding the wait queue structure within their own data
>structure that contains all needed info for running the state machine.

>I suspect that chaining of events should be built on top of the
>primatives, which should be kept as simple as possible.  Comments?

Do the following modifications to your wait queue extension sound
reasonable ?

1. Change add_wait_queue to add elements to the end of queue (fifo, by
default) and instead have an add_wait_queue_lifo() routine that adds to the
head of the queue ?
  [This will help avoid the problem of waiters getting woken up before LIFO
wakeup functions have run, just because the wait happened to have been
issued after the LIFO callbacks were registered, for example, while an IO
is going on]
   Or is there a reason why add_wait_queue adds elements to the head by
default ?

2. Pass the wait_queue_head pointer as a parameter to the wakeup function
(in addition to wait queue entry pointer).
[This will make it easier for the wakeup function to access the  structure
in which the wait queue is embedded, i.e. the object which the wait queue
is associated with. Without this, we might have to store a pointer to this
object in each element linked in the wait queue. This never was a problem
with sleeping waiters because the a reference to the object being waited
for would have been on the waiter's stack/context, but with wakeup
functions there is no such context]

3. Have  __wake_up_common break out of the loop if the wakeup function
returns 1 (or some other value) ?
[This makes it possible to abort the loop based on conditional logic in the
wakeup function ]


Regards
Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-01-31 Thread bsuparna



>Hi,
>
>On Wed, Jan 31, 2001 at 07:28:01PM +0530, [EMAIL PROTECTED] wrote:
>>
>> Do the following modifications to your wait queue extension sound
>> reasonable ?
>>
>> 1. Change add_wait_queue to add elements to the end of queue (fifo, by
>> default) and instead have an add_wait_queue_lifo() routine that adds to
the
>> head of the queue ?
>
>Cache efficiency: you wake up the task whose data set is most likely
>to be in L1 cache by waking it before its triggering event is flushed
>from cache.
>
>--Stephen

Valid point.


___
Kiobuf-io-devel mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-01-31 Thread bsuparna



>My first comment is that this looks very heavyweight indeed.  Isn't it
>just over-engineered?

Yes, I know it is, in its current form (sigh !).

But at the same time, I do not want to give up (not yet, at least) on
trying to arrive at something that can serve the objectives, and yet be
simple in principle and lightweight too. I feel the need may  grow as we
have more filter layers coming in, and as async i/o and even i/o
cancellation usage increases. And it may not be just with kiobufs ...

I took a second pass attempt at it last night based on Ben's wait queue
extensions. Will write that up in a separate note after this. Do let me
know if it seems like any improvement at all.

>We _do_ need the ability to stack completion events, but as far as the
>kiobuf work goes, my current thoughts are to do that by stacking
>lightweight "clone" kiobufs.

Would that work with stackable filesystems ?

>
>The idea is that completion needs to pass upwards (a)
>bytes-transferred, and (b) errno, to satisfy the caller: everything
>else, including any private data, can be hooked by the caller off the
>kiobuf private data (or in fact the caller's private data can embed
>the clone kiobuf).
>
>A clone kiobuf is a simple header, nothing more, nothing less: it
>shares the same page vector as its parent kiobuf.  It has private
>length/offset fields, so (for example) a LVM driver can carve the
>parent kiobuf into multiple non-overlapping children, all sharing the
>same page list but each one actually referencing only a small region
>of the whole.
>
>That ought to clean up a great deal of the problems of passing kiobufs
>through soft raid, LVM or loop drivers.
>
>I am tempted to add fields to allow the children of a kiobuf to be
>tracked and identified, but I'm really not sure it's necessary so I'll
>hold off for now.  We already have the "io-count" field which
>enumerates sub-ios, so we can define each child to count as one such
>sub-io; and adding a parent kiobuf reference to each kiobuf makes a
>lot of sense if we want to make it easy to pass callbacks up the
>stack.  More than that seems unnecessary for now.

Being able to track the children of a kiobuf would help with I/O
cancellation (e.g. to pull sub-ios off their request queues if I/O
cancellation for the parent kiobuf was issued). Not essential, I guess, in
general, but useful in some situations.
With a clone kiobuf there is no direct way to reach a clone kiobuf given
the original kiobuf (without adding some indexing scheme ).

>
>--Stephen



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread bsuparna


Here's a second pass attempt, based on Ben's wait queue extensions:
Does this sound any better ?

[This doesn't require any changes to the existing wait_queue_head based i/o
structures or to existing drivers, and the constructs mentioned come into
the picture only when compound events are actually required]

The key aspects are:
1.  Just using an extended wait queue now instead of the callbackq for
completion (this can take care of layered callbacks, and aggregation via
wakeup functions)
2. The io structures don't need to change - as they already have a
wait_queue_head embedded anyway (e.g b_wait; in fact io completion happens
simply by waking up the waiters in the wait queue, just as it happens now.
3. Instead, all co-relation information is maintained in the wait_queue
entries that involve compound events
4. No cancel callback queue any more.

(a) For simple layered callbacks (as in encryption filesystems/drivers):
 Intermediate layers simply use add_wait_queue(_lifo) to add their
callbacks to the object's wait queue as wakeup functions. The wakeup
function can access fields in the object associated with the wait queue,
using the wait_queue_head address since the wait_queue_head is embedded in
the object.
 If the wakeup function has to be associated with any other private
data, then an embedding structure is required, e.g:
/* Layered event structure */
 struct lev {
 wait_queue_twait;
 void   *data;
}

or, maybe something like the work_todo structure that Ben had stated as an
example (if callback actions have to be delayed to task context). Actually
in that case, we might like to have the wakeup function return 1 if it
needs to do some work later, and that work needs to be completed before the
remaining waiters are worken up.

(b) For compound events:

/* Compound event structure */
 struct cev_wait {
 wait_queue_twait;
 wait_queue_head_t * parent;
 unsigned intflags;  /* optional */
 struct list_head cev_list;  /* links to siblings or child
cev_waits as applicable*/
 wait_queue_head_t   *head;/* head of wait queue on which this
is/was queued  - optional ? */
  };

In this case , for each child:
 wait.func() is set up to a routine that performs any necessary
transfer/status/count updates from the child to the parent object, issues a
wakeup on the parent's wait queue (it also removes itself from the child's
wait queue, and optionally from the parent's cev_list too ).
It is this update step that will be situation/subsystem specific, and also
have a return value to indicate whether to detach from the parent or not.

And for the parent queue, a cev_wait would be registered at the beginning,
with its wait.func() set up to collate ios and let completion proceed if
the relevant criteria is met. It can reach all the child cev_waits through
the cev_list links, useful for aggregating data from all children.
During i/o cancellation, the status of the parent object is set to indicate
cancellation and wakeup issued on its wait queue. The parent cev_wait's
wakeup function, if it recognizes the cancel, would then cancel all the
sub-events.
(Is there a nice way to access the object's status from the wakeup function
that doesn't involve subsystem specific code ? )

So, it is the step of collating ios and deciding whether to proceed  which
is situation/subsystem specific. Similarly, the actual operation
cancellation logic (e.g cancelling the underlying io request) is also
non-generic.

For this reason, I was toying with the option of introducing two function
pointers - complete() and cancel() to the cev_wait  structure, so that the
rest of the logic in the wakeup function can be kept common. Does that make
sense ?

Need to define routines for initializing and setting up parent-child
cev_waits.

Right now this assumes that the changes suggested in my last posting can be
made. So still need to think if there is a way to address the cache
efficiency issue (that's a little hard).

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread bsuparna


sct wrote:
>> >
>> > Thanks for mentioning this. I didn't know about it earlier. I've been
>> > going through the 4/00 kqueue patch on freebsd ...
>>
>> Linus has already denounced them as massively over-engineered...
>
>That shouldn't stop anyone from looking at them and learning, though.
>There might be a good idea or two hiding in there somewhere.
>- Dan
>

There is always a scope to learn from a different approach to tackle a
problem of a similar nature -  both good ideas as well as over-engineered
ones - sometimes more from the later :-)

As far as I have understood so far from looking at the original kevent
patch and notes (which perhaps isn't enough and maybe out of date as well),
the concept of knotes and filter ops, and the event queuing mechanism in
itself is interesting and generic, but most of it seems to have been
designed with linkage to user-mode issueable event waits in mind - like
poll/select/aio/signal etc, at least as it appears from the way its been
used in the kernel. A little different from what I had in mind, though its
perhaps possible to use it otherwise. But maybe I've just not thought about
it enough or understood it.

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread bsuparna


>Hi,
>
>On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
>>
>> >We _do_ need the ability to stack completion events, but as far as the
>> >kiobuf work goes, my current thoughts are to do that by stacking
>> >lightweight "clone" kiobufs.
>>
>> Would that work with stackable filesystems ?
>
>Only if the filesystems were using VFS interfaces which used kiobufs.
>Right now, the only filesystem using kiobufs is XFS, and it only
>passes them down to the block device layer, not to other filesystems.

That would require the vfs interfaces themselves (address space
readpage/writepage ops) to take kiobufs as arguments, instead of struct
page *  . That's not the case right now, is it ?
A filter filesystem would be layered over XFS to take this example.
So right now a filter filesystem only sees the struct page * and passes
this along. Any completion event stacking has to be applied with reference
to this.


>> Being able to track the children of a kiobuf would help with I/O
>> cancellation (e.g. to pull sub-ios off their request queues if I/O
>> cancellation for the parent kiobuf was issued). Not essential, I guess,
in
>> general, but useful in some situations.
>
>What exactly is the justification for IO cancellation?  It really
>upsets the normal flow of control through the IO stack to have
>voluntary cancellation semantics.

One reason that I saw is that if the results of an i/o are no longer
required due to some condition (e.g. aio cancellation situations, or if the
process that issued the i/o gets killed), then this avoids the unnecessary
disk i/o, if the request hadn't been scheduled as yet.

Too remote a requirement ? If the capability/support doesn't exist at the
driver level I guess its difficult.

--Stephen

___
Kiobuf-io-devel mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread bsuparna


>Hi,
>
>On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:
>>
>> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?
>
>It's a mechanism, all right, but you haven't described what problems
>it is trying to solve, and where it is likely to be used, so it's hard
>to judge it. :)

Hmm .. I thought I had done that in my first posting, but obviously, I
mustn't have done a good job at expressing it, so let me take another stab
at trying to convey why I started on this.

There are certain specific situations that I have in mind right now, but to
me it looks like the very nature of the abstraction is such that it is
quite likely that there would be uses in some other situations which I may
not have thought of yet, or just do not understand well enough to vouch for
at this point. What those situations could be, and the associated issues
involved (especially performance related) is something that I hope other
people on this forum would be able to help pinpoint, based on their
experiences and areas of expertise.

I do realize that generic and yet simple and performance optimal in all
kinds of situations is a really difficult (if not impossible :-) ) thing to
achieve, but even then, won't it be nice to at least abstract out
uniformity in patterns across situations in a way which can be
tweaked/tuned for each specific class of situations ?

And the nice thing which I see about Ben's wait queue extensions is that it
gives us a route to try to do that ...

Some needs considered (and associated problems):

a. Stacking of completion events - asynchronously, through multiple layers
 - layered drivers  (encryption, conversion)
 - filter filesystems
Key aspects:
 1. It should be possible to pass the same (original) i/o container
structure all the way down (no copies/clones should need to happen, unless
actual i/o splitting, or extra buffer space or multiple sub-ios are
involved)
 2. Transparency: Neither the upper layer nor the layer below it should
need to have any specific knowledge about the existence/absense of an
intermediate filter layer (the mechanism should hide all that)
 3. LIFO ordering of completion actions
 4. The i/o structure should be marked as up-to-date only after all the
completion actions are done.
 5. Preferably have waiters on the i/o structure woken up only after
all completion actions are through (to avoid spurious/redundant wakeups
since the data won't be ready for use)
 6. Possible to have completion actions execute later in task context

b. Co-relation between multiple completion events and their associated
operations and data structures
 -  (bottom up aspect) merging results of split i/o requests, and
marking the completion of the compound i/o through multiple such layers
(tree), e.g
  - lvm
  - md / raid
  - evms aggregator features
 - (top down aspect) cascading down i/o cancellation requests /
sub-event waits , monitoring sub-io status etc
  Some aspects:
 1. Result of collation of sub-i/os may be driver specific  (In some
situations like lvm  - each sub i/o maps to a particular portion of a
buffer; with software raid or some other kind of scheme the collation may
involve actually interpreting the data read)
 2. Re-start/retries of sub-ios (in case of errors) can be handled.
 3. Transparency : Neither the upper layer nor the layer below it
should need to have any specific knowledge about the existence/absense of
an intermediate layer (that sends out multiple sub i/os)
 4. The system should be devised to avoid extra logic/fields in the
generic i/o structures being passed around, in situations where no compound
i/o is involved (i.e. in the simple i/o cases and most common situations).
As far as possible it is desirable to keep the linkage information outside
of the i/o structure for this reason.
 5. Possible to have collation/completion actions execute later in task
context


Ben LaHaise's wait queue extensions takes care of most of the aspects of
(a), if used with a little care to ensure a(4).
[This just means that function that marks the i/o structure as up-to-date
should be put in the completion queue first]
With this, we don't even need and explicit end_io() in bh/kiobufs etc. Just
the wait queue would do.

Only a(5) needs some thought since cache efficiency is upset by changing
the ordering of waits.

But, (b) needs a little more work as a higher level construct/mechanism
that latches on to the wait queue extensions. That is what the cev_wait
structure was designed for.
It keeps the chaining information outside of the i/o structures by default
(They can be allocated together, if desired anyway)

Is this still too much in the air ? Maybe I should describe the flow in a
specific scenario to illustrate ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-04 Thread bsuparna


>Hi,
>
>On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
>> >
>> > If I have a page vector with a single offset/length pair, I can build
>> > a new header with the same vector and modified offset/length to split
>> > the vector in two without copying it.
>>
>> You just say in the higher-level structure ignore from x to y even if
>> they have an offset in their own vector.
>
>Exactly --- and so you end up with something _much_ uglier, because
>you end up with all sorts of combinations of length/offset fields all
>over the place.
>
>This is _precisely_ the mess I want to avoid.
>
>Cheers,
> Stephen

It appears that we are coming across 2 kinds of requirements for kiobuf
vectors - and quite a bit of debate centering around that.

1. In the block device i/o world, where large i/os may be involved, we'd
like to be able to describe chunks/fragments that contain multiple pages;
which is why it  make sense to have a single  pair for the
entire set of pages in a kiobuf, rather than having to deal with per page
offset/len fields.

2. In the networking world, we deal with smaller fragments (for protocol
headers and stuff, and small packets) ideally chained together, typically
not page aligned, with the ability to extend the list at least at the head
and tail (and maybe some reshuffling in case of ip fragmentation?); so I
guess that's why it seems good to have an  pair per
page/fragment. (If there can be multiple fragments in a page, even this
might not be frugal enough ... )

Looks like there are 2 kinds of entities that we are looking for in the kio
descriptor:
 - A collection of physical memory pages (call it say, a page_list)
 - A collection of fragments of memory described as 
tuples w.r.t this collection
 (offset in turn could be  if it
helps) (call this collection a frag_list)

Can't we define a kiobuf structure as just this ? A combination of a
frag_list and a page_list ? (Clone kiobufs might share the original
kiobuf's page_list, but just split parts of the frag_list)
How hard is it to maintain and to manipulate such a structure ?

BTW, We could have a higher level io container that includes a 
field and a  to take care of i/o completion (If we have a
wait queue head, then I don't think we need a separate callback function if
we have Ben's wakeup functions in place).

Or,  is this going in the direction of a cross between and elephant and a
bicycle :-)  ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread bsuparna



>Hi,
>
>On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
>>
>> Can't we define a kiobuf structure as just this ? A combination of a
>> frag_list and a page_list ?
>

>Then all code which needs to accept an arbitrary kiobuf needs to be
>able to parse both --- ugh.
>

Making this a little more explicit to help analyse tradeoffs:

/* Memory descriptor portion of a kiobuf - this is something that may get
passed around between layers and subsystems */
struct kio_mdesc {
 int nr_frags;
 struct frag *frag_list;
 int nr_pages;
 struct page **page_list;
 /* list follows */
};

For block i/o requiring #1 type descriptors, the list could have allocated
extra space for:
struct kio_type1_ext {
 struct frag frag;
 struct page *pages[NUM_STATIC_PAGES];
}

For n/w i/o or cases requiring  #2 type descriptors, the list could have
allocated extra space for:

struct kio_type2_ext {
 struct frag frags[NUM_STATIC_FRAGS];
 struct page *page[NUM_STATIC_FRAGS];
}


struct  kiobuf {
 intstatus;
 wait_queue_head_t   waitq;
 struct kio_mdescmdesc;
 /* list follows - leaves room for allocation for mem descs, completion
sub structs etc */
}

Code that accepts an arbitrary kiobuf needs to do the following :
 process the fragments one by one
  - type #1 case, only one fragment would typically be there, but
processing it would involve crossing all pages in the page list
   So extra processing vs a kiobuf with single 
pair, involves the following:
dereferencing the frag_list pointer
checking the nr_frags field
  - type #2 case, the number of fragments would be equal to or
greater than number of pages, so processing will typically go over each
fragments and thus cross each page in the list one by one
   So extra processing vs a kiobuf with per-page 
pairs, involves
deferencing the page list entry (involves computing the
page-index in the page_list from the offset value)
check if offset+len doesn't fall outside the page


Boils down to approx one extra dereference and one comparison  per kiobuf
for the common cases (have I missed something critical ?)  vs the most
optimized choices of descriptors for those cases.

In terms of resource consumption (extra bytes taken up), two fields extra
per kiobuf chain (e.g. nr_frags and frag_list pointer when it comes to #1),
i.e. a total of 8 bytes, for the common cases vs the most optimized choice
of structures for those cases.

This seems to be more uniformly balanced across #1 and #2 cases, than an
 for every page, as well as an overall . But,
then, come to think of it, since the need for lightweight structures is
greater in the case of #2, should the point of balance (if at all we want
to find one) be tilted towards #2 ?

On the other hand, since having a common structure does involve extra bytes
and cycles, if there are very few situations where we need both #1 and #2 -
conversion only at subsystem boundaries like i2o does may turn out to be
better.

Oh well ...


>> BTW, We could have a higher level io container that includes a 
>> field and a  to take care of i/o completion
>
>IO completion requirements are much more complex.  Think of disk
>readahead: we can create a single request struct for an IO of a
>hundred buffer heads, and as the device driver satisfies that request,
>it wakes up the buffer heads as it goes.  There is a separete
>completion notification for every single buffer head in the chain.
>
I understand the requirement of independent completion notifiers for higher
level buffers/other structures, since they are indeed independently usable
structures. That was one aspect that I thought I was being able to address
in the cev_wait design based on wait_queue wakeup functions.
The way it would work is that there would be multiple wakeup functions
registered on the container for the big request, each wakeup function being
responsible for waking up a higher level buffer. This way, the linkage
information is actually external to the buffer structures (which seems
reasonable, since it is only required while the i/o is happening, unless
there is another reason to keep a more lasting association)

>It's the very essence of readahead that we wake up the earlier buffers
>as soon as they become available, without waiting for the later ones
>to complete, so we _need_ this multiple completion concept.
>

I can understand this in principle, but when we have a single request going
down to the device that actually fills in multiple buffers, do we get
notified (interrupted) by the device before all the data in that request
got transferred ? I mean, how do we know that some buffers have become
available until the overall device request has completed (unless of course
the request actually gets broken up at this level and completed bit by
bit).


>Which is exactly why we have one kiobuf per higher

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-06 Thread bsuparna


>Hi,
>
>On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
>>
>> >It's the very essence of readahead that we wake up the earlier buffers
>> >as soon as they become available, without waiting for the later ones
>> >to complete, so we _need_ this multiple completion concept.
>>
>> I can understand this in principle, but when we have a single request
going
>> down to the device that actually fills in multiple buffers, do we get
>> notified (interrupted) by the device before all the data in that request
>> got transferred ?
>
>It depends on the device driver.  Different controllers will have
>different maximum transfer size.  For IDE, for example, we get wakeups
>all over the place.  For SCSI, it depends on how many scatter-gather
>entries the driver can push into a single on-the-wire request.  Exceed
>that limit and the driver is forced to open a new scsi mailbox, and
>you get independent completion signals for each such chunk.

I see. I remember Jens Axboe mentioning something like this with IDE.
So, in this case, you want every such chunk to check if its completed
filling up a buffer and then trigger a wakeup on that ?
But, does this also mean that in such a case combining requests beyond this
limit doesn't really help ? (Reordering requests to get contiguity would
help of course in terms of seek times, I guess, but not merging beyond this
limit)

>> >Which is exactly why we have one kiobuf per higher-level buffer, and
>> >we chain together kiobufs when we need to for a long request, but we
>> >still get the independent completion notifiers.
>>
>> As I mentioned above, the alternative is to have the i/o completion
related
>> linkage information within the wakeup structures instead. That way, it
>> doesn't matter to the lower level driver what higher level structure we
>> have above (maybe buffer heads, may be page cache structures, may be
>> kiobufs). We only chain together memory descriptors for the buffers
during
>> the io.
>
>You forgot IO failures: it is essential, once the IO completes, to
>know exactly which higher-level structures completed successfully and
>which did not.  The low-level drivers have to have access to the
>independent completion notifications for this to work.
>
No, I didn't forget IO failures; just that I expect the wait structure
containing the wakeup function to be embedded in a cev structure that
contains a pointer to the wait_queue_head field in the higher level
structure. The rest is for the wakeup function to interpret (it can always
access the other fields in the higher level structure - just like
list_entry() does)

Later I realized that instead of having multiple wakeup functions queued on
the low level structures wait queue, its perhaps better to just sort of
turn the cev_wait structure upside down (entry on the lower level
structure's queue should link to the parent entries instead).




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread bsuparna


Going through all the discussions once again and trying to look at this
from the point of view of just basic requirements for data structures and
mechanisms, that they imply.

1. Should have a data structure that represents a  memory chain , which may
not be contiguous in physical memory, and which can be passed down as a
single unit all the way  through to lowest level drivers
 - e.g for direct i/o to/from a contiguous virtual address range in
user space (without any intermediate copies)

(Networking and block i/o seem may have require different optimizations in
the design of such a data structure, due to differences in the kind of
patterns expected, as is apparent from the zero-copy networking fragments
vs raw i/o kiobuf/kiovec patches. There are situations when such a data
structure may be passed between subsystems as in the i2o example)

This data structure could be part of an I/O container.

2.  I/O containers may get split or merged as they pass through various
layers --- so any completion mechanism and i/o container design should be
able to account for both cases. At any point, a request could be
 - a collection of several higher level requests,
  or
 - could be one among several sub-requests of a single higher level
request.
(Just as appropriate "clustering"  could happen at each level, appropriate
"splitting" may also take place depending on the situation. It may make
sense to delay splitting as far down the chain as possible in many
situations, where the higher level is only interested in the i/o in its
entirety and not in partial completion )
When caching/buffers are involved, sometimes the sub-requests of a single
higher level request may have individual completion requirements (even when
no merges were involved), because the sub-request buffers may be used to
service other requests alongside. With raw i/o that might not be the case.

3. It is desirable that layers which process the requests along the way
without splitting/merging, be able to pass along the same I/O container
without any duplication or cloning, and intercept async i/o completions for
post processing.

4. (Optional) It would be nice if different kinds of I/O containers or
buffer structures could be used at different levels, without having
explicit linkage fields (like bh --> page, for example) , and in a way that
intermediate drivers or layers can work transparently.

3 & 4 are more of layering related items, which gets a little specific, but
do 1 and 2 cover the general things we are looking for ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/