Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
>One question: >isn't it the case that the alternative to using synchronize_kernel() >is to protect the read side with explicit locks, which will themselves >suppress preemption? If so, why not just suppress preemption on the read >side in preemptible kernels, and thus gain the simpler implementation >of synchronize_kernel()? You are not losing any preemption latency >compared to a kernel that uses traditional locks, in fact, you should >improve latency a bit since the lock operations are more expensive than >are simple increments and decrements. As usual, what am I missing >here? ;-) >... >... >I still prefer suppressing preemption on the read side, though I >suppose one could claim that this is only because I am -really- >used to it. ;-) Since this point has come up , I just wanted to mention that it may still be nice to be able to do without explicit locks on the read-side. This is not so much for performance reasons (I agree with your assessment on that point) as for convinience / flexibility in the kind of situations where this concept (i.e. synchronize_kernel or read-copy-update) could be used. For example, consider situations where it is an executable code block that is being protected. The read side is essentially the execution of that code block - i.e. every entry/exit into the code block. This is perhaps the case with module unload races. Having to acquire a read-lock explicitly before every entry point seems to reduce the simplicity of the solution, doesn't it ? This is also the case with kernel code patching, which I agree, may appear to be a rather unlikely application of this concept to handle races in multi-byte code patching on a running kernel, a rather difficult problem, otherwise. In this case, the read-side is totally unaware of the possibility of an updater modifying the code, so it isn't even possible for a read-lock to be acquired explicitly (if we wish to have the flexibility of being able to patch any portion of the code). Have been discussing this with Dipankar last week, so I realize that the above situations were perhaps not what these locking mechanisms were intended for, but just thought I'd bring up this perspective. As you've observed, with the approach of waiting for all pre-empted tasks to synchronize, the possibility of a task staying pre-empted for a long time could affect the latency of an update/synchonize (though its hard for me to judge how likely that is). Besides, as Andi pointed out, there probably are a lot of situations where the readers are not pre-emptible anyway, so that waiting for all pre-empted tasks may be superfluos. Given these possibilities, does it make sense to simply let the updater/synchronize kernel specify an option indicating whether it would wait for pre-empted tasks or not ? Regards Suparna Suparna Bhattacharya IBM Software Lab, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Common Abstraction of Notification & Completion Handling Mechanisms - observations and potential RFC
I have been looking at various notification and operation completion processing mechanisms that we currently have in the kernel. (The "operation" is typically I/O, but could be something else too). This comes about as a result of observing similar patterns in async i/o handling aspects in filter filesystems and then in layered block drivers like lvm and evms with kiobuf, and recalling a suggestion that Ben LaHaise had made about extending the wait queue interface to support callbacks. Coming to think of it, we might observe similar patterns elsewhere wherever there is a need for some processing that needs to be done on the completion of an operation or in a more general sense, on the triggering of an event. The pattern is something like this: 1. Post process data : Invoke callbacks for layers above (reverse order) : (layer1 is the highest level - layer n lowest) i.e. callbackn(argn) -> -> callback2(arg2) -> callback1(arg1) - the sequence may get aborted temporarily at any level if required (e.g for error correction) 2. Mark data as ready for use ( e.g unlock buffer/page, mark as up-to-date etc) (We could perhaps think of this as a level0 callback) 3. Notify registered consumers - wakeup synchronous waiters (typically wait_queue based) - signal async consumers (SIGIO) (hereafter any further processing happens in the context of the consumer) We have all the separate mechanisms that are needed to achieve this (I wonder if we have too many; and if we have some duplication of logic / data structure patterns in certain cases, just to handle slight distinctions in flavour ). Here are some of them: 1. io completion callback routines + private data embedded in specific i/o structures -- in bh, kiobuf (for example) ( sock structure too ?) 2. task queues that can be used for triggering a list of callbacks perhaps ? 3. wait queues for registering synchronous waiters 4. fasync helper routines for registering async waiters to be signalled (SIGIO) Other places where we have a callback, arg pattern: - timer callback + arg (specially for timer events) - softirq handlers ? So, if we wanted to have a generic abstraction for the mentioned pattern, it could be done using a collection of the following: - something like a task queue for queueing up multiple callbacks to be invoked in LIFO order; add some extra functionality to break in case a callback returns a failure. - a wait queue for synchronous waiters - an fasync pointer for asynchronous notification requesters - a status field (to check on completion status) - a private data pointer (to help store persistent state; such state may also be required for operation cancellation) - A zeroth level callback registered in the queue during initialization to mark the status as completed and then notify synchronous and asynchronous waiters - Now, if there are multiple related event structures - like compound events (compound i/os - e.g multiple bh's componded to a page or kiobuf, sub-kiobufs compounded to a compound kiobuf etc), then there is a requirement of triggering a similar sequence on that compound event. Have still not decided at what stage this should happen and how. - Another item to think about is the operation cancellation path One question is whether an extension to the wait queue is indeed appropriate for the above. Or should it be a different abstraction in itself ? I know this needs further thinking through, and definitely some more detailing, but I'd like to hear some feedback on how it sounds. Besides, I don't know if anyone is already working on something like this. Does it even make sense to attempt this ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] Dynamic Probes 1.3 released
We've just released version 1.3 of the Dynamic Probes facility. This has 2.4.0 and 2.2.18 support and some bug fixes, including Andi Kleen's suggestions for fixing the races in handling of swapped out COW pages. For more information on DProbes see the dprobes homepage at http://oss.software.ibm.com/developer/opensource/linux/projects/dprobes/ Changes in this version: - DProbes for kernel version 2.2.18 and 2.4.0. - Fix for race condition in COW page handling and some other corrections in the detection of correct offsets in COW pages. - Removed the mmap_sem trylock from trap3 handling. The correct thing to do is to use the page_table_lock. - Probe on BOUND instruction now logs data. - Probes can now be inserted into modules that are getting initialized. - kernel/time.c and arch/i386/kernel/time.c treated as exclude regions. - Architecture specific FPU code moved to arch specific dprobes file. - Minor bug fix in merge code which merges probes that are excluded. - Exit to crashdump supported for 2.2.x kernels also. We are no longer updating the patches for 2.2.12, 2.2.16 on the site. If you require patches for these kernel versions, contact us at [EMAIL PROTECTED] Race condition fixes in handling COW pages: Some updates on the race condition fixes for COW page handling, since the discussions that happened last, for those who might have been following the thread: We eventually decided to drop the idea of trying achieve on-demand probe insertion for swapped out COW pages by having a vm_ops->swapin() routine that we could take over. The reason was that we realized that the vm_ops->swapin() replacement approach, while good for insert, is not suitable for remove, since then we might need to keep around the probes records for a removed module, until the swapped out pages eventually get cleaned out of the probes. (This may have been feasible without too much work because we already have the logic for quietly removing probes, but then it didn't seem like a good idea to have ghost records around in this way - makes it harder to maintain/debug ). So, we ended up implementing Andi Kleen's original suggestion of a 2 pass approach, instead. In the first pass, which takes place with the i_shared lock held, we build a list of swapped out page locators (mm, addr), while taking care of incrementing the mm reference count, and in the second pass which happens with the spin locks released, we actually bring in the page and cross-check that it is the same one that we'd meant to bring in. In the second pass, we hold the mmap_sem while bringing the page in. With David Miller's lock hierarchy fixes, the i_shared_lock is now always higher in the hierarchy than the page_table_lock, so we are OK there. I hope we haven't missed anything. If anyone spots any gotchas or slips, or some other races that still exist, do let us know. We've done some simple testing to try to exercise a few scenarios that we could simulate easily, but that's about it. Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
>Ok. Then we need an additional more or less generic object that is used for >passing in a rw_kiovec file operation (and we really want that for many kinds >of IO). I thould mostly be used for communicating to the high-level driver. > >/* > * the name is just plain stupid, but that shouldn't matter > */ >struct vfs_kiovec { >struct kiovec * iov; > >/* private data, mostly for the callback */ >void * private; > >/* completion callback */ >void (*end_io) (struct vfs_kiovec *); >wait_queue_head_t wait_queue; >}; > >Christoph Shouldn't we have an error / status field too ? Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DProbes 1.1
Hello Andi, Thanks for taking the trouble to go through our code in such detail and thinking through the race conditions in dp_vaddr_to_page, which I had sort of shut my eyes to and postponed for while because it didn't seem very easy to close all the loopholes in an elegant way. I need to understand the mm locking hierachies and appropriate usage more thoroughly to do a complete job. I have labelled the key points you have brought up in your note below as (a), (b), (c) for ease of reference. For (a), your suggestion of a two pass approach is I guess feasible, but I wish there were a simpler way to do it. Actually I don't even really like the idea of forcing the swapped out page back in, which we are having to do right now - it would have been nicer if there were a swapin() routine in the vma ops that we could have used for on-demand probe insertion, just the way we use inode address space readpage() right now for discardable pages, but maybe that's asking for too much :-) [Could vma type based swapin() logic be a useful abstraction in general, aside from dprobes ?]. Anyway, let me think over this for a while ... We don't quite understand (b), though. There is indeed a race due to our not holding the page given to us by handle_mm_fault, while we try to access it, and we need to fix that of course, but that doesn't sound exactly like what you mention here. We do have handle_mm_fault being called under the mm semaphore. Could you explain the deadlock situation that you have in mind ? We've taken (c) as a very reasonable usability feedback. Now, we would be displaying the opcodes as part of the dprobes query results. Hope that would help. It's good to hear that you've found dprobes useful and that you ported it to 2.4 yourself. Do send us any comments, suggestions, improvements etc that come to mind as you use it or go through the code. Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 Richard J Moore@IBMGB 10/19/2000 01:27 AM To: DProbes Team India cc: From: Richard J Moore/UK/IBM@IBMGB Subject: Re: [ANNOUNCE] DProbes 1.1 Richard Moore - RAS Project Lead - Linux Technology Centre (PISC). http://oss.software.ibm.com/developerworks/opensource/linux Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183 IBM UK Ltd, MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK -- Forwarded by Richard J Moore/UK/IBM on 18/10/2000 20:57 --- Andi Kleen <[EMAIL PROTECTED]> on 18/10/2000 18:38:13 Please respond to Andi Kleen <[EMAIL PROTECTED]> To: Richard J Moore/UK/IBM@IBMGB cc: [EMAIL PROTECTED] Subject: Re: [ANNOUNCE] DProbes 1.1 Hallo Richard, On Wed, Oct 18, 2000 at 10:44:11AM +0100, [EMAIL PROTECTED] wrote: > > > We've release v1.1 of DProbes - deatils and code is on the DProbes web > page. > > the enhancements include: > > - DProbes for kernel version 2.4.0-test7 is now available. First thanks for this nice work. I ported the older 1.0 dprobes to 2.4 a few weeks ago for my own use. It is very useful for kernel work. Unfortunately the user space support had still one ugly race which I didn't fix because it required too extensive changes for my simple port (and it didn't concern me because I only use kernel level breakpoints) I see the problems are still in 1.1. (a) The problem is the vma loop in process_recs_in_cow_pages over the vmas of an address_space. In 2.4 the only way to do that safely is to hold the address_space spinlock. Unfortunately you cannot take the semaphore or execute handle_mm_fault while holding the spinlock, because they could sleep. The only way I think to do it relatively race free without adding locks to the core VM is to do it two pass (first collect all the mms with mmget() and their addresses in a separate list with the spinlock and then process it with the spinlock released) (b) Then dp_vaddr_to_page has another race. It cannot hold the mm semaphore because that would deadlock with handle_mm_struct. Not holding it means though that the page could be swapped out again after you faulted it in before you have a change to access it. It probably can be done with an loop that checks and locks the page atomically (e.g. using cmpexchg) and retries the handle_mm_fault as needed. There may be more races I missed, the 2.4 SMP MM locking hierarchy is unfortunately not very flexible and makes things like what dprobes wants to do relatively hard. (c) Another change I added and which I found useful is a printk to show the opcode of mismatched probes (this way wrong offsets in the probe definitions are easier to fix) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: [Dprobes] Re: [ANNOUNCE] DProbes 1.1: proposing a vm_ops->swapin() abstraction
Andi, Thanks. Then, I'll work it out in more detail and propose it on linux-mm as you've suggested. Maybe I should also try to think of another example where it might be useful. Anything that comes to mind ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 Andi Kleen <[EMAIL PROTECTED]> on 10/24/2000 07:51:39 PM Please respond to Andi Kleen <[EMAIL PROTECTED]> To: Suparna Bhattacharya/India/IBM@IBMIN cc: [EMAIL PROTECTED], Richard J Moore/UK/IBM@IBMGB, [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: [Dprobes] Re: [ANNOUNCE] DProbes 1.1 Hallo, On Tue, Oct 24, 2000 at 07:37:08PM +0530, [EMAIL PROTECTED] wrote: > For (a), your suggestion of a two pass approach is I guess feasible, but I > wish there were a simpler way to do it. > Actually I don't even really like the idea of forcing the swapped out page > back in, which we are having to do right now - it would have been nicer > if there were a swapin() routine in the vma ops that we could have used for > on-demand probe insertion, just the way we use inode address space > readpage() right now for discardable pages, but maybe that's asking for too > much :-) [Could vma type based swapin() logic be a useful abstraction in > general, aside from dprobes ?]. I think it would be. You could propose it on the linux-mm mailing list and ask Linus what he thinks. I agree that it would be much nicer to do it this way. > We don't quite understand (b), though. There is indeed a race due to our > not holding the page given to us by handle_mm_fault, while we try to access > it, and we need to fix that of course, but that doesn't sound exactly like > what you mention here. We do have handle_mm_fault being called under the mm > semaphore. Could you explain the deadlock situation that you have in mind ? It does not exist sorry. I was misremembering the lock hierarchy at that place when I wrote the mail and should have double checked it. -Andi ___ Dprobes mailing list [EMAIL PROTECTED] http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/dprobes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Oddness in i_shared_lock and page_table_lock nesting hierarchies ?
The vma list lock can nest with i_shared_lock, as per Kanoj Sarcar's note on mem-mgmt locks (Documentation/vm/locking), and currently vma list lock == mm->page_table_lock. But there appears to be some inconsistency in the hierarchy of these 2 locks. (By vma list lock I mean vmlist_access/modify_lock(s) ) Looking at mmap code, it appears that the vma list lock i.e. page_table_lock right now, is to be acquired first (e.g insert_vm_struct which acquires i_shared_lock internally, is called under the page_table_lock/vma list lock). Elsewhere in madvise too, I see a similar hierarchy. In the unmap code, care has been taken not to have these locks acquired simultaneously. However, in the vmtruncate code, it looks like the hierarchy is reversed. There, the i_shared_lock is acquired, in order to traverse the i_mmap list and inside the loop it calls zap_page_range, which aquires the page_table_lock. This is odd. Isn't there a possibility of a deadlock if mmap and truncation for the same file happen simultaneously (on an SMP) ? I'm wondering if this could be a side effect of the doubling up of the page_table_lock as a vma list lock ? Or have I missed something ? [I have checked upto 2.4-test10-pre5 ] I had put this query up on linux-mm, as part of a much larger mail, but didn't get any response yet, so I thought of putting up a more focussed query this time. Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 >List: linux-mm >Subject: Re: Updated Linux 2.4 Status/TODO List (from the ALS show) >From: Kanoj Sarcar <[EMAIL PROTECTED]> >Date: 2000-10-13 18:19:06 >[Download message RAW] >> >> >> On Thu, 12 Oct 2000, David S. Miller wrote: >> > >> >page_table_lock is supposed to protect normal page table activity (like >> >what's done in page fault handler) from swapping out. >> >However, grabbing this lock in swap-out code is completely missing! >> > >> > Audrey, vmlist_access_{un,}lock == unlocking/locking page_table_lock. >> >> Yeah, it's an easy mistake to make. >> >> I've made it myself - grepping for page_table_lock and coming up empty in >> places where I expected it to be. >> >> In fact, if somebody sends me patches to remove the "vmlist_access_lock()" >> stuff completely, and replace them with explicit page_table_lock things, >> I'll apply it pretty much immediately. I don't like information hiding, >> and right now that's the only thing that the vmlist_access_lock() stuff is >> doing. >Linus, > >I came up with the vmlist_access_lock/vmlist_modify_lock names early in >2.3. The reasoning behind that was that in most places where the "vmlist >lock" was being taken was to protect the vmlist chain, vma_t fields or >mm_struct fields. The fact that implementation wise this lock could be >the same as page_table_lock was a good idea that you suggested. > >Nevertheless, the name was chosen to indicate what type of things it was >guarding. For example, in the future, you might actually have a different >(possibly sleeping) lock to guard the vmachain etc, but still have a >spin lock for the page_table_lock (No, I don't want to be drawn into a >discussion of why this might be needed right now). Some of this is >mentioned in Documentation/vm/locking. > >Just thought I would mention, in case you don't recollect some of this >history. Of course, I understand the "information hiding" part. > >Kanoj > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] Dynamic Probes 2.0 released
Version 2.0 of the Dynamic Probes facility is now available at http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes This release includes a new feature called "watchpoint probes" which exploits hardware watchpoint capabilities of the underlying hardware architecture to allow probing specific types of storage accesses. Watchpoint probes are specified by the location (virtual address) and the type of access (rw/w/x) on which the probe is fired. This capability enables fine-grained storage profiling when Dprobes is used in conjunction with LTT from Opersys as it permits tracing of memory read and write access at specific locations. Changes in this version: -- - New watchpoint probes feature allows probes to be fired on specific type of memory accesses(execute|write|read or write|io), implemented using the debug registers available on Intel x86 processors. - New RPN instructions: divide/remainder and propagate bit/zero instructions. - Kernel data can now be referenced symbolically in the probe program files. - Memory logging instructions like "log mrf" now write the fault location in the log buffer in case a page fault occurs when accessing the concerned memory area. - Log can now be optionally saved using the new keyword logonfault=yes even if the probed instruction faults. - Bug fixes in the interpreter - validity check for the selector in segmented to flat address conversion in case where the selector references the GDT. - log memory and log ascii functions now don't log if the logmax was set to zero. About Dprobes --- Dprobes is a generic and pervasive non-interactive system debugging facility designed to operate under the most extreme software conditions such as debugging a deep rooted operating system problem in a live environment. It allows the insertion of fully automated breakpoints or probepoints, anywhere in the system and user space along with a set of probe instructions that are interpreted when a specific probe fires. These instructions allow memory and CPU registers to be examined and altered using conditional logic and a log to be generated. Dprobes is currently available only on the IA32 platform. An interesting aspect of Dprobes is that it allows the probe program to conditionally trigger any external debugging facility that registers for this purpose, e.g. crash dump, kernel debugger. Dprobes interfaces with Opersys's LTT to provide a Universal Dynamic Trace facility. For more information on DProbes please visit the dprobes homepage at http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes/ Suparna Bhattacharya Linux Technology Center IBM Software Lab, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Core dumps for threads: Request for your patch (hack)
> >(I have a complimentary hack that will dump the stacks of all the >rest of the threads as well (though its a good trick to get gdb to >interpret this). Available upon request.) > Hello Adam, Could I take a look at your patch ? Regards Suparna Suparna Bhattacharya IBM Software Lab, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: (struct dentry *)->vfsmnt;
>Actually, I'm pretty sure you _never_ need to exportvg in order to have >it work on another system. That's one of the great things about AIX LVM, >because it means you can move a VG to another system after a hardware >problem, and not have any problems importing it (journaled fs also helps). >AFAIK, the only think exportvg does is remove VG information from the >ODM and /etc/filesystems. > Yes that's correct as far as I know too. The VGDA and LVCB contain all the information required for import even without an exportvg. >I suppose it is possible that because AIX is so tied into the ODM and >SMIT, that it updates the VGDA mountpoint info whenever a filesystem >mountpoint is changed, but this will _never_ work on Linux because of >different tools versions, distributions, etc. Also, it would mean on >AIX that anyone editing /etc/filesystems might have a broken system at >vgimport time (wouldn't be the first time that not using ODM/SMIT caused >such a problem). Yes, you can think of crfs (or chfs) as a composite command that handles this (writing to LVCB. These are more like administrative/setup/configuration commands -- one time, or occasional system configuration changes. On the other hand a mount doesn't cause a persistent configuration information change. You can issue a mount even if an entry doesn't exist in /etc/filesystems. > >> ... I do think that the LVM is a reasonable place to store this kind of >> information. > >Yes, even though it would tie the user into using a specific version of >mount(), I suppose it is a better solution than storing it inside the >filesystem. It will work with non-ext2 filesystems, and it also allows >you to store more information than simply the mountpoint (e.g. mount >options, dump + fsck info, etc). In the end, I will probably just >save the whole /etc/fstab line into the LV header somewhere, and extract >it at importvg time (possibly with modifications for vgname and mountpoint). > >Cheers, Andreas Is mount the right time to do this ? A mount happens on every boot of the system. And then, one can issue a mount by explicitly specifying all the parameters without having an entry in fstab. [Doesn't that also mean that you have a possibility of inconsistency even here ?] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: (struct dentry *)->vfsmnt;
>Because this is totally filesystem specific - why put extra knowledge >of filesystem internals into mount? I personally don't want it writing >into the ext2 or ext3 superblock. How can it possibly know what to do, >without embedding a lot of knowledge there? Yes, mount(8) can _read_ >the UUID and LABEL for ext2 filesystems, but I would rather not have it >_write_ into the superblock. Also, InterMezzo and SnapFS have the same >on-disk format as ext2, but would mount(8) know that? > >There are other filesystems (at least IBM JFS) that could also take >advantage of this feature, should we make mount(8) have code for each >and every filesystem? Yuck. Sort of ruins the whole modularity thing. >Yes, I know mount(8) does funny stuff for SMB and NFS, but that is a >reason to _not_ put more filesystem-specific information into mount(8). > Since you've brought up this point. I have wondered why Linux doesn't seem to yet have the option of a generic user space filesystem type specific mount helper command. I recall having seen code in mount(8) implementation to call mount., but its still under an ifdef isn't it, except for smb or ncp perhaps ? (Hope I'm not out-of-date on this) Having something like that lets one stream-line userland filesystem specific stuff like this, without having the generic part of mount(8) know about it. For example, in AIX, the association between type and the program for mount helpers (and also for filesystem helpers for things like mkfs, fsck etc) is configured in /etc/vfs, while SUN and HP look for them under particular directory locations (by fstype name). Actually, it'd be good to have this in such a way that if a specific helper doesn't exist, default mount processing continues. This avoids the extra work of writing such helpers for every new filesystem, unless we need specialized behaviour there. Suparna Bhattacharya IBM Software Lab, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: Kernel mechanism: Compound event wait/notify + callback chains
Comments, suggestions, advise, feedback solicited ! If this seems like something that might (after some refinements) be a useful abstraction to have, then I need some help in straightening out the design. I am not very satisfied with it in its current form. A Kernel Mechanism for Compound Event Wait/Notify with Callback Chaining - Objective: - This is a proposal for a kernel mechanism for event notification that can provide for: (i) triggering of multiple actions via callback chains (the way notifier lists do) - some of these actions could involve waking up synchronous waiters and asynchronous signalling (ii) compound events : a compound event is an event that is essentially an aggregation of several sub-events, each of which in turn may have their own sub-events) [should extend to more than 2 levels, unlike semaphore groups or poll wait tables] (iii) a single abstraction that can serve as an IO completion indicator The need for such a structure originally came about in the context of layered drivers and layered filesystem implementations, where an io structure may need to pass through several layers of post-processing as part of actual i/o completion, which must be asynchronous. Ben LaHaise had some thoughts about extending the wait queue interface to allow callbacks, thus building a generic way to address this requirement for any kind of io structure. Here is an attempt to take that idea further, keeping in mind certain additional requirements (including buffer splitting, io cancellation) and various potential usage scenarios and also considering existing synchronization/ notification/ callback primitives that exist in the linux kernel today. Calling this abstraction a "compound_event" or "cev", for short right now ( for want of a better name ). How/where would/could this be used ? -- A cev would typically be associated with an object -- e.g. an io structure or descriptor. Either have the cev embedded in the io structure, or just a private pointer field in the cev used to link the cev to the object. 1. In io/mem structures -- buffer/kiobuf/page_cache - especially places where layered callbacks and compound IOs may occur (e.g encryption filter filesystems, encryption drivers, lvm, raid, evms type situations). [Need to explore the possibility of using it in n/w i/o structures, as well] 2. As a basic synchronization primitive to trigger action when multiple pre-conditions are involved. Could potentially be used as an alternative way to implement select/poll and even semaphore groups. The key advantage is that multiple actions/checks can be invoked without involving a context switch, and that sub-events can be aborted top down through multiple levels of aggregation (using the cancellation feature). 3. A primitive to trigger a sequence of actions on an event, and to wake up waiters on completion of the sequence. (e.g a timer could be associated with a cev to trigger a chain of actions and a notification of completion of all of these to multiple interested parties) Some reasons for using such a primitive rather than current approach of adding end_io/private/wait fields directly within buffer heads or vfs_kiovecs : 1. Uniformity of structure at different levels; may also be helpful in terms of statistics collection/metering. Keeps the completion linkage mechanism external to the specific io structures. 2. Callback chaining makes layered systems/interceptors easier to add on, as no sub-structures need to be created in cases where splitting/merging isn't necessary. 3. The ability to abort the callback chain for the time being based on the return value of a callback has some interesting consequences in making it possible to support error recovery/retries of failed sub-ios, and also in networking stack kind of layering architecture involving queues at each level with delayed callbacks rather than a direct callback invokation --- Proposed Design -- "Perfection in design is achieved not when there is nothing more to add, but when there is nothing left to take away" So, not there yet ! Anyway, after a few iterations to try to bring this down, and to keep it efficient in simplest usage scenarios, while at the same time supporting infrastructure for all the intended functionality, this is how it stands (Some fields are there for search efficiency reasons when sub-events are involved, though not actually essential): I'm not too happy with this design in its current form. Need some suggestions to improve this. struct compound_event { /* 1. Basic infrastructure for simplest cases */ void (*done)(); unsigned i
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains
Ben, This indeed looks neat and simple ! I'd avoided touching the wait queue structure as I suspected that you might already have something like this in place :-) and was hoping that you'd see this posting and comment. I agree entirely that it makes sense to have chaining of events built over simple minimalist primitives. That's what was making me uncomfortable with the cev design I had. So now I'm thinking how to do this using the wait queues extension you have. Some things to consider: - Since non-exclusive waiters are always added to the head of the queue (unless we use a tq in a wtd kind of structure as ), ordering of layered callbacks might still be a problem. (e.g. with an encryption filter fs, we want the decrypt callback to run before any waiter gets woken up; irrespective of whether the wait was issued before or after the decrypt callback was added by the filter layer) - The wait_queue_func gets only a pointer to the wait structure as an argument, with no other means to pass any state about the sub-event that caused it (could that be a problem with event chaining ... ? every encapsulating structure will have to maintain a pointer to the related sub-event ... ? ) Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 Ben LaHaise <[EMAIL PROTECTED]> on 01/30/2001 10:59:46 AM Please respond to Ben LaHaise <[EMAIL PROTECTED]> To: Suparna Bhattacharya/India/IBM@IBMIN cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote: > > Comments, suggestions, advise, feedback solicited ! > > If this seems like something that might (after some refinements) be a > useful abstraction to have, then I need some help in straightening out the > design. I am not very satisfied with it in its current form. Here's my first bit of feedback from the point of "this is what my code currently does and why". The waitqueue extension below is a minimalist approach for providing kernel support for fully asynchronous io. The basic idea is that a function pointer is added to the wait queue structure that is called during wake_up on a wait queue head. (The patch below also includes support for exclusive lifo wakeups, which isn't crucial/perfect, but just happened to be part of the code.) No function pointer or other data is added to the wait queue structure. Rather, users are expected to make use of it by embedding the wait queue structure within their own data structure that contains all needed info for running the state machine. Here's a snippet of code which demonstrates a non blocking lock of a page cache page: struct worktodo { wait_queue_twait; struct tq_structtq; void *data; }; static void __wtd_lock_page_waiter(wait_queue_t *wait) { struct worktodo *wtd = (struct worktodo *)wait; struct page *page = (struct page *)wtd->data; if (!TryLockPage(page)) { __remove_wait_queue(&page->wait, &wtd->wait); wtd_queue(wtd); } else { schedule_task(&run_disk_tq); } } void wtd_lock_page(struct worktodo *wtd, struct page *page) { if (TryLockPage(page)) { int raced = 0; wtd->data = page; init_waitqueue_func_entry(&wtd->wait, __wtd_lock_page_waiter); add_wait_queue_cond(&page->wait, &wtd->wait, TryLockPage(page), raced = 1); if (!raced) { run_task_queue(&tq_disk); return; } } wtd->tq.routine(wtd->tq.data); } The use of wakeup functions is also useful for waking a specific reader or writer in the rw_sems, making semaphore avoid spurious wakeups, etc. I suspect that chaining of events should be built on top of the primatives, which should be kept as simple as possible. Comments? -ben diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h --- v2.4.1pre10/include/linux/mm.h Fri Jan 26 19:03:05 2001 +++ work/include/linux/mm.h Fri Jan 26 19:14:07 2001 @@ -198,10 +198,11 @@ */ #define UnlockPage(page) do { \ smp_mb__before_clear_bit(); \ +if (!test_bit(PG_locked, &(page)->flags)) { printk("last: %p\n", (page)->last_unlock); BUG(); } \ +(page)->last_unlock = current_text_addr(); \ if (!test_and_clear_bit(PG_locked, &(page)->flags)) BUG(); \ smp_mb__after_clear_bit(); \ -if (waitqueue_active(&page->wait)) \ - wake_up(&page->wait); \ +wake_up(&page->wait); \ } while (0) #define PageError(page) test_bit(PG_error, &(page)->flags) #def
Re: [Kiobuf-io-devel] Re: RFC: Kernel mechanism: Compound event wait/notify + callback chains
Mathew, Thanks for mentioning this. I didn't know about it earlier. I've been going through the 4/00 kqueue patch on freebsd ... Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 Matthew Jacob <[EMAIL PROTECTED]> on 01/30/2001 12:08:48 PM Please respond to [EMAIL PROTECTED] To: Suparna Bhattacharya/India/IBM@IBMIN cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: [Kiobuf-io-devel] Re: RFC: Kernel mechanism: Compound event wait/notify + callback chains Why don't you look at Kqueues (from FreeBSD)? ___ Kiobuf-io-devel mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains
>The waitqueue extension below is a minimalist approach for providing >kernel support for fully asynchronous io. The basic idea is that a >function pointer is added to the wait queue structure that is called >during wake_up on a wait queue head. (The patch below also includes >support for exclusive lifo wakeups, which isn't crucial/perfect, but just >happened to be part of the code.) No function pointer or other data is >added to the wait queue structure. Rather, users are expected to make use >of it by embedding the wait queue structure within their own data >structure that contains all needed info for running the state machine. >I suspect that chaining of events should be built on top of the >primatives, which should be kept as simple as possible. Comments? Do the following modifications to your wait queue extension sound reasonable ? 1. Change add_wait_queue to add elements to the end of queue (fifo, by default) and instead have an add_wait_queue_lifo() routine that adds to the head of the queue ? [This will help avoid the problem of waiters getting woken up before LIFO wakeup functions have run, just because the wait happened to have been issued after the LIFO callbacks were registered, for example, while an IO is going on] Or is there a reason why add_wait_queue adds elements to the head by default ? 2. Pass the wait_queue_head pointer as a parameter to the wakeup function (in addition to wait queue entry pointer). [This will make it easier for the wakeup function to access the structure in which the wait queue is embedded, i.e. the object which the wait queue is associated with. Without this, we might have to store a pointer to this object in each element linked in the wait queue. This never was a problem with sleeping waiters because the a reference to the object being waited for would have been on the waiter's stack/context, but with wakeup functions there is no such context] 3. Have __wake_up_common break out of the loop if the wakeup function returns 1 (or some other value) ? [This makes it possible to abort the loop based on conditional logic in the wakeup function ] Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Wed, Jan 31, 2001 at 07:28:01PM +0530, [EMAIL PROTECTED] wrote: >> >> Do the following modifications to your wait queue extension sound >> reasonable ? >> >> 1. Change add_wait_queue to add elements to the end of queue (fifo, by >> default) and instead have an add_wait_queue_lifo() routine that adds to the >> head of the queue ? > >Cache efficiency: you wake up the task whose data set is most likely >to be in L1 cache by waking it before its triggering event is flushed >from cache. > >--Stephen Valid point. ___ Kiobuf-io-devel mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains
>My first comment is that this looks very heavyweight indeed. Isn't it >just over-engineered? Yes, I know it is, in its current form (sigh !). But at the same time, I do not want to give up (not yet, at least) on trying to arrive at something that can serve the objectives, and yet be simple in principle and lightweight too. I feel the need may grow as we have more filter layers coming in, and as async i/o and even i/o cancellation usage increases. And it may not be just with kiobufs ... I took a second pass attempt at it last night based on Ben's wait queue extensions. Will write that up in a separate note after this. Do let me know if it seems like any improvement at all. >We _do_ need the ability to stack completion events, but as far as the >kiobuf work goes, my current thoughts are to do that by stacking >lightweight "clone" kiobufs. Would that work with stackable filesystems ? > >The idea is that completion needs to pass upwards (a) >bytes-transferred, and (b) errno, to satisfy the caller: everything >else, including any private data, can be hooked by the caller off the >kiobuf private data (or in fact the caller's private data can embed >the clone kiobuf). > >A clone kiobuf is a simple header, nothing more, nothing less: it >shares the same page vector as its parent kiobuf. It has private >length/offset fields, so (for example) a LVM driver can carve the >parent kiobuf into multiple non-overlapping children, all sharing the >same page list but each one actually referencing only a small region >of the whole. > >That ought to clean up a great deal of the problems of passing kiobufs >through soft raid, LVM or loop drivers. > >I am tempted to add fields to allow the children of a kiobuf to be >tracked and identified, but I'm really not sure it's necessary so I'll >hold off for now. We already have the "io-count" field which >enumerates sub-ios, so we can define each child to count as one such >sub-io; and adding a parent kiobuf reference to each kiobuf makes a >lot of sense if we want to make it easy to pass callbacks up the >stack. More than that seems unnecessary for now. Being able to track the children of a kiobuf would help with I/O cancellation (e.g. to pull sub-ios off their request queues if I/O cancellation for the parent kiobuf was issued). Not essential, I guess, in general, but useful in some situations. With a clone kiobuf there is no direct way to reach a clone kiobuf given the original kiobuf (without adding some indexing scheme ). > >--Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Here's a second pass attempt, based on Ben's wait queue extensions: Does this sound any better ? [This doesn't require any changes to the existing wait_queue_head based i/o structures or to existing drivers, and the constructs mentioned come into the picture only when compound events are actually required] The key aspects are: 1. Just using an extended wait queue now instead of the callbackq for completion (this can take care of layered callbacks, and aggregation via wakeup functions) 2. The io structures don't need to change - as they already have a wait_queue_head embedded anyway (e.g b_wait; in fact io completion happens simply by waking up the waiters in the wait queue, just as it happens now. 3. Instead, all co-relation information is maintained in the wait_queue entries that involve compound events 4. No cancel callback queue any more. (a) For simple layered callbacks (as in encryption filesystems/drivers): Intermediate layers simply use add_wait_queue(_lifo) to add their callbacks to the object's wait queue as wakeup functions. The wakeup function can access fields in the object associated with the wait queue, using the wait_queue_head address since the wait_queue_head is embedded in the object. If the wakeup function has to be associated with any other private data, then an embedding structure is required, e.g: /* Layered event structure */ struct lev { wait_queue_twait; void *data; } or, maybe something like the work_todo structure that Ben had stated as an example (if callback actions have to be delayed to task context). Actually in that case, we might like to have the wakeup function return 1 if it needs to do some work later, and that work needs to be completed before the remaining waiters are worken up. (b) For compound events: /* Compound event structure */ struct cev_wait { wait_queue_twait; wait_queue_head_t * parent; unsigned intflags; /* optional */ struct list_head cev_list; /* links to siblings or child cev_waits as applicable*/ wait_queue_head_t *head;/* head of wait queue on which this is/was queued - optional ? */ }; In this case , for each child: wait.func() is set up to a routine that performs any necessary transfer/status/count updates from the child to the parent object, issues a wakeup on the parent's wait queue (it also removes itself from the child's wait queue, and optionally from the parent's cev_list too ). It is this update step that will be situation/subsystem specific, and also have a return value to indicate whether to detach from the parent or not. And for the parent queue, a cev_wait would be registered at the beginning, with its wait.func() set up to collate ios and let completion proceed if the relevant criteria is met. It can reach all the child cev_waits through the cev_list links, useful for aggregating data from all children. During i/o cancellation, the status of the parent object is set to indicate cancellation and wakeup issued on its wait queue. The parent cev_wait's wakeup function, if it recognizes the cancel, would then cancel all the sub-events. (Is there a nice way to access the object's status from the wakeup function that doesn't involve subsystem specific code ? ) So, it is the step of collating ios and deciding whether to proceed which is situation/subsystem specific. Similarly, the actual operation cancellation logic (e.g cancelling the underlying io request) is also non-generic. For this reason, I was toying with the option of introducing two function pointers - complete() and cancel() to the cev_wait structure, so that the rest of the logic in the wakeup function can be kept common. Does that make sense ? Need to define routines for initializing and setting up parent-child cev_waits. Right now this assumes that the changes suggested in my last posting can be made. So still need to think if there is a way to address the cache efficiency issue (that's a little hard). Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
sct wrote: >> > >> > Thanks for mentioning this. I didn't know about it earlier. I've been >> > going through the 4/00 kqueue patch on freebsd ... >> >> Linus has already denounced them as massively over-engineered... > >That shouldn't stop anyone from looking at them and learning, though. >There might be a good idea or two hiding in there somewhere. >- Dan > There is always a scope to learn from a different approach to tackle a problem of a similar nature - both good ideas as well as over-engineered ones - sometimes more from the later :-) As far as I have understood so far from looking at the original kevent patch and notes (which perhaps isn't enough and maybe out of date as well), the concept of knotes and filter ops, and the event queuing mechanism in itself is interesting and generic, but most of it seems to have been designed with linkage to user-mode issueable event waits in mind - like poll/select/aio/signal etc, at least as it appears from the way its been used in the kernel. A little different from what I had in mind, though its perhaps possible to use it otherwise. But maybe I've just not thought about it enough or understood it. Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote: >> >> >We _do_ need the ability to stack completion events, but as far as the >> >kiobuf work goes, my current thoughts are to do that by stacking >> >lightweight "clone" kiobufs. >> >> Would that work with stackable filesystems ? > >Only if the filesystems were using VFS interfaces which used kiobufs. >Right now, the only filesystem using kiobufs is XFS, and it only >passes them down to the block device layer, not to other filesystems. That would require the vfs interfaces themselves (address space readpage/writepage ops) to take kiobufs as arguments, instead of struct page * . That's not the case right now, is it ? A filter filesystem would be layered over XFS to take this example. So right now a filter filesystem only sees the struct page * and passes this along. Any completion event stacking has to be applied with reference to this. >> Being able to track the children of a kiobuf would help with I/O >> cancellation (e.g. to pull sub-ios off their request queues if I/O >> cancellation for the parent kiobuf was issued). Not essential, I guess, in >> general, but useful in some situations. > >What exactly is the justification for IO cancellation? It really >upsets the normal flow of control through the IO stack to have >voluntary cancellation semantics. One reason that I saw is that if the results of an i/o are no longer required due to some condition (e.g. aio cancellation situations, or if the process that issued the i/o gets killed), then this avoids the unnecessary disk i/o, if the request hadn't been scheduled as yet. Too remote a requirement ? If the capability/support doesn't exist at the driver level I guess its difficult. --Stephen ___ Kiobuf-io-devel mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote: >> >> Here's a second pass attempt, based on Ben's wait queue extensions: > Does this sound any better ? > >It's a mechanism, all right, but you haven't described what problems >it is trying to solve, and where it is likely to be used, so it's hard >to judge it. :) Hmm .. I thought I had done that in my first posting, but obviously, I mustn't have done a good job at expressing it, so let me take another stab at trying to convey why I started on this. There are certain specific situations that I have in mind right now, but to me it looks like the very nature of the abstraction is such that it is quite likely that there would be uses in some other situations which I may not have thought of yet, or just do not understand well enough to vouch for at this point. What those situations could be, and the associated issues involved (especially performance related) is something that I hope other people on this forum would be able to help pinpoint, based on their experiences and areas of expertise. I do realize that generic and yet simple and performance optimal in all kinds of situations is a really difficult (if not impossible :-) ) thing to achieve, but even then, won't it be nice to at least abstract out uniformity in patterns across situations in a way which can be tweaked/tuned for each specific class of situations ? And the nice thing which I see about Ben's wait queue extensions is that it gives us a route to try to do that ... Some needs considered (and associated problems): a. Stacking of completion events - asynchronously, through multiple layers - layered drivers (encryption, conversion) - filter filesystems Key aspects: 1. It should be possible to pass the same (original) i/o container structure all the way down (no copies/clones should need to happen, unless actual i/o splitting, or extra buffer space or multiple sub-ios are involved) 2. Transparency: Neither the upper layer nor the layer below it should need to have any specific knowledge about the existence/absense of an intermediate filter layer (the mechanism should hide all that) 3. LIFO ordering of completion actions 4. The i/o structure should be marked as up-to-date only after all the completion actions are done. 5. Preferably have waiters on the i/o structure woken up only after all completion actions are through (to avoid spurious/redundant wakeups since the data won't be ready for use) 6. Possible to have completion actions execute later in task context b. Co-relation between multiple completion events and their associated operations and data structures - (bottom up aspect) merging results of split i/o requests, and marking the completion of the compound i/o through multiple such layers (tree), e.g - lvm - md / raid - evms aggregator features - (top down aspect) cascading down i/o cancellation requests / sub-event waits , monitoring sub-io status etc Some aspects: 1. Result of collation of sub-i/os may be driver specific (In some situations like lvm - each sub i/o maps to a particular portion of a buffer; with software raid or some other kind of scheme the collation may involve actually interpreting the data read) 2. Re-start/retries of sub-ios (in case of errors) can be handled. 3. Transparency : Neither the upper layer nor the layer below it should need to have any specific knowledge about the existence/absense of an intermediate layer (that sends out multiple sub i/os) 4. The system should be devised to avoid extra logic/fields in the generic i/o structures being passed around, in situations where no compound i/o is involved (i.e. in the simple i/o cases and most common situations). As far as possible it is desirable to keep the linkage information outside of the i/o structure for this reason. 5. Possible to have collation/completion actions execute later in task context Ben LaHaise's wait queue extensions takes care of most of the aspects of (a), if used with a little care to ensure a(4). [This just means that function that marks the i/o structure as up-to-date should be put in the completion queue first] With this, we don't even need and explicit end_io() in bh/kiobufs etc. Just the wait queue would do. Only a(5) needs some thought since cache efficiency is upset by changing the ordering of waits. But, (b) needs a little more work as a higher level construct/mechanism that latches on to the wait queue extensions. That is what the cev_wait structure was designed for. It keeps the chaining information outside of the i/o structures by default (They can be allocated together, if desired anyway) Is this still too much in the air ? Maybe I should describe the flow in a specific scenario to illustrate ? Regards Suparna - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED]
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote: >> > >> > If I have a page vector with a single offset/length pair, I can build >> > a new header with the same vector and modified offset/length to split >> > the vector in two without copying it. >> >> You just say in the higher-level structure ignore from x to y even if >> they have an offset in their own vector. > >Exactly --- and so you end up with something _much_ uglier, because >you end up with all sorts of combinations of length/offset fields all >over the place. > >This is _precisely_ the mess I want to avoid. > >Cheers, > Stephen It appears that we are coming across 2 kinds of requirements for kiobuf vectors - and quite a bit of debate centering around that. 1. In the block device i/o world, where large i/os may be involved, we'd like to be able to describe chunks/fragments that contain multiple pages; which is why it make sense to have a single pair for the entire set of pages in a kiobuf, rather than having to deal with per page offset/len fields. 2. In the networking world, we deal with smaller fragments (for protocol headers and stuff, and small packets) ideally chained together, typically not page aligned, with the ability to extend the list at least at the head and tail (and maybe some reshuffling in case of ip fragmentation?); so I guess that's why it seems good to have an pair per page/fragment. (If there can be multiple fragments in a page, even this might not be frugal enough ... ) Looks like there are 2 kinds of entities that we are looking for in the kio descriptor: - A collection of physical memory pages (call it say, a page_list) - A collection of fragments of memory described as tuples w.r.t this collection (offset in turn could be if it helps) (call this collection a frag_list) Can't we define a kiobuf structure as just this ? A combination of a frag_list and a page_list ? (Clone kiobufs might share the original kiobuf's page_list, but just split parts of the frag_list) How hard is it to maintain and to manipulate such a structure ? BTW, We could have a higher level io container that includes a field and a to take care of i/o completion (If we have a wait queue head, then I don't think we need a separate callback function if we have Ben's wakeup functions in place). Or, is this going in the direction of a cross between and elephant and a bicycle :-) ? Regards Suparna - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote: >> >> Can't we define a kiobuf structure as just this ? A combination of a >> frag_list and a page_list ? > >Then all code which needs to accept an arbitrary kiobuf needs to be >able to parse both --- ugh. > Making this a little more explicit to help analyse tradeoffs: /* Memory descriptor portion of a kiobuf - this is something that may get passed around between layers and subsystems */ struct kio_mdesc { int nr_frags; struct frag *frag_list; int nr_pages; struct page **page_list; /* list follows */ }; For block i/o requiring #1 type descriptors, the list could have allocated extra space for: struct kio_type1_ext { struct frag frag; struct page *pages[NUM_STATIC_PAGES]; } For n/w i/o or cases requiring #2 type descriptors, the list could have allocated extra space for: struct kio_type2_ext { struct frag frags[NUM_STATIC_FRAGS]; struct page *page[NUM_STATIC_FRAGS]; } struct kiobuf { intstatus; wait_queue_head_t waitq; struct kio_mdescmdesc; /* list follows - leaves room for allocation for mem descs, completion sub structs etc */ } Code that accepts an arbitrary kiobuf needs to do the following : process the fragments one by one - type #1 case, only one fragment would typically be there, but processing it would involve crossing all pages in the page list So extra processing vs a kiobuf with single pair, involves the following: dereferencing the frag_list pointer checking the nr_frags field - type #2 case, the number of fragments would be equal to or greater than number of pages, so processing will typically go over each fragments and thus cross each page in the list one by one So extra processing vs a kiobuf with per-page pairs, involves deferencing the page list entry (involves computing the page-index in the page_list from the offset value) check if offset+len doesn't fall outside the page Boils down to approx one extra dereference and one comparison per kiobuf for the common cases (have I missed something critical ?) vs the most optimized choices of descriptors for those cases. In terms of resource consumption (extra bytes taken up), two fields extra per kiobuf chain (e.g. nr_frags and frag_list pointer when it comes to #1), i.e. a total of 8 bytes, for the common cases vs the most optimized choice of structures for those cases. This seems to be more uniformly balanced across #1 and #2 cases, than an for every page, as well as an overall . But, then, come to think of it, since the need for lightweight structures is greater in the case of #2, should the point of balance (if at all we want to find one) be tilted towards #2 ? On the other hand, since having a common structure does involve extra bytes and cycles, if there are very few situations where we need both #1 and #2 - conversion only at subsystem boundaries like i2o does may turn out to be better. Oh well ... >> BTW, We could have a higher level io container that includes a >> field and a to take care of i/o completion > >IO completion requirements are much more complex. Think of disk >readahead: we can create a single request struct for an IO of a >hundred buffer heads, and as the device driver satisfies that request, >it wakes up the buffer heads as it goes. There is a separete >completion notification for every single buffer head in the chain. > I understand the requirement of independent completion notifiers for higher level buffers/other structures, since they are indeed independently usable structures. That was one aspect that I thought I was being able to address in the cev_wait design based on wait_queue wakeup functions. The way it would work is that there would be multiple wakeup functions registered on the container for the big request, each wakeup function being responsible for waking up a higher level buffer. This way, the linkage information is actually external to the buffer structures (which seems reasonable, since it is only required while the i/o is happening, unless there is another reason to keep a more lasting association) >It's the very essence of readahead that we wake up the earlier buffers >as soon as they become available, without waiting for the later ones >to complete, so we _need_ this multiple completion concept. > I can understand this in principle, but when we have a single request going down to the device that actually fills in multiple buffers, do we get notified (interrupted) by the device before all the data in that request got transferred ? I mean, how do we know that some buffers have become available until the overall device request has completed (unless of course the request actually gets broken up at this level and completed bit by bit). >Which is exactly why we have one kiobuf per higher
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
>Hi, > >On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote: >> >> >It's the very essence of readahead that we wake up the earlier buffers >> >as soon as they become available, without waiting for the later ones >> >to complete, so we _need_ this multiple completion concept. >> >> I can understand this in principle, but when we have a single request going >> down to the device that actually fills in multiple buffers, do we get >> notified (interrupted) by the device before all the data in that request >> got transferred ? > >It depends on the device driver. Different controllers will have >different maximum transfer size. For IDE, for example, we get wakeups >all over the place. For SCSI, it depends on how many scatter-gather >entries the driver can push into a single on-the-wire request. Exceed >that limit and the driver is forced to open a new scsi mailbox, and >you get independent completion signals for each such chunk. I see. I remember Jens Axboe mentioning something like this with IDE. So, in this case, you want every such chunk to check if its completed filling up a buffer and then trigger a wakeup on that ? But, does this also mean that in such a case combining requests beyond this limit doesn't really help ? (Reordering requests to get contiguity would help of course in terms of seek times, I guess, but not merging beyond this limit) >> >Which is exactly why we have one kiobuf per higher-level buffer, and >> >we chain together kiobufs when we need to for a long request, but we >> >still get the independent completion notifiers. >> >> As I mentioned above, the alternative is to have the i/o completion related >> linkage information within the wakeup structures instead. That way, it >> doesn't matter to the lower level driver what higher level structure we >> have above (maybe buffer heads, may be page cache structures, may be >> kiobufs). We only chain together memory descriptors for the buffers during >> the io. > >You forgot IO failures: it is essential, once the IO completes, to >know exactly which higher-level structures completed successfully and >which did not. The low-level drivers have to have access to the >independent completion notifications for this to work. > No, I didn't forget IO failures; just that I expect the wait structure containing the wakeup function to be embedded in a cev structure that contains a pointer to the wait_queue_head field in the higher level structure. The rest is for the wakeup function to interpret (it can always access the other fields in the higher level structure - just like list_entry() does) Later I realized that instead of having multiple wakeup functions queued on the low level structures wait queue, its perhaps better to just sort of turn the cev_wait structure upside down (entry on the lower level structure's queue should link to the parent entries instead). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Going through all the discussions once again and trying to look at this from the point of view of just basic requirements for data structures and mechanisms, that they imply. 1. Should have a data structure that represents a memory chain , which may not be contiguous in physical memory, and which can be passed down as a single unit all the way through to lowest level drivers - e.g for direct i/o to/from a contiguous virtual address range in user space (without any intermediate copies) (Networking and block i/o seem may have require different optimizations in the design of such a data structure, due to differences in the kind of patterns expected, as is apparent from the zero-copy networking fragments vs raw i/o kiobuf/kiovec patches. There are situations when such a data structure may be passed between subsystems as in the i2o example) This data structure could be part of an I/O container. 2. I/O containers may get split or merged as they pass through various layers --- so any completion mechanism and i/o container design should be able to account for both cases. At any point, a request could be - a collection of several higher level requests, or - could be one among several sub-requests of a single higher level request. (Just as appropriate "clustering" could happen at each level, appropriate "splitting" may also take place depending on the situation. It may make sense to delay splitting as far down the chain as possible in many situations, where the higher level is only interested in the i/o in its entirety and not in partial completion ) When caching/buffers are involved, sometimes the sub-requests of a single higher level request may have individual completion requirements (even when no merges were involved), because the sub-request buffers may be used to service other requests alongside. With raw i/o that might not be the case. 3. It is desirable that layers which process the requests along the way without splitting/merging, be able to pass along the same I/O container without any duplication or cloning, and intercept async i/o completions for post processing. 4. (Optional) It would be nice if different kinds of I/O containers or buffer structures could be used at different levels, without having explicit linkage fields (like bh --> page, for example) , and in a way that intermediate drivers or layers can work transparently. 3 & 4 are more of layering related items, which gets a little specific, but do 1 and 2 cover the general things we are looking for ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/