On Saturday 20 January 2007 21:55, Michael Tokarev wrote: > Denis Vlasenko wrote: > > On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > >> example, which isn't quite possible now from userspace. But as long as > >> O_DIRECT actually writes data before returning from write() call (as it > >> seems to be the case at least with a normal filesystem on a real block > >> device - I don't touch corner cases like nfs here), it's pretty much > >> THE ideal solution, at least from the application (developer) standpoint. > > > > Why do you want to wait while 100 megs of data are being written? > > You _have to_ have threaded db code in order to not waste > > gobs of CPU time on UP + even with that you eat context switch > > penalty anyway. > > Usually it's done using aio ;) > > It's not that simple really. > > For reads, you have to wait for the data anyway before doing something > with it. Omiting reads for now.
Really? All 100 megs _at once_? Linus described fairly simple (conceptually) idea here: http://lkml.org/lkml/2002/5/11/58 In short, page-aligned read buffer can be just unmapped, with page fault handler catching accesses to yet-unread data. As data comes from disk, it gets mapped back in process' address space. This way read() returns almost immediately and CPU is free to do something useful. > For writes, it's not that problematic - even 10-15 threads is nothing > compared with the I/O (O in this case) itself -- that context switch > penalty. Well, if you have some CPU intensive thing to do (e.g. sort), why not benefit from lack of extra context switch? Assume that we have "clever writes" like Linus described. /* something like "caching i/o over this fd is mostly useless" */ /* (looks like this API is easier to transition to * than fadvise etc. - it's "looks like" O_DIRECT) */ fd = open(..., flags|O_STREAM); ... /* Starts writeout immediately due to O_STREAM, * marks buf100meg's pages R/O to catch modifications, * but doesn't block! */ write(fd, buf100meg, 100*1024*1024); /* We are free to do something useful in parallel */ sort(); > > I hope you agree that threaded code is not ideal performance-wise > > - async IO is better. O_DIRECT is strictly sync IO. > > Hmm.. Now I'm confused. > > For example, oracle uses aio + O_DIRECT. It seems to be working... ;) > As an alternative, there are multiple single-threaded db_writer processes. > Why do you say O_DIRECT is strictly sync? I mean that O_DIRECT write() blocks until I/O really is done. Normal write can block for much less, or not at all. > In either case - I provided some real numbers in this thread before. > Yes, O_DIRECT has its problems, even security problems. But the thing > is - it is working, and working WAY better - from the performance point > of view - than "indirect" I/O, and currently there's no alternative that > works as good as O_DIRECT. Why we bothered to write Linux at all? There were other Unixes which worked ok. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/