Hi, Thank you so much for your constructive notes. We do feel confident to merge some parts into the mainline code. Other parts may require further discussion:
Bosko Milekic wrote: > 1) User-visible DeBoxInfo structure has the magic number "5" > PerSleepInfo structs and the magic number "200" CallTrace > structs. It seems that it would be somewhat less crude to turn > the struct arrays in DeBoxInfo into pointers in which case you > have several options. You could provide a library to link > applications compiled for DeBox use with that would take care of > allocating the space in which to store maxSleeps and > maxTrace-worth of memory and hooking the data into resultBuf or > providing the addresses as separate arguments to the > DeBoxControl() system call. For what concerns the kernel, you > could take a similar approach and dynamically pre-allocate the > PerSleepInfo and CallTrace structures, based on the requirements > given by the DeBoxControl system call. This would be a better solution. We admit that the magic numbers we took were entirely for experimental purpose and we agree that better approaches should be taken if DeBox is going to be adopted. > 2) The problem of modifying entry-exit paths in function calls. > Admittedly, this is hard, but crudely modifying a select number > of functions to Do The Right Thing for what concerns call tracing > is hard to justify from a general perspective. I don't mean to > spread FUD here; the change you made is totally OK from a > measurement perspective and serves great for the paper, it's just > tougher to integrate this stuff into the mainline code. You are right about the problems of manual modification. We opted for the manual modification only as a short-term solution while we investigate other approaches. We started by trying to modify mcount, but didn't succeed in controlling it, namely how to make it profile interested functions only. Then we switched to gcc's entry/exit functions specified via the "instrument functions" option and encountered unbearable overhead. Moreover, the common problem of these two approaches is how to avoid the bottom-half invocations within a system call because these interruption-handling functions don't belong to any of the system call paths. Automating this modification process might be possible with some compiler assistance, or also possible with mcount, but we didn't find the right way? > - On the Case Study. I was most interested in the sendfile > modifications you talk about and would be interested in seeing > patches. I know that some of the modifications you mention have > already been done in 5.x; Notably, if you have not already, you'll > want to glance at: > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/uipc_syscalls.c? \ > rev=1.144&content-type=text/x-cvsweb-markup > > (regarding your mapping caching in sf_bufs) > > and this [gigantic] thread: > > http://www.freebsd.org/cgi/getmsg.cgi?fetch=12432+15802+ \ > /usr/local/www/db/text/2003/freebsd-arch/20030601.freebsd-arch > > (subject: sendfile(2) SF_NOPUSH flag proposal on freebsd-arch@, at > least). > > You may want to contact Igor Sysoev or other concerned parties in > that thread to show them that you actually have performance results > resulting from such a change. In terms of the sendfile optimization, we started doing it back in last October, and were aware that some issues got discussed on this list later. We also went some steps further, specifically: 1. Cache the mapping between VM pages and physical map, and don't ever free these caching until the number of cache entries reaches "nsfbufs". This aggressive cache does cause more wired memory but the reduction in mapping/releasing overhead and address space consumption outweighs the drawbacks according to our measurements. It may be necessary to free these pages based on some timer system if they're not being used anymore. 2. We made a variant form of sendfile to avoid disk IO by returning an error if the file is set to be non-blocking and not in the memory. This optimization is very powerful for applications like event-driven servers. Given the fact that any blocking IO hurts performance seriously, we actually used to maintain a mmap cache then use mincore() to avoid any disk IO requests. But now it is no longer needed and saves a lot of overhead. This change makes sendfile non-block on both socket writing and disk reading if interested, but by default it still keeps the traditional semantics. 3. Pack header and tail into the body packets using mbuf cluster space. The current implementation of calling writev for header and body causes more packets to be generated and really hurts the performance of small transfers. The consequence is more of an issue for fast services in WANs because of needless latency. Compared to writev, sendfile used to have a performance loss on a portion of our workload. The performance loss on small file is more than the performance gain on larger files, leading to a net loss for our web server. Though it is possible to use writev for small files while leaving large files for sendfile, as Terry Lambert pointed out in the discussion of http://www.freebsd.org/cgi/getmsg.cgi?fetch=24340+0+/usr/local/www/db/text/2003/freebsd-arch/20030601.freebsd-arch , this makes applications too complicated. We found that building the mbuf chain and passing it was a significant benefit and is more straightforward than TCP_NOPUSH option proposed by Igor Sysoev. Regards - Yaoping _______________________________________________ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"