On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd <adr...@freebsd.org> wrote: > Hi, > > I'm interested in this, primarily because I'm tinkering with file > storage stuff on my little (most wifi targetted) embedded MIPS > platforms. > > So what's the story here? How can I reproduce your issue and do some > of my own profiling/investigation? > > > Adrian
Hi, your interest has made me to do more solid/comparable investigation on my embedded ELAN486 platform. With more test results, I made full tracing of related VFS, filesystem, and disk function calls. It took some time to understand what about the issue really is. My test case: Single file copy (no O_FSYNC). It means that no other filesystem operation is served. The file size must be big enough according to hidirtybuffers value. Other processes on machine, where the test was run, almost were inactive. The real copy time was profiled. In all tests, a machine was booted, a file was copied, file was removed, the machine was rebooted. Thus, the the file was copied into same disk layout. The motivation is that my embedded machines don't do any writing to a disk mostly. Only during software update, a single process is writing to a disk (file by file). It doesn't need to be a problem at all, but an update must be successful even under full cpu load. So, the writing should be tuned up greatly to not affect other processes too much and to finish in finite time. On my embedded ELAN486 machines, a flash memory is used as a disk. It means that a reading is very fast, but a writing is slow. Further, a flash memory is divided into sectors and only complete sector can be erased at once. A sector erasure is very time expensive action. When I tried to tune up VFS by various parameters changing, I found out that real copy time depends on two things. Both of them are a subject of bufdaemon. Namely, its feature to try to work harder, if its buffers flushing mission is failing. It's not suprise that the best copy times were achived when bufdaemon was excluded from buffers flushing at all (by VFS parameters setting). This bufdaemon feature brings along (with respect to the real copy time): 1. bufdaemon runtime itself, 2. very frequent filesystem buffers flushing. What really happens in the test case on my machine: A copy program uses a buffer for coping. The default buffer size is 128 KiB in my case. The simplified sys_write() implementation for DTYPE_VNODE and VREG type is following: sys_write() { dofilewrite() { bwillwrite() fo_write() => vn_write() { bwillwrite() vn_lock() VOP_WRITE() VOP_UNLOCK() } } } So, all 128 KiB is written under VNODE lock. When I take back the machine defaults: hidirtybuffers: 134 lodirtybuffers: 67 bufdirtythresh: 120 buffer size (filesystem block size): 512 bytes and do some simple calculations: 134 * 512 = 68608 -> high water bytes count 120 * 512 = 61440 67 * 512 = 34304 -> low water byte count then it's obvious that bufdaemon has something to do during each sys_write(). However, almost all dirty buffers belong to new file VNODE and the VNODE is locked. What remains are filesystem buffers only. I.e., superblock buffer and free block bitmap buffers. So, bufdaemon iterates over all dirty buffers queue, what takes a SIGNIFICANT time on my machine, and does not find any buffer to be able to flush almost all time. If bufdaemon flushes one or two buffers, kern_yield() is called, and new iteration is started until no buffer is flushed. So, very often TWO full iteration over dirty buffers queue is done to flush only one or two filesystem buffers and to failed to reach lodirtybuffers threshold. A bufdaemon runtime is growing up. Moreover, the frequent filesystem buffers flushing brings along higher cpu load (geom down thread, geom up thread, disk thread scheduling) and a disk blocks writing re-ordering. The correct disk blocks writing order is important for the flash disk. Further, while the file data buffers are aged but not flushed, filesystem buffers are written repeatedly but flushed. Of course, I use a sector cache in the flash disk, but I can't cache too many sectors because of total memory size. So, filesystem disk blocks often are written and that evokes more disk sector flushes. A sector flush really takes long time, so real copy time grows up beyond control. Last but not least, the flash memory are going to be aged uselessly. Well, this is my old story. Just to be honest, I quite forgot that my kernel was compiled with FULL_PREEMPTION option. The things are very much worse in this case. However, the option just makes the issue worse, the issue doesn't disapper without it. In this old story, I played a game with and focused to bufdirtythresh value. However, bufdirtythresh is changing the way, how and by who buffers are flushed, too much. I recorded disk sector flush count and total disk_strategy() calls count with BIO_WRITE command (and total bytes count to write). I used a file with size 2235517 bytes. When I was caching 4 disk sectors, without FULL_PREEMPTION option, the result are following: bufdirtythresh: 120, 85, 40 real copy time: 1:16.909, 38.031, 23.117 cp runtime: 8.876, 9.419, 9.922 bufdaemon runtime: 4.95, 0.26, 0.01 fldisk runtime: 39.64, 18.23, 8.50 idle runtime: 18.99, 7.79, 2.76 qsleep: 358, 32, 0 flush_count: 49, 39, 25 write_count: 452, 529, 931 write_size: 2428928, 2461184, 2649600 Idle runtime is, in fact, a disk sector erasure runtime. A qsleep is a counter, how often bufdaemon slept with hz/10 timeout, so in some sense, it shows how often bufdaemon failed to reach lodirtybuffers threshold. A flush_count is a number of disk sector flushes and in fact, many runtimes depend on it. A write_count is a number of disk_strategy() calls with BIO_WRITE command. A write_size is a total number of bytes to write in BIO_WRITE commands. It can be seen in case of bufdirtythresh=40 that even if write_count is biggest, the flush_count is lowest. Thus, a disk block writing order was the best in respect to disk sector cache. However, it has shown that there is another things in the issue, I didn't figure out before: VFS clusters and disk fragmentation. For a matter of interest, the same tests with FULL_PREEMPTION option: bufdirtythresh: 120, 85, 40 real copy time: 2:00.389, 27.385, 21.975 cp runtime: 10.873, 9.538, 9.988 bufdaemon runtime: 48.36, 0.11, 0.00 fldisk runtime: 35.96, 11.67, 7.75 idle runtime: 16.59, 4.23, 2.20 qsleep: 4180, 10, 0 flush_count: 44, 30, 24 write_count: 1112, 532, 931 write_size: 2767360, 2462720, 264960 It's interesting that real copy times are better for bufdirtythresh=40 and 85. However, the times depend on flush_count again. I.e., a disk block writing order is what is better. As I said, the FULL_PREEMPTION option just makes the things worse. The issue starts when cp process is rescheduled by bufdaemon (and numdirtybuffers are greater than lodirtybuffers) during VOP_WRITE(), which is called with VNODE locked. At the beginning (no FULL_PREEMPTION option), it happens if: 1. bufdaemon is timeouted, 2. a thread with realtime priority is scheduled. Threads with realtime priorities always are full preempted (with or without FULL_PREEMPTION option). When a realtime thread finish its job, a thread with highest priority is scheduled. I.e., bufdaemon interrupts cp process too. After that, filesystem buffers can be bawrited by bufdaemon, it causes, that cp process must sleep in getblk() (td_wmesg = "getblk") sometimes and bufdaemon is scheduled again. Moreover, when it starts, bufdaemon sleeps with hz/10 timeout, so it's more likely that bufdaemon is timeouted. I'm not sure, but when it starts, it looks that cp process sometimes can be contested too (and bufdaemon can be scheduled). One more word about FULL_PREEMPTION option. I know that the option is not supported (unfortunately). The option also makes the things worse because of bd_wakeup() and bufdaemon synchronization by bdlock mutex. In case of FULL_PREEMPTION option, bufdaemon is waked up and scheduled and contested immediately, so process which calls bd_wakeup() must be scheduled again to release bdlock and then, bufdaemon is scheduled again. The following would help: critical_enter() wakeup(&bd_request); mtx_unlock(&bdlock); critical_exit() In UP case, it's OK always. In SMP case, it's better than nothing. Now, a new story. I compiled kernel without FULL_PREEMPTION option and made a ramdisk from flash disk, so no disk sector flush happened during the copy. Only VFS and filesystem remained for investigation. I already mentioned VFS clusters and disk fragmentation. My flash disk almost is not fragmented, so even 64 KiB clusters are written at once. A new message is that clusters are bawrited. I did not investigate it fully, but (struct mount *)mnt_iosize_max limits maximal cluster size. This clusters bawriting keeps numdirtybuffers low and makes the things better. It still is about how to exclude bufdaemon from dirty buffers flushing. I also mentioned a buffer size, copy program is using. If the buffer size is lower, it makes the probability of bufdaemon influence lesser. However, once the described issue starts ... I left bufdirtythresh tuning and started with mnt_iosize_max and copy program buffer size tuning. On my machine, default mnt_iosize_max is MAXPHYS = 128 KiB. Just to remember, 68608 bytes is my default high water dirty buffer byte size. Results for no FULL_PREEMPTION option, bufdirtythresh=120, file size 2235517 bytes, and ramdisk: mnt_iosize_max = 128 KiB ------------------------ cp buffer size: 128, 64, 32, 16 KiB real copy time: 19.234, 15.708, 15.700, 16.054 cp runtime: 10.978, 9.837, 9.895, 9.995 bufdaemon runtime: 2.93, 1.14, 0.97, 1.12 ramdisk runtime: 3.08, 2.98, 2.99, 3.00 qsleep: 172, 46, 37, 47 write_count: 473, 284, 283, 292 write_size: 2439680, 2336768, 2337280, 2341888 dirtybufferflushes: 134, 5, 25, 25 altbufferflushes: 1, 0, 0, 0 The dirtybufferflushes and altbufferflushes are kernel globals. As I said, cp buffer size just makes probability lesser. A bufdaemon runtime is significant in respect to ramdisk runtime. A mnt_iosize_max is bigger than mentioned high water. mnt_iosize_max = 64 KiB ------------------------ cp buffer size: 128, 64, 32, 16 KiB real copy time: 17.566, 15.669, 15.708, 15.543 cp runtime: 9.934, 9.632, 9.730, 9.797 bufdaemon runtime: 2.40, 1.18, 1.10, 0.90 ramdisk runtime: 3.09, 2.99, 3.00, 2.98 qsleep: 137, 53, 43, 30 write_count: 451, 291, 280, 269 write_size: 2421760, 2339840, 2334720, 2329088 dirtybufferflushes: 134, 12, 5, 6 altbufferflushes: 1, 0, 0, 0 A bufdaemon is excluded a little, but no dirtybufthresh=120. It means that 120 * 512 = 61440 bytes is less than mnt_iosize_max and still is in a game (dirtybufferflushes). mnt_iosize_max = 32 KiB ------------------------ cp buffer size: 128, 64, 32, 16 KiB real copy time: 14.580, 14.327, 14.491, 14.609 cp runtime: 9.864, 9.680, 9.732, 9.841 bufdaemon runtime: 0.04, 0.01, 0.01, 0.03 ramdisk runtime: 2.98, 2.98, 2.97, 2.99 qsleep: 2, 0, 0, 2 write_count: 213, 206, 194, 191 write_size: 2266112, 2262528, 2256384, 2254848 dirtybufferflushes: 0, 0, 0, 0 altbufferflushes: 0, 0, 0, 0 A bufdaemon is excluded and dirtybufthresh too. Dirty buffers are bawrited by VFS cluster layer. All times and results are typical (no statistical) ones. The issue is not so visible when a ramdisk is used. If cp process is sleeping because of a buffer bawriting, which can be slow on my flash disk, the bufdaemon runtime can grow very much. On high speed cpu, the issue can be hidden in a system workload noise totally. However, on my machine, the issue exists with no doubt. I think, it's clear how to try to reproduce my issue. In FreeBSD kernel, MAXPHYS = 128 KiB is a maximal mnt_iosize_max value. The hidirtybuffers, dirtybufthresh, and lodirtybuffers together with VFS buffer size must be set in respect to mnt_iosize_max. They (in bytes) must be lower than mnt_iosize_max. I know that my machine configuration and utilizations is a special one. Thus, I don't push any change. And once again, I'm just sharing my experience. However, for embedded platforms, it can be more common than somebody thinks. Svata _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"