On 2015-01-27 09:24:13, Zhang Haoyu wrote: > > On 2015-01-26 22:11:59, Max Reitz wrote: > >On 2015-01-26 at 08:20, Zhang Haoyu wrote: >> > Hi, all > > > > > > Regarding too large qcow2 image, e.g., 2TB, > > > so long disruption happened when performing snapshot, > >> which was caused by cache update and IO wait. > > > perf top data shown as below, > > > PerfTop: 2554 irqs/sec kernel: 0.4% exact: 0.0% [4000Hz > > > cycles], (target_pid: 34294) > > > ------------------------------------------------------------------------------------------------------------------------ > >> > > > 33.80% qemu-system-x86_64 [.] qcow2_cache_do_get > > > 27.59% qemu-system-x86_64 [.] qcow2_cache_put > > > 15.19% qemu-system-x86_64 [.] qcow2_cache_entry_mark_dirty > > > 5.49% qemu-system-x86_64 [.] update_refcount > >> 3.02% libpthread-2.13.so [.] pthread_getspecific > > > 2.26% qemu-system-x86_64 [.] get_refcount > > > 1.95% qemu-system-x86_64 [.] coroutine_get_thread_state >> > 1.32% qemu-system-x86_64 [.] qcow2_update_snapshot_refcount > >> 1.20% qemu-system-x86_64 [.] qemu_coroutine_self > > > 1.16% libz.so.1.2.7 [.] 0x0000000000003018 > > > 0.95% qemu-system-x86_64 [.] qcow2_update_cluster_refcount > > > 0.91% qemu-system-x86_64 [.] qcow2_cache_get > > > 0.76% libc-2.13.so [.] 0x0000000000134e49 > >> 0.73% qemu-system-x86_64 [.] bdrv_debug_event > > > 0.16% qemu-system-x86_64 [.] pthread_getspecific@plt > > > 0.12% [kernel] [k] _raw_spin_unlock_irqrestore > > > 0.10% qemu-system-x86_64 [.] vga_draw_line24_32 > >> 0.09% [vdso] [.] 0x000000000000060c > > > 0.09% qemu-system-x86_64 [.] qcow2_check_metadata_overlap > > > 0.08% [kernel] [k] do_blockdev_direct_IO > > > > > > If expand the cache table size, the IO will be decreased, > >> but the calculation time will be grown. >> > so it's worthy to optimize qcow2 cache get and put algorithm. > > > > > > My proposal: > >> get: > > > using ((use offset >> cluster_bits) % c->size) to locate the cache entry, > > > raw implementation, > > > index = (use offset >> cluster_bits) % c->size; > > > if (c->entries[index].offset == offset) { > >> goto found; > > > } > > > > > > replace: > >> c->entries[use offset >> cluster_bits) % c->size].offset = offset; > > > > Well, direct-mapped caches do have their benefits, but remember that > > they do have disadvantages, too. Regarding CPU caches, set associative >> caches seem to be largely favored, so that may be a better idea. > > > Thanks, Max, > I think if direct-mapped caches were used, we can expand the cache table size > to decrease IOs, and cache location is not time-expensive even cpu cache miss > happened. > Of course set associative caches is preferred regarding cpu caches, > but sequential traverse algorithm only provides more probability > for association, but after running some time, the probability > of association maybe reduced, I guess. > I will test the direct-mapped cache, and test result will be posted soon. > I've tested direct-mapped cache, the conflicts of cache location caused about 4000 IOs during performing snapshot for 2TB thin-provision qcow2 image. But the overhead of qcow2_cache_do_get() significantly decreased from 33.80% to 10.43%. I'll try two-dimension cache to decrease the mostly IO, even to zero, 4 as the default size of the second dimension.
Any ideas? > > CC'ing Kevin, because it's his code. > > > > Max > > >> > ... > > > > > > put: > > > using 64-entries cache table to cache > >> the recently got c->entries, i.e., cache for cache, > > > then during put process, firstly search the 64-entries cache, > > > if not found, then the c->entries. I've tried c->last_used_cache pointer for the most recently got cache entry, the overhead of qcow2_cache_put() significantly decreased from 27.59% to 5.38%. I've also traced c->last_used_cache miss rate, absolutely zero, I'll test again. > > > > >> Any idea? > > > > > > Thanks, > > > Zhang Haoyu