Hi Stephen, hackers,
>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after
>> spending time looking at the smgr/md/fd API changes required to find
>> shortcut, but I'm getting actually a little bit worse timings at least on
>> "laptop DB tests". One thing that I've noticed is that needs to be only for
>> automatic-analyze, but not for automatic-vacuum where apparently there is
>> some boost due to readahead.
>Interesting that you weren't seeing any benefit to disabling readahead.
I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation
is preferable:
Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB,
35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is
different layer and might have different storage settings), VG on top of dm-0,
LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical
output of PgSQL12:
INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows
and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows
Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent
heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as
reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average
request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches,
total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s
read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed
3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed
3m28s
No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0:
reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB),
total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0:
as above, double-confirmed no readaheads [
pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total
elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0:
as above, total elapsed 5m55s
One thing not clear here is maybe in future worth measuring how striped LVs are
being
affected by readaheads.
>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?
Yes, my interpretation is that the time spent goes into directly block I/O
layer waiting.
54.67% 1.33% postgres postgres [.] FileRead
---FileRead
--53.33%--__pread_nocancel
--50.67%--entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_pread64
|--49.33%--vfs_read
| --48.00%--__vfs_read
|
|--45.33%--generic_file_read_iter
| |
|--42.67%--ondemand_readahead
| | |
__do_page_cache_readahead
| | |
|--25.33%--ext4_mpage_readpages
| | |
| |--10.67%--submit_bio
| | |
| | generic_make_request
| | |
| | |--8.00%--blk_mq_make_request
| | |
| | | |--4.00%--blk_mq_get_request
| | |
| | | | |--1.33%--blk_mq_get_tag
| | |
| | | | --1.33%--sched_clock
| | |
| | | | xen_sched_clock
| | |
| | | | pvclock_clocksource_read
| | |
| | | |--1.33%--bio_integrity_prep
| | |
| | | --1.33%--blk_account_io_start
| | |
| | | part_round_stats
| | |
| | | blk_mq_in_flight
| | |
| | | blk_mq_queue_tag_busy_iter
| | |
| | --2.67%--dm_make_request
| | |
| | __split_and_process_bio
| | |
| | __split_and_process_non_flush
| | |
| | |--1.33%--__map_bio
| | |
| | | generic_make_request
| | |
| | | generic_make_request_checks
| | |
| | | percpu_counter_add_batch
| | |
| | --1.33%--bio_alloc_bioset
| | |
| | mempool_alloc
| | |
| | kmem_cache_alloc
| | |
| |--6.67%--ext4_map_blocks
| | |
| | |--4.00%--ext4_es_lookup_extent
| | |
| | | --2.67%--_raw_read_lock
| | |
| | --2.67%--__check_block_validity.constprop.81
| | |
| | ext4_data_block_valid
| | |
| --6.67%--add_to_page_cache_lru
| | |
| |--4.00%--__add_to_page_cache_locked
| | |
| | --1.33%--mem_cgroup_try_charge
| | |
| | get_mem_cgroup_from_mm
| | |
| --2.67%--__lru_cache_add
| | |
| pagevec_lru_move_fn
| | |
| __lock_text_start
| | |
|--12.00%--blk_finish_plug
| | |
| blk_flush_plug_list
| | |
| blk_mq_flush_plug_list
| | |
| |--10.67%--__blk_mq_delay_run_hw_queue
| | |
| | __blk_mq_run_hw_queue
| | |
| | blk_mq_sched_dispatch_requests
| | |
| | --9.33%--blk_mq_dispatch_rq_list
| | |
| | nvme_queue_rq
| | |
| | --1.33%--blk_mq_start_request
>The VACUUM case is going to be complicated by what's in the visibility
>map. (..)
After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.
>> My only idea would be that a lot of those blocks could be read
>> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range
>> before, so maybe the the optimization is possible, but not until we'll have
>> AIO ;)
>
> (..)AIO is a whole other animal that's been discussed off and on
>around here but it's a much larger and more invasive change than just
>calling posix_fadvise().
Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....
The ANALYZE just seem fit to be natural candidate to use it. The only easy
chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O
concurrency
(by e.g. partitioning the data), both in readahead and non-readahead
scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might
read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once
AIO
would be it could throw thousands of requests without readahead and achieve
much
better efficiency probably</speculation>
I hope the previous simple patch goes into master and helps other people
understand
the picture more easily.
-J.