Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> I have I hope interesting observation (and nano patch proposal) on system 
> where statistics freshness is a critical factor. Autovacuum/autogathering 
> statistics was tuned to be pretty very aggressive:
> autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
> autovacuum_naptime=1s
> autovacuum_max_workers=4
> 
> Some critical table partitions are configured with:  
> autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000 to 
> force auto-analyze jobs pretty often. The interesting logs are like this:
> automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 
> 23.01 s, elapsed: 252.14 s
> automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 
> 22.63 s, elapsed: 317.33 s
> automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 
> 21.43 s, elapsed: 213.58 s
> automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 
> 24.49 s, elapsed: 229.45 s

That's certainly pretty aggressive. :)

> and this is root-cause of my question.  As you can see there is huge 3x-4x 
> time discrepancy between "elapsed" and user+system which is strange at least 
> for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). 
> According to various tools it is true: Time was wasted somewhere else, but 
> not in the PostgreSQL analyze. The ps(1) and pidstat(1) also report the same 
> for the worker:

The user/system time there is time-on-CPU (hence the 'CPU: ' prefix).

> 06:56:12 AM       PID    %usr %system  %guest    %CPU   CPU  Command
> 06:56:13 AM    114774    8.00   10.00    0.00   18.00    18  postgres
> 06:56:14 AM    114774    8.00   11.00    0.00   19.00    15  postgres
> 06:56:15 AM    114774    5.00   13.00    0.00   18.00    18  postgres
> 
> 06:56:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
> 06:56:18 AM    114774  63746.53      0.00      0.00  postgres
> 06:56:19 AM    114774  62896.00      0.00      0.00  postgres
> 06:56:20 AM    114774  62920.00      0.00      0.00  postgres
> 
> One could argue that such autoanalyze worker could perform 5x more of work 
> (%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s 
> reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although 
> reporting high utilization I'm pretty sure it could push much more. There can 
> be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for 
> statistics collection alone).

The analyze is doing more-or-less random i/o since it's skipping through
the table picking out select blocks, not doing regular sequential i/o.

> According to various gdb backtraces, a lot of time is spent here:
> #0  0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
> #1  0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, 
> __buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84
> #2  FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", 
> amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
> #3  0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized 
> out>, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
> #4  0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, 
> relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902, 
> mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
>     at bufmgr.c:897
> #5  0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, 
> forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimized out>, 
> strategy=<optimized out>) at bufmgr.c:665
> #6  0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized 
> out>, blockno=<optimized out>, bstrategy=<optimized out>) at 
> heapam_handler.c:998
> #7  0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized 
> out>, blockno=<optimized out>, scan=0x10c8098) at 
> ../../../src/include/access/tableam.h:1462
> #8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, 
> targrows=1500000, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at 
> analyze.c:1048
> #9  0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, 
> params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>, 
> relpages=26763525, inh=false,
>     in_outer_xact=false, elevel=13) at analyze.c:502
> [..]
> #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, 
> tab=<optimized out>) at autovacuum.c:3118
> [..]

Sure, we're blocked on a read call trying to get the next block.

> The interesting thing that targrows=1.5mlns and that blocks are accessed (as 
> expected) in sorted order:
> 
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890935, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890996, bstrategy=0x1102278) at heapam_handler.c:984

Not really sure what's interesting here, but it does look like we're
skipping through the table as expected.

> And probably this explains the discrepancy, perf with CPU usage reporting 
> shows a lot of frames waiting on I/O on readaheads done by ext4, trimmed for 
> clarity:
> 
> # Children      Self       sys       usr  Command   Shared Object       Symbol
>     63.64%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] 
> entry_SYSCALL_64_after_hwframe
>             ---entry_SYSCALL_64_after_hwframe
>                do_syscall_64
>                |--59.66%--sys_pread64
>                |          vfs_read
>                |           --59.09%--__vfs_read
>                |                      --58.24%--generic_file_read_iter
>                |                                |--47.44%--ondemand_readahead
>                |                                |           
> --46.88%--__do_page_cache_readahead
>                |                                |                     
> |--32.67%--ext4_mpage_readpages
>                |                                |                     |       
>    |--16.76%--submit_bio
>                |                                |                     
> |--10.23%--blk_finish_plug
> [..]
>     63.64%     1.99%     1.99%     0.00%  postgres  [kernel.kallsyms]   [k] 
> do_syscall_64
>             |--61.65%--do_syscall_64
>             |          |--59.66%--sys_pread64
>             |          |          vfs_read
>             |          |           --59.09%--__vfs_read
>             |          |                      --58.24%--generic_file_read_iter
>             |          |                                
> |--47.44%--ondemand_readahead
>             |          |                                |           
> --46.88%--__do_page_cache_readahead
> 
>     61.36%     0.00%     0.00%     0.00%  postgres  postgres            [.] 
> FileRead 
>             ---FileRead
>                __pread_nocancel
>                 --60.51%--entry_SYSCALL_64_after_hwframe
>                           do_syscall_64
>                            --59.66%--sys_pread64
>                                      vfs_read
>                                       --59.09%--__vfs_read
>                                                  
> --58.24%--generic_file_read_iter
>                                                            
> |--47.44%--ondemand_readahead
>                                                            |           
> --46.88%--__do_page_cache_readahead
> 
>     61.36%     0.85%     0.00%     0.85%  postgres  libpthread-2.17.so  [.] 
> __pread_nocancel
>             |--60.51%--__pread_nocancel
>             |          entry_SYSCALL_64_after_hwframe
>             |          do_syscall_64
>             |           --59.66%--sys_pread64
>             |                     vfs_read
>             |                      --59.09%--__vfs_read
>             |                                 --58.24%--generic_file_read_iter
>             |                                           
> |--47.44%--ondemand_readahead
>             |                                           |           
> --46.88%--__do_page_cache_readahead
> 
> 
>     59.66%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] 
> sys_pread64
>             ---sys_pread64
>                vfs_read
>                 --59.09%--__vfs_read
>                            --58.24%--generic_file_read_iter
>                                      |--47.44%--ondemand_readahead
>                                      |           
> --46.88%--__do_page_cache_readahead
>                                      |                     
> |--32.67%--ext4_mpage_readpages
> 

With all those 'readahead' calls it certainly makes one wonder if the
Linux kernel is reading more than just the block we're looking for
because it thinks we're doing a sequential read and will therefore want
the next few blocks when, in reality, we're going to skip past them,
meaning that any readahead the kernel is doing is likely just wasted
I/O.

> [..] 
> Perf --no-children also triple confirms that there isn't any function that is 
> burning a lot inside the worker:
> 
> # Overhead       sys       usr  Command   Shared Object       Symbol
>      5.40%     0.00%     5.40%  postgres  [vdso]              [.] 
> __vdso_clock_gettime
>      5.11%     0.00%     5.11%  postgres  postgres            [.] 
> acquire_sample_rows
>             ---acquire_sample_rows
>      3.98%     0.00%     3.98%  postgres  postgres            [.] 
> heapam_scan_analyze_next_tuple
>             ---heapam_scan_analyze_next_tuple
>      3.69%     3.69%     0.00%  postgres  [kernel.kallsyms]   [k] 
> pvclock_clocksource_read

Sure, makes sense.

> My questions are:
> a) does anybody know if it is expected that getrusage() doesn't include 
> readahead times  as current thread system time ? (I don't know by may be 
> performed by other kernel threads?) ru_stime is defined as "This is the total 
> amount of time spent executing in kernel mode". Maybe the "executing" is the 
> keyword here? (waiting != executing?)

getrusage()'s user/system CPU times are reporting time-on-CPU, not
counting time blocking for i/o.  Waiting isn't the same as executing,
no.

> b) initially I've wanted to add a new pg_rusage_show_verbose() that would 
> also add ru_inblock, but that wouldn't add much value to the end user. Also 
> adding another timing directly around table_scan_analyze_next_block() seems 
> like the bad idea as it involves locking underneah. So I've tried the most 
> easy approach to simply log $pgStatBlockReadTime as strictly I/O time spent 
> in pread() (ReadBuffer_common() already measures time). The attached patch 
> for PgSQL14-devel in heavy I/O conditions (with track_io_timings=on) logs the 
> following: 
> "LOG:  automatic analyze of table "test.public.t1_default" system usage: IO 
> read time 0.69 s, CPU: user: 0.18 s, system: 0.13 s, elapsed: 0.92 s"

That definitely seems like a useful thing to include and thanks for the
patch!  Please be sure to register it in the commitfest app:
https://commitfest.postgresql.org

> my interpretation would be that IO reading time was most limiting factor 
> (69/92 = 75%), but *CPU* on kernel side was just 13s. It could give the 
> enduser/DBA the information needed, the information where's the bottleneck 
> given the autovacuum_vacuum_cost_delay=0. In autovacuum_vacuum_cost_delay>0 
> maybe it would make sense to include also time spent on sleeping?

Yeah, that would certainly be useful.

> c) I'm curious if anybody has any I/O related insights into analyze.c 
> processing especially related to readaheads? E.g. maybe disabling readahead 
> would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that 
> only x% of blocks are needed? The only option I'm aware would be to e.g. 
> hash-partition the table (to introduce parallelism by autovacuums and enable 
> even workers). Any hints or comments?

I would think that, ideally, we'd teach analyze.c to work in the same
way that bitmap heap scans do- that is, use posix_fadvise to let the
kernel know what pages we're going to want next instead of the kernel
guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
lot of time poking, but it doesn't look like analyze.c tries to do any
prefetching today.  In a similar vein, I wonder if VACUUM should be
doing prefetching too today, at least when it's skipping through the
heap based on the visibility map and jumping over all-frozen pages.

> All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with 
> ext4/striped dm with 3x-4x NVMEs.
> 
> -Jakub Wartak.

> diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> index 8af12b5c6b..fea1bd6f44 100644
> --- a/src/backend/commands/analyze.c
> +++ b/src/backend/commands/analyze.c
> @@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>       Oid                     save_userid;
>       int                     save_sec_context;
>       int                     save_nestlevel;
> +     PgStat_Counter startblockreadtime = 0;
>  
>       if (inh)
>               ereport(elevel,
> @@ -347,6 +348,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>       if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
>       {
>               pg_rusage_init(&ru0);
> +             startblockreadtime = pgStatBlockReadTime;
>               if (params->log_min_duration > 0)
>                       starttime = GetCurrentTimestamp();
>       }
> @@ -686,10 +688,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>                       TimestampDifferenceExceeds(starttime, 
> GetCurrentTimestamp(),
>                                                                          
> params->log_min_duration))
>                       ereport(LOG,
> -                                     (errmsg("automatic analyze of table 
> \"%s.%s.%s\" system usage: %s",
> +                                     (errmsg("automatic analyze of table 
> \"%s.%s.%s\" system usage: IO read time %.2f s, %s",
>                                                       
> get_database_name(MyDatabaseId),
>                                                       
> get_namespace_name(RelationGetNamespace(onerel)),
>                                                       
> RelationGetRelationName(onerel),
> +                                                     (double) 
> (pgStatBlockReadTime - startblockreadtime)/1000000,
>                                                       pg_rusage_show(&ru0))));
>       }
>  

Haven't looked too closely at this but in general +1 on the idea and
this approach looks pretty reasonable to me.  Only thing I can think of
off-hand is to check how it compares to other places where we report IO
read time and make sure that it looks similar.

Thanks,

Stephen

Attachment: signature.asc
Description: PGP signature

Reply via email to