Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Nitin Jadhav Tue, 08 Mar 2022 07:01:46 -0800

> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
>
> I will update in the next patch.


The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

I have taken care of the rest of the comments in v5 patch for which
there was clarity.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 7, 2022 at 8:15 PM Nitin Jadhav
<nitinjadhavpostg...@gmail.com> wrote:
>
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>type</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>kind</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> > +      </para></entry>
> > +     </row>
> >
> > This looks a bit confusing. Two columns, one with the name "checkpoint
> > types" and another "checkpoint kinds". You can probably rename
> > checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> > as-it-is.
>
> Makes sense. I will change in the next patch.
> ---
>
> > +      
> > <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> > +      <entry>One row only, showing the progress of the checkpoint.
> >
> > Let's make this message consistent with the already existing message
> > for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> > view in "Dynamic Statistics Views" table.
>
> You want me to change "One row only" to "Only one row" ? If that is
> the case then for other views in the "Collected Statistics Views"
> table, it is referred as "One row only". Let me know if you are
> pointing out something else.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
>
> I will update in the next patch.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | performing two phase checkpoint
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
>
> The idea behind choosing the phases is based on the functionality
> which takes longer time to execute. Since after two phase checkpoint
> till post checkpoint cleanup won't take much time to execute, I have
> not added any additional phase for that. But I also agree that this
> gives wrong information to the user. How about mentioning the phase
> information at the end of each phase like "Initializing",
> "Initialization done", ..., "two phase checkpoint done", "post
> checkpoint cleanup done", .., "finalizing". Except for the first phase
> ("initializing") and last phase ("finalizing"), all the other phases
> describe the end of a certain operation. I feel this gives correct
> information even though the phase name/description does not represent
> the entire code block between two phases. For example if the current
> phase is ''two phase checkpoint done". Then the user can infer that
> the checkpointer has done till two phase checkpoint and it is doing
> other stuff that are after that. Thoughts?
> ---
>
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | finalizing
> > buffers_total     | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
>
> Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
> which indicates the total number of buffers to be processed. Based on
> that, the 'buffers_processed' and 'buffers_written' counter gets
> incremented. I meant these values may reach upto 'buffers_total'. The
> current pg_stat_progress_view support above information. There is
> another place when 'ckpt_bufs_written' gets incremented (In
> SlruInternalWritePage()). This increment is above the 'buffers_total'
> value and it is included in the server log message (checkpoint end)
> and not included in the view. I am a bit confused here. If we include
> this increment in the view then we cannot calculate the exact
> 'buffers_total' beforehand. Can we increment the 'buffers_toal' also
> when 'ckpt_bufs_written' gets incremented so that we can match the
> behaviour with checkpoint end message?  Please share your thoughts.
> ---
>
> > I think we can add a couple of more information to this view -
> > start_time for buffer write operation and start_time for buffer sync
> > operation. These are two very time consuming tasks in a checkpoint and
> > people would find it useful to know how much time is being taken by
> > the checkpoint in I/O operation phase. thoughts?
>
> The detailed progress is getting shown for these 2 phases of the
> checkpoint like 'buffers_processed', 'buffers_written' and
> 'files_synced'. Hence I did not think about adding start time and If
> it is really required, then I can add.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Thu, Mar 3, 2022 at 8:30 PM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> >
> > Here are some of my review comments on the latest patch:
> >
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>type</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>kind</structfield> <type>text</type>
> > +      </para>
> > +      <para>
> > +       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
> > +      </para></entry>
> > +     </row>
> >
> > This looks a bit confusing. Two columns, one with the name "checkpoint
> > types" and another "checkpoint kinds". You can probably rename
> > checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
> > as-it-is.
> >
> > ==
> >
> > +      
> > <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
> > +      <entry>One row only, showing the progress of the checkpoint.
> >
> > Let's make this message consistent with the already existing message
> > for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
> > view in "Dynamic Statistics Views" table.
> >
> > ==
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> >
> > I think the output in the kind column can be displayed as {immediate,
> > force, wait, requested, time}. By the way these are all checkpoint
> > flags so it is better to display it as checkpoint flags instead of
> > checkpoint kind as mentioned in one of my previous comments.
> >
> > ==
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | performing two phase checkpoint
> >
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
> >
> > ==
> >
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-----+-------------------------------------
> > pid               | 22043
> > type              | checkpoint
> > kind              | immediate force wait requested time
> > start_lsn         | 0/14C60F8
> > start_time        | 2022-03-03 18:59:56.018662+05:30
> > phase             | finalizing
> > buffers_total     | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
> >
> > ==
> >
> > I think we can add a couple of more information to this view -
> > start_time for buffer write operation and start_time for buffer sync
> > operation. These are two very time consuming tasks in a checkpoint and
> > people would find it useful to know how much time is being taken by
> > the checkpoint in I/O operation phase. thoughts?
> >
> > --
> > With Regards,
> > Ashutosh Sharma.
> >
> > On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
> > <nitinjadhavpostg...@gmail.com> wrote:
> > >
> > > Thanks for reviewing.
> > >
> > > > > > I suggested upthread to store the starting timeline instead.  This 
> > > > > > way you can
> > > > > > deduce whether it's a restartpoint or a checkpoint, but you can 
> > > > > > also deduce
> > > > > > other information, like what was the starting WAL.
> > > > >
> > > > > I don't understand why we need the timeline here to just determine
> > > > > whether it's a restartpoint or checkpoint.
> > > >
> > > > I'm not saying it's necessary, I'm saying that for the same space usage 
> > > > we can
> > > > store something a bit more useful.  If no one cares about having the 
> > > > starting
> > > > timeline available for no extra cost then sure, let's just store the 
> > > > kind
> > > > directly.
> > >
> > > Fixed.
> > >
> > > > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > > > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > > > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > > > + (flags == 0) ? "unknown" : "",
> > > > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > > > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > > > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > > > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > > > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > > > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > > > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > > > + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
> > >
> > > Fixed.
> > > ---
> > >
> > > > 5) Do we need a special phase for this checkpoint operation? I'm not
> > > > sure in which cases it will take a long time, but it looks like
> > > > there's a wait loop here.
> > > > vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> > > > if (nvxids > 0)
> > > > {
> > > > do
> > > > {
> > > > pg_usleep(10000L); /* wait for 10 msec */
> > > > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> > > > }
> > >
> > > Yes. It is better to add a separate phase here.
> > > ---
> > >
> > > > Also, how about special phases for SyncPostCheckpoint(),
> > > > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > > > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > > > it might be increase in future (?)), TruncateSUBTRANS()?
> > >
> > > SyncPreCheckpoint() is just incrementing a counter and
> > > PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> > > there is no need to add any phases for these as of now. We can add in
> > > the future if necessary. Added phases for SyncPostCheckpoint(),
> > > InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> > > ---
> > >
> > > > 6) SLRU (Simple LRU) isn't a phase here, you can just say
> > > > PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> > > > +
> > > > + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> > > > + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> > > >  CheckPointPredicate();
> > > >
> > > > And :s/checkpointing SLRU pages/checkpointing predicate lock pages
> > > >+                      WHEN 9 THEN 'checkpointing SLRU pages'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 7) 
> > > > :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS
> > >
> > > I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
> > > it describes the purpose in less words.
> > >
> > > > And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> > > > 'processing file sync requests'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 8) :s/Finalizing/finalizing
> > > > +                      WHEN 14 THEN 'Finalizing'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 9) :s/checkpointing snapshots/checkpointing logical replication 
> > > > snapshot files
> > > > +                      WHEN 3 THEN 'checkpointing snapshots'
> > > > :s/checkpointing logical rewrite mappings/checkpointing logical
> > > > replication rewrite mapping files
> > > > +                      WHEN 4 THEN 'checkpointing logical rewrite 
> > > > mappings'
> > >
> > > Fixed.
> > > ---
> > >
> > > > 10) I'm not sure if it's discussed, how about adding the number of
> > > > snapshot/mapping files so far the checkpoint has processed in file
> > > > processing while loops of
> > > > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > > > be many logical snapshot or mapping files and users may be interested
> > > > in knowing the so-far-processed-file-count.
> > >
> > > I had thought about this while sharing the v1 patch and mentioned my
> > > views upthread. I feel it won't give meaningful progress information
> > > (It can be treated as statistics). Hence not included. Thoughts?
> > >
> > > > > > As mentioned upthread, there can be multiple backends that request a
> > > > > > checkpoint, so unless we want to store an array of pid we should 
> > > > > > store a number
> > > > > > of backend that are waiting for a new checkpoint.
> > > > >
> > > > > Yeah, you are right. Let's not go that path and store an array of
> > > > > pids. I don't see a strong use-case with the pid of the process
> > > > > requesting checkpoint. If required, we can add it later once the
> > > > > pg_stat_progress_checkpoint view gets in.
> > > >
> > > > I don't think that's really necessary to give the pid list.
> > > >
> > > > If you requested a new checkpoint, it doesn't matter if it's only your 
> > > > backend
> > > > that triggered it, another backend or a few other dozen, the result 
> > > > will be the
> > > > same and you have the information that the request has been seen.  We 
> > > > could
> > > > store just a bool for that but having a number instead also gives a bit 
> > > > more
> > > > information and may allow you to detect some broken logic on your 
> > > > client code
> > > > if it keeps increasing.
> > >
> > > It's a good metric to show in the view but the information is not
> > > readily available. Additional code is required to calculate the number
> > > of requests. Is it worth doing that? I feel this can be added later if
> > > required.
> > >
> > > Please find the v4 patch attached and share your thoughts.
> > >
> > > Thanks & Regards,
> > > Nitin Jadhav
> > >
> > > On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
> > > <nitinjadhavpostg...@gmail.com> wrote:
> > > >
> > > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > > Do you ever see a negative LSN or something here?
> > > > > > +    ('0/0'::pg_lsn + (
> > > > > > +        CASE
> > > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, 
> > > > > > (64)::numeric)
> > > > > > +            ELSE (0)::numeric
> > > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > > >
> > > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > > around, but that means we have to take care to correctly display the
> > > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > > the special-casing for negative values.
> > > >
> > > > Yes. The extra calculation is required here as we are storing unit64
> > > > value in the variable of type int64. When we convert uint64 to int64
> > > > then the bit pattern is preserved (so no data is lost). The high-order
> > > > bit becomes the sign bit and if the sign bit is set, both the sign and
> > > > magnitude of the value changes. To safely get the actual uint64 value
> > > > whatever was assigned, we need the above calculations.
> > > >
> > > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > > the reasoning for having this function and it's named as 
> > > > > > *checkpoint*
> > > > > > when it doesn't do anything specific to the checkpoint at all?
> > > > >
> > > > > I hadn't thought of using the types' inout functions, but it looks
> > > > > like timestamp IO functions use a formatted timestring, which won't
> > > > > work with the epoch-based timestamp stored in the view.
> > > >
> > > > There is a variation of to_timestamp() which takes UNIX epoch (float8)
> > > > as an argument and converts it to timestamptz but we cannot directly
> > > > call this function with S.param4.
> > > >
> > > > TimestampTz
> > > > GetCurrentTimestamp(void)
> > > > {
> > > >     TimestampTz result;
> > > >     struct timeval tp;
> > > >
> > > >     gettimeofday(&tp, NULL);
> > > >
> > > >     result = (TimestampTz) tp.tv_sec -
> > > >         ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> > > >     result = (result * USECS_PER_SEC) + tp.tv_usec;
> > > >
> > > >     return result;
> > > > }
> > > >
> > > > S.param4 contains the output of the above function
> > > > (GetCurrentTimestamp()) which returns Postgres epoch but the
> > > > to_timestamp() expects UNIX epoch as input. So some calculation is
> > > > required here. I feel the SQL 'to_timestamp(946684800 +
> > > > (S.param4::float / 1000000)) AS start_time' works fine. The value
> > > > '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > > > SECS_PER_DAY). I am not sure whether it is good practice to use this
> > > > way. Kindly share your thoughts.
> > > >
> > > > Thanks & Regards,
> > > > Nitin Jadhav
> > > >
> > > > On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> > > > <boekewurm+postg...@gmail.com> wrote:
> > > > >
> > > > > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > > > > <bharath.rupireddyforpostg...@gmail.com> wrote:
> > > > > > 3) Why do we need this extra calculation for start_lsn?
> > > > > > Do you ever see a negative LSN or something here?
> > > > > > +    ('0/0'::pg_lsn + (
> > > > > > +        CASE
> > > > > > +            WHEN (s.param3 < 0) THEN pow((2)::numeric, 
> > > > > > (64)::numeric)
> > > > > > +            ELSE (0)::numeric
> > > > > > +        END + (s.param3)::numeric)) AS start_lsn,
> > > > >
> > > > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > > > around, but that means we have to take care to correctly display the
> > > > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > > > the special-casing for negative values.
> > > > >
> > > > > As to whether it is reasonable: Generating 16GB of wal every second
> > > > > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > > > > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > > > > database runtime; or about 17 years. Seeing that a cluster can be
> > > > > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > > > > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > > > > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > > > > possibility within the next few years. As the lifespan of a PG release
> > > > > is about 5 years, it doesn't seem impossible that there will be actual
> > > > > clusters that are going to hit this naturally in the lifespan of PG15.
> > > > >
> > > > > It is also possible that someone fat-fingers pg_resetwal; and creates
> > > > > a cluster with LSN >= 2^63; resulting in negative values in the
> > > > > s.param3 field. Not likely, but we can force such situations; and as
> > > > > such we should handle that gracefully.
> > > > >
> > > > > > 4) Can't you use timestamptz_in(to_char(s.param4))  instead of
> > > > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > > > the reasoning for having this function and it's named as 
> > > > > > *checkpoint*
> > > > > > when it doesn't do anything specific to the checkpoint at all?
> > > > >
> > > > > I hadn't thought of using the types' inout functions, but it looks
> > > > > like timestamp IO functions use a formatted timestring, which won't
> > > > > work with the epoch-based timestamp stored in the view.
> > > > >
> > > > > > Having 3 unnecessary functions that aren't useful to the users at 
> > > > > > all
> > > > > > in proc.dat will simply eatup the function oids IMO. Hence, I 
> > > > > > suggest
> > > > > > let's try to do without extra functions.
> > > > >
> > > > > I agree that (1) could be simplified, or at least fully expressed in
> > > > > SQL without exposing too many internals. If we're fine with exposing
> > > > > internals like flags and type layouts, then (2), and arguably (4), can
> > > > > be expressed in SQL as well.
> > > > >
> > > > > -Matthias

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Reply via email to