Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */ +#define PROGRESS_CHECKPOINT_KIND_WAL0 +#define PROGRESS_CHECKPOINT_KIND_TIME 1 +#define PROGRESS_CHECKPOINT_KIND_FORCE 2 +#define PROGRESS_CHECKPOINT_KIND_UNKNOWN3 On what basis have you classified the above into the various types of checkpoints? AFAIK, the first two types are based on what triggered the checkpoint (whether it was the checkpoint_timeout or maz_wal_size settings) while the third type indicates the force checkpoint that can happen when the checkpoint is triggered for various reasons e.g. . during createb or dropdb etc. This is quite possible that both the PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags are set for the checkpoint because multiple checkpoint requests are processed at one go, so what type of checkpoint would that be? +*/ + if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0) + { + pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid); + checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE, + PROGRESS_CHECKPOINT_PHASE_INIT); + if (flags & CHECKPOINT_CAUSE_XLOG) + checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND, + PROGRESS_CHECKPOINT_KIND_WAL); + else if (flags & CHECKPOINT_CAUSE_TIME) + checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND, + PROGRESS_CHECKPOINT_KIND_TIME); + else if (flags & CHECKPOINT_FORCE) + checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND, + PROGRESS_CHECKPOINT_KIND_FORCE); + else + checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND, + PROGRESS_CHECKPOINT_KIND_UNKNOWN); + } +} -- With Regards, Ashutosh Sharma. On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav wrote: > > > > We need at least a trace of the number of buffers to sync (num_to_scan) > > > before the checkpoint start, instead of just emitting the stats at the > > > end. > > > > > > Bharat, it would be good to show the buffers synced counter and the total > > > buffers to sync, checkpointer pid, substep it is running, whether it is > > > on target for completion, checkpoint_Reason > > > (manual/times/forced). BufferSync has several variables tracking the sync > > > progress locally, and we may need some refactoring here. > > > > I agree to provide above mentioned information as part of showing the > > progress of current checkpoint operation. I am currently looking into > > the code to know if any other information can be added. > > Here is the initial patch to show the progress of checkpoint through > pg_stat_progress_checkpoint view. Please find the attachment. > > The information added to this view are pid - process ID of a > CHECKPOINTER process, kind - kind of checkpoint indicates the reason > for checkpoint (values can be wal, time or force), phase - indicates > the current phase of checkpoint operation, total_buffer_writes - total > number of buffers to be written, buffers_processed - number of buffers > processed, buffers_written - number of buffers written, > total_file_syncs - total number of files to be synced, files_synced - > number of files synced. > > There are many operations happen as part of checkpoint. For each of > the operation I am updating the phase field of > pg_stat_progress_checkpoint view. The values supported for this field > are initializing, checkpointing replication slots, checkpointing > snapshots, checkpointing logical rewrite mappings, checkpointing CLOG > pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages, > checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing > buffers, performing sync requests, performing two phase checkpoint, > recycling old XLOG files and Finalizing. In case of checkpointing > buffers phase, the fields total_buffer_writes, buffers_processed and > buffers_written shows the detailed progress of writing buffers. In > case of performing sync requests phase, the fields total_file_syncs > and files_synced shows the detailed progress of syncing files. In > other phases, only the phase field is getting updated and it is > difficult to show the progress because we do not get the total number > of files count without traversing the directory. It is not worth to > calculate that as it affects the performance of the checkpoint. I also > gave a thought to just mention the number of files processed, but this > wont give a meaningful progress information (It can be treated as > stati
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
I'm not sure about the current status, but found it while playing around with the latest changes a bit, so thought of sharing it here. + + strategy + + + This is used for copying the database directory. Currently, we have + two strategies the WAL_LOG_BLOCK and the Isn't it wal_log instead of wal_log_block? I think when users input wrong strategy with createdb command, we should provide a hint message showing allowed values for strategy types along with an error message. This will be helpful for the users. On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar wrote: > > On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda wrote: > > > Here is the updated version of the patch, the changes are 1) Fixed > review comments given by Robert and one open comment from Ashutosh. > 2) Preserved the old create db method. 3) As agreed upthread for now > we are using the new strategy only for createdb not for movedb so I > have removed the changes in ForgetDatabaseSyncRequests() and > DropDatabaseBuffers(). 3) Provided a database creation strategy > option as of now I have kept it as below. > > CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if > option is omitted > CREATE DATABASE ... WITH (STRATEGY = FILE_COPY); > > I have updated the document but I was not sure how much internal > information to be exposed to the user so I will work on that based on > feedback from others. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
+ if ((ckpt_flags & +(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0) + { This code (present at multiple places) looks a little ugly to me, what we can do instead is add a macro probably named IsShutdownCheckpoint() which does the above check and use it in all the functions that have this check. See below: #define IsShutdownCheckpoint(flags) \ (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0) And then you may use this macro like: if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags)) return; This change can be done in all these functions: +void +checkpoint_progress_start(int flags) -- + */ +void +checkpoint_progress_update_param(int index, int64 val) -- + * Stop reporting progress of the checkpoint. + */ +void +checkpoint_progress_end(void) == + pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid); + + val[0] = XLogCtl->InsertTimeLineID; + val[1] = flags; + val[2] = PROGRESS_CHECKPOINT_PHASE_INIT; + val[3] = CheckpointStats.ckpt_start_t; + + pgstat_progress_update_multi_param(4, index, val); + } Any specific reason for recording the timelineID in checkpoint stats table? Will this ever change in our case? -- With Regards, Ashutosh Sharma. On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav wrote: > > > I will make use of pgstat_progress_update_multi_param() in the next > > patch to replace multiple calls to checkpoint_progress_update_param(). > > Fixed. > --- > > > > The other progress tables use [type]_total as column names for counter > > > targets (e.g. backup_total for backup_streamed, heap_blks_total for > > > heap_blks_scanned, etc.). I think that `buffers_total` and > > > `files_total` would be better column names. > > > > I agree and I will update this in the next patch. > > Fixed. > --- > > > How about this "The checkpoint is started because max_wal_size is reached". > > > > "The checkpoint is started because checkpoint_timeout expired". > > > > "The checkpoint is started because some operation forced a checkpoint". > > I have used the above description. Kindly let me know if any changes > are required. > --- > > > > > + checkpointing CommitTs pages > > > > > > CommitTs -> Commit time stamp > > > > I will handle this in the next patch. > > Fixed. > --- > > > There are more scenarios where you can have a baackend requesting a > > checkpoint > > and waiting for its completion, and there may be more than one backend > > concerned, so I don't think that storing only one / the first backend pid is > > ok. > > Thanks for this information. I am not considering backend_pid. > --- > > > I think all the information should be exposed. Only knowing why the current > > checkpoint has been triggered without any further information seems a bit > > useless. Think for instance for cases like [1]. > > I have supported all possible checkpoint kinds. Added > pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a > string representing a combination of flags and also checking for the > flag update in ImmediateCheckpointRequested() which checks whether > CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other > cases where the flags get changed (which changes the current > checkpoint behaviour) during the checkpoint. Kindly let me know if I > am missing something. > --- > > > > I feel 'processes_wiating' aligns more with the naming conventions of > > > the fields of the existing progres views. > > > > There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't > > have > > a strong opinion on it, just make sure to correct the typo. > > More analysis is required to support this. I am planning to take care > in the next patch. > --- > > > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a > > restartpoint if the checkpoint's timeline is different from the current > > timeline? > > Fixed. > > Sharing the v2 patch. Kindly have a look and share your comments. > > Thanks & Regards, > Nitin Jadhav > > > > > On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav > wrote: > > > > > > Thank you for sharing the information. 'triggering backend PID' (int) > > > > - can be stored without any problem. 'checkpoint or restartpoint?' > > > > (boolean) - can be stored as a integer value like > > > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and > &g
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Some review comments on v5 patch (v5-0001-pg_walinspect.patch) +-- +-- pg_get_wal_records_info() +-- +CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn, +IN end_lsn pg_lsn, +IN wait_for_wal boolean DEFAULT false, +OUT lsn pg_lsn, What does the wait_for_wal flag mean here when one has already specified the start and end lsn? AFAIU, If a user has specified a start and stop LSN, it means that the user knows the extent to which he/she wants to display the WAL records in which case we need to stop once the end lsn has reached . So what is the meaning of wait_for_wal flag? Does it look sensible to have the wait_for_wal flag here? To me it doesn't. == +-- +-- pg_get_wal_records_info_till_end_of_wal() +-- +CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn pg_lsn, +OUT lsn pg_lsn, +OUT prev_lsn pg_lsn, +OUT xid xid, Why is this function required? Is pg_get_wal_records_info() alone not enough? I think it is. See if we can make end_lsn optional in pg_get_wal_records_info() and lets just have it alone. I think it can do the job of pg_get_wal_records_info_till_end_of_wal function. == +-- +-- pg_get_wal_stats_till_end_of_wal() +-- +CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn, +OUT resource_manager text, +OUT count int8, Above comment applies to this function as well. Isn't pg_get_wal_stats() enough? == + if (loc <= read_upto) + break; + + /* Let's not wait for WAL to be available if indicated */ + if (loc > read_upto && + state->private_data != NULL) + { Why loc > read_upto? The first if condition is (loc <= read_upto) followed by the second if condition - (loc > read_upto). Is the first if condition (loc <= read_upto) not enough to indicate that loc > read_upto? == +#define IsEndOfWALReached(state) \ + (state->private_data != NULL && \ + (((ReadLocalXLOGPage2Private *) xlogreader->private_data)->no_wait == true) && \ + (((ReadLocalXLOGPage2Private *) xlogreader->private_data)->reached_end_of_wal == true)) I think we should either use state or xlogreader. First line says state->private_data and second line xlogreader->private_data. == + (((ReadLocalXLOGPage2Private *) xlogreader->private_data)->reached_end_of_wal == true)) + There is a new patch coming to make the end of WAL messages less scary. It introduces the EOW flag in xlogreaderstate maybe we can use that instead of introducing new flags in private area to represent the end of WAL. == +/* + * XLogReaderRoutine->page_read callback for reading local xlog files + * + * This function is same as read_local_xlog_page except that it works in both + * wait and no wait mode. The callers can specify about waiting in private_data + * of XLogReaderState. + */ +int +read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr, + int reqLen, XLogRecPtr targetRecPtr, char *cur_page) +{ + XLogRecPtr read_upto, Do we really need this function? Can't we make use of an existing WAL reader function - read_local_xlog_page()? -- With Regards, Ashutosh Sharma. On Fri, Feb 25, 2022 at 4:33 PM Bharath Rupireddy wrote: > > On Wed, Feb 16, 2022 at 9:04 AM Ashutosh Sharma wrote: > > I don't think that's the use case of this patch. Unless there is some > > other valid reason, I would suggest you remove it. > > Removed the function pg_verify_raw_wal_record. Robert and Greg also > voted for removal upthread. > > > > > Should we add a function that returns the pointer to the first and > > > > probably the last WAL record in the WAL segment? This would help users > > > > to inspect the wal records in the entire wal segment if they wish to > > > > do so. > > > > > > Good point. One can do this already with pg_get_wal_records_info and > > > pg_walfile_name_offset. Usually, the LSN format itself can give an > > > idea about the WAL file it is in. > > > > > > postgres=# select lsn, pg_walfile_name_offset(lsn) from > > > pg_get_wal_records_info('0/500', '0/5FF') order by lsn asc > > > limit 1; > > > lsn|pg_walfile_name_offset > > > ---+--- > > > 0/538 | (00010005,56) > > > (1 row) > > > > > > postgres=# select lsn, pg_walfile_name_offset(lsn) from > > > pg_get_wal_records_info('0/500', '0/5FF') order by lsn desc > > > limit 1; > > > lsn| pg_walfile_name_
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Wed, Mar 2, 2022 at 10:37 PM Bharath Rupireddy wrote: > > On Wed, Mar 2, 2022 at 8:12 PM Ashutosh Sharma wrote: > > > > Some review comments on v5 patch (v5-0001-pg_walinspect.patch) > > Thanks for reviewing. > > > +-- > > +-- pg_get_wal_records_info() > > +-- > > +CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn, > > +IN end_lsn pg_lsn, > > +IN wait_for_wal boolean DEFAULT false, > > +OUT lsn pg_lsn, > > > > What does the wait_for_wal flag mean here when one has already > > specified the start and end lsn? AFAIU, If a user has specified a > > start and stop LSN, it means that the user knows the extent to which > > he/she wants to display the WAL records in which case we need to stop > > once the end lsn has reached . So what is the meaning of wait_for_wal > > flag? Does it look sensible to have the wait_for_wal flag here? To me > > it doesn't. > > Users can always specify a future end_lsn and set wait_for_wal to > true, then the pg_get_wal_records_info/pg_get_wal_stats functions can > wait for the WAL. IMO, this is useful. If you remember you were okay > with wait/nowait versions for some of the functions upthread [1]. I'm > not going to retain this behaviour for both > pg_get_wal_records_info/pg_get_wal_stats as it is similar to > pg_waldump's --follow option. > It is not at all similar to pg_waldumps behaviour. Please check the behaviour of pg_waldump properly. Does it wait for any wal records when a user has specified a stop pointer? It doesn't and it shouldn't. I mean does it even make sense to wait for the WAL when a stop pointer is specified? And it's quite understandable that if a user has asked pg_walinspect to stop at a certain point, it must. Also, What if there are already WAL records after the stop pointer, in that case does it even make sense to have a wait flag. WHat would be the meaning of the wait flag in that case? Further, have you checked wait_for_wal flag behaviour, is it even working? > > > > +-- > > +-- pg_get_wal_records_info_till_end_of_wal() > > +-- > > +CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn > > pg_lsn, > > +OUT lsn pg_lsn, > > +OUT prev_lsn pg_lsn, > > +OUT xid xid, > > > > Why is this function required? Is pg_get_wal_records_info() alone not > > enough? I think it is. See if we can make end_lsn optional in > > pg_get_wal_records_info() and lets just have it alone. I think it can > > do the job of pg_get_wal_records_info_till_end_of_wal function. > > > > == > > > > +-- > > +-- pg_get_wal_stats_till_end_of_wal() > > +-- > > +CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn, > > +OUT resource_manager text, > > +OUT count int8, > > > > Above comment applies to this function as well. Isn't pg_get_wal_stats() > > enough? > > I'm doing the following input validations for these functions to not > cause any issues with invalid LSN. If I were to have the default value > for end_lsn as 0/0, I can't perform input validations right? That is > the reason I'm having separate functions {pg_get_wal_records_info, > pg_get_wal_stats}_till_end_of_wal() versions. > You can do it. Please check pg_waldump to understand how it is done there. You cannot have multiple functions doing different things when one single function can do all the job. > > == > > > > > > + if (loc <= read_upto) > > + break; > > + > > + /* Let's not wait for WAL to be available if > > indicated */ > > + if (loc > read_upto && > > + state->private_data != NULL) > > + { > > > > Why loc > read_upto? The first if condition is (loc <= read_upto) > > followed by the second if condition - (loc > read_upto). Is the first > > if condition (loc <= read_upto) not enough to indicate that loc > > > read_upto? > > Yeah, that's unnecessary, I improved the comment there and removed loc > > read_upto. > > > == > > > > +#define IsEndOfWALReached(state) \ > > + (state->private_data != NULL && \ > > + (((ReadLocalXLOGPage2Private *) > > xlogreader->private_data)->no_wait == true) && \ > > + (((ReadLocalXLOGPage2Private *) > > xlogreader->private_data)->reached_end_of_wal == true)) > > > > I think we should either use state or xlogreader. First
Re: Make mesage at end-of-recovery less scary.
On Wed, Mar 2, 2022 at 7:47 AM Kyotaro Horiguchi wrote: > > At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma > wrote in > > The changes looks good. thanks.! > > Thanks! > > Some recent core change changed WAL insertion speed during the TAP > test and revealed one forgotton case of EndOfWAL. When a record > header flows into the next page, XLogReadRecord does separate check > from ValidXLogRecordHeader by itself. > The new changes made in the patch look good. Thanks to the recent changes to speed WAL insertion that have helped us catch this bug. One small comment: record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; Do you think we need to change the position of the comments written for above code that says: /* * Read the record length. * ... ... -- With Regards, Ashutosh Sharma.
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Here are some of my review comments on the latest patch: + + + type text + + + Type of checkpoint. See . + + + + + + kind text + + + Kind of checkpoint. See . + + This looks a bit confusing. Two columns, one with the name "checkpoint types" and another "checkpoint kinds". You can probably rename checkpoint-kinds to checkpoint-flags and let the checkpoint-types be as-it-is. == + pg_stat_progress_checkpointpg_stat_progress_checkpoint + One row only, showing the progress of the checkpoint. Let's make this message consistent with the already existing message for pg_stat_wal_receiver. See description for pg_stat_wal_receiver view in "Dynamic Statistics Views" table. == [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; -[ RECORD 1 ]-+- pid | 22043 type | checkpoint kind | immediate force wait requested time I think the output in the kind column can be displayed as {immediate, force, wait, requested, time}. By the way these are all checkpoint flags so it is better to display it as checkpoint flags instead of checkpoint kind as mentioned in one of my previous comments. == [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; -[ RECORD 1 ]-+- pid | 22043 type | checkpoint kind | immediate force wait requested time start_lsn | 0/14C60F8 start_time| 2022-03-03 18:59:56.018662+05:30 phase | performing two phase checkpoint This is the output I see when the checkpointer process has come out of the two phase checkpoint and is currently writing checkpoint xlog records and doing other stuff like updating control files etc. Is this okay? == The output of log_checkpoint shows the number of buffers written is 3 whereas the output of pg_stat_progress_checkpoint shows it as 0. See below: 2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2, longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB -- [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; -[ RECORD 1 ]-+- pid | 22043 type | checkpoint kind | immediate force wait requested time start_lsn | 0/14C60F8 start_time| 2022-03-03 18:59:56.018662+05:30 phase | finalizing buffers_total | 0 buffers_processed | 0 buffers_written | 0 Any idea why this mismatch? == I think we can add a couple of more information to this view - start_time for buffer write operation and start_time for buffer sync operation. These are two very time consuming tasks in a checkpoint and people would find it useful to know how much time is being taken by the checkpoint in I/O operation phase. thoughts? -- With Regards, Ashutosh Sharma. On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav wrote: > > Thanks for reviewing. > > > > > I suggested upthread to store the starting timeline instead. This way > > > > you can > > > > deduce whether it's a restartpoint or a checkpoint, but you can also > > > > deduce > > > > other information, like what was the starting WAL. > > > > > > I don't understand why we need the timeline here to just determine > > > whether it's a restartpoint or checkpoint. > > > > I'm not saying it's necessary, I'm saying that for the same space usage we > > can > > store something a bit more useful. If no one cares about having the > > starting > > timeline available for no extra cost then sure, let's just store the kind > > directly. > > Fixed. > > > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks > > directly instead of new function pg_stat_get_progress_checkpoint_kind? > > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s", > > + (flags == 0) ? "unknown" : "", > > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "", > > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "", > > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "", > > + (flags & CHECKPOINT_FORCE) ? "force " : "", > > + (flags & CHECKPOINT_WAIT) ? "wait " : "", > > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "", > > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "", > > +
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
I think we should also see if we can allow end users to input timeline information with the pg_walinspect functions. This will help the end users to get information about WAL records from previous timeline which can be helpful in case of restored servers. -- With Regards, Ashutosh Sharma. On Thu, Mar 3, 2022 at 8:20 AM Kyotaro Horiguchi wrote: > > Hi. > > +#ifdef FRONTEND > +/* > + * Functions that are currently not needed in the backend, but are better > + * implemented inside xlogreader.c because of the internal facilities > available > + * here. > + */ > + > #endif /* FRONTEND */ > > Why didn't you remove the emptied #ifdef section? > > +int > +read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr, > + int reqLen, XLogRecPtr > targetRecPtr, char *cur_page) > > The difference with the original function is this function has one > additional if-block amid. I think we can insert the code directly in > the original function. > > + /* > +* We are trying to read future WAL. Let's not wait > for WAL to be > +* available if indicated. > +*/ > + if (state->private_data != NULL) > > However, in the first place it seems to me there's not need for the > function to take care of no_wait affairs. > > If, for expample, pg_get_wal_record_info() with no_wait = true, it is > enough that the function identifies the bleeding edge of WAL then loop > until the LSN. So I think no need for the new function, nor for any > modification on the origical function. > > The changes will reduce the footprint of the patch largely, I think. > > At Wed, 2 Mar 2022 22:37:43 +0530, Bharath Rupireddy > wrote in > > On Wed, Mar 2, 2022 at 8:12 PM Ashutosh Sharma > > wrote: > > > > > > Some review comments on v5 patch (v5-0001-pg_walinspect.patch) > > > > Thanks for reviewing. > > > > > +-- > > > +-- pg_get_wal_records_info() > > > +-- > > > +CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn, > > > +IN end_lsn pg_lsn, > > > +IN wait_for_wal boolean DEFAULT false, > > > +OUT lsn pg_lsn, > > > > > > What does the wait_for_wal flag mean here when one has already > > > specified the start and end lsn? AFAIU, If a user has specified a > > > start and stop LSN, it means that the user knows the extent to which > > > he/she wants to display the WAL records in which case we need to stop > > > once the end lsn has reached . So what is the meaning of wait_for_wal > > > flag? Does it look sensible to have the wait_for_wal flag here? To me > > > it doesn't. > > > > Users can always specify a future end_lsn and set wait_for_wal to > > true, then the pg_get_wal_records_info/pg_get_wal_stats functions can > > wait for the WAL. IMO, this is useful. If you remember you were okay > > with wait/nowait versions for some of the functions upthread [1]. I'm > > not going to retain this behaviour for both > > pg_get_wal_records_info/pg_get_wal_stats as it is similar to > > pg_waldump's --follow option. > > I agree to this for now. However, I prefer that NULL or invalid > end_lsn is equivalent to pg_current_wal_lsn(). > > > > == > > > > > > +-- > > > +-- pg_get_wal_records_info_till_end_of_wal() > > > +-- > > > +CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn > > > pg_lsn, > > > +OUT lsn pg_lsn, > > > +OUT prev_lsn pg_lsn, > > > +OUT xid xid, > > > > > > Why is this function required? Is pg_get_wal_records_info() alone not > > > enough? I think it is. See if we can make end_lsn optional in > > > pg_get_wal_records_info() and lets just have it alone. I think it can > > > do the job of pg_get_wal_records_info_till_end_of_wal function. > > I rather agree to Ashutosh. This feature can be covered by > pg_get_wal_records(start_lsn, NULL, false). > > > > == > > > > > > +-- > > > +-- pg_get_wal_stats_till_end_of_wal() > > > +-- > > > +CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn, > > > +OUT resource_manager text, > > > +OUT count int8, > > > > > > Above comment applies to this function as well. Isn't pg_get_wal_stats() > > > enough? > > > > I'm doing the following input
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Please don't mix comments from multiple reviewers into one thread. It's hard to understand which comments are mine or Julien's or from others. Can you please respond to the email from each of us separately with an inline response. That will be helpful to understand your thoughts on our review comments. -- With Regards, Ashutosh Sharma. On Fri, Mar 4, 2022 at 4:59 PM Nitin Jadhav wrote: > > Thanks for reviewing. > > > 6) How about shutdown and end-of-recovery checkpoint? Are you planning > > to have an ereport_startup_progress mechanism as 0002? > > I thought of including it earlier then I felt lets first make the > current patch stable. Once all the fields are properly decided and the > patch gets in then we can easily extend the functionality to shutdown > and end-of-recovery cases. I have also observed that the timer > functionality wont work properly in case of shutdown as we are doing > an immediate checkpoint. So this needs a lot of discussion and I would > like to handle this on a separate thread. > --- > > > 7) I think you don't need to call checkpoint_progress_start and > > pgstat_progress_update_param, any other progress reporting function > > for shutdown and end-of-recovery checkpoint right? > > I had included the guards earlier and then removed later based on the > discussion upthread. > --- > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; > > -[ RECORD 1 ]-+- > > pid | 22043 > > type | checkpoint > > kind | immediate force wait requested time > > start_lsn | 0/14C60F8 > > start_time| 2022-03-03 18:59:56.018662+05:30 > > phase | performing two phase checkpoint > > > > > > This is the output I see when the checkpointer process has come out of > > the two phase checkpoint and is currently writing checkpoint xlog > > records and doing other stuff like updating control files etc. Is this > > okay? > > The idea behind choosing the phases is based on the functionality > which takes longer time to execute. Since after two phase checkpoint > till post checkpoint cleanup won't take much time to execute, I have > not added any additional phase for that. But I also agree that this > gives wrong information to the user. How about mentioning the phase > information at the end of each phase like "Initializing", > "Initialization done", ..., "two phase checkpoint done", "post > checkpoint cleanup done", .., "finalizing". Except for the first phase > ("initializing") and last phase ("finalizing"), all the other phases > describe the end of a certain operation. I feel this gives correct > information even though the phase name/description does not represent > the entire code block between two phases. For example if the current > phase is ''two phase checkpoint done". Then the user can infer that > the checkpointer has done till two phase checkpoint and it is doing > other stuff that are after that. Thoughts? > > > The output of log_checkpoint shows the number of buffers written is 3 > > whereas the output of pg_stat_progress_checkpoint shows it as 0. See > > below: > > > > 2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3 > > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; > > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2, > > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB > > > > -- > > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; > > -[ RECORD 1 ]-+- > > pid | 22043 > > type | checkpoint > > kind | immediate force wait requested time > > start_lsn | 0/14C60F8 > > start_time| 2022-03-03 18:59:56.018662+05:30 > > phase | finalizing > > buffers_total | 0 > > buffers_processed | 0 > > buffers_written | 0 > > > > Any idea why this mismatch? > > Good catch. In BufferSync() we have 'num_to_scan' (buffers_total) > which indicates the total number of buffers to be processed. Based on > that, the 'buffers_processed' and 'buffers_written' counter gets > incremented. I meant these values may reach upto 'buffers_total'. The > current pg_stat_progress_view support above information. There is > another place when 'ckpt_bufs_written' gets incremented (In > SlruInternalWritePage()). This increment is above the 'buffers_total' > v
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Thanks Bharath for working on all my review comments. I took a quick look at the new version of the patch (v7-pg_walinspect.patch) and this version looks a lot better. I'll do some detailed review later (maybe next week or so) and let you know my further comments, if any. -- With Regards, Ashutosh Sharma. On Fri, Mar 4, 2022 at 3:54 PM Bharath Rupireddy wrote: > > On Thu, Mar 3, 2022 at 10:05 PM Robert Haas wrote: > > > > On Fri, Feb 25, 2022 at 6:03 AM Bharath Rupireddy > > wrote: > > > Added a new function that returns the first and last valid WAL record > > > LSN of a given WAL file. > > > > Sounds like fuzzy thinking to me. WAL records can cross file > > boundaries, and forgetting about that leads to all sorts of problems. > > Just give people one function that decodes a range of LSNs and call it > > good. Why do you need anything else? If people want to get the first > > record that begins in a segment or the first record any portion of > > which is in a particular segment or the last record that begins in a > > segment or the last record that ends in a segment or any other such > > thing, they can use a WHERE clause for that... and if you think they > > can't, then that should be good cause to rethink the return value of > > the one-and-only SRF that I think you need here. > > Thanks Robert. > > Thanks to others for your review comments. > > Here's the v7 patch set. These patches are based on the motive that > "keep it simple and short yet effective and useful". With that in > mind, I have not implemented the wait mode for any of the functions > (as it doesn't look an effective use-case and requires adding a new > page_read callback, instead throw error if future LSN is specified), > also these functions will give WAL information on the current server's > timeline. Having said that, I'm open to adding new functions in future > once this initial version gets in, if there's a requirement and users > ask for the new functions. > > Please review the v7 patch set and provide your thoughts. > > Regards, > Bharath Rupireddy.
Re: Synchronizing slots from primary to standby
Hi, I have spent little time trying to understand the concern raised by Andres and while doing so I could think of a couple of issues which I would like to share here. Although I'm not quite sure how inline these are with the problems seen by Andres. 1) Firstly, what if we come across a situation where the failover occurs when the confirmed flush lsn has been updated on primary, but is yet to be updated on the standby? I believe this may very well be the case especially considering that standby sends sql queries to the primary to synchronize the replication slots at regular intervals and if the primary dies just after updating the confirmed flush lsn of its logical subscribers then the standby may not be able to get this information/update from the primary which means we'll probably end up having a broken logical replication slot on the new primary. 2) Secondly, if the standby goes down, the logical subscribers will stop receiving new changes from the primary as per the design of this patch OR if standby lags behind the primary for whatever reason, it will have a direct impact on logical subscribers as well. -- With Regards, Ashutosh Sharma. On Sat, Feb 19, 2022 at 3:53 AM Andres Freund wrote: > > Hi, > > On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote: > > On 05.02.22 20:59, Andres Freund wrote: > > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > > > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > > > > From: Peter Eisentraut > > > > Date: Mon, 3 Jan 2022 14:43:36 +0100 > > > > Subject: [PATCH v3] Synchronize logical replication slots from primary > > > > to > > > > standby > > > I've just skimmed the patch and the related threads. As far as I can tell > > > this > > > cannot be safely used without the conflict handling in [1], is that > > > correct? > > > > This or similar questions have been asked a few times about this or similar > > patches, but they always come with some doubt. > > I'm certain it's a problem - the only reason I couched it was that there could > have been something clever in the patch preventing problems that I missed > because I just skimmed it. > > > > If we think so, it would be > > useful perhaps if we could come up with test cases that would demonstrate > > why that other patch/feature is necessary. (I'm not questioning it > > personally, I'm just throwing out ideas here.) > > The patch as-is just breaks one of the fundamental guarantees necessary for > logical decoding, that no rows versions can be removed that are still required > for logical decoding (signalled via catalog_xmin). So there needs to be an > explicit mechanism upholding that guarantee, but there is not right now from > what I can see. > > One piece of the referenced patchset is that it adds information about removed > catalog rows to a few WAL records, and then verifies during replay that no > record can be replayed that removes resources that are still needed. If such a > conflict exists it's dealt with as a recovery conflict. > > That itself doesn't provide prevention against removal of required, but it > provides detection. The prevention against removal can then be done using a > physical replication slot with hot standby feedback or some other mechanism > (e.g. slot syncing mechanism could maintain a "placeholder" slot on the > primary for all sync targets or something like that). > > Even if that infrastructure existed / was merged, the slot sync stuff would > still need some very careful logic to protect against problems due to > concurrent WAL replay and "synchronized slot" creation. But that's doable. > > Greetings, > > Andres Freund > >
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav wrote: > > > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint; > > > -[ RECORD 1 ]-+- > > > pid | 22043 > > > type | checkpoint > > > kind | immediate force wait requested time > > > > > > I think the output in the kind column can be displayed as {immediate, > > > force, wait, requested, time}. By the way these are all checkpoint > > > flags so it is better to display it as checkpoint flags instead of > > > checkpoint kind as mentioned in one of my previous comments. > > > > I will update in the next patch. > > The current format matches with the server log message for the > checkpoint start in LogCheckpointStart(). Just to be consistent, I > have not changed the code. > See below, how flags are shown in other sql functions like: ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1); raw_flags| combined_flags -+ {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (1 row) This looks more readable and it's easy to understand for the end-users.. Further comparing the way log messages are displayed with the way sql functions display its output doesn't look like a right comparison to me. Obviously both should show matching data but the way it is shown doesn't need to be the same. In fact it is not in most of the cases. > I have taken care of the rest of the comments in v5 patch for which > there was clarity. > Thank you very much. Will take a look at it later. -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Here are some review comments on the latest patch (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review of the v10 patch but that applies for this latest version as well. + /* Now errors are fatal ... */ + START_CRIT_SECTION(); Did you mean PANIC instead of FATAL? == + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("invalid create strategy %s", strategy), +errhint("Valid strategies are \"wal_log\", and \"file_copy\"."))); + } Should this be - "invalid createdb strategy" instead of "invalid create strategy"? == + /* +* In case of ALTER DATABASE SET TABLESPACE we don't need to do +* anything for the object which are not in the source db's default +* tablespace. The source and destination dboid will be same in +* case of ALTER DATABASE SET TABLESPACE. +*/ + else if (src_dboid == dst_dboid) + continue; + else + dstrnode.spcNode = srcrnode.spcNode; Is this change still required? Do we support the WAL_COPY strategy for ALTER DATABASE? == + /* Open the source and the destination relation at smgr level. */ + src_smgr = smgropen(srcrnode, InvalidBackendId); + dst_smgr = smgropen(dstrnode, InvalidBackendId); + + /* Copy relation storage from source to the destination. */ + CreateAndCopyRelationData(src_smgr, dst_smgr, relinfo->relpersistence); Do we need to do smgropen for destination relfilenode here? Aren't we already doing that inside RelationCreateStorage? == + use_wal = XLogIsNeeded() && + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); + + /* Get number of blocks in the source relation. */ + nblocks = smgrnblocks(src, forkNum); What if number of blocks in a source relation is ZERO? Should we check for that and return immediately. We have already done smgrcreate. == + /* We don't need to copy the shared objects to the target. */ + if (classForm->reltablespace == GLOBALTABLESPACE_OID) + return NULL; + + /* +* If the object doesn't have the storage then nothing to be +* done for that object so just ignore it. +*/ + if (!RELKIND_HAS_STORAGE(classForm->relkind)) + return NULL; We can probably club together above two if-checks. == + + strategy + + + This is used for copying the database directory. Currently, we have + two strategies the WAL_LOG and the + FILE_COPY. If WAL_LOG strategy + is used then the database will be copied block by block and it will + also WAL log each copied block. Otherwise, if FILE_COPY I think we need to mention the default strategy in the documentation page. -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 4:32 PM Dilip Kumar wrote: > > On Wed, Mar 9, 2022 at 6:44 PM Robert Haas wrote: > > > > > Right, infact now also if you see the logic, the > > > write_relmap_file_internal() is taking care of the actual update and > > > the write_relmap_file() is doing update + setting the global > > > variables. So yeah we can rename as you suggested in 0001 and here > > > also we can change the logic as you suggested that the recovery and > > > createdb will only call the first function which is just doing the > > > update. > > > > But I think we want the path construction to be managed by the > > function rather than the caller, too. > > I have completely changed the logic for this refactoring. Basically, > write_relmap_file(), is already having parameters to control whether > to write wal, send inval and we are already passing the dbpath. > Instead of making a new function I just pass one additional parameter > to this function itself about whether we are creating a new map or not > and I think with that changes are very less and this looks cleaner to > me. Similarly for load_relmap_file() also I just had to pass the > dbpath and memory for destination map. Please have a look and let me > know your thoughts. > > > Sure, I guess. It's just not obvious why the argument would actually > > need to be a function that copies storage, or why there's more than > > one way to copy storage. I'd rather keep all the code paths unified, > > if we can, and set behavior via flags or something, maybe. I'm not > > sure whether that's realistic, though. > > I try considering that, I think it doesn't look good to make it flag >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Thanks Dilip for working on the review comments. I'll take a look at the new version of patch and let you know my comments, if any. -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 8:38 PM Dilip Kumar wrote: > > On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma wrote: > > > > Here are some review comments on the latest patch > > (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review > > of the v10 patch but that applies for this latest version as well. > > > > + /* Now errors are fatal ... */ > > + START_CRIT_SECTION(); > > > > Did you mean PANIC instead of FATAL? > > I think here fatal didn't really mean the error level FATAL, that > means critical and I have seen it is used in other places also. But I > really don't think we need this comments to removed to avoid any > confusion. > > > == > > > > + > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > +errmsg("invalid create > > strategy %s", strategy), > > +errhint("Valid strategies are > > \"wal_log\", and \"file_copy\"."))); > > + } > > > > > > Should this be - "invalid createdb strategy" instead of "invalid > > create strategy"? > > Changed > > > == > > > > + /* > > +* In case of ALTER DATABASE SET TABLESPACE we don't need > > to do > > +* anything for the object which are not in the source > > db's default > > +* tablespace. The source and destination dboid will be > > same in > > +* case of ALTER DATABASE SET TABLESPACE. > > +*/ > > + else if (src_dboid == dst_dboid) > > + continue; > > + else > > + dstrnode.spcNode = srcrnode.spcNode; > > > > > > Is this change still required? Do we support the WAL_COPY strategy for > > ALTER DATABASE? > > Yeah now it is unreachable code so removed. > > > == > > > > + /* Open the source and the destination relation at > > smgr level. */ > > + src_smgr = smgropen(srcrnode, InvalidBackendId); > > + dst_smgr = smgropen(dstrnode, InvalidBackendId); > > + > > + /* Copy relation storage from source to the destination. */ > > + CreateAndCopyRelationData(src_smgr, dst_smgr, > > relinfo->relpersistence); > > > > Do we need to do smgropen for destination relfilenode here? Aren't we > > already doing that inside RelationCreateStorage? > > Yeah I have changed the complete logic and removed the smgr_open for > both source and destination and moved inside > CreateAndCopyRelationData, please check the updated code. > > > == > > > > + use_wal = XLogIsNeeded() && > > + (relpersistence == RELPERSISTENCE_PERMANENT || > > copying_initfork); > > + > > + /* Get number of blocks in the source relation. */ > > + nblocks = smgrnblocks(src, forkNum); > > > > What if number of blocks in a source relation is ZERO? Should we check > > for that and return immediately. We have already done smgrcreate. > > Yeah make sense to optimize, with that we will not have to get the > buffer strategy so done. > > > == > > > > + /* We don't need to copy the shared objects to the target. */ > > + if (classForm->reltablespace == GLOBALTABLESPACE_OID) > > + return NULL; > > + > > + /* > > +* If the object doesn't have the storage then nothing to be > > +* done for that object so just ignore it. > > +*/ > > + if (!RELKIND_HAS_STORAGE(classForm->relkind)) > > + return NULL; > > > > We can probably club together above two if-checks. > > Done > > > == > > > > + > > + strategy > > + > > + > > + This is used for copying the database directory. Currently, we > > have > > + two strategies the WAL_LOG and the > > + FILE_COPY. If WAL_LOG > > strategy > > + is used then the database will be copied block by block and it > > will > > + also WAL log each copied block. Otherwise, if FILE_COPY > > > > I think we need to mention the default strategy in the documentation page. > > Done > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Mar 10, 2022 at 10:18 PM Robert Haas wrote: > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar wrote: > > I have completely changed the logic for this refactoring. Basically, > > write_relmap_file(), is already having parameters to control whether > > to write wal, send inval and we are already passing the dbpath. > > Instead of making a new function I just pass one additional parameter > > to this function itself about whether we are creating a new map or not > > and I think with that changes are very less and this looks cleaner to > > me. Similarly for load_relmap_file() also I just had to pass the > > dbpath and memory for destination map. Please have a look and let me > > know your thoughts. > > It's not terrible, but how about something like the attached instead? > I think this has the effect of reducing the number of cases that the > low-level code needs to know about from 2 to 1, instead of making it > go up from 2 to 3. > Looks better, but why do you want to pass elevel to the load_relmap_file()? Are we calling this function from somewhere other than read_relmap_file()? If not, do we have any plans to call this function directly bypassing read_relmap_file for any upcoming patch? -- With Regards, Ashutosh Sharma.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi wrote: > > Sorry, some minor non-syntactical corrections. > > At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi > wrote in > > I played with this a bit, and would like to share some thoughts on it. > > > > It seems to me too rigorous that pg_get_wal_records_info/stats() > > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > > the real end-of-WAL is more kind to users. I think the same with the > > restriction that start and end LSN are required to be different. > > > > The definition of end-lsn is fuzzy here. If I fed a future LSN to the > > functions, they tell me the beginning of the current insertion point > > in error message. On the other hand they don't accept the same > > value as end-LSN. I think it is right that they tell the current > > insertion point and they should take the end-LSN as the LSN to stop > > reading. > > > > I think pg_get_wal_stats() is worth to have but I think it should be > > implemented in SQL. Currently pg_get_wal_records_info() doesn't tell > > about FPI since pg_waldump doesn't but it is internally collected (of > > course!) and easily revealed. If we do that, the > > pg_get_wal_records_stats() would be reduced to the following SQL > > statement > > > > SELECT resource_manager resmgr, > >count(*) AS N, > > (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N", > > sum(total_length) AS "combined size", > > (sum(total_length) * 100 / sum(sum(total_length)) OVER > > tot)::numeric(5,2) AS "%combined size", > > sum(fpi_len) AS fpilen, > > (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS > > "%fpilen" > > FROM pg_get_wal_records_info('0/100', '0/175DD7f') > > GROUP by resource_manager > > WINDOW tot AS () > > ORDER BY "combined size" desc; > > > > The only difference with pg_waldump is the statement above doesn't > > show lines for the resource managers that don't contained in the > > result of pg_get_wal_records_info(). But I don't think that matters. > > > > > > Sometimes the field description has very long (28kb long) content. It > > makes the result output almost unreadable and I had a bit hard time > > struggling with the output full of '-'s. I would like have a default > > limit on the length of such fields that can be long but I'm not sure > > we want that. > > > > > - The difference between pg_get_wal_record_info and _records_ other than > - the number of argument is the former accepts incorrect LSNs. > > The discussion is somewhat confused after some twists and turns.. It > should be something like the following. > > pg_get_wal_record_info and pg_get_wal_records_info are almost same > since the latter can show a single record. However it is a bit > annoying to do that. Since, other than it doens't accept same LSNs for > start and end, it doesn't show a record when there' no record in the > specfied LSN range. But I don't think there's no usefulness of the > behavior. > > The following works, > pg_get_wal_record_info('0/100'); This does work but it doesn't show any WARNING message for the start pointer adjustment. I think it should. > pg_get_wal_records_info('0/100'); > I think this is fine. It should be working because the user hasn't specified the end pointer so we assume the default end pointer is end-of-WAL. > but this doesn't > pg_get_wal_records_info('0/1000000', '0/100'); > > ERROR: WAL start LSN must be less than end LSN > I think this behaviour is fine. We cannot have the same start and end lsn pointers. > And the following shows no records. > pg_get_wal_records_info('0/100', '0/101'); > pg_get_wal_records_info('0/100', '0/128'); > I think we should be erroring out here saying - couldn't find any valid WAL record between given start and end lsn because there exists no valid wal records between the specified start and end lsn pointers. -- With Regards, Ashutosh Sharma.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi wrote: > > - The difference between pg_get_wal_record_info and _records_ other than > - the number of argument is the former accepts incorrect LSNs. > > The discussion is somewhat confused after some twists and turns.. It > should be something like the following. > > pg_get_wal_record_info and pg_get_wal_records_info are almost same > since the latter can show a single record. However it is a bit > annoying to do that. Since, other than it doens't accept same LSNs for > start and end, it doesn't show a record when there' no record in the > specfied LSN range. But I don't think there's no usefulness of the > behavior. > So, do you want the pg_get_wal_record_info function to be removed as we can use pg_get_wal_records_info() to achieve what it does? -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
You may also need to add documentation to app-createdb.sgml. Currently you have just added to create_database.sgml. Also, I had a quick look at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch and they seemed fine to me although I haven't put much emphasis on the comments and other cosmetic stuff. -- With Regards, Ashutosh Sharma. On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar wrote: > > On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar wrote: > > > > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas wrote: > > > > > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar wrote: > > > > I have completely changed the logic for this refactoring. Basically, > > > > write_relmap_file(), is already having parameters to control whether > > > > to write wal, send inval and we are already passing the dbpath. > > > > Instead of making a new function I just pass one additional parameter > > > > to this function itself about whether we are creating a new map or not > > > > and I think with that changes are very less and this looks cleaner to > > > > me. Similarly for load_relmap_file() also I just had to pass the > > > > dbpath and memory for destination map. Please have a look and let me > > > > know your thoughts. > > > > > > It's not terrible, but how about something like the attached instead? > > > I think this has the effect of reducing the number of cases that the > > > low-level code needs to know about from 2 to 1, instead of making it > > > go up from 2 to 3. > > > > Yeah this looks cleaner, I will rebase the remaining patch. > > Here is the updated version of the patch set. > > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Some comments on pg_walinspect-docc.patch this time: + + + pg_get_wal_record_info(in_lsn pg_lsn, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length OUT int4, total_length OUT int4, description OUT text, block_ref OUT text, data OUT bytea, data_len OUT int4) + You may shorten this by mentioning just the function input parameters and specify "returns record" like shown below. So no need to specify all the OUT params. pg_get_wal_record_info(in_lsn pg_lsn) returns record. Please check the documentation for other functions for reference. == + + pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length OUT int4, total_length OUT int4, description OUT text, block_ref OUT text, data OUT bytea, data_len OUT int4) + Same comment applies here as well. In the return type you can just mention - "returns setof record" like shown below: pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof records. You may also check for such optimizations at other places. I might have missed some. == + +postgres=# select prev_lsn, xid, resource_manager, length, total_length, block_ref from pg_get_wal_records_info('0/158A7F0', '0/1591400'); + prev_lsn | xid | resource_manager | length | total_length | block_ref +---+-+--++--+-- + 0/158A7B8 | 735 | Heap | 54 | 7838 | blkref #0: rel 1663/5/2619 blk 18 (FPW); hole: offset: 88, length: 408 + 0/158A7F0 | 735 | Btree| 53 | 8133 | blkref #0: rel 1663/5/2696 blk 1 (FPW); hole: offset: 1632, length: 112 + 0/158C6A8 | 735 | Heap | 53 | 873 | blkref #0: rel 1663/5/1259 blk 0 (FPW); hole: offset: 212, length: 7372 Instead of specifying column names in the targetlist I think it's better to use "*" so that it will display all the output columns. Also you may shorten the gap between start and end lsn to reduce the output size. == Any reason for not specifying author name in the .sgml file. Do you want me to add my name to the author? :) Ashutosh Sharma ashu.coe...@gmail.com -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 10:15 PM Bharath Rupireddy wrote: > > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis wrote: > > > > On Wed, 2022-03-02 at 22:37 +0530, Bharath Rupireddy wrote: > > > > > > Attaching v6 patch set with above review comments addressed. Please > > > review it further. > > Thanks Jeff for reviewing it. I've posted the latest v7 patch-set > upthread [1] which is having more simple-yet-useful-and-effective > functions. > > > * Don't issue WARNINGs or other messages for ordinary situations, like > > when pg_get_wal_records_info() hits the end of WAL. > > v7 patch-set [1] has no warnings, but the functions will error out if > future LSN is specified. > > > * It feels like the APIs that allow waiting for the end of WAL are > > slightly off. Can't you just do pg_get_wal_records_info(start_lsn, > > least(pg_current_wal_flush_lsn(), end_lsn)) if you want the non-waiting > > behavior? Try to make the API more orthogonal, where a few basic > > functions can be combined to give you everything you need, rather than > > specifying extra parameters and issuing WARNINGs. I > > v7 patch-set [1] onwards waiting mode has been removed for all of the > functions, again to keep things simple-yet-useful-and-effective. > However, we can always add new pg_walinspect functions that wait for > future WAL in the next versions once basic stuff gets committed and if > many users ask for it. > > > * In the docs, include some example output. I don't see any output in > > the tests, which makes sense because it's mostly non-deterministic, but > > it would be helpful to see sample output of at least > > pg_get_wal_records_info(). > > +1. Added for pg_get_wal_records_info and pg_get_wal_stats. > > > * Is pg_get_wal_stats() even necessary, or can you get the same > > information with a query over pg_get_wal_records_info()? For instance, > > if you want to group by transaction ID rather than rmgr, then > > pg_get_wal_stats() is useless. > > Yes, you are right pg_get_wal_stats provides WAL stats per resource > manager which is similar to pg_waldump with --start, --end and --stats > option. It provides more information than pg_get_wal_records_info and > is a good way of getting stats than adding more columns to > pg_get_wal_records_info, calculating percentage in sql and having > group by clause. IMO, pg_ge
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Mar 11, 2022 at 8:21 PM Robert Haas wrote: > > On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma > wrote: > > Looks better, but why do you want to pass elevel to the > > load_relmap_file()? Are we calling this function from somewhere other > > than read_relmap_file()? If not, do we have any plans to call this > > function directly bypassing read_relmap_file for any upcoming patch? > > If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In > that case, we only need to stop trying to create a database; we don't > need to terminate the session. On the other hand if we can't read our > own database's relmap files, that's an unrecoverable error, because we > will not be able to run any queries at all, so FATAL is appropriate. > OK. I can see it being used in the v13 patch. In the previous patches it was hard-coded with FATAL. Also, we simply error out when doing file copy as I can see in the copy_file function. So yes FATAL is not the right option to use when creating a database. Thanks. -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Few comments on the latest patch: - /* We need to construct the pathname for this database */ - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); + if (xlrec->dbid != InvalidOid) + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); + else + dbpath = pstrdup("global"); Do we really need this change? Is GetDatabasePath() alone not capable of handling it? == +static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, + Oid tbid, Oid dbid, + char *srcpath); +static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, + Oid dbid, char *srcpath, + List *rnodelist, Snapshot snapshot); +static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath); I think we can shorten these function names to probably ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise? -- With Regards, Ashutosh Sharma. On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar wrote: > > On Mon, Mar 14, 2022 at 10:04 PM Robert Haas wrote: > > > I think it would make sense to have two different WAL records e.g. > > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's > > easy to see how this could be generalized to other strategies in the > > future. > > Done that way. In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I > have kept the older description i.e. "copy dir" and for > XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the > first one is actually copying and the second one is just creating the > directory. Do you think we should be using "copy dir file_copy" and > "copy dir wal_log" in the description as well? > > > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas wrote: > > > I was looking at 0001 and 0002 again and realized that I swapped the > > > names load_relmap_file() and read_relmap_file() from what I should > > > have done. Here's a revised version. With this, read_relmap_file() and > > > write_relmap_file() become functions that just read and write the file > > > without touching any global variables, and load_relmap_file() is the > > > function that reads data from the file and puts it into a global > > > variable, which seems more sensible than the way I had it before. > > Okay, I have included this patch and rebased other patches on top of that. > > > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think > > 'bool permanent' would be better (note BM_PERMANENT). This would > > involve reversing true and false. > > Okay changed. > > > Regarding 0005: > > > > + CREATEDB_WAL_LOG = 0, > > + CREATEDB_FILE_COPY = 1 > > > > I still think you don't need = 0 and = 1 here. > > Done > > > I'll probably go through and do a pass over the comments once you post > > the next version of this. There seems to be work needed in a bunch of > > places, but it probably makes more sense for me to go through and > > adjust the things that seem to need it rather than listing a bunch of > > changes for you to make. > > Sure, thanks. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Tue, Mar 15, 2022 at 10:17 PM Robert Haas wrote: > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma > wrote: > > > > Few comments on the latest patch: > > > > - /* We need to construct the pathname for this database */ > > - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > + if (xlrec->dbid != InvalidOid) > > + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > + else > > + dbpath = pstrdup("global"); > > > > Do we really need this change? Is GetDatabasePath() alone not capable > > of handling it? > > Well, I mean, that function has a special case for > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid > is 0. > Wouldn't this be true only in case of a shared map file (when dbOid is Invalid and tblspcOid is globaltablespace_oid) or am I missing something? -- With Regards, Ashutosh Sharma.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
I can see that the pg_get_wal_records_info function shows the details of the WAL record whose existence is beyond the user specified stop/end lsn pointer. See below: ashu@postgres=# select * from pg_get_wal_records_info('0/0128', '0/0129'); -[ RECORD 1 ]+-- start_lsn| 0/128 end_lsn | 0/19F prev_lsn | 0/0 xid | 0 resource_manager | XLOG record_length| 114 fpi_length | 0 description | CHECKPOINT_SHUTDOWN redo 0/128; tli 1; prev tli 1; fpw true; xid 0:3; oid 1; multi 1; offset 0; oldest xid 3 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown block_ref| data_length | 88 data | \x280101000100010003001027010003000100010001007255a5c43162ff7f In this case, the end lsn pointer specified by the user is '0/0129'. There is only one WAL record which starts before this specified end lsn pointer whose start pointer is at 0128, but that WAL record ends at 0/19F which is way beyond the specified end lsn. So, how come we are able to display the complete WAL record info? AFAIU, end lsn is the lsn pointer where you need to stop reading the WAL data. If that is true, then there exists no valid WAL record between the start and end lsn in this particular case. -- With Regards, Ashutosh Sharma. On Wed, Mar 16, 2022 at 7:56 PM Stephen Frost wrote: > > Greetings, > > * Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote: > > On Tue, Mar 15, 2022 at 7:21 AM Bharath Rupireddy > > wrote: > > > > > > On Mon, Mar 14, 2022 at 8:25 PM Stephen Frost wrote: > > > > > > > > > As this patch is currently written, pg_monitor has access these > > > > > functions, though I don't think that's the right privilege level at > > > > > least for pg_get_raw_wal_record(). > > > > > > > > Yeah, I agree that pg_monitor isn't the right thing for such a function > > > > to be checking. > > > > > > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis wrote: > > > > > > > > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that > > > > function should require pg_read_server_files? Or at least > > > > pg_read_all_data? > > > > > > The v9 patch set posted at [1] grants execution on > > > pg_get_raw_wal_record() to the pg_monitor role. > > > > > > pg_read_all_data may not be the right choice, but pg_read_server_files > > > is as these functions do read the WAL files on the server. If okay, > > > I'm happy to grant execution on pg_get_raw_wal_record() to the > > > pg_read_server_files role. > > > > > > Thoughts? > > > > > > [1] > > > https://www.postgresql.org/message-id/CALj2ACVRH-z8mZLyFkpLvY4qRhxQCqU_BLkFTtwt%2BTPZNhfEVg%40mail.gmail.com > > > > Attaching v10 patch set which allows pg_get_raw_wal_record to be > > executed by either superuser or users with pg_read_server_files role, > > no other change from v9 patch set. > > In a quick look, that seems reasonable to me. If folks want to give out > access to this function individually they're also able to do so, which > is good. Doesn't seem worthwhile to introduce a new predefined role for > this one function. > > Thanks, > > Stephen
MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Hi All, Today while working on some other task related to database encoding, I noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in UTF-8. See below: postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); convert -- \xefbc8d (1 row) Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH HYPHEN-MINUS SIGN. When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is converted to EUC-JP, the convert functions fails with an error saying: "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no equivalent in encoding EUC_JP". See below: postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" has no equivalent in encoding "EUC_JP" However, when the same MINUS SIGN in UTF-8 is converted to SJIS encoding, the convert function returns the correct result. See below: postgres=# select convert('\xe28892', 'utf-8', 'sjis'); convert - \x817c (1 row) Please note that the byte sequence (81-7c) in SJIS represents MINUS SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the MINUS SIGN in SJIS and that is what we expect. Isn't it? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi wrote: > > Hello. > > At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma > wrote in > > Hi All, > > > > Today while working on some other task related to database encoding, I > > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > > UTF-8. See below: > > > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > > convert > > -- > > \xefbc8d > > (1 row) > > > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > > HYPHEN-MINUS SIGN. > > No it's not a bug, but a well-known "design":( > > The mapping is generated from CP932.TXT and JIS0212.TXT by > UCS_to_UEC_JP.pl. > > CP932.TXT used here is here. > > https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > > CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows. > > 0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS > We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP should throw an error saying: "... in encoding UTF8 has *no* equivalent in EUC_JP". I mean this information looks misleading and that's I reason I feel its a bug. > > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > > converted to EUC-JP, the convert functions fails with an error saying: > > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > > equivalent in encoding EUC_JP". See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > > has no equivalent in encoding "EUC_JP" > > U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp > U+2212(e2 88 92) doesn't have a mapping between euc-jp. > > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > > encoding, the convert function returns the correct result. See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > > convert > > - > > \x817c > > (1 row) > > It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason > but maybe because it was used widely. > > So ping-pong between Unicode and SJIS behaves like this: > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > I think we don't change authoritative mappings, but maybe can add some > one-way conversions for the convenience. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted: 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of. 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects. -- Further, the code changes in the patch looks good. I just have few comments: +void +LockRelationId(LockRelId *relid, LOCKMODE lockmode) +{ + LOCKTAG tag; + LOCALLOCK *locallock; + LockAcquireResult res; + + SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId); Should there be an assertion statement here to ensure that relid->dbid and relid->relid is valid? -- if (info == XLOG_DBASE_CREATE) { xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record); - char *src_path; - char *dst_path; - struct stat st; - - src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id); - dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); + char *dbpath; - /* -* Our theory for replaying a CREATE is to forcibly drop the target -* subdirectory if present, then re-copy the source data. This may be -* more work than needed, but it is simple to implement. -*/ - if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode)) - { - if (!rmtree(dst_path, true)) - /* If this failed, copydir() below is going to error. */ - ereport(WARNING, - (errmsg("some useless files may be left behind in old database directory \"%s\"", - dst_path))); - } I think this is a significant change and probably needs some kind of explanation/comments as-in why we are just creating a dir and copying the version file when replaying create database operation. Earlier, this meant replaying the complete create database operation, that doesn't seem to be the case now. -- Have you intentionally skipped pg_internal.init file from being copied to the target database? -- With Regards, Ashutosh Sharma. On Thu, Dec 2, 2021 at 7:20 PM Dilip Kumar wrote: > On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar wrote: > > > Thanks a lot for testing this. From the error, it seems like some of > > the old buffer w.r.t. the previous tablespace is not dropped after the > > movedb. Actually, we are calling DropDatabaseBuffers() after copying > > to a new tablespace and dropping all the buffers of this database > > w.r.t the old tablespace. But seems something is missing, I will > > reproduce this and try to fix it by tomorrow. I will also fix the > > other review comments raised by you in the previous mail. > > Okay, I got the issue, basically we are dropping the database buffers > but not unregistering the existing sync request for database buffers > w.r.t old tablespace. Attached patch fixes that. I also had to extend > ForgetDatabaseSyncRequests so that we can delete the sync request of > the database for the particular tablespace so added another patch for > the same (0006). > > I will test the performance scenario next week, which is suggested by John. > > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar wrote: > On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma > wrote: > > > > I see that this patch is reducing the database creation time by almost > 3-4 times provided that the template database has some user data in it. > However, there are couple of points to be noted: > > Thanks a lot for looking into the patches. > > > > 1) It makes the crash recovery a bit slower than before if the crash has > occurred after the execution of a create database statement. Moreover, if > the template database size is big, it might even generate a lot of WAL > files which the user needs to be aware of. > > Yes it will but actually that is the only correct way to do it, in > current we are just logging the WAL as copying the source directory to > destination directory without really noting down exactly what we > wanted to copy, so we are force to do the checkpoint right after > create database because in crash recovery we can not actually replay > that WAL. Because WAL just say copy the source to destination so it > is very much possible that at the DO time source directory had some > different content than the REDO time so this would have created the > inconsistencies in the crash recovery so to avoid this bug they force > the checkpoint so now also if you do force checkpoint then again crash > recovery will be equally fast. So I would not say that we have made > crash recovery slow but we have removed some bugs and with that now we > don't need to force the checkpoint. Also note that in current code > even with force checkpoint the bug is not completely avoided in all > the cases, check below comments from the code[1]. > > > 2) This will put a lot of load on the first checkpoint that will occur > after creating the database statement. I will experiment around this to see > if this has any side effects. > > But now a checkpoint can happen at its own need and there is no need > to force a checkpoint like it was before patch. > > So the major goal of this patch is 1) Correctly WAL log the create > database which is hack in the current system, 2) Avoid force > checkpoints, 3) We copy page by page so it will support TDE because if > the source and destination database has different encryption then we > can reencrypt the page before copying to destination database, which > is not possible in current system as we are copying directory 4) Now > the new database pages will get the latest LSN which is the correct > things earlier new database pages were getting copied directly with > old LSN only. > OK. Understood, thanks.! -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Here are few more review comments: 1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile(). -- + */ +static void +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo) +{ 2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid. -- 3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level. + use_wal = XLogIsNeeded() && + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); -- With Regards, Ashutosh Sharma. On Mon, Dec 6, 2021 at 9:12 AM Ashutosh Sharma wrote: > On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar wrote: > >> On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma >> wrote: >> > >> > I see that this patch is reducing the database creation time by almost >> 3-4 times provided that the template database has some user data in it. >> However, there are couple of points to be noted: >> >> Thanks a lot for looking into the patches. >> > >> > 1) It makes the crash recovery a bit slower than before if the crash >> has occurred after the execution of a create database statement. Moreover, >> if the template database size is big, it might even generate a lot of WAL >> files which the user needs to be aware of. >> >> Yes it will but actually that is the only correct way to do it, in >> current we are just logging the WAL as copying the source directory to >> destination directory without really noting down exactly what we >> wanted to copy, so we are force to do the checkpoint right after >> create database because in crash recovery we can not actually replay >> that WAL. Because WAL just say copy the source to destination so it >> is very much possible that at the DO time source directory had some >> different content than the REDO time so this would have created the >> inconsistencies in the crash recovery so to avoid this bug they force >> the checkpoint so now also if you do force checkpoint then again crash >> recovery will be equally fast. So I would not say that we have made >> crash recovery slow but we have removed some bugs and with that now we >> don't need to force the checkpoint. Also note that in current code >> even with force checkpoint the bug is not completely avoided in all >> the cases, check below comments from the code[1]. >> >> > 2) This will put a lot of load on the first checkpoint that will occur >> after creating the database statement. I will experiment around this to see >> if this has any side effects. >> >> But now a checkpoint can happen at its own need and there is no need >> to force a checkpoint like it was before patch. >> >> So the major goal of this patch is 1) Correctly WAL log the create >> database which is hack in the current system, 2) Avoid force >> checkpoints, 3) We copy page by page so it will support TDE because if >> the source and destination database has different encryption then we >> can reencrypt the page before copying to destination database, which >> is not possible in current system as we are copying directory 4) Now >> the new database pages will get the latest LSN which is the correct >> things earlier new database pages were getting copied directly with >> old LSN only. >> > > OK. Understood, thanks.! > > -- > With Regards, > Ashutosh Sharma. >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch. One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly. -- With Regards, Ashutosh Sharma. On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar wrote: > On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma > wrote: > > > > Here are few more review comments: > > Thanks for reviewing it. > > > 1) It seems that we are not freeing the memory allocated for buf.data in > CreateDirAndVersionFile(). > > Yeah this was a problem in v6 but I have fixed in v7, can you check that. > > > > + */ > > +static void > > +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo) > > +{ > > > > 2) Do we need to pass dbpath here? I mean why not reconstruct it from > dbid and tsid. > > Yeah we can do that but I thought computing dbpath has some cost and > since the caller already has it why not to pass it. > > > > > 3) Not sure if this point has already been discussed, Will we be able to > recover the data when wal_level is set to minimal because the following > condition would be false with this wal level. > > > > + use_wal = XLogIsNeeded() && > > + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); > > > > Since we are creating new relfilenode this is fine, refer "Skipping > WAL for New RelFileNode" in src/backend/access/transam/README > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Thanks Robert for sharing your thoughts. On Mon, Dec 6, 2021 at 11:16 PM Robert Haas wrote: > On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma > wrote: > > One last point - If we try to clone a huge database, as expected CREATE > DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints > which seems to be affecting the performance slightly. > > Yes, I think this needs to be characterized better. If you have a big > shared buffers setting and a lot of those buffers are dirty and the > template database is small, all of which is fairly normal, then this > new approach should be much quicker. On the other hand, what if the > situation is reversed? Perhaps you have a small shared buffers and not > much of it is dirty and the template database is gigantic. Then maybe > this new approach will be slower. But right now I think we don't know > where the crossover point is, and I think we should try to figure that > out. > Yes I think so too. > > So for example, imagine tests with 1GB of shard_buffers, 8GB, and > 64GB. And template databases with sizes of whatever the default is, > 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then > create a new database from one of the templates. And then just measure > the performance. Maybe for large databases this approach is just > really the pits -- and if your max_wal_size is too small, it > definitely will be. But, I don't know, maybe with reasonable settings > it's not that bad. Writing everything to disk twice - once to WAL and > once to the target directory - has to be more expensive than doing it > once. But on the other hand, it's all sequential I/O and the data > pages don't need to be fsync'd, so perhaps the overhead is relatively > mild. I don't know. > So far, I haven't found much performance overhead with a few gb of data in the template database. It's just a bit with the default settings, perhaps setting a higher value of max_wal_size would reduce this overhead. -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below: ashu@postgres=# alter database test set tablespace pg_default; ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists I have taken the above from Neha's test-case. -- Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the creation of tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the boolean flag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me. Kindly check the attached patch for the changes. Dilip, Could you please check the attached patch and let me know if it looks fine or not? Neha, can you please re-run the test-cases with the attached patch. Thanks, -- With Regards, Ashutosh Sharma. On Thu, Dec 9, 2021 at 8:43 AM Neha Sharma wrote: > > > > On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow wrote: > >> On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma >> wrote: >> > >> > While testing the v7 patches, I am observing a crash with the below >> test case. >> > >> > Test case: >> > create tablespace tab location '/test_dir'; >> > create tablespace tab1 location '/test_dir1'; >> > create database test tablespace tab; >> > \c test >> > create table t( a int PRIMARY KEY,b text); >> > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS >> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; >> > insert into t values (generate_series(1,200), large_val()); >> > alter table t set tablespace tab1 ; >> > \c postgres >> > create database test1 template test; >> > alter database test set tablespace pg_default; >> > alter database test set tablespace tab; >> > \c test1 >> > alter table t set tablespace tab; >> > >> > Logfile says: >> > 2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file >> "base/16386/4152": No such file or directory >> > 2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID >> 134252) was terminated by signal 6: Aborted >> > >> >> I tried to reproduce the issue using your test scenario, but I needed >> to reduce the amount of inserted data (so reduced 200 to 2) >> due to disk space. >> I then consistently get an error like the following: >> >> postgres=# alter database test set tablespace pg_default; >> ERROR: could not create file >> "pg_tblspc/16385/PG_15_202111301/16386/36395": File exists >> >> (this only happens when the patch is used) >> >> > Yes, I was also getting this, and moving further we get a crash when we > alter the table of database test1. > Below is the output of the test at my end. > > postgres=# create tablespace tab1 location > '/home/edb/PGsources/postgresql/inst/bin/rep_test1'; > CREATE TABLESPACE > postgres=# create tablespace tab location > '/home/edb/PGsources/postgresql/inst/bin/rep_test'; > CREATE TABLESPACE > postgres=# create database test tablespace tab; > CREATE DATABASE > postgres=# \c test > You are now connected to database "test" as user "edb". > test=# create table t( a int PRIMARY KEY,b text); > CREATE TABLE > test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > CREATE FUNCTION > test=# insert into t values (generate_series(1,200), large_val()); > INSERT 0 200 > test=# alter table t set tablespace tab1 ; > ALTER TABLE > test=# \c postgres > You are now connected to database "postgres" as user "edb". > postgres=# create database test1 template test; > CREATE DATABASE > postgres=# alter database test set tablespace pg_default; > ERROR: could not create file > "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists > postgres=# alter database test set tablespace tab; > ALTER DATABASE > postgres=# \c test1 > You are now connected to database "test1" as user "edb". > test1=# alter table t set tablespace tab; > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another server process exited > abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !?> > >> >> Regards, >> Greg Nancarrow >> Fujitsu Australia >> > skip-table-creation-for-alter-database.patch Description: Binary data
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma wrote: > On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma > wrote: > >> Hi, >> >> The issue here is that we are trying to create a table that exists inside >> a non-default tablespace when doing ALTER DATABASE. I think this should be >> skipped otherwise we will come across the error like shown below: >> >> ashu@postgres=# alter database test set tablespace pg_default; >> ERROR: 58P02: could not create file >> "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists >> > > Thanks Ashutosh for the patch, the mentioned issue has been resolved with > the patch. > > But I am still able to reproduce the crash consistently on top of this > patch + v7 patches,just the test case has been modified. > > create tablespace tab1 location '/test1'; > create tablespace tab location '/test'; > create database test tablespace tab; > \c test > create table t( a int PRIMARY KEY,b text); > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > insert into t values (generate_series(1,10), large_val()); > alter table t set tablespace tab1 ; > \c postgres > create database test1 template test; > \c test1 > alter table t set tablespace tab; > \c postgres > alter database test1 set tablespace tab1; > > --Cancel the below command after few seconds > alter database test1 set tablespace pg_default; > > \c test1 > alter table t set tablespace tab1; > > > Logfile Snippet: > 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file > "base/116398/116400": No such file or directory > 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) > was terminated by signal 6: Aborted > 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active > server processes > This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files that got successfully copied to the target database, but the overall alter database statement failed. We are doing this when the database is created successfully, but not when it fails. Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem. -- With Regards, Ashutosh Sharma.
Re: WIP: WAL prefetch (another approach)
Hi Thomas, I am unable to apply these new set of patches on HEAD. Can you please share the rebased patch or if you have any work branch can you please point it out, I will refer to it for the changes. -- With Regards, Ashutosh sharma. On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro wrote: > On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson > wrote: > > Could you post an updated version of the patch which is for review? > > Sorry for taking so long to come back; I learned some new things that > made me want to restructure this code a bit (see below). Here is an > updated pair of patches that I'm currently testing. > > Old problems: > > 1. Last time around, an infinite loop was reported in pg_waldump. I > believe Horiguchi-san has fixed that[1], but I'm no longer depending > on that patch. I thought his patch set was a good idea, but it's > complicated and there's enough going on here already... let's consider > that independently. > > This version goes back to what I had earlier, though (I hope) it is > better about how "nonblocking" states are communicated. In this > version, XLogPageRead() has a way to give up part way through a record > if it doesn't have enough data and there are queued up records that > could be replayed right now. In that case, we'll go back to the > beginning of the record (and occasionally, back a WAL page) next time > we try. That's the cost of not maintaining intra-record decoding > state. > > 2. Last time around, we could try to allocate a crazy amount of > memory when reading garbage past the end of the WAL. Fixed, by > validating first, like in master. > > New work: > > Since last time, I went away and worked on a "real" AIO version of > this feature. That's ongoing experimental work for a future proposal, > but I have a working prototype and I aim to share that soon, when that > branch is rebased to catch up with recent changes. In that version, > the prefetcher starts actual reads into the buffer pool, and recovery > receives already pinned buffers attached to the stream of records it's > replaying. > > That inspired a couple of refactoring changes to this non-AIO version, > to minimise the difference and anticipate the future work better: > > 1. The logic for deciding which block to start prefetching next is > moved into a new callback function in a sort of standard form (this is > approximately how all/most prefetching code looks in the AIO project, > ie sequential scans, bitmap heap scan, etc). > > 2. The logic for controlling how many IOs are running and deciding > when to call the above is in a separate component. In this non-AIO > version, it works using a simple ring buffer of LSNs to estimate the > number of in flight I/Os, just like before. This part would be thrown > away and replaced with the AIO branch's centralised "streaming read" > mechanism which tracks I/O completions based on a stream of completion > events from the kernel (or I/O worker processes). > > 3. In this version, the prefetcher still doesn't pin buffers, for > simplicity. That work did force me to study places where WAL streams > need prefetching "barriers", though, so in this patch you can > see that it's now a little more careful than it probably needs to be. > (It doesn't really matter much if you call posix_fadvise() on a > non-existent file region, or the wrong file after OID wraparound and > reuse, but it would matter if you actually read it into a buffer, and > if an intervening record might be trying to drop something you have > pinned). > > Some other changes: > > 1. I dropped the GUC recovery_prefetch_fpw. I think it was a > possibly useful idea but it's a niche concern and not worth worrying > about for now. > > 2. I simplified the stats. Coming up with a good running average > system seemed like a problem for another day (the numbers before were > hard to interpret). The new stats are super simple counters and > instantaneous values: > > postgres=# select * from pg_stat_prefetch_recovery ; > -[ RECORD 1 ]--+-- > stats_reset| 2021-11-10 09:02:08.590217+13 > prefetch | 13605674 <- times we called posix_fadvise() > hit| 24185289 <- times we found pages already cached > skip_init | 217215 <- times we did nothing because init, not read > skip_new | 192347 <- times we skipped because relation too small > skip_fpw | 27429<- times we skipped because fpw, not read > wal_distance | 10648<- how far ahead in WAL bytes > block_distance | 134 <- how far ahead in block references > io_depth |
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
+ /* +* If the relation is from the default tablespace then we need to +* create it in the destinations db's default tablespace. Otherwise, +* we need to create in the same tablespace as it is in the source +* database. +*/ This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it. -- + else if (src_dboid == dst_dboid) + continue; + else + dstrnode.spcNode = srcrnode.spcNode;; There is an extra semicolon here. -- With Regards, Ashutosh Sharma. On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar wrote: > On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma > wrote: > >> > >> Logfile Snippet: > >> 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file > "base/116398/116400": No such file or directory > >> 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID > 18151) was terminated by signal 6: Aborted > >> 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active > server processes > > > > > > This is different from the issue you raised earlier. As Dilip said, we > need to unregister sync requests for files that got successfully copied to > the target database, but the overall alter database statement failed. We > are doing this when the database is created successfully, but not when it > fails. > > Probably doing the same inside the cleanup function > movedb_failure_callback() should fix the problem. > > Correct, I have done this cleanup, apart from this we have dropped the > fsyc request in create database failure case as well and also need to > drop buffer in error case of creatdb as well as movedb. I have also > fixed the other issue for which you gave the patch (a bit differently) > basically, in case of movedb the source and destination dboid are same > so we don't need an additional parameter and also readjusted the > conditions to avoid nested if. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: Multi-Column List Partitioning
Hi, Is this okay? postgres=# CREATE TABLE t1 (a int, b int) PARTITION BY LIST ( a, a, a ); CREATE TABLE postgres=# CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES IN ((1, 2, 3), (4, 5, 6)); CREATE TABLE postgres=# \d t1 Partitioned table "public.t1" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Partition key: LIST (a, a, a) Number of partitions: 1 (Use \d+ to list them.) -- Also, getting some compiler warnings when building the source. please check. -- With Regards, Ashutosh Sharma. On Mon, Dec 6, 2021 at 7:27 PM Nitin Jadhav wrote: > Thank you for reviewing the patch. > > > partbounds.c: In function ‘get_qual_for_list.isra.18’: > > partbounds.c:4284:29: warning: ‘boundinfo’ may be used uninitialized > > in this function [-Wmaybe-uninitialized] > > datumCopy(bound_info->datums[i][j], > > ~~^~~~ > > partbounds.c:4335:21: note: ‘boundinfo’ was declared here > > PartitionBoundInfo boundinfo; > > ^ > > partbounds.c: In function ‘partition_bounds_merge’: > > partbounds.c:1305:12: warning: ‘inner_isnull’ may be used > > uninitialized in this function [-Wmaybe-uninitialized] > > bool*inner_isnull; > >^~~~ > > partbounds.c:1304:12: warning: ‘outer_isnull’ may be used > > uninitialized in this function [-Wmaybe-uninitialized] > > bool*outer_isnull; > > ^~~~ > > Fixed. > > > This function is unnecessarily complicated, I think you can avoid > > inner for loops; simply replace for-loop-block with "if > > (equal(lfirst(cell), new_bound)) return true". > > Thank you for the suggestion. Fixed. > > > + char **colname = (char **) palloc0(partnatts * sizeof(char *)); > > + Oid*coltype = palloc0(partnatts * sizeof(Oid)); > > + int32*coltypmod = palloc0(partnatts * sizeof(int)); > > + Oid*partcollation = palloc0(partnatts * sizeof(Oid)); > > + > > This allocation seems to be worthless, read ahead. > > > > I think there is no need for this separate loop inside > > transformPartitionListBounds, you can do that same in the next loop as > > well. And instead of get_partition_col_* calling and storing, simply > > use that directly as an argument to transformPartitionBoundValue(). > > Yes. The loop can be avoided and content of the above loop can be > included in the next loop but the next loop iterates over a list of > multi column datums. For each iteration, we need the information of > all the columns. The above data (colname, coltype, coltypmod and > partcollation) remains same for each iteration of the loop, If we > modify as suggested, then the function to fetch these information has > to be called every-time. To avoid this situation I have made a > separate loop outside which only runs as many number of columns and > stores in a variable which can be reused later. Please let me correct > if I am wrong. > > > I think this should be inside the "else" block after "!IsA(rowexpr, > > RowExpr)" error and you can avoid IsA() check too. > > This is required to handle the situation when one partition key is > mentioned and multiple values are provided in the partition bound > specification. > > > Looks difficult to understand at first glance, how about the following: > > > > if (b1->isnulls != b2->isnulls) > >return false; > > > > if (b1->isnulls) > > { > >if (b1->isnulls[i][j] != b2->isnulls[i][j]) > >return false; > >if (b1->isnulls[i][j]) > >continue; > > } > > > > See how range partitioning infinite values are handled. Also, place > > this before the comment block that was added for the "!datumIsEqual()" > > case. > > Fixed. I feel the 'continue' block is not required and hence removed it. > > > Nothing wrong with this but if we could have checked "dest->isnulls" > > instead of "src->isnulls" would be much better. > > Here we are copying the data from 'src' to 'dest'. If there is no data > in 'src', it is unnecessary to copy. Hence checking 'src'. > > > Condition "key->strategy != PARTITION_STRATEGY_LIST" seems to be > unnecessary. > > Fixed. > > > Can't be a single loop? > > Yes. Fixed. > > > > On Fri, Dec 3, 2021 at 7:26 PM Amul Sul wrote: > > > > Hi, > > > > Few comments for v7 patch,
Re: Add sub-transaction overflow status in pg_stat_activity
Hi, I have looked into the v2 patch and here are my comments: + PG_RETURN_INT32(local_beentry->subxact_overflowed); +} Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32?? -- +{ oid => '6107', descr => 'statistics: cached subtransaction count of backend', + proname => 'pg_stat_get_backend_subxact_count', provolatile => 's', proparallel => 'r', + prorettype => 'int4', proargtypes => 'int4', + prosrc => 'pg_stat_get_backend_subxact_count' }, +{ oid => '6108', descr => 'statistics: subtransaction cache of backend overflowed', + proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's', proparallel => 'r', + prorettype => 'bool', proargtypes => 'int4', + prosrc => 'pg_stat_get_backend_subxact_overflow' }, The description says that the two new functions show the statistics for "cached subtransaction count of backend" and "subtransaction cache of backend overflowed". But, when these functions are called it shows these stats for the non-backend processes like checkpointer, walwriter etc as well. Should that happen? -- typedef struct LocalPgBackendStatus * not. */ TransactionId backend_xmin; + + /* +* Number of cached subtransactions in the current session. +*/ + int subxact_count; + + /* +* The number of subtransactions in the current session exceeded the cached +* subtransaction limit. +*/ + bool subxact_overflowed; All the variables inside this LocalPgBackendStatus structure are prefixed with "backend" word. Can we do the same for the newly added variables as well? -- + * Get the xid and xmin, nsubxid and overflow status of the backend. The Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid and xmin, nsubxid and overflow"? -- With Regards, Ashutosh Sharma. On Mon, Dec 13, 2021 at 7:58 PM Dilip Kumar wrote: > On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby > wrote: > > > > > You added this to pg_stat_activity, which already has a lot of fields. > > We talked a few months ago about not adding more fields that weren't > commonly > > used. > > > https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e > > > > Since I think this field is usually not interesting to most users of > > pg_stat_activity, maybe this should instead be implemented as a function > like > > pg_backend_get_subxact_status(pid). > > > > People who want to could use it like: > > SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) > sub; > > I have provided two function, one for subtransaction counts and other > whether subtransaction cache is overflowed or not, we can use like > this, if we think this is better way to do it then we can also add > another function for the lastOverflowedXid > > postgres[43994]=# select id, pg_stat_get_backend_pid(id) as pid, > pg_stat_get_backend_subxact_count(id) as nsubxact, > pg_stat_get_backend_subxact_overflow(id) as overflowed from > pg_stat_get_backend_idset() as id; > id | pid | nsubxact | overflowed > +---+--+ > 1 | 43806 |0 | f > 2 | 43983 | 64 | t > 3 | 43994 |0 | f > 4 | 44323 | 22 | f > 5 | 43802 |0 | f > 6 | 43801 |0 | f > 7 | 43804 |0 | f > (7 rows) > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
I am getting the below error when running the same test-case that Neha shared in her previous email. ERROR: 55000: some relations of database "test1" are already in tablespace "tab1" HINT: You must move them back to the database's default tablespace before using this command. LOCATION: movedb, dbcommands.c:1555 test-case: create tablespace tab1 location '/home/ashu/test1'; create tablespace tab location '/home/ashu/test'; create database test tablespace tab; \c test create table t(a int primary key, b text); create or replace function large_val() returns text language sql as 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; insert into t values (generate_series(1,10), large_val()); alter table t set tablespace tab1 ; \c postgres create database test1 template test; \c test1 alter table t set tablespace tab; \c postgres alter database test1 set tablespace tab1; -- this fails with the given error. Observations: === Please note that before running above alter database statement, the table 't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is returning true when searching for table 't' in tablespace 'tab1'. It should have returned NULL here: while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL) { if (strcmp(xlde->d_name, ".") == 0 || strcmp(xlde->d_name, "..") == 0) continue; ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("some relations of database \"%s\" are already in tablespace \"%s\"", dbname, tblspcname), errhint("You must move them back to the database's default tablespace before using this command."))); } Also, if I run the checkpoint explicitly before executing the above alter database statement, this error doesn't appear which means it only happens with the new changes because earlier we were doing the force checkpoint at the end of createdb statement. -- With Regards, Ashutosh Sharma. On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma wrote: > Hi, > > While testing the v8 patches in a hot-standby setup, it was observed the > master is crashing with the below error; > > 2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file > "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory > 2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID > 101483) was terminated by signal 6: Aborted > > Parameters configured at master: > wal_level = hot_standby > max_wal_senders = 3 > hot_standby = on > max_standby_streaming_delay= -1 > wal_consistency_checking='all' > max_wal_size= 10GB > checkpoint_timeout= 1d > log_min_messages=debug1 > > Test Case: > create tablespace tab1 location > '/home/edb/PGsources/postgresql/inst/bin/test1'; > create tablespace tab location > '/home/edb/PGsources/postgresql/inst/bin/test'; > create database test tablespace tab; > \c test > create table t( a int PRIMARY KEY,b text); > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > insert into t values (generate_series(1,10), large_val()); > alter table t set tablespace tab1 ; > \c postgres > create database test1 template test; > \c test1 > alter table t set tablespace tab; > \c postgres > alter database test1 set tablespace tab1; > > --cancel the below command > alter database test1 set tablespace pg_default; --press ctrl+c > \c test1 > alter table t set tablespace tab1; > > > Log file attached for reference. > > Thanks. > -- > Regards, > Neha Sharma > > > On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar wrote: > >> On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian wrote: >> > >> > On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote: >> > From the patch: >> > >> > > Currently, CREATE DATABASE forces a checkpoint, then copies all the >> files, >> > > then forces another checkpoint. The comments in the createdb() >> function >> > > explain the reasons for this. The attached patch fixes this problem >> by making >> > > create database completely WAL logged so that we can avoid the >> checkpoints. >> > > >> > > This can also be useful for supporting the TDE. For example, if we >> need different >> > > encryption for the source and the target database then we can not >> re-encrypt the >>
Re: Multi-Column List Partitioning
On Mon, Dec 20, 2021 at 7:04 PM Amit Langote wrote: > Hi, > > On Mon, Dec 13, 2021 at 11:37 PM Ashutosh Sharma > wrote: > > > > Hi, > > > > Is this okay? > > > > postgres=# CREATE TABLE t1 (a int, b int) PARTITION BY LIST ( a, a, a ); > > CREATE TABLE > > > > postgres=# CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES IN ((1, 2, 3), > (4, 5, 6)); > > CREATE TABLE > > > > postgres=# \d t1 > >Partitioned table "public.t1" > > Column | Type | Collation | Nullable | Default > > +-+---+--+- > > a | integer | | | > > b | integer | | | > > Partition key: LIST (a, a, a) > > Number of partitions: 1 (Use \d+ to list them.) > > I'd say it's not okay for a user to expect this to work sensibly, and > I don't think it would be worthwhile to write code to point that out > to the user if that is what you were implying. > OK. As you wish. -- With Regards, Ashutosh Sharma.
Re: Checkpointer crashes with "PANIC: could not fsync file "pg_tblspc/.."
This is happening because in the case of the primary server, we let the checkpointer process to unlink the first segment of the rel file but that's not the case with the standby server. In case of standby, the startup/recovery process unlinks the first segment of the rel file immediately during WAL replay. Now, in this case as the tablespace path is shared between the primary and standby servers, when the startup process unlinks the first segment of the rel file, it gets unlinked/deleted for the primary server as well. So, when we run the checkpoint on the primary server the subsequent fsync fails with the error "No such file.." which causes the database system to PANIC. Please have a look at the code below in mdunlinkfork(). if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode)) { if (!RelFileNodeBackendIsTemp(rnode)) { /* Prevent other backends' fds from holding on to the disk space */ ret = do_truncate(path); /* Forget any pending sync requests for the first segment */ register_forget_request(rnode, forkNum, 0 /* first seg */ ); } else ret = 0; /* Next unlink the file, unless it was already found to be missing */ if (ret == 0 || errno != ENOENT) { ret = unlink(path); if (ret < 0 && errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } } else { /* Prevent other backends' fds from holding on to the disk space */ ret = do_truncate(path); /* Register request to unlink first segment later */ register_unlink_segment(rnode, forkNum, 0 /* first seg */ ); } == Is it okay to share the same tablespace (infact relfile) between the primary and standby server? Perhaps NO. -- With Regards, Ashutosh Sharma. On Tue, Dec 21, 2021 at 4:47 PM Dilip Kumar wrote: > While testing the below case with the hot standby setup (with the > latest code), I have noticed that the checkpointer process crashed > with the $subject error. As per my observation, we have registered the > SYNC_REQUEST when inserting some tuple into the table, and later on > ALTER SET TABLESPACE we have registered the SYNC_UNLINK_REQUEST, which > looks fine so far, then I have noticed that only when the standby is > connected the underlying table file w.r.t the old tablespace is > already deleted. Now, in AbsorbFsyncRequests we don't do anything for > the SYNC_REQUEST even though we have SYNC_UNLINK_REQUEST for the same > file, but since the underlying file is already deleted the > checkpointer cashed while processing the SYNC_REQUEST. > > I have spent some time on this but could not figure out how the > relfilenodenode file w.r.t. to the old tablespace is getting deleted > and if I disconnect the standby then it is not getting deleted, not > sure how walsender is playing a role in deleting the file even before > checkpointer process the unlink request. > > postgres[8905]=# create tablespace tab location > '/home/dilipkumar/work/PG/install/bin/test'; > CREATE TABLESPACE > postgres[8905]=# create tablespace tab1 location > '/home/dilipkumar/work/PG/install/bin/test1'; > CREATE TABLESPACE > postgres[8905]=# create database test tablespace tab; > CREATE DATABASE > postgres[8905]=# \c test > You are now connected to database "test" as user "dilipkumar". > test[8912]=# create table t( a int PRIMARY KEY,b text); > CREATE TABLE > test[8912]=# insert into t values (generate_series(1,10), 'aaa'); > INSERT 0 10 > test[8912]=# alter table t set tablespace tab1 ; > ALTER TABLE > test[8912]=# CHECKPOINT ; > WARNING: 57P02: terminating connection because of crash of another > server process > > log shows: > PANIC: could not fsync file > "pg_tblspc/16384/PG_15_202112131/16386/16387": No such file or > directory > > backtrace: > #0 0x7f2f865ff387 in raise () from /lib64/libc.so.6 > #1 0x7f2f86600a78 in abort () from /lib64/libc.so.6 > #2 0x00b13da3 in errfinish (filename=0xcf283f "sync.c", .. > #3 0x00978dc7 in ProcessSyncRequests () at sync.c:439 > #4 0x005949d2 in CheckPointGuts (checkPointRedo=67653624, > flags=108) at xlog.c:9590 > #5 0x005942fe in CreateCheckPoint (flags=108) at xlog.c:9318 > #6 0x008a80b7 in CheckpointerMain () at checkpointer.c:444 > > Note: This smaller test case is derived from one of the bigger > scenarios raised by Neha Sharma [1] > > [1] > https://www.postgresql.org/message-id/CANiYTQs0E8TcB11eU0C4eNN0tUd%3DSQqsqEtL1AVZP1%3DEnD-49A%40mail.gmail.com > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com > > >
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi Dilip, On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma wrote: > I am getting the below error when running the same test-case that Neha > shared in her previous email. > > ERROR: 55000: some relations of database "test1" are already in > tablespace "tab1" > HINT: You must move them back to the database's default tablespace before > using this command. > LOCATION: movedb, dbcommands.c:1555 > > test-case: > > create tablespace tab1 location '/home/ashu/test1'; > create tablespace tab location '/home/ashu/test'; > > create database test tablespace tab; > \c test > > create table t(a int primary key, b text); > > create or replace function large_val() returns text language sql as > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > > insert into t values (generate_series(1,10), large_val()); > > alter table t set tablespace tab1 ; > > \c postgres > create database test1 template test; > > \c test1 > alter table t set tablespace tab; > > \c postgres > alter database test1 set tablespace tab1; -- this fails with the given > error. > > Observations: > === > Please note that before running above alter database statement, the table > 't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is > returning true when searching for table 't' in tablespace 'tab1'. It should > have returned NULL here: > > while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL) > { > if (strcmp(xlde->d_name, ".") == 0 || > strcmp(xlde->d_name, "..") == 0) > continue; > > ereport(ERROR, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("some relations of database \"%s\" are already > in tablespace \"%s\"", > dbname, tblspcname), > errhint("You must move them back to the database's > default tablespace before using this command."))); > } > > Also, if I run the checkpoint explicitly before executing the above alter > database statement, this error doesn't appear which means it only happens > with the new changes because earlier we were doing the force checkpoint at > the end of createdb statement. > Is this expected? I think it is not. -- With Regards, Ashutosh Sharma.
Re: Checkpointer crashes with "PANIC: could not fsync file "pg_tblspc/.."
On Wed, Dec 22, 2021 at 7:20 AM Dilip Kumar wrote: > On Wed, 22 Dec 2021 at 12:28 AM, Ashutosh Sharma > wrote: > >> >> Is it okay to share the same tablespace (infact relfile) between the >> primary and standby server? Perhaps NO. >> > >> Oops, yeah absolutely they can never share the tablespace path. So we > can ignore this issue. > That's right. I agree. thanks.! -- With Regards, Ashutosh sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Wed, Dec 22, 2021 at 2:44 PM Dilip Kumar wrote: > On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma > wrote: > > > > I am getting the below error when running the same test-case that Neha > shared in her previous email. > > > > ERROR: 55000: some relations of database "test1" are already in > tablespace "tab1" > > HINT: You must move them back to the database's default tablespace > before using this command. > > LOCATION: movedb, dbcommands.c:1555 > > > > test-case: > > > > create tablespace tab1 location '/home/ashu/test1'; > > create tablespace tab location '/home/ashu/test'; > > > > create database test tablespace tab; > > \c test > > > > create table t(a int primary key, b text); > > > > create or replace function large_val() returns text language sql as > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > > > > insert into t values (generate_series(1,10), large_val()); > > > > alter table t set tablespace tab1 ; > > > > \c postgres > > create database test1 template test; > > > > \c test1 > > alter table t set tablespace tab; > > > > \c postgres > > alter database test1 set tablespace tab1; -- this fails with the given > error. > > > > Observations: > > === > > Please note that before running above alter database statement, the > table 't' is moved to tablespace 'tab' from 'tab1' so not sure why > ReadDir() is returning true when searching for table 't' in tablespace > 'tab1'. It should have returned NULL here: > > > > while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL) > > { > > if (strcmp(xlde->d_name, ".") == 0 || > > strcmp(xlde->d_name, "..") == 0) > > continue; > > > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("some relations of database \"%s\" are > already in tablespace \"%s\"", > > dbname, tblspcname), > > errhint("You must move them back to the database's > default tablespace before using this command."))); > > } > > > > Also, if I run the checkpoint explicitly before executing the above > alter database statement, this error doesn't appear which means it only > happens with the new changes because earlier we were doing the force > checkpoint at the end of createdb statement. > > > > Basically, ALTER TABLE SET TABLESPACE, will register the > SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but > those will get unlinked during the next checkpoint. Although the > files must be truncated during commit itself but unlink might not have > been processed until the next checkpoint. This is the explanation for > the behavior you found during your investigation, but I haven't looked > into the issue so I will do it latest by tomorrow and send my > analysis. > > Thanks for working on this. > Yeah the problem here is that the old rel file that needs to be unlinked still exists in the old tablespace. Earlier, without your changes we were doing force checkpoint before starting with the actual work for the alter database which unlinked/deleted the rel file from the old tablespace, but that is not the case here. Now we have removed the force checkpoint from movedb() which means until the auto checkpoint happens the old rel file will remain in the old tablespace thereby creating this problem. -- With Regards, Ashutosh Sharma.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Wed, Dec 22, 2021 at 5:06 PM Dilip Kumar wrote: > > On Wed, Dec 22, 2021 at 4:26 PM Ashutosh Sharma wrote: > > >> Basically, ALTER TABLE SET TABLESPACE, will register the > >> SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but > >> those will get unlinked during the next checkpoint. Although the > >> files must be truncated during commit itself but unlink might not have > >> been processed until the next checkpoint. This is the explanation for > >> the behavior you found during your investigation, but I haven't looked > >> into the issue so I will do it latest by tomorrow and send my > >> analysis. > >> > >> Thanks for working on this. > > > > > > Yeah the problem here is that the old rel file that needs to be unlinked > > still exists in the old tablespace. Earlier, without your changes we were > > doing force checkpoint before starting with the actual work for the alter > > database which unlinked/deleted the rel file from the old tablespace, but > > that is not the case here. Now we have removed the force checkpoint from > > movedb() which means until the auto checkpoint happens the old rel file > > will remain in the old tablespace thereby creating this problem. > > One solution to this problem could be that, similar to mdpostckpt(), > we invent one more function which takes dboid and dsttblspc oid as > input and it will unlink all the requests which are w.r.t. the dboid > and tablespaceoid, and before doing it we should also do > ForgetDatabaseSyncRequests(), so that next checkpoint does not flush > some old request. I couldn't find the mdpostchkpt() function. Are you talking about SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to unlink all the files available inside the dst_tablespaceoid/dst_dboid/ directory by scanning the pendingUnlinks list. And finally we don't want the next checkpoint to unlink this file again and PANIC so for that we have to update the entry for this unlinked rel file in the hash table i.e. cancel the sync request for it. -- With Regards, Ashutosh Sharma.
Re: Parallel copy
I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition. On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > On Wed, Jul 22, 2020 at 7:56 PM vignesh C wrote: > > > > Thanks for reviewing and providing the comments Ashutosh. > > Please find my thoughts below: > > > > On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma > wrote: > > > > > > Some review comments (mostly) from the leader side code changes: > > > > > > 3) Should we allow Parallel Copy when the insert method is > CIM_MULTI_CONDITIONAL? > > > > > > + /* Check if the insertion mode is single. */ > > > + if (FindInsertMethod(cstate) == CIM_SINGLE) > > > + return false; > > > > > > I know we have added checks in CopyFrom() to ensure that if any > trigger (before row or instead of) is found on any of partition being > loaded with data, then COPY FROM operation would fail, but does it mean > that we are okay to perform parallel copy on partitioned table. Have we > done some performance testing with the partitioned table where the data in > the input file needs to be routed to the different partitions? > > > > > > > Partition data is handled like what Amit had told in one of earlier > mails [1]. My colleague Bharath has run performance test with partition > table, he will be sharing the results. > > > > I ran tests for partitioned use cases - results are similar to that of non > partitioned cases[1]. > > parallel workers test case 1(exec time in sec): copy from csv file, > 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns > unique data test case 2(exec time in sec): copy from csv file, 5.1GB, > 10million tuples, 4 range partitions, unique data > 0 205.403(1X) 135(1X) > 2 114.724(1.79X) 59.388(2.27X) > 4 99.017(2.07X) 56.742(2.34X) > 8 99.722(2.06X) 66.323(2.03X) > 16 98.147(2.09X) 66.054(2.04X) > 20 97.723(2.1X) 66.389(2.03X) > 30 97.048(2.11X) 70.568(1.91X) > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com >
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi All, Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that contains these functions. Please have a look and let me know your feedback. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Jul 16, 2020 at 9:44 PM Robert Haas wrote: > On Thu, Jul 16, 2020 at 10:00 AM Robert Haas > wrote: > > I see your point, though: the tuple has to be able to survive > > HOT-pruning in order to cause a problem when we check whether it needs > > freezing. > > Here's an example where the new sanity checks fail on an invisible > tuple without any concurrent transactions: > > $ initdb > $ pg_ctl start -l ~/logfile > $ createdb > $ psql > > create table simpsons (a int, b text); > vacuum freeze; > > $ cat > txid.sql > select txid_current(); > $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql > $ psql > > insert into simpsons values (1, 'homer'); > > $ pg_ctl stop > $ pg_resetwal -x 1000 $PGDATA > $ pg_ctl start -l ~/logfile > $ psql > > update pg_class set relfrozenxid = (relfrozenxid::text::integer + > 200)::text::xid where relname = 'simpsons'; > > rhaas=# select * from simpsons; > a | b > ---+--- > (0 rows) > > rhaas=# vacuum simpsons; > ERROR: found xmin 1049082 from before relfrozenxid 2000506 > CONTEXT: while scanning block 0 of relation "public.simpsons" > > This is a fairly insane situation, because we should have relfrozenxid > < tuple xid < xid counter, but instead we have xid counter < tuple xid > < relfrozenxid, but it demonstrates that it's possible to have a > database which is sufficiently corrupt that you can't escape from the > new sanity checks using only INSERT, UPDATE, and DELETE. > > Now, an even easier way to create a table with a tuple that prevents > vacuuming and also can't just be deleted is to simply remove a > required pg_clog file (and maybe restart the server to clear out any > cached data in the SLRUs). What we typically do with customers who > need to recover from that situation today is give them a script to > fabricate a bogus CLOG file that shows all transactions as committed > (or, perhaps, aborted). But I think that the tools proposed on this > thread might be a better approach in certain cases. If the problem is > that a pg_clog file vanished, then recreating it with whatever content > you think is closest to what was probably there before is likely the > best you can do. But if you've got some individual tuples with crazy > xmin values, you don't really want to drop matching files in pg_clog; > it's better to fix the tuples. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > From 75a6ba0e86b6d199201477c5ebd60258ef16f181 Mon Sep 17 00:00:00 2001 From: ashu Date: Fri, 24 Jul 2020 11:26:45 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap tables. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 109 + contrib/pg_surgery/heap_surgery_funcs.c| 356 + contrib/pg_surgery/pg_surgery--1.0.sql | 14 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 72 ++ 7 files changed, 580 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..07d5734 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,7 @@ SUBDIRS = \ pg_standby \ pg_stat_statements \ pg_trgm \ + pg_surgery \ pgcrypto \ pgrowlocks \ pgstattuple \ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 000..baf2a88 --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery_funcs.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on the damaged he
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi, Thanks for sharing your thoughts. Please find my comments inline below: > > I think here we should report that we haven't done what was asked. > + /* Nothing to do if the itemid is unused or > already dead. */ > + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > + continue; > > Okay. Will add a log message saying "skipping tid ... because ..." > Also, should we try to fix VM along the way? > I think we should let VACUUM do that. > Are there any caveats with concurrent VACUUM? (I do not see any, just > asking) > As of now, I don't see any. > It would be good to have some checks for interrupts in safe places. > I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out. > I think we should not trust user entierly here. I'd prefer validation and > graceful exit, not a core dump. > + Assert(noffs <= PageGetMaxOffsetNumber(page)); > > Yeah, sounds reasonable. Will do that. > For some reason we had unlogged versions of these functions. But I do not > recall exact rationale.. > Also, I'd be happy if we had something like "Restore this tuple iff this > does not break unique constraint". To do so we need to sort tids by > xmin\xmax, to revive most recent data first. > I didn't get this point. Could you please elaborate. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
> > I think we should let VACUUM do that. > Sometimes VACUUM will not get to these pages, because they are marked All > Frozen. > An possibly some tuples will get stale on this page again > Hmm, okay, will have a look into this. Thanks. > > > > Are there any caveats with concurrent VACUUM? (I do not see any, just > asking) > > > > As of now, I don't see any. > VACUUM has collection of dead item pointers. We will not resurrect any of > them, right? > We won't be doing any such things. > > > It would be good to have some checks for interrupts in safe places. > > > > I think I have already added those wherever I felt it was required. If > you feel it's missing somewhere, it could be good if you could point it out. > Sorry, I just overlooked that call, everything is fine here. > Okay, thanks for confirming. > > > Also, I'd be happy if we had something like "Restore this tuple iff > this does not break unique constraint". To do so we need to sort tids by > xmin\xmax, to revive most recent data first. > > > > I didn't get this point. Could you please elaborate. > You may have 10 corrupted tuples for the same record, that was updated 9 > times. And if you have unique constraint on the table you may want to have > only latest version of the row. So you want to kill 9 tuples and freeze 1. > Okay, in that case, users need to pass the tids of the 9 tuples that they want to kill to heap_force_kill function and the tid of the tuple that they want to be marked as frozen to heap_force_freeze function. Just to inform you that this tool is not used to detect any data corruption, it is just meant to remove/clean the corrupted data in a table so that the operations like vacuum, pg_dump/restore succeeds. It's users responsibility to identify the corrupted data and pass its tid to either of these functions as per the requirement. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Beena, Thanks for the review. 1. We would be marking buffer dirty and writing wal even if we have > not done any changes( ex if we pass invalid/dead tids). Maybe we can > handle this better? > Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch. > cosmetic changes > 1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and > also the variable names could use surgery instead. > I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it. > 2. extension comment pg_surgery.control "extension to perform surgery > the damaged heap table" -> "extension to perform surgery on the > damaged heap table" > Okay, will fix that typo. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma wrote: > > > I think we should let VACUUM do that. >> Sometimes VACUUM will not get to these pages, because they are marked All >> Frozen. >> An possibly some tuples will get stale on this page again >> > > Hmm, okay, will have a look into this. Thanks. > I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skipping behavior. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Attached is the new version of patch that addresses the comments from Andrey and Beena. On Wed, Jul 29, 2020 at 10:07 AM Ashutosh Sharma wrote: > Hi Beena, > > Thanks for the review. > > 1. We would be marking buffer dirty and writing wal even if we have >> not done any changes( ex if we pass invalid/dead tids). Maybe we can >> handle this better? >> > > Yeah, we can skip this when nothing has changed. Will take care of it in > the next version of patch. > > >> cosmetic changes >> 1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and >> also the variable names could use surgery instead. >> > > I think that looks fine. I would rather prefer the word "Force" just > because all the enum options contain the word "Force" in it. > > >> 2. extension comment pg_surgery.control "extension to perform surgery >> the damaged heap table" -> "extension to perform surgery on the >> damaged heap table" >> > > Okay, will fix that typo. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > From e78c6fc6a99bba7f38482864a36dc812b0798464 Mon Sep 17 00:00:00 2001 From: ashu Date: Fri, 31 Jul 2020 17:24:50 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap tables. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 109 + contrib/pg_surgery/heap_surgery_funcs.c| 378 + contrib/pg_surgery/pg_surgery--1.0.sql | 14 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 72 ++ 7 files changed, 602 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..07d5734 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,7 @@ SUBDIRS = \ pg_standby \ pg_stat_statements \ pg_trgm \ + pg_surgery \ pgcrypto \ pgrowlocks \ pgstattuple \ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 000..baf2a88 --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery_funcs.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" + +REGRESS = pg_surgery + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_surgery +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out new file mode 100644 index 000..98419b0 --- /dev/null +++ b/contrib/pg_surgery/expected/pg_surgery.out @@ -0,0 +1,109 @@ +create extension pg_surgery; +-- +-- test pg_surgery functions with all the supported relations. Should pass. +-- +-- normal heap table. +begin; +create table htab(a int); +insert into htab values (100), (200), (300), (400), (500); +select * from htab where xmin = 2; + a +--- +(0 rows) + +select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]); + heap_force_freeze +--- + +(1 row) + +select ctid, xmax from htab where xmin = 2; + ctid | xmax +---+-- + (0,4) |0 +(1 row) + +select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]); + heap_force_kill +- + +(1 row) + +select * from htab where ctid = '(0, 4)'; + a +--- +(0 rows) + +rollback; +-- toast table. +begin; +create table ttab(a text); +insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,1); +select * from ttab where xmin = 2; + a +--- +(0 rows) + +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); + heap_force_freeze +--- + +(1 row) + +select ctid, xmax from ttab where xmin = 2; + ctid | xmax +---+-- + (0,1) |0 +(1 row) + +select heap_force_kill('ttab'::regclass, ARRAY['(0, 1)']::tid[]); + heap_force_kill +-
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Robert, Thanks for the review. I've gone through all your review comments and understood all of them except this one: You really cannot > modify the buffer like this and then decide, oops, never mind, I think > I won't mark it dirty or write WAL for the changes. If you do that, > the buffer is still in memory, but it's now been modified. A > subsequent operation that modifies it will start with the altered > state you created here, quite possibly leading to WAL that cannot be > correctly replayed on the standby. In other words, you've got to > decide for certain whether you want to proceed with the operation > *before* you enter the critical section. > Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Aug 3, 2020 at 7:06 PM Robert Haas wrote: > > On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma wrote: > > Could you please explain this point once more in detail? I am not quite > > able to understand under what circumstances a buffer would be modified, but > > won't be marked as dirty or a WAL won't be written for it. > > Whenever this branch is taken: > > + if (nskippedItems == noffs) > + goto skip_wal; > If the above path is taken that means none of the items in the page got changed. As per the following if-check whenever an item in the offnos[] array is found dead or unused, it is skipped (due to continue statement) which means the item is neither marked dead nor it is marked frozen. Now, if this happens for all the items in a page, then the above condition (nskippedItems == noffs) would be true and hence the buffer would remain unchanged, so, we don't mark such a buffer as dirty and neither do any WAL logging for it. This is my understanding, please let me know if I am missing something here. Thank you. if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) { nskippedItems++; ereport(NOTICE, (errmsg("skipping tid (%u, %u) because it is already marked %s", blkno, offnos[i], ItemIdIsDead(itemid) ? "dead" : "unused"))); continue; } > At this point you have already modified the page, using ItemIdSetDead, > HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this > branch is taken, then MarkBufferDirty() and log_newpage_buffer() are > skipped. > -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Robert, Thanks for the review. Please find my comments inline: On Sat, Aug 1, 2020 at 12:18 AM Robert Haas wrote: > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma wrote: > > Attached is the new version of patch that addresses the comments from > > Andrey and Beena. > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" > > the -> a > > I also suggest: heap table -> relation, because we might want to > extend this to other cases later. > Corrected. > +-- toast table. > +begin; > +create table ttab(a text); > +insert into ttab select string_agg(chr(floor(random() * 26)::int + > 65), '') from generate_series(1,1); > +select * from ttab where xmin = 2; > + a > +--- > +(0 rows) > + > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); > + heap_force_freeze > +--- > + > +(1 row) > + > > I don't understand the point of this. You're not testing the function > on the TOAST table; you're testing it on the main table when there > happens to be a TOAST table that is probably getting used for > something. But that's not really relevant to what is being tested > here, so as written this seems redundant with the previous cases. > Yeah, it's being tested on the main table, not on a toast table. I've removed this test-case and also restricted direct access to the toast table using heap_force_kill/freeze functions. I think we shouldn't be using these functions to do any changes in the toast table. We will only use these functions with the main table and let VACUUM remove the corresponding data chunks (pointed by the tuple that got removed from the main table). Another option would be to identify all the data chunks corresponding to the tuple (ctid) being killed from the main table and remove them one by one. We will only do this if the tuple from the main table that has been marked as killed has an external storage. We will have to add a bunch of code for this otherwise we can let VACUUM do this for us. Let me know your thoughts on this. > +-- test pg_surgery functions with the unsupported relations. Should fail. > > Please name the specific functions being tested here in case we add > more in the future that are tested separately. > Done. > +++ b/contrib/pg_surgery/heap_surgery_funcs.c > > I think we could drop _funcs from the file name. > Done. > +#ifdef PG_MODULE_MAGIC > +PG_MODULE_MAGIC; > +#endif > > The #ifdef here is not required, and if you look at other contrib > modules you'll see that they don't have it. > Okay, done. > I don't like all the macros at the top of the file much. CHECKARRVALID > is only used in one place, so it seems to me that you might as well > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY. > Done. > Once you do that, heap_force_common() can just compute the number of > array elements once, instead of doing it once inside ARRISEMPTY, then > again inside SORT, and then a third time to initialize ntids. You can > just have a local variable in that function that is set once and then > used as needed. Then you are only doing ARRNELEMS once, so you can get > rid of that macro too. The same technique can be used to get rid of > ARRPTR. So then all the macros go away without introducing any code > duplication. > Done. > +/* Options to forcefully change the state of a heap tuple. */ > +typedef enum HTupleForceOption > +{ > + FORCE_KILL, > + FORCE_FREEZE > +} HTupleForceOption; > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Done. Also, how about > option -> operation? > I think both look okay to me. > + return heap_force_common(fcinfo, FORCE_KILL); > > I think it might be more idiomatic to use PG_RETURN_DATUM here. I > know it's the same thing, though, and perhaps I'm even wrong about the > prevailing style. > Done. > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE); > > I think this is unnecessary. It's an enum with 2 values. > Removed. > + if (ARRISEMPTY(ta)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("empty tid array"))); > > I don't see why this should be an error. Why can't we just continue > normally and if it does nothing, it does nothing? You'd need to change > the do..while loop to a while loop so that the end condition is tested > at the top, but that seems fine. > I think it's okay to have this check. So, just left it as-is. We do have such checks in other contrib modules as well wherever the array is being passed as an inpu
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Thu, Aug 6, 2020 at 1:04 AM Robert Haas wrote: > > On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma wrote: > > If the above path is taken that means none of the items in the page > > got changed. > > Oops. I didn't realize that, sorry. Maybe it would be a little more > clear if instead of "int nSkippedItems" you had "bool > did_modify_page"? Then you could initialize it to false and set it to > true just before doing the page modifications. > Okay, np, in that case, as you suggested, I will replace "int nSkippedItems" with "did_modify_page" to increase the clarity. I will do this change in the next version of patch. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Thu, Aug 6, 2020 at 1:29 AM Robert Haas wrote: > > On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma wrote: > > Yeah, it's being tested on the main table, not on a toast table. I've > > removed this test-case and also restricted direct access to the toast > > table using heap_force_kill/freeze functions. I think we shouldn't be > > using these functions to do any changes in the toast table. We will > > only use these functions with the main table and let VACUUM remove the > > corresponding data chunks (pointed by the tuple that got removed from > > the main table). > > I agree with removing the test case, but I disagree with restricting > this from being used on the TOAST table. These are tools for experts, > who may use them as they see fit. It's unlikely that it would be > useful to use this on a TOAST table, I think, but not impossible. > Okay, If you want I can remove the restriction on a toast table, but, then that means a user can directly remove the data chunks from the toast table without changing anything in the main table. This means we won't be able to query the main table as it will fail with an error like "ERROR: unexpected chunk number ...". So, we will have to find some way to identify the pointer to the chunks that got deleted from the toast table and remove that pointer from the main table. We also need to make sure that before we remove a tuple (pointer) from the main table, we identify all the remaining data chunks pointed by this tuple and remove them completely only then that table would be considered to be in a good state. Now, I am not sure if we can currently do all these things. > > Another option would be to identify all the data chunks corresponding > > to the tuple (ctid) being killed from the main table and remove them > > one by one. We will only do this if the tuple from the main table that > > has been marked as killed has an external storage. We will have to add > > a bunch of code for this otherwise we can let VACUUM do this for us. > > Let me know your thoughts on this. > > I don't think VACUUM will do anything for us automatically -- it isn't > going to know that we force-killed the tuple in the main table. > Normally, a tuple delete would have to set xmax on the TOAST tuples > and then VACUUM would do its thing, but in this case that won't > happen. So if you force-kill a tuple in the main table you would end > up with a space leak in the TOAST table. > > The problem here is that one reason you might force-killing a tuple in > the main table is because it's full of garbage. If so, trying to > decode the tuple so that you can find the TOAST pointers might crash > or error out, or maybe that part will work but then you'll error out > trying to look up the corresponding TOAST tuples, either because the > values are not valid or because the TOAST table itself is generally > hosed in some way. So I think it is probably best if we keep this tool > as simple as possible, with as few dependencies as we can, and > document the possible negative outcomes of using it. I completely agree with you. It's not > impossible to recover from a space-leak like this; you can always move > the data into a new table with CTAS and then drop the old one. Not > sure whether CLUSTER or VACUUM FULL would also be sufficient. > Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine. > Separately, we might want to add a TOAST-checker to amcheck, or > enhance the heap-checker Mark is working on, and one of the things it > could do is check for TOAST entries to which nothing points. Then if > you force-kill tuples in the main table you could also use that tool > to look for things in the TOAST table that ought to also be > force-killed. > Okay, good to know that. Thanks for sharing this info. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hello Masahiko-san, Thanks for looking into the patch. Please find my comments inline below: On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada wrote: > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma wrote: > > Please check the attached patch for the changes. > > I also looked at this version patch and have some small comments: > > + Oid relid = PG_GETARG_OID(0); > + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); > + ItemPointer tids; > + int ntids; > + Relationrel; > + Buffer buf; > + Pagepage; > + ItemId itemid; > + BlockNumber blkno; > + OffsetNumber *offnos; > + OffsetNumberoffno, > + noffs, > + curr_start_ptr, > + next_start_ptr, > + maxoffset; > + int i, > + nskippedItems, > + nblocks; > > You declare all variables at the top of heap_force_common() function > but I think we can declare some variables such as buf, page inside of > the do loop. > Sure, I will do this in the next version of patch. > --- > + if (offnos[i] > maxoffset) > + { > + ereport(NOTICE, > +errmsg("skipping tid (%u, %u) because it > contains an invalid offset", > + blkno, offnos[i])); > + continue; > + } > > If all tids on a page take the above path, we will end up logging FPI > in spite of modifying nothing on the page. > Yeah, that's right. I've taken care of this in the new version of patch which I am yet to share. > --- > + /* XLOG stuff */ > + if (RelationNeedsWAL(rel)) > + log_newpage_buffer(buf, true); > > I think we need to set the returned LSN by log_newpage_buffer() to the page > lsn. > I think we are already setting the page lsn in the log_newpage() which is being called from log_newpage_buffer(). So, AFAIU, no change would be required here. Please let me know if I am missing something. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Attached v4 patch fixes the latest comments from Robert and Masahiko-san. Changes: 1) Let heap_force_kill and freeze functions to be used with toast tables. 2) Replace "int nskippedItems" with "bool did_modify_page" flag to know if any modification happened in the page or not. 3) Declare some of the variables such as buf, page inside of the do loop instead of declaring them at the top of heap_force_common function. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada wrote: > > On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma wrote: > > > > Hello Masahiko-san, > > > > Thanks for looking into the patch. Please find my comments inline below: > > > > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada > > wrote: > > > > > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma > > > wrote: > > > > Please check the attached patch for the changes. > > > > > > I also looked at this version patch and have some small comments: > > > > > > + Oid relid = PG_GETARG_OID(0); > > > + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); > > > + ItemPointer tids; > > > + int ntids; > > > + Relationrel; > > > + Buffer buf; > > > + Pagepage; > > > + ItemId itemid; > > > + BlockNumber blkno; > > > + OffsetNumber *offnos; > > > + OffsetNumberoffno, > > > + noffs, > > > + curr_start_ptr, > > > + next_start_ptr, > > > + maxoffset; > > > + int i, > > > + nskippedItems, > > > + nblocks; > > > > > > You declare all variables at the top of heap_force_common() function > > > but I think we can declare some variables such as buf, page inside of > > > the do loop. > > > > > > > Sure, I will do this in the next version of patch. > > > > > --- > > > + if (offnos[i] > maxoffset) > > > + { > > > + ereport(NOTICE, > > > +errmsg("skipping tid (%u, %u) because it > > > contains an invalid offset", > > > + blkno, offnos[i])); > > > + continue; > > > + } > > > > > > If all tids on a page take the above path, we will end up logging FPI > > > in spite of modifying nothing on the page. > > > > > > > Yeah, that's right. I've taken care of this in the new version of > > patch which I am yet to share. > > > > > --- > > > + /* XLOG stuff */ > > > + if (RelationNeedsWAL(rel)) > > > + log_newpage_buffer(buf, true); > > > > > > I think we need to set the returned LSN by log_newpage_buffer() to the > > > page lsn. > > > > > > > I think we are already setting the page lsn in the log_newpage() which > > is being called from log_newpage_buffer(). So, AFAIU, no change would > > be required here. Please let me know if I am missing something. > > You're right. I'd missed the comment of log_newpage_buffer(). Thanks! > > Regards, > > -- > Masahiko Sawadahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 6d616af51a8f9e6352145f799f635ec907ed83c2 Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 6 Aug 2020 18:02:29 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap table. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 78 +++ contrib/pg_surgery/heap_surgery.c | 362 + contrib/pg_surgery/pg_surgery--1.0.sql | 18 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 61 + 7 files changed, 548 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_sur
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Thu, Aug 6, 2020 at 9:19 PM Robert Haas wrote: > > On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma wrote: > > Okay, If you want I can remove the restriction on a toast table, but, > > then that means a user can directly remove the data chunks from the > > toast table without changing anything in the main table. This means we > > won't be able to query the main table as it will fail with an error > > like "ERROR: unexpected chunk number ...". So, we will have to find > > some way to identify the pointer to the chunks that got deleted from > > the toast table and remove that pointer from the main table. We also > > need to make sure that before we remove a tuple (pointer) from the > > main table, we identify all the remaining data chunks pointed by this > > tuple and remove them completely only then that table would be > > considered to be in a good state. Now, I am not sure if we can > > currently do all these things. > > That's the user's problem. If they don't have a plan for that, they > shouldn't use this tool. I don't think we can or should try to handle > it in this code. > Okay, thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Thanks Rajkumar for testing the patch. Here are some of the additional test-cases that I would suggest you to execute, if possible: 1) You may try running the test-cases that you have executed so far with SR setup and see if the changes are getting reflected on the standby. 2) You may also try running some concurrent test-cases for e.g. try running these functions with VACUUM or some other sql commands (preferable DML commands) in parallel. 3) See what happens when you pass some invalid tids (containing invalid block or offset number) to these functions. You may also try running these functions on the same tuple repeatedly and see the behaviour. ... -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi wrote: > > I have been doing some user-level testing of this feature, apart from sanity > test for extension and it's functions > > I have tried to corrupt tuples and then able to fix it using > heap_force_freeze/kill functions. > > > --corrupt relfrozenxid, cause vacuum failed. > > update pg_class set relfrozenxid = (relfrozenxid::text::integer + > 10)::text::xid where relname = 'test_tbl'; > > UPDATE 1 > > insert into test_tbl values (2, 'BBB'); > > > postgres=# vacuum test_tbl; > > ERROR: found xmin 507 from before relfrozenxid 516 > > CONTEXT: while scanning block 0 of relation "public.test_tbl" > > > postgres=# select *, ctid, xmin, xmax from test_tbl; > > a | b | ctid | xmin | xmax > > ---+-+---+--+-- > > 1 | AAA | (0,1) | 505 |0 > > 2 | BBB | (0,2) | 507 |0 > > (2 rows) > > > --fixed using heap_force_freeze > > postgres=# select heap_force_freeze('test_tbl'::regclass, > ARRAY['(0,2)']::tid[]); > > heap_force_freeze > > --- > > > postgres=# vacuum test_tbl; > > VACUUM > > postgres=# select *, ctid, xmin, xmax from test_tbl; > > a | b | ctid | xmin | xmax > > ---+-+---+--+-- > > 1 | AAA | (0,1) | 505 |0 > > 2 | BBB | (0,2) |2 |0 > > (2 rows) > > > --corrupt table headers in base/oid. file, cause table access failed. > > postgres=# select ctid, * from test_tbl; > > ERROR: could not access status of transaction 4294967295 > > DETAIL: Could not open file "pg_xact/0FFF": No such file or directory. > > > --removed corrupted tuple using heap_force_kill > > postgres=# select heap_force_kill('test_tbl'::regclass, > ARRAY['(0,2)']::tid[]); > > heap_force_kill > > - > > > > (1 row) > > > postgres=# select ctid, * from test_tbl; > > ctid | a | b > > ---+---+- > > (0,1) | 1 | AAA > > (1 row) > > > I will be continuing with my testing with the latest patch and update here if > found anything. > > > Thanks & Regards, > Rajkumar Raghuwanshi > > > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada > wrote: >> >> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma wrote: >> > >> > Hi Robert, >> > >> > Thanks for the review. Please find my comments inline: >> > >> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas wrote: >> > > >> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma >> > > wrote: >> > > > Attached is the new version of patch that addresses the comments from >> > > > Andrey and Beena. >> > > >> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" >> > > >> > > the -> a >> > > >> > > I also suggest: heap table -> relation, because we might want to >> > > extend this to other cases later. >> > > >> > >> > Corrected. >> > >> > > +-- toast table. >> > > +begin; >> > > +create table ttab(a text); >> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int + >> > > 65), '') from generate_series(1,1); >> > > +select * from ttab where xmin = 2; >> > > + a >> > > +--- >> > > +(0 rows) >> > > + >> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); >> > > + heap_force_freeze >> > > +--- >> > > + >> > > +(1 row) >> > > + >> > > >> > > I don't understand the point of this. You're not testing the function >> > > on t
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Fri, Aug 7, 2020 at 9:21 PM Robert Haas wrote: > > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma wrote: > There's a certain inconsistency to these messages: > > rhaas=# create table foo (a int); > CREATE TABLE > rhaas=# insert into foo values (1); > INSERT 0 1 > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); > NOTICE: skipping tid (0, 2) because it contains an invalid offset > heap_force_kill > - > > (1 row) > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); > ERROR: invalid item pointer > LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); > ERROR: block number 1 is out of range for relation "foo" > > From a user perspective it seems like I've made three very similar > mistakes: in the first case, the offset is too high, in the second > case it's too low, and in the third case the block number is out of > range. But in one case I get a NOTICE and in the other two cases I get > an ERROR. In one case I get the relation name and in the other two > cases I don't. The two complaints about an invalid offset are phrased > completely differently from each other. For example, suppose you do > this: > > ERROR: tid (%u, %u) is invalid for relation "%s" because the block > number is out of range (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item > number is out of range for this block (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead > Thank you for your suggestions. To make this consistent, I am planning to do the following changes: Remove the error message to report "invalid item pointer" from tids_same_page_fetch_offnums() and expand the if-check to detect any invalid offset number in the CRITICAL section such that it not just checks if the offset number is > maxoffset, but also checks if the offset number is equal to 0. That way it would also do the job that "if (!ItemPointerIsValid)" was doing for us. Further, if any invalid block number is detected, then I am planning to skip all the tids associated with this block and move to the next block. Hence, instead of reporting the error I would report the NOTICE message to the user. The other two messages for reporting unused items and dead items remain the same. Hence, with above change, we would be reporting the following 4 messages: NOTICE: skipping all the tids in block %u for relation "%s" because the block number is out of range NOTICE: skipping tid (%u, %u) for relation "%s" because the item number is out of range for this block NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused Please let me know if you are okay with the above changes or not? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Tue, Aug 11, 2020 at 7:33 PM Robert Haas wrote: > > On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma wrote: > > The other two messages for reporting unused items and dead items > > remain the same. Hence, with above change, we would be reporting the > > following 4 messages: > > > > NOTICE: skipping all the tids in block %u for relation "%s" because > > the block number is out of range > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because the item > > number is out of range for this block > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused > > > > Please let me know if you are okay with the above changes or not? > > That seems broadly reasonable, but I would suggest phrasing the first > message like this: > > skipping block %u for relation "%s" because the block number is out of range > Okay, thanks for the confirmation. I'll do that. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Thanks Robert for the review. Please find my comments inline below: On Fri, Aug 7, 2020 at 9:21 PM Robert Haas wrote: > > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma wrote: > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san. > > Compiler warning: > > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is > always false [-Werror,-Wtautological-compare] > if (blkno < 0 || blkno >= nblocks) > ~ ^ ~ > Fixed. > There's a certain inconsistency to these messages: > > rhaas=# create table foo (a int); > CREATE TABLE > rhaas=# insert into foo values (1); > INSERT 0 1 > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); > NOTICE: skipping tid (0, 2) because it contains an invalid offset > heap_force_kill > - > > (1 row) > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); > ERROR: invalid item pointer > LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); > ERROR: block number 1 is out of range for relation "foo" > > From a user perspective it seems like I've made three very similar > mistakes: in the first case, the offset is too high, in the second > case it's too low, and in the third case the block number is out of > range. But in one case I get a NOTICE and in the other two cases I get > an ERROR. In one case I get the relation name and in the other two > cases I don't. The two complaints about an invalid offset are phrased > completely differently from each other. For example, suppose you do > this: > > ERROR: tid (%u, %u) is invalid for relation "%s" because the block > number is out of range (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item > number is out of range for this block (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead > Corrected. > I think I misled you when I said to use pg_class_aclcheck. I think it > should actually be pg_class_ownercheck. > okay, I've changed it to pg_class_ownercheck. > I think the relkind sanity check should permit RELKIND_MATVIEW also. > Yeah, actually we should allow MATVIEW, don't know why I thought of blocking it earlier. > It's unclear to me why the freeze logic here shouldn't do this part > what heap_prepare_freeze_tuple() does when freezing xmax: > > frz->t_infomask2 &= ~HEAP_HOT_UPDATED; > frz->t_infomask2 &= ~HEAP_KEYS_UPDATED; > Yeah, we should have these changes when freezing the xmax. > Likewise, why should we not freeze or invalidate xvac in the case > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple() > would do? > Again, we should have this as well. Apart from above, this time I've also added the documentation on pg_surgery module and added a few more test-cases. Attached patch with above changes. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 2aa56b9632cf1d291f4433afd972dc647d354dcb Mon Sep 17 00:00:00 2001 From: ashu Date: Wed, 12 Aug 2020 14:38:14 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap table. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 161 + contrib/pg_surgery/heap_surgery.c | 375 + contrib/pg_surgery/pg_surgery--1.0.sql | 18 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 89 +++ doc/src/sgml/contrib.sgml | 1 + doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/pgsurgery.sgml| 94 10 files changed, 768 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql create mode 100644 doc/src/sgml/pgsurgery.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..07d5734 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Asim, Thanks for having a look into the patch and for sharing your feedback. Please find my comments inline below: On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen wrote: > > Hi Ashutosh > > I stumbled upon this thread today, went through your patch and it looks good. > A minor suggestion in sanity_check_relation(): > > if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("only heap AM is supported"))); > > Instead of checking the access method OID, it seems better to check the > handler OID like so: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > > The reason is current version of sanity_check_relation() would emit error for > the following case even when the table structure is actually heap. > > create access method myam type table handler heap_tableam_handler; > create table mytable (…) using myam; > This looks like a very good suggestion to me. I will do this change in the next version. Just wondering if we should be doing similar changes in other contrib modules (like pgrowlocks, pageinspect and pgstattuple) as well? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hello Masahiko-san, Thanks for the review. Please check the comments inline below: On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada wrote: > Thank you for updating the patch! Here are my comments on v5 patch: > > --- a/contrib/Makefile > +++ b/contrib/Makefile > @@ -35,6 +35,7 @@ SUBDIRS = \ > pg_standby \ > pg_stat_statements \ > pg_trgm \ > + pg_surgery \ > pgcrypto\ > > I guess we use alphabetical order here. So pg_surgery should be placed > before pg_trgm. > Okay, will take care of this in the next version of patch. > --- > + if (heap_force_opt == HEAP_FORCE_KILL) > + ItemIdSetDead(itemid); > > I think that if the page is an all-visible page, we should clear an > all-visible bit on the visibility map corresponding to the page and > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > return the wrong results. > I think we should let VACUUM do that. Please note that this module is intended to be used only on a damaged relation and should only be operated on damaged tuples of such relations. And the execution of any of the functions provided by this module on a damaged relation must be followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. This is necessary to bring back a damaged relation to the sane state once a surgery is performed on it. I will try to add this note in the documentation for this module. > --- > + /* > +* We do not mark the buffer dirty or do WAL logging for unmodifed > +* pages. > +*/ > + if (!did_modify_page) > + goto skip_wal; > + > + /* Mark buffer dirty before we write WAL. */ > + MarkBufferDirty(buf); > + > + /* XLOG stuff */ > + if (RelationNeedsWAL(rel)) > + log_newpage_buffer(buf, true); > + > +skip_wal: > + END_CRIT_SECTION(); > + > > s/unmodifed/unmodified/ > okay, will fix this typo. > Do we really need to use goto? I think we can modify it like follows: > > if (did_modity_page) > { >/* Mark buffer dirty before we write WAL. */ >MarkBufferDirty(buf); > >/* XLOG stuff */ >if (RelationNeedsWAL(rel)) >log_newpage_buffer(buf, true); > } > > END_CRIT_SECTION(); > No, we don't need it. We can achieve the same by checking the status of did_modify_page flag as you suggested. I will do this change in the next version. > --- > pg_force_freeze() can revival a tuple that is already deleted but not > vacuumed yet. Therefore, the user might need to reindex indexes after > using that function. For instance, with the following script, the last > two queries: index scan and seq scan, will return different results. > > set enable_seqscan to off; > set enable_bitmapscan to off; > set enable_indexonlyscan to off; > create table tbl (a int primary key); > insert into tbl values (1); > > update tbl set a = a + 100 where a = 1; > > explain analyze select * from tbl where a < 200; > > -- revive deleted tuple on heap > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > -- index scan returns 2 tuples > explain analyze select * from tbl where a < 200; > > -- seq scan returns 1 tuple > set enable_seqscan to on; > explain analyze select * from tbl; > I am not sure if this is the right use-case of pg_force_freeze function. I think we should only be running pg_force_freeze function on a tuple for which VACUUM reports "found xmin ABC from before relfrozenxid PQR" sort of error otherwise it might worsen the things instead of making it better. Now, the question is - can VACUUM report this type of error for a deleted tuple or it would only report it for a live tuple? AFAIU this won't be reported for the deleted tuples because VACUUM wouldn't consider freezing a tuple that has been deleted. > Also, if a tuple updated and moved to another partition is revived by > heap_force_freeze(), its ctid still has special values: > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > see a problem yet caused by a visible tuple having the special ctid > value, but it might be worth considering either to reset ctid value as > well or to not freezing already-deleted tuple. > For this as well, the answer remains the same as above. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hello Masahiko-san, I've spent some more time trying to understand the code in lazy_scan_heap function to know under what all circumstances a VACUUM can fail with "found xmin ... before relfrozenxid ..." error for a tuple whose xmin is behind relfrozenxid. Here are my observations: 1) It can fail with this error for a live tuple OR, 2) It can also fail with this error if a tuple (that went through update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE. OR, 3) If there are any concurrent transactions, then the tuple might be marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with this error. Now, AFAIU, as we will be dealing with a damaged table, the chances of point #3 being the cause of this error looks impossible in our case because I don't think we will be doing anything in parallel when performing surgery on a damaged table, in fact we shouldn't be doing any such things. However, it is quite possible that reason #2 could cause VACUUM to fail with this sort of error, but, as we are already skipping redirected item pointers in heap_force_common(), I think, we would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we may not need to handle point #2 as well. Further, I also don't see VACUUM reporting this error for a tuple that has been moved from one partition to another. So, I think we might not need to do any special handling for a tuple that got updated and its new version was moved to another partition. If you feel I am missing something here, please correct me. Thank you. Moreover, while I was exploring on above, I noticed that in lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check for a redirected item pointers and if any redirected item pointer is detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked with HEAP_HOT_UPDATED. I am referring to the following code in lazy_scan_heap(). for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { ItemId itemid; itemid = PageGetItemId(page, offnum); . . /* Redirect items mustn't be touched */ <-- this check would bypass the redirected item pointers from being checked for HeapTupleSatisfiesVacuum. if (ItemIdIsRedirected(itemid)) { hastup = true; /* this page won't be truncatable */ continue; } .. .. switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) { case HEAPTUPLE_DEAD: if (HeapTupleIsHotUpdated(&tuple) || HeapTupleIsHeapOnly(&tuple) || params->index_cleanup == VACOPT_TERNARY_DISABLED) nkeep += 1; else tupgone = true; /* we can delete the tuple */ .. .. } So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma wrote: > > Hello Masahiko-san, > > Thanks for the review. Please check the comments inline below: > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > wrote: > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > --- a/contrib/Makefile > > +++ b/contrib/Makefile > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > pg_standby \ > > pg_stat_statements \ > > pg_trgm \ > > + pg_surgery \ > > pgcrypto\ > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > before pg_trgm. > > > > Okay, will take care of this in the next version of patch. > > > --- > > + if (heap_force_opt == HEAP_FORCE_KILL) > > + ItemIdSetDead(itemid); > > > > I think that if the page is an all-visible page, we should clear an > > all-visible bit on the visibility map corresponding to the page and > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > return the wrong results. > > > > I think we should let VACUUM do that. Please note that this module is > intended to be used only on a damaged relation and should only be > operated on damaged tuples of such relations. And the execution of any > of the functions provided by this module on a damaged relation must be > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > T
Re: recovering from "found xmin ... from before relfrozenxid ..."
Attached is the new version of patch that addresses the comments from Asim Praveen and Masahiko-san. It also improves the documentation to some extent. On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma wrote: > > Hello Masahiko-san, > > I've spent some more time trying to understand the code in > lazy_scan_heap function to know under what all circumstances a VACUUM > can fail with "found xmin ... before relfrozenxid ..." error for a > tuple whose xmin is behind relfrozenxid. Here are my observations: > > 1) It can fail with this error for a live tuple > > OR, > > 2) It can also fail with this error if a tuple (that went through > update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE. > > OR, > > 3) If there are any concurrent transactions, then the tuple might be > marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS > or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with > this error. > > Now, AFAIU, as we will be dealing with a damaged table, the chances of > point #3 being the cause of this error looks impossible in our case > because I don't think we will be doing anything in parallel when > performing surgery on a damaged table, in fact we shouldn't be doing > any such things. However, it is quite possible that reason #2 could > cause VACUUM to fail with this sort of error, but, as we are already > skipping redirected item pointers in heap_force_common(), I think, we > would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't > see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we > may not need to handle point #2 as well. > > Further, I also don't see VACUUM reporting this error for a tuple that > has been moved from one partition to another. So, I think we might not > need to do any special handling for a tuple that got updated and its > new version was moved to another partition. > > If you feel I am missing something here, please correct me. Thank you. > > Moreover, while I was exploring on above, I noticed that in > lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check > for a redirected item pointers and if any redirected item pointer is > detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how > HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked > with HEAP_HOT_UPDATED. I am referring to the following code in > lazy_scan_heap(). > > for (offnum = FirstOffsetNumber; > offnum <= maxoff; > offnum = OffsetNumberNext(offnum)) > { > ItemId itemid; > > itemid = PageGetItemId(page, offnum); > > . > . > > > /* Redirect items mustn't be touched */ <-- this check > would bypass the redirected item pointers from being checked for > HeapTupleSatisfiesVacuum. > if (ItemIdIsRedirected(itemid)) > { > hastup = true; /* this page won't be truncatable */ > continue; > } > > .. > .. > > switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) > { > case HEAPTUPLE_DEAD: > > if (HeapTupleIsHotUpdated(&tuple) || > HeapTupleIsHeapOnly(&tuple) || > params->index_cleanup == VACOPT_TERNARY_DISABLED) > nkeep += 1; > else > tupgone = true; /* we can delete the tuple */ > .. > .. > } > > > So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true? > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma > wrote: > > > > Hello Masahiko-san, > > > > Thanks for the review. Please check the comments inline below: > > > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > > wrote: > > > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > > > --- a/contrib/Makefile > > > +++ b/contrib/Makefile > > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > > pg_standby \ > > > pg_stat_statements \ > > > pg_trgm \ > > > + pg_surgery \ > > > pgcrypto\ > > > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > > before pg_trgm. > > > > > > > Okay, will take care of this in the next version of patch. > >
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera wrote: > > On 2020-Aug-17, Ashutosh Sharma wrote: > > > > + if (heap_force_opt == HEAP_FORCE_KILL) > > > + ItemIdSetDead(itemid); > > > > > > I think that if the page is an all-visible page, we should clear an > > > all-visible bit on the visibility map corresponding to the page and > > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > > return the wrong results. > > > > I think we should let VACUUM do that. Please note that this module is > > intended to be used only on a damaged relation and should only be > > operated on damaged tuples of such relations. And the execution of any > > of the functions provided by this module on a damaged relation must be > > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > > This is necessary to bring back a damaged relation to the sane state > > once a surgery is performed on it. I will try to add this note in the > > documentation for this module. > > It makes sense to recommend VACUUM after fixing the page, but I agree > with Sawada-san that it would be sensible to reset the VM bit while > doing surgery, since that's the state that the page would be in. Sure, I will try to do that change but I would still recommend to always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation that underwent surgery. We > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, > but if users fail to do so, then leaving the VM bit set just means that > we know *for certain* that there will be further corruption as soon as > the XID counter advances sufficiently. > Yeah, I've already added a note for this in the documentation: Note: "After a surgery is performed on a damaged relation using this module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back into a sane state." -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada wrote: > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma wrote: > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > using that function. For instance, with the following script, the last > > > two queries: index scan and seq scan, will return different results. > > > > > > set enable_seqscan to off; > > > set enable_bitmapscan to off; > > > set enable_indexonlyscan to off; > > > create table tbl (a int primary key); > > > insert into tbl values (1); > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > explain analyze select * from tbl where a < 200; > > > > > > -- revive deleted tuple on heap > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > -- index scan returns 2 tuples > > > explain analyze select * from tbl where a < 200; > > > > > > -- seq scan returns 1 tuple > > > set enable_seqscan to on; > > > explain analyze select * from tbl; > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > function. I think we should only be running pg_force_freeze function > > on a tuple for which VACUUM reports "found xmin ABC from before > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > instead of making it better. > > Should this also be documented? I think that it's hard to force the > user to always use this module in the right situation but we need to > show at least when to use. > I've already added some examples in the documentation explaining the use-case of force_freeze function. If required, I will also add a note about it. > > > Also, if a tuple updated and moved to another partition is revived by > > > heap_force_freeze(), its ctid still has special values: > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > see a problem yet caused by a visible tuple having the special ctid > > > value, but it might be worth considering either to reset ctid value as > > > well or to not freezing already-deleted tuple. > > > > > > > For this as well, the answer remains the same as above. > > Perhaps the same is true when a tuple header is corrupted including > xmin and ctid for some reason and the user wants to fix it? I'm > concerned that a live tuple having the wrong ctid will cause SEGV or > PANIC error in the future. > If a tuple header itself is corrupted, then I think we must kill that tuple. If only xmin and t_ctid fields are corrupted, then probably we can think of resetting the ctid value of that tuple. However, it won't be always possible to detect the corrupted ctid value. It's quite possible that the corrupted ctid field has valid values for block number and offset number in it, but it's actually corrupted and it would be difficult to consider such ctid as corrupted. Hence, we can't do anything about such types of corruption. Probably in such cases we need to run VACUUM FULL on such tables so that new ctid gets generated for each tuple in the table. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada wrote: > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma wrote: > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > wrote: > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma > > > wrote: > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > > using that function. For instance, with the following script, the last > > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > > > set enable_seqscan to off; > > > > > set enable_bitmapscan to off; > > > > > set enable_indexonlyscan to off; > > > > > create table tbl (a int primary key); > > > > > insert into tbl values (1); > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > -- revive deleted tuple on heap > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > -- index scan returns 2 tuples > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > -- seq scan returns 1 tuple > > > > > set enable_seqscan to on; > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > function. I think we should only be running pg_force_freeze function > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > instead of making it better. > > > > > > Should this also be documented? I think that it's hard to force the > > > user to always use this module in the right situation but we need to > > > show at least when to use. > > > > > > > I've already added some examples in the documentation explaining the > > use-case of force_freeze function. If required, I will also add a note > > about it. > > > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > > heap_force_freeze(), its ctid still has special values: > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > > see a problem yet caused by a visible tuple having the special ctid > > > > > value, but it might be worth considering either to reset ctid value as > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > PANIC error in the future. > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > can think of resetting the ctid value of that tuple. However, it won't > > be always possible to detect the corrupted ctid value. It's quite > > possible that the corrupted ctid field has valid values for block > > number and offset number in it, but it's actually corrupted and it > > would be difficult to consider such ctid as corrupted. Hence, we can't > > do anything about such types of corruption. Probably in such cases we > > need to run VACUUM FULL on such tables so that new ctid gets generated > > for each tuple in the table. > > Understood. > > Perhaps such corruption will be able to be detected by new heapam > check functions discussed on another thread. My point was that it > might be better to attempt making the tuple header sane state as much > as possible when fixing a live tuple in order to prevent further > problems such as databases crash by malicious attackers. > Agreed. So, to handle the ctid related concern that you raised, I'm planning to do the following changes to ensure that the tuple being marked as frozen contains the correct item pointer
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada wrote: > > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma wrote: > > > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada > > wrote: > > > > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma > > > wrote: > > > > > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma > > > > > wrote: > > > > > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but > > > > > > > not > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes > > > > > > > after > > > > > > > using that function. For instance, with the following script, the > > > > > > > last > > > > > > > two queries: index scan and seq scan, will return different > > > > > > > results. > > > > > > > > > > > > > > set enable_seqscan to off; > > > > > > > set enable_bitmapscan to off; > > > > > > > set enable_indexonlyscan to off; > > > > > > > create table tbl (a int primary key); > > > > > > > insert into tbl values (1); > > > > > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > -- revive deleted tuple on heap > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > > > > > -- index scan returns 2 tuples > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > -- seq scan returns 1 tuple > > > > > > > set enable_seqscan to on; > > > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > > > function. I think we should only be running pg_force_freeze function > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > > > instead of making it better. > > > > > > > > > > Should this also be documented? I think that it's hard to force the > > > > > user to always use this module in the right situation but we need to > > > > > show at least when to use. > > > > > > > > > > > > > I've already added some examples in the documentation explaining the > > > > use-case of force_freeze function. If required, I will also add a note > > > > about it. > > > > > > > > > > > Also, if a tuple updated and moved to another partition is > > > > > > > revived by > > > > > > > heap_force_freeze(), its ctid still has special values: > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I > > > > > > > don't > > > > > > > see a problem yet caused by a visible tuple having the special > > > > > > > ctid > > > > > > > value, but it might be worth considering either to reset ctid > > > > > > > value as > > > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > > > PANIC error in the future. > > > > > > > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > > > can think of resetting the ctid value of that tuple. However, it
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Masahiko-san, Please find the updated patch with the following new changes: 1) It adds the code changes in heap_force_kill function to clear an all-visible bit on the visibility map corresponding to the page that is marked all-visible. Along the way it also clears PD_ALL_VISIBLE flag on the page header. 2) It adds the code changes in heap_force_freeze function to reset the ctid value in a tuple header if it is corrupted. 3) It adds several notes and examples in the documentation stating when and how we need to use the functions provided by this module. Please have a look and let me know for any other concern. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma wrote: > > On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada > wrote: > > > > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma wrote: > > > > > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma > > > > wrote: > > > > > > > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma > > > > > > wrote: > > > > > > > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted > > > > > > > > but not > > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes > > > > > > > > after > > > > > > > > using that function. For instance, with the following script, > > > > > > > > the last > > > > > > > > two queries: index scan and seq scan, will return different > > > > > > > > results. > > > > > > > > > > > > > > > > set enable_seqscan to off; > > > > > > > > set enable_bitmapscan to off; > > > > > > > > set enable_indexonlyscan to off; > > > > > > > > create table tbl (a int primary key); > > > > > > > > insert into tbl values (1); > > > > > > > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > > > -- revive deleted tuple on heap > > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > > > > > > > -- index scan returns 2 tuples > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > > > -- seq scan returns 1 tuple > > > > > > > > set enable_seqscan to on; > > > > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > > > > function. I think we should only be running pg_force_freeze > > > > > > > function > > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the > > > > > > > things > > > > > > > instead of making it better. > > > > > > > > > > > > Should this also be documented? I think that it's hard to force the > > > > > > user to always use this module in the right situation but we need to > > > > > > show at least when to use. > > > > > > > > > > > > > > > > I've already added some examples in the documentation explaining the > > > > > use-case of force_freeze function. If required, I will also add a note > > > > > about it. > > > > > > > > > > > > > Also, if a tuple updated and moved to another partition is > > > > > > > > revived by > > > > > > > > heap_force_freeze(), its ctid still has special values: > > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I > > > > > > > > don't
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Aug 24, 2020 at 7:15 PM Robert Haas wrote: > > On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera > wrote: > > It makes sense to recommend VACUUM after fixing the page, but I agree > > with Sawada-san that it would be sensible to reset the VM bit while > > doing surgery, since that's the state that the page would be in. We > > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, > > but if users fail to do so, then leaving the VM bit set just means that > > we know *for certain* that there will be further corruption as soon as > > the XID counter advances sufficiently. > > +1. > This has been taken care of in the latest v7 patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Aug 24, 2020 at 11:00 PM Andrey M. Borodin wrote: > > Hi! > > > 21 авг. 2020 г., в 18:24, Ashutosh Sharma > > написал(а): > > > > Please find the updated patch with the following new changes: > > Do you have plans to support pg_surgery as external extension? For example > for earlier versions of Postgres and for new features, like amcheck_next is > maintained. > ISTM that I'll have to use something like that tomorrow and I'm in doubt - > should I resurrect our pg_dirty_hands or try your new pg_surgey... > AFAICS, we don't have any plans to support pg_surgery as an external extension as of now. Based on the discussion that has happened earlier in this thread, I think we might also back-patch this contrib module. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Masahiko-san, Thank you for the review. Please check my comments inline below: On Tue, Aug 25, 2020 at 1:39 PM Masahiko Sawada wrote: > > On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma wrote: > > > > Hi Masahiko-san, > > > > Please find the updated patch with the following new changes: > > > > Thank you for updating the patch! > > > 1) It adds the code changes in heap_force_kill function to clear an > > all-visible bit on the visibility map corresponding to the page that > > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE > > flag on the page header. > > I think we need to clear all visibility map bits by using > VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but > not all-visible bit, which is not a valid state. > Yeah, makes sense, I will do that change in the next version of patch. > > > > 2) It adds the code changes in heap_force_freeze function to reset the > > ctid value in a tuple header if it is corrupted. > > > > 3) It adds several notes and examples in the documentation stating > > when and how we need to use the functions provided by this module. > > > > Please have a look and let me know for any other concern. > > > > And here are small comments on the heap_surgery.c: > > + /* > +* Get the offset numbers from the tids belonging to one particular > page > +* and process them one by one. > +*/ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > +offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > + /* > +* Update the current start pointer so that next time when > +* tids_same_page_fetch_offnums() is called, we can calculate the > number > +* of offsets present in the offnos array. > +*/ > + curr_start_ptr = next_start_ptr; > + > + /* Check whether the block number is valid. */ > + if (blkno >= nblocks) > + { > + ereport(NOTICE, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > +errmsg("skipping block %u for relation \"%s\" > because the block number is out of range", > + blkno, RelationGetRelationName(rel; > + continue; > + } > + > + CHECK_FOR_INTERRUPTS(); > > I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top > of the do loop for safety. I think it's unlikely to happen but the > user might mistakenly specify a lot of wrong block numbers. > Okay, np, will shift it to top of the do loop. > > + offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber)); > + noffs = curr_start_ptr = next_start_ptr = 0; > + nblocks = RelationGetNumberOfBlocks(rel); > + > + do > + { > > (snip) > > + > + /* > +* Get the offset numbers from the tids belonging to one particular > page > +* and process them one by one. > +*/ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > +offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > > (snip) > > + /* No ereport(ERROR) from here until all the changes are logged. */ > + START_CRIT_SECTION(); > + > + for (i = 0; i < noffs; i++) > > You copy all offset numbers belonging to the same page to palloc'd > array, offnos, and iterate it while processing the tuples. I might be > missing something but I think we can do that without allocating the > space for offset numbers. Is there any reason for this? I guess we can > do that by just iterating the sorted tids array. > Hmmm.. okay, I see your point. I think probably what you are trying to suggest here is to make use of the current and next start pointers to get the tids belonging to the same page and process them one by one instead of fetching the offset numbers of all tids belonging to one page into the offnos array and then iterate through the offnos array. I think that is probably possible and I will try to do that in the next version of patch. If there is something else that you have in your mind, please let me know. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Tue, Aug 25, 2020 at 11:51 PM Robert Haas wrote: > > On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada > wrote: > > + > > + > > + While performing surgery on a damaged relation, we must not be doing > > anything > > + else on that relation in parallel. This is to ensure that when we are > > + operating on a damaged tuple there is no other transaction trying to > > modify > > + that tuple. > > + > > + > > > > If we prefer to avoid concurrent operations on the target relation why > > don't we use AccessExclusiveLock? > > I disagree with the content of the note. It's up to the user whether > to perform any concurrent operations on the target relation, and in > many cases it would be fine to do so. Users who can afford to take the > table off-line to repair the problem don't really need this tool in > the first place. > The only reason I added this note was to ensure that we do not revive the tuple that is deleted but not yet vacuumed. There is one corner-case scenario as reported by you in - [1] where you have explained a scenario under which vacuum can report "found xmin ... from before relfrozenxid ..." sort of error for the deleted tuples. And as per the explanation provided there, it can happen when there are multiple transactions operating on the same tuple. However, I think we can take care of this scenario by doing some code changes in heap_force_freeze to identify the deleted tuples and maybe skip such tuples. So, yeah, I will do the code changes for handling this and remove the note added in the documentation. Thank you Robert and Masahiko-san for pointing this out. [1] - https://www.postgresql.org/message-id/CA%2BTgmobfJ8CkabKJZ-1FGfvbSz%2Bb8bBX807Y6hHEtVfzVe%2Bg6A%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Thanks for the review. Please find my comments inline below: On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada wrote: > > Let me share other comments on the latest version patch: > > Some words need to be tagged. For instance, I found the following words: > > VACUUM > DISABLE_PAGE_SKIPPING > HEAP_XMIN_FROZEN > HEAP_XMAX_INVALID > Okay, done. > --- > +test=# select ctid from t1 where xmin = 507; > + ctid > +--- > + (0,3) > +(1 row) > + > +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]); > +-[ RECORD 1 ]-+- > +heap_force_freeze | > > I think it's better to use a consistent output format. The former uses > the normal format whereas the latter uses the expanded format. > Yep, makes sense, done. > --- > + > + > + While performing surgery on a damaged relation, we must not be doing > anything > + else on that relation in parallel. This is to ensure that when we are > + operating on a damaged tuple there is no other transaction trying to modify > + that tuple. > + > + > > If we prefer to avoid concurrent operations on the target relation why > don't we use AccessExclusiveLock? > Removed this note from the documentation and added a note saying: "The user needs to ensure that they do not operate pg_force_freeze function on a deleted tuple because it may revive the deleted tuple." > --- > +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[]) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'heap_force_kill' > +LANGUAGE C STRICT; > > +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[]) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'heap_force_freeze' > +LANGUAGE C STRICT; > > I think these functions should be PARALLEL UNSAFE. > By default the functions are marked PARALLEL UNSAFE, so I think there is nothing to do here. Attached patch with above changes. This patch also takes care of all the other review comments from - [1]. [1] - https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From a8067f03cf5ea900790a96ea0059ae080a164ed6 Mon Sep 17 00:00:00 2001 From: ashu Date: Wed, 26 Aug 2020 15:57:59 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap table. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 161 contrib/pg_surgery/heap_surgery.c | 404 + contrib/pg_surgery/pg_surgery--1.0.sql | 18 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 89 +++ doc/src/sgml/contrib.sgml | 1 + doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/pgsurgery.sgml| 144 ++ 10 files changed, 847 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql create mode 100644 doc/src/sgml/pgsurgery.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..c8d2a16 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -34,6 +34,7 @@ SUBDIRS = \ pg_prewarm \ pg_standby \ pg_stat_statements \ + pg_surgery \ pg_trgm \ pgcrypto \ pgrowlocks \ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 000..ecf2e20 --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on a damaged relation" + +REGRESS = pg_surgery + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_surgery +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out new file mode 100644 index 000..9858de2 --- /dev/null +++ b/contrib/pg_surgery/expected/pg_surgery.out @@ -0,0 +1,161 @@ +create extension pg_surgery; +-- +-- check that using heap_fo
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Wed, Aug 26, 2020 at 9:19 PM Robert Haas wrote: > > On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma wrote: > > Removed this note from the documentation and added a note saying: "The > > user needs to ensure that they do not operate pg_force_freeze function > > on a deleted tuple because it may revive the deleted tuple." > > I do not agree with that note, either. I believe that trying to tell > people what things specifically they should do or avoid doing with the > tool is the wrong approach. Instead, the thrust of the message should > be to tell people that if you use this, it may corrupt your database, > and that's your problem. The difficulty with telling people what > specifically they ought to avoid doing is that experts will be annoyed > to be told that something is not safe when they know that it is fine, > and non-experts will think that some uses are safer than they really > are. > Okay, point noted. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Help needed configuring postgreSQL with xml support
In addition to what Thomas said, I would also recommend you to refer to the description of --with-libxml command line option provided in the postgres installation-procedure page - [1]. [1] - https://www.postgresql.org/docs/12/install-procedure.html -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 27, 2020 at 1:56 PM Thomas Munro wrote: > > On Thu, Aug 27, 2020 at 8:17 PM Khanna, Sachin 000 > wrote: > > I am getting following error in configuration.log of installation . Please > > help > > You didn't mention what operating system this is, but, for example, if > it's Debian, Ubuntu or similar you might need to install libxml2-dev > and pkg-config for --with-libxml to work. > >
Should we replace the checks for access method OID with handler OID?
Hi All, While reviewing the patch for pg_surgery contrib module - [1], Asim Praveen suggested that it would be better to replace the check for access method OID with handler OID. Otherwise, if someone creates a new AM using the AM handler that is originally supported for e.g. "heap_tableam_handler" and if this new AM is used to create a table, then one cannot perform surgery on such tables because we have checks for access method OID which would reject this new AM as we only allow heap AM. For e.g. if we do this: create access method myam type table handler heap_tableam_handler; create table mytable (…) using myam; And use an access method OID check, we won't be able to perform surgery on mytable created above because it isn't the heap table although its table structure is actually heap. This problem won't be there if the check for access method OID is replaced with handler OID. I liked this suggestion from Asim and did the changes accordingly. However, while browsing the code for other contrib modules, I could find such checks present in some of the contrib modules like pgstattuple, pageinspect and pgrowlocks as well. So, just wondering if we should be doing similar changes in these contrib modules also. Thoughts? [1] - https://www.postgresql.org/message-id/1D56CEFD-E195-4E6B-B870-3383E3E8C65E%40vmware.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Please help for error ( file is required for XML support )
On Thu, Aug 27, 2020 at 6:30 PM Sachin Khanna wrote: > > Hi > > > > I am new to postgreSQL , I am trying to install the same with XML support but > it is giving below error on configuration. > > > > ./configure --prefix=/opt/postgresql-12.3/pqsql --with-libxml > --datadir=/home/postgres/ --with-includes=/usr/lib64/ > It seems like your include path is pointing to "/usr/lib64/" which basically contains the libraries and not the header files, I guess. To include libraries you should be using --with-libraries option. Also, I feel that these types of questions are not for hackers mailing-list. These are configuration related issues and should probably be raised in pgsql-general mailing list (pgsql-general.postgresql.org). -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Fri, Aug 28, 2020 at 1:44 AM Robert Haas wrote: > > On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma > wrote: > > Okay, point noted. > > I spent some time today working on this patch. I'm fairly happy with > it now and intend to commit it if nobody sees a big problem with that. > Per discussion, I do not intend to back-patch at this time. The two > most significant changes I made to your version are: > > 1. I changed things around to avoid using any form of ereport() in a > critical section. I'm not actually sure whether it is project policy > to avoid ereport(NOTICE, ...) or similar in a critical section, but it > seems prudent, because if anything fails in a critical section, we > will PANIC, so doing fewer things there seems prudent. > > 2. I changed the code so that it does not try to follow redirected > line pointers; instead, it skips them with an appropriate message, as > we were already doing for dead and unused line pointers. I think the > way you had it coded might've been my suggestion originally, but the > more I looked into it the less I liked it. One problem is that it > didn't match the docs. A second is that following a corrupted line > pointer might index off the end of the line pointer array, and while > that probably shouldn't happen, we are talking about corruption > recovery here. Then I realized that, as you coded it, if the line > pointer was redirected to a line pointer that is in turn dead (or > unused, if there's corruption) the user would get a NOTICE complaining > about a TID they hadn't specified, which seems like it would be very > confusing. I thought about trying to fix all that stuff, but it just > didn't seem worth it, because I can't think of a good reason to pass > this function the TID of a redirected line pointer in the first place. > If you're doing surgery, you should probably specify the exact thing > upon which you want to operate, not some other thing that points to > it. > > Here is a list of other changes I made: > > * Added a .gitignore file. > * Renamed the regression test file from pg_surgery to heap_surgery to > match the name of the single C source file we currently have. > * Capitalized TID in a few places. > * Ran pgindent. > * Adjusted various comments. > * Removed the check for an empty TID array. I don't see any reason why > this should be an error case and I don't see much precedent for having > such a check. > * Fixed the code to work properly with that change and added a test case. > * Added a check that the array is not multi-dimensional. > * Put the AM type check before the relkind check, following existing > precedent. > * Adjusted the check to use the AM OID rather than the handler OID, > following existing precedent. Fixed the message wording accordingly. > * Changed the documentation wording to say less about specific > recovery procedures and focus more on the general idea that this is > dangerous. > * Removed all but one of the test cases that checked what happens if > you use this on a non-heap; three tests for basically the same thing > seemed excessive. > * Added some additional tests to improve code coverage. There are now > only a handful of lines not covered. > * Reorganized the test cases somewhat. > > New patch attached. > Thank you Robert for the patch. I've looked into the changes you've made to the v8 patch and they all look good to me. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi Mark, Thanks for the review. Please find my comments inline below: > HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list. > This has been fixed in the v9 patch. > > The tidcmp function can be removed, and ItemPointerCompare used directly by > qsort as: > > - qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp); > + qsort((void*) tids, ntids, sizeof(ItemPointerData), > + (int (*) (const void *, const void *)) > ItemPointerCompare); > Will have a look into this. > > sanity_check_tid_array() has two error messages: > > "array must not contain nulls" > "empty tid array" > > I would change the first to say "tid array must not contain nulls", as "tid" > is the name of the parameter being checked. It is also more consistent with > the second error message, but that doesn't matter to me so much, as I'd argue > for removing the second check. I don't see why an empty array should draw an > error. It seems more reasonable to just return early since there is no work > to do. Consider if somebody uses a function that returns the tids for all > corrupt tuples in a table, aggregates that into an array, and hands that to > this function. It doesn't seem like an error for that aggregated array to > have zero elements in it. I suppose you could emit a NOTICE in this case? > This comment is no more valid as per the changes done in the v9 patch. > > Upthread: > > On Aug 13, 2020, at 12:03 PM, Robert Haas wrote: > > > >> This looks like a very good suggestion to me. I will do this change in > >> the next version. Just wondering if we should be doing similar changes > >> in other contrib modules (like pgrowlocks, pageinspect and > >> pgstattuple) as well? > > > > It seems like it should be consistent, but I'm not sure the proposed > > change is really an improvement. > > You have used Asim's proposed check: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("only the relation using heap_tableam_handler is > supported"))); > > which Robert seems unenthusiastic about, but if you are going that direction, > I think at least the language of the error message should be changed. I > recommend something like: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > -errmsg("only the relation using > heap_tableam_handler is supported"))); > +errmsg("\"%s\" does not use a heap access > method", > + > RelationGetRelationName(rel; > > where "a heap access method" could also be written as "a heap table type > access method", "a heap table compatible access method", and so forth. There > doesn't seem to be enough precedent to dictate exactly how to phrase this, or > perhaps I'm just not looking in the right place. > Same here. This also looks invalid as per the changes done in v9 patch. > > The header comment for function find_tids_one_page should state the > requirement that the tids array must be sorted. > Okay, will add a comment for this. > > The heap_force_common function contains multiple ereport(NOTICE,...) within a > critical section. I don't think that is normal practice. Can those reports > be buffered until after the critical section is exited? You also have a > CHECK_FOR_INTERRUPTS within the critical section. > This has been fixed in the v9 patch. > [1] https://commitfest.postgresql.org/29/2700/ > — Thanks for adding a commitfest entry for this. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Should we replace the checks for access method OID with handler OID?
On Thu, Aug 27, 2020 at 9:21 PM Robert Haas wrote: > > On Thu, Aug 27, 2020 at 5:37 AM Ashutosh Sharma wrote: > > While reviewing the patch for pg_surgery contrib module - [1], Asim > > Praveen suggested that it would be better to replace the check for > > access method OID with handler OID. Otherwise, if someone creates a > > new AM using the AM handler that is originally supported for e.g. > > "heap_tableam_handler" and if this new AM is used to create a table, > > then one cannot perform surgery on such tables because we have checks > > for access method OID which would reject this new AM as we only allow > > heap AM. For e.g. if we do this: > > > > create access method myam type table handler heap_tableam_handler; > > create table mytable (…) using myam; > > > > And use an access method OID check, we won't be able to perform > > surgery on mytable created above because it isn't the heap table > > although its table structure is actually heap. > > The only reason I can see why it would make sense to do this sort of > thing is if you wanted to create a new AM for testing purposes which > behaves like some existing AM but is technically a different AM. And > if you did that, then I guess the change you are proposing would make > it behave more like it's the same thing after all, which seems like it > might be missing the point. > Okay, understood. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
> > The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as: > > > > - qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp); > > + qsort((void*) tids, ntids, sizeof(ItemPointerData), > > + (int (*) (const void *, const void *)) ItemPointerCompare); > > > > Will have a look into this. > We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons that it makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function argument is being passed to qsort at several other places in postgres which kinda of increases the code readability and simplicity. For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here. See below: static int qsortCompareItemPointers(const void *a, const void *b) { int res = ginCompareItemPointers((ItemPointer) a, (ItemPointer) b); /* Assert that there are no equal item pointers being sorted */ Assert(res != 0); return res; } In this case as well, it could have been done the way you are suggesting, but it seems like writing a small comparator function with the prototype that qsort accepts looked like a better option. Considering this, I am just leaving this as-it-is. Please let me know if you feel the other way round. > > The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted. > > > > Okay, will add a comment for this. > Added a comment for this in the attached patch. Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From ee49cfbfc153bbf6212d90ac81a2f9d13fa0eef7 Mon Sep 17 00:00:00 2001 From: ashu Date: Fri, 28 Aug 2020 15:00:41 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged relation. This contrib module provides couple of functions named heap_force_kill and heap_force_freeze that can be used to perform surgery on a damaged heap table. Author: Ashutosh Sharma and Robert Haas. Reviewed by: Sawada Masahiko, Andrey Borodin, Beena Emerson, Mark Dilger and Robert Haas Tested by: Rajkumar Raghuwanshi Discussion: https://www.postgresql.org/message-id/CA%2BTgmoZW1fsU-QUNCRUQMGUygBDPVeOTLCqRdQZch%3DEYZnctSA%40mail.gmail.com --- contrib/Makefile | 1 + contrib/pg_surgery/.gitignore| 4 + contrib/pg_surgery/Makefile | 23 ++ contrib/pg_surgery/expected/heap_surgery.out | 180 +++ contrib/pg_surgery/heap_surgery.c| 428 +++ contrib/pg_surgery/pg_surgery--1.0.sql | 18 ++ contrib/pg_surgery/pg_surgery.control| 5 + contrib/pg_surgery/sql/heap_surgery.sql | 91 ++ doc/src/sgml/contrib.sgml| 1 + doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/pgsurgery.sgml | 107 +++ src/tools/pgindent/typedefs.list | 1 + 12 files changed, 860 insertions(+) create mode 100644 contrib/pg_surgery/.gitignore create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/heap_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/heap_surgery.sql create mode 100644 doc/src/sgml/pgsurgery.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..c8d2a16 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -34,6 +34,7 @@ SUBDIRS = \ pg_prewarm \ pg_standby \ pg_stat_statements \ + pg_surgery \ pg_trgm \ pgcrypto \ pgrowlocks \ diff --git a/contrib/pg_surgery/.gitignore b/contrib/pg_surgery/.gitignore new file mode 100644 index 000..5dcb3ff --- /dev/null +++ b/contrib/pg_surgery/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 000..a66776c --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on a damaged relation" + +REGRESS = heap_surgery + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_surgery +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Thu, Nov 11, 2021 at 6:07 PM Alvaro Herrera wrote: > On 2021-Nov-10, Bossart, Nathan wrote: > > > On 11/10/21, 9:43 AM, "Bharath Rupireddy" < > bharath.rupireddyforpostg...@gmail.com> wrote: > > > As discussed in [1], isn't it a better idea to add some of activity > > > messages [2] such as recovery, archive, backup, streaming etc. to > > > server logs at LOG level? > > > I think this would make the logs far too noisy for many servers. For > > archiving alone, this could cause tens of thousands more log messages > > per hour on a busy system. I think you can already see such > > information at a debug level, anyway. > > Yeah. If we had some sort of ring buffer in which to store these > messages, the user could examine them through a view; they would still > be accessible in a running server, but they would not be written to the > server log. > That's a good idea. How about also adding some GUC(s) to the log archive, recovery related log messages just like we have for checkpoints, autovacuum etc? Maybe something like log_archive, log_recovery etc. -- With Regards, Ashutosh Sharma.
Re: Zedstore - compressed in-core columnar storage
On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas wrote: > > On 29/08/2019 14:30, Ashutosh Sharma wrote: > > > > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang > <mailto:lew...@pivotal.io>> wrote: > > > > You are correct that we currently go through each item in the leaf > > page that > > contains the given tid, specifically, the logic to retrieve all the > > attribute > > items inside a ZSAttStream is now moved to decode_attstream() in the > > latest > > code, and then in zsbt_attr_fetch() we again loop through each item we > > previously retrieved from decode_attstream() and look for the given > > tid. > > > > > > Okay. Any idea why this new way of storing attribute data as streams > > (lowerstream and upperstream) has been chosen just for the attributes > > but not for tids. Are only attribute blocks compressed but not the tids > > blocks? > > Right, only attribute blocks are currently compressed. Tid blocks need > to be modified when there are UPDATEs or DELETE, so I think having to > decompress and recompress them would be more costly. Also, there is no > user data on the TID tree, and the Simple-8b encoded codewords used to > represent the TIDs are already pretty compact. I'm not sure how much > gain you would get from passing it through a general purpose compressor. > > I could be wrong though. We could certainly try it out, and see how it > performs. > > > One > > optimization we can to is to tell decode_attstream() to stop > > decoding at the > > tid we are interested in. We can also apply other tricks to speed up the > > lookups in the page, for fixed length attribute, it is easy to do > > binary search > > instead of linear search, and for variable length attribute, we can > > probably > > try something that we didn't think of yet. > > > > > > I think we can probably ask decode_attstream() to stop once it has found > > the tid that we are searching for but then we only need to do that for > > Index Scans. > > I've been thinking that we should add a few "bookmarks" on long streams, > so that you could skip e.g. to the midpoint in a stream. It's a tradeoff > though; when you add more information for random access, it makes the > representation less compact. > > > Zedstore currently implement update as delete+insert, hence the old > > tid is not > > reused. We don't store the tuple in our UNDO log, and we only store the > > transaction information in the UNDO log. Reusing the tid of the old > > tuple means > > putting the old tuple in the UNDO log, which we have not implemented > > yet. > > > > OKay, so that means performing update on a non-key attribute would also > > require changes in the index table. In short, HOT update is currently > > not possible with zedstore table. Am I right? > > That's right. There's a lot of potential gain for doing HOT updates. For > example, if you UPDATE one column on every row on a table, ideally you > would only modify the attribute tree containing that column. But that > hasn't been implemented. Thanks Heikki for your reply. After quite some time today I got chance to look back into the code. I could see that you have changed the tuple insertion and update mechanism a bit. As per the latest changes all the tuples being inserted/updated in a transaction are spooled into a hash table and then flushed at the time of transaction commit and probably due to this change, I could see that the server crashes when trying to perform UPDATE operation on a zedstore table having 10 lacs record. See below example, create table t1(a int, b int) using zedstore; insert into t1 select i, i+10 from generate_series(1, 100) i; postgres=# update t1 set b = 200; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Above update statement crashed due to some extensive memory leak. Further, the UPDATE operation on zedstore table is very slow. I think that's because in case of zedstore table we have to update all the btree data structures even if one column is updated and that really sucks. Please let me know if there is some other reason for it. I also found some typos when going through the writeup in zedstore_internal.h and thought of correcting those. Attached is the patch with the changes. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/include/access/zedstore_internal.h b/src/inc
Re: Support for CALL statement in ecpg
On Tue, Sep 17, 2019 at 1:06 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > I don't find this patch in any commit fest. Seems like a good addition. > Thanks for the consideration. Will add an entry for it in the commit fest. -- With Regards, Ashutosh Sharma EnterpriseDB:*http://www.enterprisedb.com <http://www.enterprisedb.com/>*
Re: Zedstore - compressed in-core columnar storage
On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang wrote: > > On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma wrote: >> >> create table t1(a int, b int) using zedstore; >> insert into t1 select i, i+10 from generate_series(1, 100) i; >> postgres=# update t1 set b = 200; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> >> Above update statement crashed due to some extensive memory leak. > > > Thank you for reporting! We have located the memory leak and also > noticed some other memory related bugs. We are working on the fixes > please stay tuned! > Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers" context that is completely exhausting the memory and it is being used to spool the tuples. >> >> I also found some typos when going through the writeup in >> zedstore_internal.h and thought of correcting those. Attached is the >> patch with the changes. > > > Applied. Thank you! Thanks for that. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Zedstore - compressed in-core columnar storage
On Thu, Sep 19, 2019 at 11:35 AM Ashutosh Sharma wrote: > > On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang wrote: > > > > On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma > > wrote: > >> > >> create table t1(a int, b int) using zedstore; > >> insert into t1 select i, i+10 from generate_series(1, 100) i; > >> postgres=# update t1 set b = 200; > >> server closed the connection unexpectedly > >> This probably means the server terminated abnormally > >> before or while processing the request. > >> The connection to the server was lost. Attempting reset: Failed. > >> > >> Above update statement crashed due to some extensive memory leak. > > > > > > Thank you for reporting! We have located the memory leak and also > > noticed some other memory related bugs. We are working on the fixes > > please stay tuned! > > > > Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers" > context that is completely exhausting the memory and it is being used > to spool the tuples. > Some more updates on top of this: When doing update operation, for each tuple being modified, *tuplebuffers_insert()* says that there is no entry for the relation being modified in the hash table although it was already added when the first tuple in the table was updated. Why is it so? I mean if I have added an entry in the hash table *tuplebuffers* for let's say table t1 then should the subsequent call to tuplebuffers_insert() say that there is no entry for table t1 in the *tuplebuffers*. Shouldn't that only happen once you have flushed all the tuples in the tupbuffer->attbuffers. Because of this reason, for each tuple, tupbuffer->attbuffers is allocated resulting into a lot of memory consumption. OTOH if the insert is performed on the same table only for the first tuple tuplebuffers_insert() says that is no entry for the the table t1 in hash but from the second time onwards that doesn;t happen. I think because of this reason the memory leak is happening in case of update operation. Please let me know if I'm missing something here just because I didn't get chance to spent much time on this. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Zedstore - compressed in-core columnar storage
On Fri, Sep 20, 2019 at 5:48 AM Taylor Vesely wrote: > > > When doing update operation, for each tuple being modified, > > *tuplebuffers_insert()* says that there is no entry for the relation > > being modified in the hash table although it was already added when > > the first tuple in the table was updated. Why is it so? > > Currently, when doing an update, it will actually flush the tuple > buffers every time we update a tuple. As a result, we only ever spool > up one tuple at a time. This is a good place to put in an optimization > like was implemented for insert, but I haven't gotten around to > looking into that yet. > Okay. So, that's the root cause. Spooling just one tuple where at least 60 tuples can be spooled and then not freeing it at all is altogether the reason for this extensive memory leak. > The memory leak is actually happening because it isn't freeing the > attbuffers after flushing. Alexandra Wang and I have a working > branch[1] where we tried to plug the leak by freeing the attbuffers, > but it has exposed an issue with triggers that I need to understand > before I push the fix into the main zedstore branch. > > I don't like our solution of freeing the buffers either, because they > could easily be reused. I'm going to take a stab at making that better > before merging in the fix. > That's right, why do we need to free the memory after flushing data in attbuffers. We can simply reuse it for next set of data to be updated. > [1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Usage of the system truststore for SSL certificate validation
This certainly looks like a good addition to me that can be implemented on both client and server side. It is always good to have a common location where the list of all the certificates from various CA's can be placed for validation. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Sep 19, 2019 at 8:24 PM Thomas Berger wrote: > > Hi, > > currently, libpq does SSL cerificate validation only against the defined > `PGSSLROOTCERT` file. > > Is there any specific reason, why the system truststore ( at least under > unixoid systems) is not considered for the validation? > > We would like to contribute a patch to allow certificate validation against > the system truststore. Are there any opinions against it? > > > A little bit background for this: > > Internally we sign the certificates for our systems with our own CA. The CA > root certificates and revocation lists are distributed via puppet and/or > packages on all of our internal systems. > > Validating the certificate against this CA requires to either override the > PGSSLROOTCERT location via the environment or provide a copy of the file for > each user that connects with libpq or libpq-like connectors. > > We would like to simplify this. > > > -- > Thomas Berger > > PostgreSQL DBA > Database Operations > > 1&1 Telecommunication SE | Ernst-Frey-Straße 10 | 76135 Karlsruhe | Germany > >
Re: Zedstore - compressed in-core columnar storage
Hi Alexandra, On Tue, Sep 17, 2019 at 4:45 PM Ashutosh Sharma wrote: > > On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas wrote: > > > > On 29/08/2019 14:30, Ashutosh Sharma wrote: > > > > > > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang > > <mailto:lew...@pivotal.io>> wrote: > > Further, the UPDATE operation on zedstore table is very slow. I think > that's because in case of zedstore table we have to update all the > btree data structures even if one column is updated and that really > sucks. Please let me know if there is some other reason for it. > There was no answer for this in your previous reply. It seems like you missed it. As I said earlier, I tried performing UPDATE operation with optimised build and found that to update around 10 lacs record in zedstore table it takes around 24k ms whereas for normal heap table it takes 2k ms. Is that because in case of zedstore table we have to update all the Btree data structures even if one column is updated or there is some other reason for it. If yes, could you please let us know. FYI, I'm trying to update the table with just two columns. Further, In the latest code I'm getting this warning message when it is compiled using -O2 optimisation flag. zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’: zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized in this function [-Wmaybe-uninitialized] opaque = ZSBtreePageGetOpaque(page); ^ Attached is the patch that fixes it. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c index 7730ef3..f590f79 100644 --- a/src/backend/access/zedstore/zedstore_tidpage.c +++ b/src/backend/access/zedstore/zedstore_tidpage.c @@ -956,9 +956,10 @@ zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, true); if (!BufferIsValid(buf)) return result; - page = BufferGetPage(buf); } + page = BufferGetPage(buf); + maxoff = PageGetMaxOffsetNumber(page); for (off = FirstOffsetNumber; off <= maxoff; off++) {
Re: Skip recovery/standby signal files in pg_basebackup
Hi David, On Sat, Sep 28, 2019 at 12:23 AM David Steele wrote: > > Hackers, > > Restoring these files could cause surprising behaviors so it seems best > to let the restore process create them when needed. > Could you please let us know what is the surprising behaviour you are talking about here when including recovery/standby signal files in pg_basebackup output. If including recovery.conf in pg_basebackup output earlier wasn't a problem then why including recovery/standby.signal should be a problem. Your patch is just trying to skip standby.signal or recovery.signal files when the base backup is either taken on standby server or it is taken on the server where the PITR is still going on or may be paused. What would be the behaviour with your patch when *-R* option is used with pg_basebackup to take backup from standby server ? Won't it create a standby.signal file. > Patch is attached. > -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Zedstore - compressed in-core columnar storage
On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang wrote: > > Hi Ashutosh, > > Sorry I indeed missed your question, thanks for the reminder! > > On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma wrote: >> >> > Further, the UPDATE operation on zedstore table is very slow. I think >> > that's because in case of zedstore table we have to update all the >> > btree data structures even if one column is updated and that really >> > sucks. Please let me know if there is some other reason for it. >> > >> >> There was no answer for this in your previous reply. It seems like you >> missed it. As I said earlier, I tried performing UPDATE operation with >> optimised build and found that to update around 10 lacs record in >> zedstore table it takes around 24k ms whereas for normal heap table it >> takes 2k ms. Is that because in case of zedstore table we have to >> update all the Btree data structures even if one column is updated or >> there is some other reason for it. If yes, could you please let us >> know. FYI, I'm trying to update the table with just two columns. > > > Zedstore UPDATE operation currently fetches the old rows, updates the > undo pointers stored in the tid btree, and insert new rows into all > the attribute btrees with the new tids. So performance of updating one > column makes no difference from updating all the columns. That said, > the wider the table is, the longer it takes to update, regardless > updating one column or all the columns. > > However, since your test table only has two columns, and we also > tested the same on a one-column table and got similar results as > yours, there is definitely room for optimizations. Attached file > zedstore_update_flames_lz4_first_update.svg is the profiling results > for the update query on a one-column table with 1M records. It spent > most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For > zedstoream_fetch_row(), Taylor and I had some interesting findings > which I'm going to talk about next, I haven't dived into > zsbt_tid_update() yet and need to think about it more. > > To understand what slows down zedstore UDPATE, Taylor and I did the > following test and profiling on a zedstore table with only one column. > > postgres=# create table onecol(a int) using zedstore; > postgres=# insert into onecol select i from generate_series(1, 100) i; > > -- Create view to count zedstore pages group by page types > postgres=# CREATE VIEW pg_zs_page_counts AS > SELECT > c.relnamespace::regnamespace, > c.oid, > c.relname, > pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)), > count(*) > FROM pg_am am > JOIN pg_class c ON (c.relam = am.oid) > WHERE am.amname='zedstore' > GROUP BY 1,2,3,4; > > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --+---+-+-+--- > public | 32768 | onecol | BTREE | 640 > public | 32768 | onecol | FREE|90 > public | 32768 | onecol | META| 1 > (3 rows) > > -- Run update query the first time > postgres=# update onecol set a = 200; -- profiling attached in > zedstore_update_flames_lz4_first_update.svg > Time: 28760.199 ms (00:28.760) > > postgres=# select * from pg_zs_page_counts; > relnamespace | oid | relname | pg_zs_page_type | count > --+---+-+-+--- > public | 32768 | onecol | BTREE | 6254 > public | 32768 | onecol | FREE| 26915 > public | 32768 | onecol | META| 1 > (6 rows) > Oops, the first UPDATE created a lot of free pages. Just FYI, when the second update was ran, it took around 5 mins (which is almost 10-12 times more than what 1st UPDATE took) but this time there was no more free pages added, instead the already available free pages were used. Here is the stats observed before and after second update, before: = postgres=# select * from pg_zs_page_counts; relnamespace | oid | relname | pg_zs_page_type | count --+---+-+-+--- public | 16390 | t1 | FREE| 26915 public | 16390 | t1 | BTREE | 7277 public | 16390 | t1 | META| 1 (3 rows) after: postgres=# select * from pg_zs_page_counts; relnamespace | oid | relname | pg_zs_page_type | count --+---+-+-+--- public | 16390 | t1 | FREE| 26370 public | 16390 | t1 | BTREE | 7822 publ
Re: pgbench - allow to create partitioned tables
Hi Fabien, Amit, I could see that when an invalid number of partitions is specified, sometimes pgbench fails with an error "invalid number of partitions: ..." whereas many a times it doesn't, instead it creates number of partitions that hasn't been specified by the user. As partitions is an integer type variable, the maximum value it can hold is "2147483647". But if I specify partitions as "3147483647", atoi function returns a value lesser than zero and pgbench terminates with an error. However, if the value for number of partitions specified is something like "5147483647", atoi returns a non-negative number and pgbench creates as many number of partitions as the value returned by atoi function. Have a look at the below examples, [ashu@localhost bin]$ ./pgbench -i -s 10 --partitions=2147483647 postgres dropping old tables... creating tables... creating 2147483647 partitions... ^C [ashu@localhost bin]$ ./pgbench -i -s 10 --partitions=3147483647 postgres invalid number of partitions: "3147483647" [ashu@localhost bin]$ ./pgbench -i -s 10 --partitions=5147483647 postgres dropping old tables... creating tables... creating 852516351 partitions... ^C This seems like a problem with atoi function, isn't it? atoi functions has been used at several places in pgbench script and I can see similar behaviour for all. For e.g. it has been used with scale factor and above observation is true for that as well. So, is this a bug or you guys feel that it isn't and can be ignored? Please let me know your thoughts on this. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Oct 3, 2019 at 10:30 AM Fabien COELHO wrote: > > > >> Thanks, attached is a patch with minor modifications which I am > >> planning to push after one more round of review on Thursday morning > >> IST unless there are more comments by anyone else. > > > > Pushed. > > Thanks! > > -- > Fabien. > >
Re: pgbench - allow to create partitioned tables
On Thu, Oct 3, 2019 at 1:53 PM Fabien COELHO wrote: > > > Hello, > > > As partitions is an integer type variable, the maximum value it can > > hold is "2147483647". But if I specify partitions as "3147483647", > > atoi function returns a value lesser than zero and pgbench terminates > > with an error. However, if the value for number of partitions > > specified is something like "5147483647", atoi returns a non-negative > > number and pgbench creates as many number of partitions as the value > > returned by atoi function. > > > > This seems like a problem with atoi function, isn't it? > > Yes. > > > atoi functions has been used at several places in pgbench script and I > > can see similar behaviour for all. For e.g. it has been used with > > scale factor and above observation is true for that as well. So, is > > this a bug or you guys feel that it isn't and can be ignored? Please > > let me know your thoughts on this. Thank you. > > I think that it is a known bug (as you noted atoi is used more or less > everywhere in pgbench and other commands) which shoud be addressed > separately: all integer user inputs should be validated for syntax and > overflow, everywhere, really. This is not currently the case, so I simply > replicated the current bad practice when developing this feature. > Okay, I think we should possibly replace atoi with strtol function call for better error handling. It handles the erroneous inputs better than atoi. > There is/was a current patch/discussion to improve integer parsing, which > could address this. > It seems like you are trying to point out the following discussion on hackers, https://www.postgresql.org/message-id/flat/20190724040237.GB64205%40begriffs.com#5677c361d3863518b0db5d5baae72bbe -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: refactoring - share str2*int64 functions
Is there any specific reason for hard coding the *base* of a number representing the string in strtouint64(). I understand that currently strtouint64() is being used just to convert an input string to decimal unsigned value but what if we want it to be used for hexadecimal values or may be some other values, in that case it can't be used. Further, the function name is strtouint64() but the comments atop it's definition says it's pg_strtouint64(). That needs to be corrected. At few places, I could see that the function call to pg_strtoint32_check() is followed by an error handling. Isn't that already being done in pg_strtoint32_check function itself. For e.g. in refint.c the function call to pg_strtoint32_check is followed by a if condition that checks for an error which I assume shouldn't be there as it is already being done by pg_strtoint32_check. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Sep 18, 2019 at 6:43 AM Michael Paquier wrote: > > On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote: > > In order to unify the parsing interface and put all the conversion > > routines in a single place, I still think that the patch has value so > > I would still keep it (with a fix for the queryId parsing of course), > > but there is much more to it. > > As of now, here is an updated patch which takes the path to not > complicate the refactored APIs and fixes the issue with queryID in > readfuncs.c. Thoughts? > -- > Michael
Re: refactoring - share str2*int64 functions
On Fri, Oct 4, 2019 at 8:58 PM Andres Freund wrote: > > Hi, > > On 2019-10-04 14:27:44 +0530, Ashutosh Sharma wrote: > > Is there any specific reason for hard coding the *base* of a number > > representing the string in strtouint64(). I understand that currently > > strtouint64() is being used just to convert an input string to decimal > > unsigned value but what if we want it to be used for hexadecimal > > values or may be some other values, in that case it can't be used. > > It's a lot slower if the base is variable, because the compiler cannot > replace the division by shifts. > Thanks Andres for the reply. I didn't know that the compiler won't be able to replace division with shifts operator if the base is variable and it's true that it would make the things a lot slower. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Change atoi to strtol in same place
Hi Joe, On a quick look, the patch seems to be going in a good direction although there are quite some pending work to be done. One suggestion: The start value for port number is set to 1, however it seems like the port number that falls in the range of 1-1023 is reserved and can't be used. So, is it possible to have the start value as 1024 instead of 1 ? Further, I encountered one syntax error (INT_MAX undeclared) as the header file "limits.h" has not been included in postgres_fe.h or option.h -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Fri, Oct 4, 2019 at 9:04 PM Alvaro Herrera wrote: > > On 2019-Oct-03, Joe Nelson wrote: > > > Kyotaro Horiguchi wrote: > > > > pg_standby: -k keepfiles could not parse 'hoge' as integer > > > > > > I didn't checked closely, but -k of pg_standby's message looks > > > somewhat strange. Needs a separator? > > > > Good point, how about this: > > > > pg_standby: -k keepfiles: > > The wording is a bit strange. How about something like > pg_standy: invalid argument to -k: %s > > where the %s is the error message produced like you propose: > > > I could have pg_strtoint64_range() wrap its error messages in _() so > > that translators could customize the messages prior to concatenation. > > > > *error = psprintf(_("could not parse '%s' as integer"), str); > > ... except that they would rather be more explicit about what the > problem is. "insufficient digits" or "extraneous character", etc. > > > Would this suffice? > > I hope that no callers would like to have the messages not translated, > because that seems like it would become a mess. > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > >
Re: dropping column prevented due to inherited index
I think we could have first deleted all the dependency of child object on parent and then deleted the child itself using performDeletion(). As an example let's consider the case of toast table where we first delete the dependency of toast relation on main relation and then delete the toast table itself otherwise the toast table deletion would fail. But, the problem I see here is that currently we do not have any entry in pg_attribute table that would tell us about the dependency of child column on parent. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Mon, Oct 7, 2019 at 7:31 AM Amit Langote wrote: > > Hello, > > On Fri, Oct 4, 2019 at 5:57 PM Michael Paquier wrote: > > > > On Thu, Oct 03, 2019 at 09:18:12AM -0300, Alvaro Herrera wrote: > > > Hmm. I wonder if we shouldn't adopt the coding pattern we've used > > > elsewhere of collecting all columns to be dropped first into an > > > ObjectAddresses array, then use performMultipleDeletions. > > > > +1. That's the common pattern these days, because that's more > > performant. > > Actually I don't see the peformMultipleDeletions() pattern being used > for the situations where there are multiple objects to drop due to > inheritance. I only see it where there are multiple objects related > to one table. Maybe it's possible to apply to the inheritance > situation though, but in this particular case, it seems a bit hard to > do, because ATExecDropColumn steps through an inheritance tree level > at a time. > > But maybe I misunderstood Alvaro's suggestion? > > > I think that the patch should have regression tests. > > I have added one in the attached updated patch. > > Thanks, > Amit