Re: Synchronizing slots from primary to standby
On Sat, Mar 2, 2024 at 9:21 AM Zhijie Hou (Fujitsu) wrote: > > Apart from the comments, the code in WalSndWaitForWal was refactored > a bit to make it neater. Thanks Shveta for helping writing the code and doc. > A few more comments: == 1. +# Wait until the primary server logs a warning indicating that it is waiting +# for the sb1_slot to catch up. +$primary->wait_for_log( + qr/replication slot \"sb1_slot\" specified in parameter standby_slot_names does not have active_pid/, + $offset); Shouldn't we wait for such a LOG even in the first test as well which involves two standbys and two logical subscribers? 2. +## +# Test that logical replication will wait for the user-created inactive +# physical slot to catch up until we remove the slot from standby_slot_names. +## I don't see anything different tested in this test from what we already tested in the first test involving two standbys and two logical subscribers. Can you please clarify if I am missing something? 3. Note that after receiving the shutdown signal, an ERROR + * is reported if any slots are dropped, invalidated, or inactive. This + * measure is taken to prevent the walsender from waiting indefinitely. + */ + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) Isn't this part of the comment should be moved inside NeedToWaitForStandby()? 4. + /* + * Update our idea of the currently flushed position only if we are + * not waiting for standbys to catch up, otherwise the standby would + * have to catch up to a newer WAL location in each cycle. + */ + if (wait_event != WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION) + { This functionality (in function WalSndWaitForWal()) seems to ensure that we first wait for the required WAL to be flushed and then wait for standbys. If true, we should cover that point in the comments here or somewhere in the function WalSndWaitForWal(). Apart from this, I have made a few modifications in the comments. -- With Regards, Amit Kapila. diff --git a/src/test/recovery/t/040_standby_failover_slots_sync.pl b/src/test/recovery/t/040_standby_failover_slots_sync.pl index 5e4dd6c0ab..16483e52a9 100644 --- a/src/test/recovery/t/040_standby_failover_slots_sync.pl +++ b/src/test/recovery/t/040_standby_failover_slots_sync.pl @@ -505,8 +505,9 @@ ok( $standby1->poll_query_until( 'restart_lsn and confirmed_flush_lsn of slot lsub1_slot synced to standby'); ## -# Test primary disallowing specified logical replication slots getting ahead of -# specified physical replication slots. It uses the following set up: +# Test that logical failover replication slots wait for the specified +# physical replication slots to receive the changes first. It uses the +# following set up: # # (physical standbys) # | > standby1 (primary_slot_name = sb1_slot) @@ -518,9 +519,9 @@ ok( $standby1->poll_query_until( # # standby_slot_names = 'sb1_slot' # -# Set up is configured in such a way that the logical slot of subscriber1 is -# enabled for failover, thus it will wait for the physical slot of -# standby1(sb1_slot) to catch up before sending decoded changes to subscriber1. +# The setup is configured in such a way that the logical slot of subscriber1 is +# enabled for failover, and thus the subscriber1 will wait for the physical +# slot of standby1(sb1_slot) to catch up before receiving the decoded changes. ## $backup_name = 'backup3'; @@ -543,8 +544,8 @@ primary_slot_name = 'sb2_slot' $standby2->start; $primary->wait_for_replay_catchup($standby2); -# Configure primary to disallow any logical slots that enabled failover from -# getting ahead of specified physical replication slot (sb1_slot). +# Configure primary to disallow any logical slots that have enabled failover +# from getting ahead of the specified physical replication slot (sb1_slot). $primary->append_conf( 'postgresql.conf', qq( standby_slot_names = 'sb1_slot'
Re: Regardign RecentFlushPtr in WalSndWaitForWal()
On Fri, Mar 1, 2024 at 4:40 PM Matthias van de Meent wrote: > > On Mon, 26 Feb 2024 at 12:46, shveta malik wrote: > > > > Hi hackers, > > > > I would like to understand why we have code [1] that retrieves > > RecentFlushPtr in WalSndWaitForWal() outside of the loop. We utilize > > RecentFlushPtr later within the loop, but prior to that, we already > > have [2]. Wouldn't [2] alone be sufficient? > > > > Just to check the impact, I ran 'make check-world' after removing [1], > > and did not see any issue exposed by the test at-least. > > Yeah, that seems accurate. > > > Any thoughts? > [...] > > [2]: > > /* Update our idea of the currently flushed position. */ > > else if (!RecoveryInProgress()) > > I can't find where this "else" of this "else if" clause came from, as > this piece of code hasn't changed in years. > Right, I think the quoted code has check "if (!RecoveryInProgress())". > But apart from that, your > observation seems accurate, yes. > I also find the observation correct and the code has been like that since commit 5a991ef8 [1]. So, let's wait to see if Robert or Andres remembers the reason, otherwise, we should probably nuke this code. [1] commit 5a991ef8692ed0d170b44958a81a6bd70e90585c Author: Robert Haas Date: Mon Mar 10 13:50:28 2014 -0400 Allow logical decoding via the walsender interface. -- With Regards, Amit Kapila.
Re: Seeking Clarification on Logical Replication Start LSN
On Tue, Feb 27, 2024 at 5:56 PM Pradeep Kumar wrote: > > Dear Postgres Community, > > I hope this email finds you well. I am reaching out to seek clarification on > an issue I am encountering with logical replication in PostgreSQL. > > My specific question pertains to determining the appropriate LSN (Log > Sequence Number) from which to start logical replication. Allow me to provide > detailed context for better understanding: > > During the process of performing a parallel pg_basebackup, I concurrently > execute DML queries. As part of the pg_basebackup command, I utilize the > option create-slot to create a replication slot. Subsequently, upon > completion of the base backup, I initiate logical replication using the > restart_lsn obtained during the execution of the pg_basebackup command. My > intention is to ensure consistency between two database clusters. > > However, I am encountering errors during this process. Specifically, I > receive the following error message on the source side: > > """ > 2024-02-27 16:20:09.271 IST [2838457] ERROR: duplicate key value violates > unique constraint "table_15_36_pkey" > 2024-02-27 16:20:09.271 IST [2838457] DETAIL: Key (col_1, col_2)=(23, > 2024-02-27 15:14:24.332557) already exists. > 2024-02-27 16:20:09.272 IST [2834967] LOG: background worker "logical > replication worker" (PID 2838457) exited with exit code 1 > Upon analysis, it appears that the errors stem from starting the logical > replication with an incorrect LSN, one that has already been applied to the > target side, leading to duplicate key conflicts. > """ > > In light of this issue, I seek guidance on determining the appropriate LSN > from which to commence logical replication. > > To further clarify my problem: > > 1)I have a source machine and a target machine. > 2) I perform a backup from the source to the target using pg_basebackup. > 3) Prior to initiating the base backup, I create logical replication slots on > the source machine. > 4) During the execution of pg_basebackup, DML queries are executed, and I aim > to replicate this data on the target machine. > 5) My dilemma lies in determining the correct LSN to begin the logical > replication process. > I think the reason of the problem you are seeing is pg_basebackup also includes the WAL generated during backup if you specify -X method. See [1]. Now, as you have created a logical slot before starting backup, data duplication is possible. I don't see a very straightforward way but you might be able to achieve your desired purpose if somehow identify the last WAL location copied in backup and use that as your starting point for logical replication. [1] - https://www.postgresql.org/docs/devel/app-pgbasebackup.html -- With Regards, Amit Kapila.
Re: Shared detoast Datum proposal
Hi, Andy! Sorry for the delay, I have had long flights this week. I've reviewed the patch set, thank you for your efforts. I have several notes about patch set code, but first of I'm not sure the overall approach is the best for the task. As Tomas wrote above, the approach is very invasive and spreads code related to detoasting among many parts of code. Have you considered another one - to alter pg_detoast_datum (actually, it would be detoast_attr function) and save detoasted datums in the detoast context derived from the query context? We have just enough information at this step to identify the datum - toast relation id and value id, and could keep links to these detoasted values in a, say, linked list or hash table. Thus we would avoid altering the executor code and all detoast-related code would reside within the detoast source files? I'd check this approach in several days and would report on the result here. There are also comments on the code itself, I'd write them a bit later. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: RFC: Logging plan of the running query
On Wed, Feb 28, 2024 at 1:18 AM Robert Haas wrote: > > On Mon, Feb 26, 2024 at 5:31 PM torikoshia wrote: > > It would be nice if there was a place accessed once every few seconds or > > so.. > > I think this comment earlier from Andres deserves close attention: > > # If we went with something like tht approach, I think we'd have to do > something > # like redirecting node->ExecProcNode to a wrapper, presumably from within a > # CFI. That wrapper could then implement the explain support, without slowing > # down the normal execution path. > > If this is correctly implemented, the overhead in the case where the > feature isn't used should be essentially zero, I believe. If I can rephrase this idea: it's basically "delay this interrupt until inline to the next ExecProcNode execution". That seems pretty promising to me as well. Regards, James Coleman
Re: Commitfest Manager for March
> On 1 Mar 2024, at 14:57, Andrey M. Borodin wrote: > >> On 1 Mar 2024, at 17:29, Daniel Gustafsson wrote: >> >> The call for a CFM volunteer is still open. > > I always wanted to try. And most of the stuff I'm interested in is already > committed. > > But given importance of last commitfest before feature freeze, we might be > interested in more experienced CFM. > If I can do something useful - I'm up for it. I'm absolutely convinced you have more than enough experience with postgres hacking to do an excellent job. I'm happy to give a hand as well. Thanks for volunteering! (someone from pginfra will give you the required admin permissions on the CF app) -- Daniel Gustafsson
Re: Documentation: warn about two_phase when altering a subscription
> On 26 Feb 2024, at 14:14, Bertrand Drouvot > wrote: > > As the patch as it is now looks good to Amit (see [1]), I prefer to let him > decide which wording he pefers to push. Added Amit to cc, just to be sure. Thanks! Best regards, Andrey Borodin, learning how to be CFM.
CF entries for 17 to be reviewed
Hi hackers! In this thread, I want to promote entries from CommitFest that require review. I have scanned through the bugs, clients, and documentation sections, and here is my take on the current situation. All of these threads are currently in the "Needs review" state and were marked by the patch author as targeting version 17. Bugs: * LockAcquireExtended improvement Not really a bug, but rather a limitation. Thread might be of interest for a reviewer who wants to dive into heavy locks. Some input from Robert. I doubt this improvement will land into 17. * Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s) 3-liner fix, with input from Alvaro. * Avoid deadlock and concurrency during orphan temp table removal A real deadlock from production. 0 prior external review, hurry up to be first reviewer :) The patch is small, yet need a deep understanding of lock protocols of different combination of objects. * Explicitly show dependent types as extension members Some issues on the tip of FDW and types subsystems. Fix by Tom, not backpatchable. If you want better understanding of extendebility and type dependencies - take this for review. IMO most probably will be in 17, just need some extra eyes. * initdb's -c option behaves wrong way Fundamental debate, might seems much like tabs vs spaces (strncmp vs strncasecmp). Patch by Tom, perfect as usual, needs agreement from someone who's already involved. Clients: * vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases Nice feature, some review from Kyotaro Horiguchi, but need more. * Support for named parsed statement in psql Some review by Jelte was done, but seem to require more attention. * Extend pgbench partitioning to pgbench_history Tomas Vondra explressed some backward comparability concerns, Melanie questioned implementation details. I doubt this will land into 17, but eventually might be part of a famous "sort of TPC-B". * psql: Allow editing query results with \gedit There's a possible new nice psql feature, but as of January architectural discussion was still going on. It does not seem like it will be in 17, unless some agreement emerge. Documentation: * SET ROLE documentation improvement Thin matters of superuser documentation. Nathan signed up as a committer, but the thread has no updates for some months. * Add detail regarding resource consumption wrt max_connections Cary Huang reviewd the patch, but the result is inconclusive. * Quick Start Guide to PL/pgSQL and PL/Python Documentation Some comments from Pavel and Li seem like were not addressed. * Simplify documentation related to Windows builds Andres had a couple of notes that were addressed by the author. * PG DOCS - protocol specifying boolean parameters without values. Small leftover in a relatevely old documentation thread. Needs a bump from someone in the thread. * Documentation: warn about two_phase when altering a subscription Amit LGTMed, I've added him to cc of the followup. Stay tuned for other CF sections. Best regards, Andrey Borodin, learning how to be a CFM.
Re: CF entries for 17 to be reviewed
On Sat, Mar 2, 2024 at 1:32 PM Andrey M. Borodin wrote: > > Hi hackers! > > In this thread, I want to promote entries from CommitFest that require > review. I have scanned through the bugs, clients, and documentation sections, > and here is my take on the current situation. All of these threads are > currently in the "Needs review" state and were marked by the patch author as > targeting version 17. Hi Andrey, thanks for volunteering. I at least had forgotten to update the target version for all of my registered patches. I suspect others may be in the same situation. Anyway, I've done that now. But, I'm not sure if the ones that do have a target version = 17 are actually all the patches targeting 17. - Melanie
Failures in constraints regression test, "read only 0 of 8192 bytes"
These two animals seem to have got mixed up about about the size of this relation in the same place: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2024-02-28%2017%3A34%3A30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2024-03-01%2006%3A47%3A53 +++ /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/regress/results/constraints.out 2024-03-01 08:22:11.624897033 +0100 @@ -573,42 +573,38 @@ UNIQUE (i) DEFERRABLE INITIALLY DEFERRED; BEGIN; INSERT INTO unique_tbl VALUES (1, 'five'); +ERROR: could not read blocks 0..0 in file "base/16384/21437": read only 0 of 8192 bytes That error message changed slightly in my smgrreadv() commit a couple of months ago (it would have been "block 0" and now it's "blocks 0..0" because now we can read more than one block at a time) but I don't immediately see how anything at that low level could be responsible for this.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman wrote: > > On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman > wrote: > > > > On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra > > wrote: > > > > > > > > > > > > On 2/29/24 22:19, Melanie Plageman wrote: > > > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra > > > > wrote: > > > >> > > > >> > > > >> > > > >> On 2/29/24 00:40, Melanie Plageman wrote: > > > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra > > > >>> wrote: > > > > > > > > > > > > On 2/28/24 21:06, Melanie Plageman wrote: > > > > On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra > > > > wrote: > > > >> > > > >> On 2/28/24 15:56, Tomas Vondra wrote: > > > ... > > > >>> > > > >>> Sure, I can do that. It'll take a couple hours to get the > > > >>> results, I'll > > > >>> share them when I have them. > > > >>> > > > >> > > > >> Here are the results with only patches 0001 - 0012 applied (i.e. > > > >> without > > > >> the patch introducing the streaming read API, and the patch > > > >> switching > > > >> the bitmap heap scan to use it). > > > >> > > > >> The changes in performance don't disappear entirely, but the scale > > > >> is > > > >> certainly much smaller - both in the complete results for all > > > >> runs, and > > > >> for the "optimal" runs that would actually pick bitmapscan. > > > > > > > > Hmm. I'm trying to think how my refactor could have had this impact. > > > > It seems like all the most notable regressions are with 4 parallel > > > > workers. What do the numeric column labels mean across the top > > > > (2,4,8,16...) -- are they related to "matches"? And if so, what does > > > > that mean? > > > > > > > > > > That's the number of distinct values matched by the query, which > > > should > > > be an approximation of the number of matching rows. The number of > > > distinct values in the data set differs by data set, but for 1M rows > > > it's roughly like this: > > > > > > uniform: 10k > > > linear: 10k > > > cyclic: 100 > > > > > > So for example matches=128 means ~1% of rows for uniform/linear, and > > > 100% for cyclic data sets. > > > >>> > > > >>> Ah, thank you for the explanation. I also looked at your script after > > > >>> having sent this email and saw that it is clear in your script what > > > >>> "matches" is. > > > >>> > > > As for the possible cause, I think it's clear most of the difference > > > comes from the last patch that actually switches bitmap heap scan to > > > the > > > streaming read API. That's mostly expected/understandable, although > > > we > > > probably need to look into the regressions or cases with e_i_c=0. > > > >>> > > > >>> Right, I'm mostly surprised about the regressions for patches > > > >>> 0001-0012. > > > >>> > > > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a > > > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically > > > >>> disabled prefetching but with the streaming read API, it will still > > > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas > > > >>> prefers to fix it by always avoiding an fadvise for the last buffer in > > > >>> a range before issuing a read (since we are about to read it anyway, > > > >>> best not fadvise it too). This will fix eic 0 and also cut one system > > > >>> call from each invocation of the streaming read machinery. > > > >>> > > > To analyze the 0001-0012 patches, maybe it'd be helpful to run tests > > > for > > > individual patches. I can try doing that tomorrow. It'll have to be a > > > limited set of tests, to reduce the time, but might tell us whether > > > it's > > > due to a single patch or multiple patches. > > > >>> > > > >>> Yes, tomorrow I planned to start trying to repro some of the "red" > > > >>> cases myself. Any one of the commits could cause a slight regression > > > >>> but a 3.5x regression is quite surprising, so I might focus on trying > > > >>> to repro that locally and then narrow down which patch causes it. > > > >>> > > > >>> For the non-cached regressions, perhaps the commit to use the correct > > > >>> recheck flag (0004) when prefetching could be the culprit. And for the > > > >>> cached regressions, my money is on the commit which changes the whole > > > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple() > > > >>> functions (0010). > > > >>> > > > >> > > > >> I do have some partial results, comparing the patches. I only ran one > > > >> of > > > >> the more affected workloads (cyclic) on the xeon, attached is a PDF > > > >> comparing master and the 0001-0014 patches. The percentages are timing > > > >> vs. the preceding patch (green - faster, red - slower). > > > > > > > > Just confirming: the results are for uncached? > > >
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra wrote: > > Here's a PDF with charts for a dataset where the row selectivity is more > correlated to selectivity of pages. I'm attaching the updated script, > with the SQL generating the data set. But the short story is all rows on > a single page have the same random value, so the selectivity of rows and > pages should be the same. > > The first page has results for the original "uniform", the second page > is the new "uniform-pages" data set. There are 4 charts, for > master/patched and 0/4 parallel workers. Overall the behavior is the > same, but for the "uniform-pages" it's much more gradual (with respect > to row selectivity). I think that's expected. Cool! Thanks for doing this. I have convinced myself that Thomas' forthcoming patch which will eliminate prefetching with eic = 0 will fix the eic 0 blue line regressions. The eic = 1 with four parallel workers is more confusing. And it seems more noticeably bad with your randomized-pages dataset. Regarding your earlier question: > Just to be sure we're on the same page regarding what eic=1 means, > consider a simple sequence of pages: A, B, C, D, E, ... > > With the current "master" code, eic=1 means we'll issue a prefetch for B > and then read+process A. And then issue prefetch for C and read+process > B, and so on. It's always one page ahead. Yes, that is what I mean for eic = 1 > As for how this is related to eic=1 - I think my point was that these > are "adversary" data sets, most likely to show regressions. This applies > especially to the "uniform" data set, because as the row selectivity > grows, it's more and more likely it's right after to the current one, > and so a read-ahead would likely do the trick. No, I think you're right that eic=1 should prefetch. As you say, with high selectivity, a bitmap plan is likely not the best one anyway, so not prefetching in order to preserve the performance of those cases seems silly. - Melanie
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
These are "my" animals (running at a local university). There's a couple interesting details: 1) the animals run on the same machine (one with gcc, one with clang) 2) I did upgrade the OS and restarted the machine on 2024/02/26, i.e. right before the failures started These might be just coincidences, but maybe something got broken by the upgrade ... OTOH it's weird it'd affect just HEAD and none of the other branches, and on two difference compilers. Just to be sure I removed the buildroot, in case there's something wrong with ccache. It's a wild guess, but I don't have any other idea. regards On 3/2/24 22:39, Thomas Munro wrote: > These two animals seem to have got mixed up about about the size of > this relation in the same place: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2024-02-28%2017%3A34%3A30 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2024-03-01%2006%3A47%3A53 > > +++ > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/regress/results/constraints.out > 2024-03-01 08:22:11.624897033 +0100 > @@ -573,42 +573,38 @@ > UNIQUE (i) DEFERRABLE INITIALLY DEFERRED; > BEGIN; > INSERT INTO unique_tbl VALUES (1, 'five'); > +ERROR: could not read blocks 0..0 in file "base/16384/21437": read > only 0 of 8192 bytes > > That error message changed slightly in my smgrreadv() commit a couple > of months ago (it would have been "block 0" and now it's "blocks 0..0" > because now we can read more than one block at a time) but I don't > immediately see how anything at that low level could be responsible > for this. > > -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/2/24 23:28, Melanie Plageman wrote: > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra > wrote: >> >> Here's a PDF with charts for a dataset where the row selectivity is more >> correlated to selectivity of pages. I'm attaching the updated script, >> with the SQL generating the data set. But the short story is all rows on >> a single page have the same random value, so the selectivity of rows and >> pages should be the same. >> >> The first page has results for the original "uniform", the second page >> is the new "uniform-pages" data set. There are 4 charts, for >> master/patched and 0/4 parallel workers. Overall the behavior is the >> same, but for the "uniform-pages" it's much more gradual (with respect >> to row selectivity). I think that's expected. > > Cool! Thanks for doing this. I have convinced myself that Thomas' > forthcoming patch which will eliminate prefetching with eic = 0 will > fix the eic 0 blue line regressions. The eic = 1 with four parallel > workers is more confusing. And it seems more noticeably bad with your > randomized-pages dataset. > > Regarding your earlier question: > >> Just to be sure we're on the same page regarding what eic=1 means, >> consider a simple sequence of pages: A, B, C, D, E, ... >> >> With the current "master" code, eic=1 means we'll issue a prefetch for B >> and then read+process A. And then issue prefetch for C and read+process >> B, and so on. It's always one page ahead. > > Yes, that is what I mean for eic = 1 > >> As for how this is related to eic=1 - I think my point was that these >> are "adversary" data sets, most likely to show regressions. This applies >> especially to the "uniform" data set, because as the row selectivity >> grows, it's more and more likely it's right after to the current one, >> and so a read-ahead would likely do the trick. > > No, I think you're right that eic=1 should prefetch. As you say, with > high selectivity, a bitmap plan is likely not the best one anyway, so > not prefetching in order to preserve the performance of those cases > seems silly. > I was just trying to respond do this from an earlier message: > Yes, I would like to see results from a data set where selectivity is > more correlated to pages/heap fetches. But, I'm not sure I see how > that is related to prefetching when eic = 1. And in that same message you also said "Not doing prefetching with eic 1 actually seems like the right behavior". Hence my argument we should not stop prefetching for eic=1. But maybe I'm confused - it seems agree eic=1 should prefetch, and that uniform data set may not be a good argument against that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/2/24 23:11, Melanie Plageman wrote: > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman > wrote: >> >> ... >> >> Hold the phone on this one. I realized why I moved >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in >> the first place -- master calls BitmapAdjustPrefetchIterator after the >> tbm_iterate() for the current block -- otherwise with eic = 1, it >> considers the prefetch iterator behind the current block iterator. I'm >> going to go through and figure out what order this must be done in and >> fix it. > > So, I investigated this further, and, as far as I can tell, for > parallel bitmapheapscan the timing around when workers decrement > prefetch_pages causes the performance differences with patch 0010 > applied. It makes very little sense to me, but some of the queries I > borrowed from your regression examples are up to 30% slower when this > code from BitmapAdjustPrefetchIterator() is after > table_scan_bitmap_next_block() instead of before it. > > SpinLockAcquire(&pstate->mutex); > if (pstate->prefetch_pages > 0) > pstate->prefetch_pages--; > SpinLockRelease(&pstate->mutex); > > I did some stracing and did see much more time spent in futex/wait > with this code after the call to table_scan_bitmap_next_block() vs > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). > > In my branch, I've now moved only the parallel prefetch_pages-- code > to before table_scan_bitmap_next_block(). > https://github.com/melanieplageman/postgres/tree/bhs_pgsr > I'd be interested to know if you see the regressions go away with 0010 > applied (commit message "Make table_scan_bitmap_next_block() async > friendly" and sha bfdcbfee7be8e2c461). > I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1 complete. But it seems really bizarre that simply moving this code a little bit would cause such a regression ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 5:41 PM Tomas Vondra wrote: > > > > On 3/2/24 23:28, Melanie Plageman wrote: > > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra > > wrote: > >> > >> Here's a PDF with charts for a dataset where the row selectivity is more > >> correlated to selectivity of pages. I'm attaching the updated script, > >> with the SQL generating the data set. But the short story is all rows on > >> a single page have the same random value, so the selectivity of rows and > >> pages should be the same. > >> > >> The first page has results for the original "uniform", the second page > >> is the new "uniform-pages" data set. There are 4 charts, for > >> master/patched and 0/4 parallel workers. Overall the behavior is the > >> same, but for the "uniform-pages" it's much more gradual (with respect > >> to row selectivity). I think that's expected. > > > > Cool! Thanks for doing this. I have convinced myself that Thomas' > > forthcoming patch which will eliminate prefetching with eic = 0 will > > fix the eic 0 blue line regressions. The eic = 1 with four parallel > > workers is more confusing. And it seems more noticeably bad with your > > randomized-pages dataset. > > > > Regarding your earlier question: > > > >> Just to be sure we're on the same page regarding what eic=1 means, > >> consider a simple sequence of pages: A, B, C, D, E, ... > >> > >> With the current "master" code, eic=1 means we'll issue a prefetch for B > >> and then read+process A. And then issue prefetch for C and read+process > >> B, and so on. It's always one page ahead. > > > > Yes, that is what I mean for eic = 1 > > > >> As for how this is related to eic=1 - I think my point was that these > >> are "adversary" data sets, most likely to show regressions. This applies > >> especially to the "uniform" data set, because as the row selectivity > >> grows, it's more and more likely it's right after to the current one, > >> and so a read-ahead would likely do the trick. > > > > No, I think you're right that eic=1 should prefetch. As you say, with > > high selectivity, a bitmap plan is likely not the best one anyway, so > > not prefetching in order to preserve the performance of those cases > > seems silly. > > > > I was just trying to respond do this from an earlier message: > > > Yes, I would like to see results from a data set where selectivity is > > more correlated to pages/heap fetches. But, I'm not sure I see how > > that is related to prefetching when eic = 1. > > And in that same message you also said "Not doing prefetching with eic 1 > actually seems like the right behavior". Hence my argument we should not > stop prefetching for eic=1. > > But maybe I'm confused - it seems agree eic=1 should prefetch, and that > uniform data set may not be a good argument against that. Yep, we agree. I was being confusing and wrong :) I just wanted to make sure the thread had a clear consensus that, yes, it is the right thing to do to prefetch blocks for bitmap heap scans when effective_io_concurrency = 1. - Melanie
Re: Streaming read-ready sequential scan code
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman wrote: > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > wrote: > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > wrote: > > > > > > > > There is an outstanding question about where to allocate the > > > > PgStreamingRead object for sequential scans > > > > > > I've written three alternative implementations of the actual streaming > > > read user for sequential scan which handle the question of where to > > > allocate the streaming read object and how to handle changing scan > > > direction in different ways. > > > > > > Option A) > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > - Allocates the streaming read object in initscan(). Since we do not > > > know the scan direction at this time, if the scan ends up not being a > > > forwards scan, the streaming read object must later be freed -- so > > > this will sometimes allocate a streaming read object it never uses. > > > - Only supports ForwardScanDirection and once the scan direction > > > changes, streaming is never supported again -- even if we return to > > > ForwardScanDirection > > > - Must maintain a "fallback" codepath that does not use the streaming > > > read API > > > > Attached is a version of this patch which implements a "reset" > > function for the streaming read API which should be cheaper than the > > full pg_streaming_read_free() on rescan. This can easily be ported to > > work on any of my proposed implementations (A/B/C). I implemented it > > on A as an example. > > Attached is the latest version of this patchset -- rebased in light of > Thomas' updatees to the streaming read API [1]. I have chosen the > approach I think we should go with. It is a hybrid of my previously > proposed approaches. While investigating some performance concerns, Andres pointed out that the members I add to HeapScanDescData in this patch push rs_cindex and rs_ntuples to another cacheline and introduce a 4-byte hole. Attached v4's HeapScanDescData is as well-packed as on master and its members are reordered so that rs_cindex and rs_ntuples are back on the second cacheline of the struct's data. - Melanie From 614195777b3ee675d74d98953b086e0f8a4a494d Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 26 Feb 2024 15:33:39 -0500 Subject: [PATCH v4 4/5] Add pg_streaming_read_reset For rescan, we want to reuse the streaming read object and simply release the buffers that were pinned by the streaming read infrastructure. --- src/backend/storage/aio/streaming_read.c | 18 ++ src/include/storage/streaming_read.h | 1 + 2 files changed, 19 insertions(+) diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c index 71f2c4a70b6..70f3ef051f8 100644 --- a/src/backend/storage/aio/streaming_read.c +++ b/src/backend/storage/aio/streaming_read.c @@ -610,3 +610,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr) pfree(pgsr); } + + +/* + * Reset a streaming read object by releasing all of the buffers. Note that + * max_ios is not recalculated, so any changes to maintenance_io_concurrency and + * effective_io_concurrency will have no effect. + */ +void +pg_streaming_read_reset(PgStreamingRead *pgsr) +{ + Buffer buffer; + + /* Stop looking ahead, and unpin anything that wasn't consumed. */ + pgsr->finished = true; + while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer) + ReleaseBuffer(buffer); + pgsr->finished = false; +} diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h index c4d3892bb26..63cef719e42 100644 --- a/src/include/storage/streaming_read.h +++ b/src/include/storage/streaming_read.h @@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags, extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr); extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private); extern void pg_streaming_read_free(PgStreamingRead *pgsr); +extern void pg_streaming_read_reset(PgStreamingRead *pgsr); #endif -- 2.40.1 From 4b6d9059aa32625e91d38f1d414ec514d4073197 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 29 Jan 2024 11:50:01 -0500 Subject: [PATCH v4 2/5] Replace blocks with buffers in heapgettup control flow Future commits will introduce the streaming read API and the sequential scan streaming read API user. Streaming read API users implement a callback which returns the next block to read. Sequential scans previously looped through the blocks in the relation, synchronously reading in a block and then processing it. An InvalidBlockNumber returned by heapgettup_advance_block() meant that the relation was exhausted and all blocks had been processed. The streaming read API may exhaust the blocks in a relation (having read all of them into buffers) before they have all be
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 5:51 PM Tomas Vondra wrote: > > On 3/2/24 23:11, Melanie Plageman wrote: > > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman > > wrote: > >> > >> ... > >> > >> Hold the phone on this one. I realized why I moved > >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in > >> the first place -- master calls BitmapAdjustPrefetchIterator after the > >> tbm_iterate() for the current block -- otherwise with eic = 1, it > >> considers the prefetch iterator behind the current block iterator. I'm > >> going to go through and figure out what order this must be done in and > >> fix it. > > > > So, I investigated this further, and, as far as I can tell, for > > parallel bitmapheapscan the timing around when workers decrement > > prefetch_pages causes the performance differences with patch 0010 > > applied. It makes very little sense to me, but some of the queries I > > borrowed from your regression examples are up to 30% slower when this > > code from BitmapAdjustPrefetchIterator() is after > > table_scan_bitmap_next_block() instead of before it. > > > > SpinLockAcquire(&pstate->mutex); > > if (pstate->prefetch_pages > 0) > > pstate->prefetch_pages--; > > SpinLockRelease(&pstate->mutex); > > > > I did some stracing and did see much more time spent in futex/wait > > with this code after the call to table_scan_bitmap_next_block() vs > > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). > > > > In my branch, I've now moved only the parallel prefetch_pages-- code > > to before table_scan_bitmap_next_block(). > > https://github.com/melanieplageman/postgres/tree/bhs_pgsr > > I'd be interested to know if you see the regressions go away with 0010 > > applied (commit message "Make table_scan_bitmap_next_block() async > > friendly" and sha bfdcbfee7be8e2c461). > > > > I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1 > complete. But it seems really bizarre that simply moving this code a > little bit would cause such a regression ... Yes, it is bizarre. It also might not be a reproducible performance difference on the cases besides the one I was testing (cyclic dataset, uncached, eic=8, matches 16+, distinct=100, rows=1, 4 parallel workers). But even if it only affects that one case, it still had a major, reproducible performance impact to move those 5 lines before and after table_scan_bitmap_next_block(). The same number of reads and fadvises are being issued overall. However, I did notice that the pread calls are skewed when the those lines of code are after table_scan_bitmap_next_block() -- fewer of the workers are doing more of the reads. Perhaps this explains what is taking longer. Why those workers would end up doing more of the reads, I don't quite know. - Melanie
Re: Synchronizing slots from primary to standby
On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu) wrote: > > On Friday, March 1, 2024 12:23 PM Peter Smith wrote: > > ... > > == > > src/backend/replication/slot.c > > > > 2. validate_standby_slots > > > > + else if (!ReplicationSlotCtl) > > + { > > + /* > > + * We cannot validate the replication slot if the replication slots' > > + * data has not been initialized. This is ok as we will validate the > > + * specified slot when waiting for them to catch up. See > > + * StandbySlotsHaveCaughtup for details. > > + */ > > + } > > + else > > + { > > + /* > > + * If the replication slots' data have been initialized, verify if the > > + * specified slots exist and are logical slots. > > + */ > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > > > IMO that 2nd comment does not need to say "If the replication slots' > > data have been initialized," because that is implicit from the if/else. > > Removed. I only meant to suggest removing the 1st part, not the entire comment. I thought it is still useful to have a comment like: /* Check that the specified slots exist and are logical slots.*/ == And, here are some review comments for v103-0001. == Commit message 1. Additionally, The SQL functions pg_logical_slot_get_changes, pg_logical_slot_peek_changes and pg_replication_slot_advance are modified to wait for the replication slots mentioned in 'standby_slot_names' to catch up before returning. ~ (use the same wording as previous in this message) /mentioned in/specified in/ == doc/src/sgml/config.sgml 2. +Additionally, when using the replication management functions + +pg_replication_slot_advance, + +pg_logical_slot_get_changes, and + +pg_logical_slot_peek_changes, +with failover enabled logical slots, the functions will wait for the +physical slots specified in standby_slot_names to +confirm WAL receipt before proceeding. + It says "the ... functions" twice so maybe reword it slightly. BEFORE Additionally, when using the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, with failover enabled logical slots, the functions will wait for the physical slots specified in standby_slot_names to confirm WAL receipt before proceeding. SUGGESTION Additionally, the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, when used with failover enabled logical slots, will wait for the physical slots specified in standby_slot_names to confirm WAL receipt before proceeding. (Actually the "will wait ... before proceeding" is also a bit tricky, so below is another possible rewording) SUGGESTION #2 Additionally, the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, when used with failover enabled logical slots, will block until all physical slots specified in standby_slot_names have confirmed WAL receipt. ~~~ 3. + + + Value * is not accepted as it is inappropriate to + block logical replication for physical slots that either lack + associated standbys or have standbys associated that are not enabled + for replication slot synchronization. (see + ). + + Why does the document need to provide an excuse/reason for the rule? You could just say something like: SUGGESTION The slots must be named explicitly. For example, specifying wildcard values like * is not permitted. == doc/src/sgml/func.sgml 4. @@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset - + pg_logical_slot_get_changes @@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset - + pg_logical_slot_peek_changes @@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset - + pg_replication_slot_advance Should these 3 functions say something about how their behaviour is affected by 'standby_slot_names' and give a link back to the GUC 'standby_slot_names' docs? == src/backend/replication/slot.c 5. StandbySlotsHaveCaughtup + if (!slot) + { + /* + * If the provided slot does not exist, report a message and exit + * the loop. It is possible for a user to specify a slot name in + * standby_slot_names that does not exist just before the server + * startup. The GUC check_hook(validate_standby_slots) cannot + * validate such a slot during startup as the ReplicationSlotCtl + * shared memory is not initialized at that time. It is also + * possible for a user to drop the slot in standby_slot_names + * afterwards. + */ 5a. Minor rewording to ma
Re: pub/sub - specifying optional parameters without values.
On Fri, Jan 12, 2024 at 4:07 PM Peter Smith wrote: > > On Mon, Jan 30, 2023 at 8:36 AM Tom Lane wrote: > > > > Zheng Li writes: > > > The behavior is due to the following code > > > https://github.com/postgres/postgres/blob/master/src/backend/commands/define.c#L113 > > > > Yeah, so you can grep for places that have this behavior by looking > > for defGetBoolean calls ... and there are quite a few. That leads > > me to the conclusion that we'd better invent a fairly stylized > > documentation solution that we can plug into a lot of places, > > rather than thinking of slightly different ways to say it and > > places to say it. I'm not necessarily opposed to Peter's desire > > to fix replication-related commands first, but we have more to do > > later. > > > > I'm also not that thrilled with putting the addition up at the top > > of the relevant text. This behavior is at least two decades old, > > so if we've escaped documenting it at all up to now, it can't be > > that important to most people. > > > > I also notice that ALTER SUBSCRIPTION has fully three different > > sub-sections with about equal claims on this note, if we're going > > to stick it directly into the affected option lists. > > > > That all leads me to propose that we add the new text at the end of > > the Parameters in the affected man pages. So about > > like the attached. (I left out alter_publication.sgml, as I'm not > > sure it needs its own copy of this text --- it doesn't describe > > individual parameters at all, just refer to CREATE PUBLICATION.) > > > > regards, tom lane > > > > Hi, > > Here is a similar update for another page: "55.4 Streaming Replication > Protocol" [0]. This patch was prompted by a review comment reply at > [1] (#2). > > I've used text almost the same as the boilerplate text added by the > previous commit [2] > > ~ > > PSA patch v4. > > == > [0] https://www.postgresql.org/docs/devel/protocol-replication.html > [1] > https://www.postgresql.org/message-id/OS0PR01MB571663BCE8B28597D462FADE946A2%40OS0PR01MB5716.jpnprd01.prod.outlook.com > [2] > https://github.com/postgres/postgres/commit/ec7e053a98f39a9e3c7e6d35f0d2e83933882399 > > Kind Regards, > Peter Smith. > Fujitsu Australia Bump. Kind Regards, Peter Smith. Fujitsu Australia
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 6:59 PM Tomas Vondra wrote: > > > > On 3/1/24 17:51, Melanie Plageman wrote: > > On Fri, Mar 1, 2024 at 9:05 AM Tomas Vondra > > wrote: > >> > >> On 3/1/24 02:18, Melanie Plageman wrote: > >>> On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra > >>> wrote: > > On 2/29/24 23:44, Tomas Vondra wrote: > 1) On master there's clear difference between eic=0 and eic=1 cases, but > on the patched build there's literally no difference - for example the > "uniform" distribution is clearly not great for prefetching, but eic=0 > regresses to eic=1 poor behavior). > >>> > >>> Yes, so eic=0 and eic=1 are identical with the streaming read API. > >>> That is, eic 0 does not disable prefetching. Thomas is going to update > >>> the streaming read API to avoid issuing an fadvise for the last block > >>> in a range before issuing a read -- which would mean no prefetching > >>> with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems > >>> like the right behavior -- which would be different than what master > >>> is doing, right? > >> > >> I don't think we should stop doing prefetching for eic=1, or at least > >> not based just on these charts. I suspect these "uniform" charts are not > >> a great example for the prefetching, because it's about distribution of > >> individual rows, and even a small fraction of rows may match most of the > >> pages. It's great for finding strange behaviors / corner cases, but > >> probably not a sufficient reason to change the default. > > > > Yes, I would like to see results from a data set where selectivity is > > more correlated to pages/heap fetches. But, I'm not sure I see how > > that is related to prefetching when eic = 1. > > > >> I think it makes sense to issue a prefetch one page ahead, before > >> reading/processing the preceding one, and it's fairly conservative > >> setting, and I assume the default was chosen for a reason / after > >> discussion. > > > > Yes, I suppose the overhead of an fadvise does not compare to the IO > > latency of synchronously reading that block. Actually, I bet the > > regression I saw by accidentally moving BitmapAdjustPrefetchIterator() > > after table_scan_bitmap_next_block() would be similar to the > > regression introduced by making eic = 1 not prefetch. > > > > When you think about IO concurrency = 1, it doesn't imply prefetching > > to me. But, I think we want to do the right thing and have parity with > > master. > > > >> My suggestion would be to keep the master behavior unless not practical, > >> and then maybe discuss changing the details later. The patch is already > >> complicated enough, better to leave that discussion for later. > > > > Agreed. Speaking of which, we need to add back use of tablespace IO > > concurrency for the streaming read API (which is used by > > BitmapHeapScan in master). > > > >>> With very low selectivity, you are less likely to get readahead > >>> (right?) and similarly less likely to be able to build up > 8kB IOs -- > >>> which is one of the main value propositions of the streaming read > >>> code. I imagine that this larger read benefit is part of why the > >>> performance is better at higher selectivities with the patch. This > >>> might be a silly experiment, but we could try decreasing > >>> MAX_BUFFERS_PER_TRANSFER on the patched version and see if the > >>> performance gains go away. > >> > >> Sure, I can do that. Do you have any particular suggestion what value to > >> use for MAX_BUFFERS_PER_TRANSFER? > > > > I think setting it to 1 would be the same as always master -- doing > > only 8kB reads. The only thing about that is that I imagine the other > > streaming read code has some overhead which might end up being a > > regression on balance even with the prefetching if we aren't actually > > using the ranges/vectored capabilities of the streaming read > > interface. Maybe if you just run it for one of the very obvious > > performance improvement cases? I can also try this locally. > > > > Here's some results from a build with > > #define MAX_BUFFERS_PER_TRANSFER 1 > > There are three columns: > > - master > - patched (original patches, with MAX_BUFFERS_PER_TRANSFER=128kB) > - patched-single (MAX_BUFFERS_PER_TRANSFER=8kB) > > The color scales are always branch compared to master. > > I think the expectation was that setting the transfer to 1 would make it > closer to master, reducing some of the regressions. But in practice the > effect is the opposite. > > - In "cached" runs, this eliminates the small improvements (light > green), but leaves the regressions behind. For cached runs, I actually would expect that MAX_BUFFERS_PER_TRANSFER would eliminate the regressions. Pinning more buffers will only hurt us for cached workloads. This is evidence that we may need to control the number of pinned buffers differently when there has been a run of fully cached blocks. > - In "uncached" runs, this exacerbates the regressions, particularly for > low se
psql: fix variable existence tab completion
Hello hackers, psql has the :{?name} syntax for testing a psql variable existence. But currently doing \echo :{?VERB doesn't trigger tab completion. This patch fixes it. I've also included a TAP test. Best regards, Steve Chavez From adb1f997b67d8ef603017ab34e1b9061e4e229ea Mon Sep 17 00:00:00 2001 From: steve-chavez Date: Sat, 2 Mar 2024 19:06:13 -0500 Subject: [PATCH 1/1] psql: fix variable existence tab completion --- src/bin/psql/t/010_tab_completion.pl | 8 src/bin/psql/tab-complete.c | 4 +++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/src/bin/psql/t/010_tab_completion.pl b/src/bin/psql/t/010_tab_completion.pl index b6575b075e..b45c39f0f5 100644 --- a/src/bin/psql/t/010_tab_completion.pl +++ b/src/bin/psql/t/010_tab_completion.pl @@ -413,6 +413,14 @@ check_completion( clear_query(); +# check completion for psql variable test +check_completion( + "\\echo :{?VERB\t", + qr/:\{\?VERBOSITY} /, + "complete a psql variable test"); + +clear_query(); + # check no-completions code path check_completion("blarg \t\t", qr//, "check completion failure path"); diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index ada711d02f..a16dac9e73 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -76,7 +76,7 @@ #endif /* word break characters */ -#define WORD_BREAKS "\t\n@><=;|&{() " +#define WORD_BREAKS "\t\n@><=;|&() " /* * Since readline doesn't let us pass any state through to the tab completion @@ -1785,6 +1785,8 @@ psql_completion(const char *text, int start, int end) matches = complete_from_variables(text, ":'", "'", true); else if (text[1] == '"') matches = complete_from_variables(text, ":\"", "\"", true); + else if (text[1] == '{' && text[2] == '?') + matches = complete_from_variables(text, ":{?", "}", true); else matches = complete_from_variables(text, ":", "", true); } -- 2.40.1
Re: Shared detoast Datum proposal
Hi Nikita, > > Have you considered another one - to alter pg_detoast_datum > (actually, it would be detoast_attr function) and save > detoasted datums in the detoast context derived > from the query context? > > We have just enough information at this step to identify > the datum - toast relation id and value id, and could > keep links to these detoasted values in a, say, linked list > or hash table. Thus we would avoid altering the executor > code and all detoast-related code would reside within > the detoast source files? I think you are talking about the way Tomas provided. I am really afraid that I was thought of too self-opinionated, but I do have some concerns about this approch as I stated here [1], looks my concerns is still not addressed, or the concerns itself are too absurd which is really possible I think? [1] https://www.postgresql.org/message-id/875xyb1a6q.fsf%40163.com -- Best Regards Andy Fan
Re: Synchronizing slots from primary to standby
On Sun, Mar 3, 2024 at 5:17 AM Peter Smith wrote: > > ~~~ > > 3. > + > + > + Value * is not accepted as it is inappropriate to > + block logical replication for physical slots that either lack > + associated standbys or have standbys associated that are not enabled > + for replication slot synchronization. (see > + linkend="logicaldecoding-replication-slots-synchronization"/>). > + > + > > Why does the document need to provide an excuse/reason for the rule? > You could just say something like: > > SUGGESTION > The slots must be named explicitly. For example, specifying wildcard > values like * is not permitted. > I would like to document the reason somewhere, if not in docs, then let's write a comment for the same in code. > == > > ~~~ > > 9. NeedToWaitForWal > > + /* > + * Check if the standby slots have caught up to the flushed position. It > + * is good to wait up to flushed position and then let it send the changes > + * to logical subscribers one by one which are already covered in flushed > + * position without needing to wait on every change for standby > + * confirmation. Note that after receiving the shutdown signal, an ERROR > + * is reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > + return true; > > I felt it was confusing things for this function to also call to the > other one -- it seems an overlapping/muddling of the purpose of these. > I think it will be easier to understand if the calling code just calls > to one or both of these functions as required. > I felt otherwise because the caller has to call these functions at more than one place which makes the caller's code difficult to follow. It is better to encapsulate the computation of wait_event. -- With Regards, Amit Kapila.
Re: Documentation: warn about two_phase when altering a subscription
On Sat, Mar 2, 2024 at 11:44 PM Andrey M. Borodin wrote: > > > On 26 Feb 2024, at 14:14, Bertrand Drouvot > > wrote: > > > > As the patch as it is now looks good to Amit (see [1]), I prefer to let him > > decide which wording he pefers to push. > > Added Amit to cc, just to be sure. Thanks! > I'll look into it during this CF. -- With Regards, Amit Kapila.
[docs] revise ORDER BY documentation
I recently encountered some odd behavior with a query both selecting and sorting by `random()`. When I posted about it on pgsql-bugs ^1, David Johnston and Tom Lane provided some very detailed explanations as to what was happening, but weren't sure whether or where information about it could live comfortably in the docs. I think it's a useful addition; it's not an everyday occurrence but I'm very much not the first person to run into it. After a bit of looking, I think I've found a reasonable location. This patch revises https://www.postgresql.org/docs/current/queries-order.html to discuss sort expressions and options separately, and fits a caveat based on Tom's suggested language (with an example) into the former section. There are a few other minor tweaks included here: - note that `*` is not an expression - consolidate output column examples - mention non-column sort expressions I did write a query demonstrating the `group by` case Tom mentioned, but expect that one's a lot less common. 1: https://www.postgresql.org/message-id/CZHAF947QQQO.27MAUK2SVMBXW%40nmfay.com From cb336d5f5e5e23704e14f42eb09d4bf3f1c7e10e Mon Sep 17 00:00:00 2001 From: Dian M Fay Date: Sat, 2 Mar 2024 22:43:05 -0500 Subject: [PATCH] Revise ORDER BY documentation Discuss sort expressions and options separately, and add clarifications around `*` not being an expression, the use of non-column sort expressions, and the potentially unexpected behavior of volatile expressions appearing both in ORDER BY and in the select list. --- doc/src/sgml/queries.sgml | 161 +++--- 1 file changed, 99 insertions(+), 62 deletions(-) diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml index 648b283b06..066d0a17ab 100644 --- a/doc/src/sgml/queries.sgml +++ b/doc/src/sgml/queries.sgml @@ -1822,75 +1822,112 @@ SELECT select_list ORDER BY sort_expression1 ASC | DESC NULLS { FIRST | LAST } , sort_expression2 ASC | DESC NULLS { FIRST | LAST } ... - The sort expression(s) can be any expression that would be valid in the - query's select list. An example is: + When more than one expression is specified, the later values are used to + sort rows that are equal according to the earlier values. + + + + Types of Sort Expression + + +A sort_expression can be any expression (see +; * is not an +expression) that would be valid in the query's select list. This is most +often a column or an operation on a column belonging to a relation +referenced in the query, such as: SELECT a, b FROM table1 ORDER BY a + b, c; - When more than one expression is specified, - the later values are used to sort rows that are equal according to the - earlier values. Each expression can be followed by an optional - ASC or DESC keyword to set the sort direction to - ascending or descending. ASC order is the default. - Ascending order puts smaller values first, where - smaller is defined in terms of the - < operator. Similarly, descending order is - determined with the > operator. - - - Actually, PostgreSQL uses the default B-tree - operator class for the expression's data type to determine the sort - ordering for ASC and DESC. Conventionally, - data types will be set up so that the < and - > operators correspond to this sort ordering, - but a user-defined data type's designer could choose to do something - different. - - - +Columns do not have to be projected in the select list to be used in +ORDER BY. Non-column expressions may also be used; +ORDER BY random() is a common technique to shuffle +output records. + - - The NULLS FIRST and NULLS LAST options can be - used to determine whether nulls appear before or after non-null values - in the sort ordering. By default, null values sort as if larger than any - non-null value; that is, NULLS FIRST is the default for - DESC order, and NULLS LAST otherwise. - + +A sort_expression can also be the column label +or number of an output column. The output column name must stand alone, +that is, it cannot be used in an expression. For example: + +SELECT a + b AS sum, c FROM table1 ORDER BY sum; -- sorts by a + b +SELECT a, max(b) FROM table1 GROUP BY a ORDER BY 1; -- sorts by a +SELECT a + b AS sum, c FROM table1 ORDER BY sum + c; -- error + +This restriction on expressions involving output columns is made to reduce +ambiguity. There is still ambiguity if an ORDER BY item +is a simple name that could match either an output column name or a column +from the table expression. The output column is used in such cases. This +would only cause confusion if you use AS to rename an +output column to match some other table column's name. + - - Note that the ordering options are considered independently for each - sort column. For example ORDER BY x, y DESC means - ORDER B
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada wrote: > > On Thu, Feb 29, 2024 at 8:43 PM John Naylor wrote: > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > > +"tidstore storage", > > > > "tidstore storage" sounds a bit strange -- maybe look at some other > > context names for ideas. > > Agreed. How about "tidstore's radix tree"? That might be okay. I'm now thinking "TID storage". On that note, one improvement needed when we polish tidstore.c is to make sure it's spelled "TID" in comments, like other files do already. > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > > +? tree->leaf_ctx > > +: tree->context, allocsize); > > > > Instead of branching here, can we copy "context" to "leaf_ctx" when > > necessary (those names should look more like eachother, btw)? I think > > that means anything not covered by this case: > > > > +#ifndef RT_VARLEN_VALUE_SIZE > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > > + tree->leaf_ctx = SlabContextCreate(ctx, > > +RT_STR(RT_PREFIX) "radix_tree leaf contex", > > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > > +sizeof(RT_VALUE_TYPE)); > > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > > > ...also, we should document why we're using slab here. On that, I > > don't recall why we are? We've never had a fixed-length type test case > > on 64-bit, so it wasn't because it won through benchmarking. It seems > > a hold-over from the days of "multi-value leaves". Is it to avoid the > > possibility of space wastage with non-power-of-two size types? > > Yes, it matches my understanding. There are two issues quoted here, so not sure if you mean both or only the last one... For the latter, I'm not sure it makes sense to have code and #ifdef's to force slab for large-enough fixed-length values just because we can. There may never be such a use-case anyway. I'm also not against it, either, but it seems like a premature optimization. > > For this stanza that remains unchanged: > > > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > > { > > MemoryContextDelete(tree->node_slabs[i]); > > } > > > > if (tree->leaf_ctx) > > { > > MemoryContextDelete(tree->leaf_ctx); > > } > > > > ...is there a reason we can't just delete tree->ctx, and let that > > recursively delete child contexts? > > I thought that considering the RT_CREATE doesn't create its own memory > context but just uses the passed context, it might be a bit unusable > to delete the passed context in the radix tree code. For example, if a > caller creates a radix tree (or tidstore) on a memory context and > wants to recreate it again and again, he also needs to re-create the > memory context together. It might be okay if we leave comments on > RT_CREATE as a side effect, though. This is the same reason why we > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), Right, I should have said "reset". Resetting a context will delete it's children as well, and seems like it should work to reset the tree context, and we don't have to know whether that context actually contains leaves at all. That should allow copying "tree context" to "leaf context" in the case where we have no special context for leaves.
Re: Shared detoast Datum proposal
Hi, Here is a updated version, the main changes are: 1. an shared_detoast_datum.org file which shows the latest desgin and pending items during discussion. 2. I removed the slot->pre_detoast_attrs totally. 3. handle some pg_detoast_datum_slice use case. 4. Some implementation improvement. commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value) Author: yizhi.fzh Date: Sun Mar 3 13:48:25 2024 +0800 shared detoast datum See the overall design & alternative design & testing in shared_detoast_datum.org In the shared_detoast_datum.org, I added the alternative design part for the idea of TOAST cache. -- Best Regards Andy Fan >From 66c64c197a5dab97a563be5a291127e4c5d6841d Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Sun, 3 Mar 2024 13:48:25 +0800 Subject: [PATCH v9 1/1] shared detoast datum See the overall design & alternative design & testing in shared_detoast_datum.org --- src/backend/access/common/detoast.c | 68 +- src/backend/access/common/toast_compression.c | 10 +- src/backend/executor/execExpr.c | 60 +- src/backend/executor/execExprInterp.c | 179 ++ src/backend/executor/execTuples.c | 127 src/backend/executor/execUtils.c | 2 + src/backend/executor/nodeHashjoin.c | 2 + src/backend/executor/nodeMergejoin.c | 2 + src/backend/executor/nodeNestloop.c | 1 + src/backend/executor/shared_detoast_datum.org | 203 ++ src/backend/jit/llvm/llvmjit_expr.c | 26 +- src/backend/jit/llvm/llvmjit_types.c | 1 + src/backend/optimizer/plan/createplan.c | 107 +++- src/backend/optimizer/plan/setrefs.c | 590 +++--- src/include/access/detoast.h | 3 + src/include/access/toast_compression.h| 4 +- src/include/executor/execExpr.h | 12 + src/include/executor/tuptable.h | 14 + src/include/nodes/execnodes.h | 14 + src/include/nodes/plannodes.h | 53 ++ src/tools/pgindent/typedefs.list | 2 + 21 files changed, 1342 insertions(+), 138 deletions(-) create mode 100644 src/backend/executor/shared_detoast_datum.org diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c index 3547cdba56..acc9644689 100644 --- a/src/backend/access/common/detoast.c +++ b/src/backend/access/common/detoast.c @@ -22,11 +22,11 @@ #include "utils/expandeddatum.h" #include "utils/rel.h" -static struct varlena *toast_fetch_datum(struct varlena *attr); +static struct varlena *toast_fetch_datum(struct varlena *attr, MemoryContext ctx); static struct varlena *toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 slicelength); -static struct varlena *toast_decompress_datum(struct varlena *attr); +static struct varlena *toast_decompress_datum(struct varlena *attr, MemoryContext ctx); static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength); /* -- @@ -42,7 +42,7 @@ static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 * -- */ struct varlena * -detoast_external_attr(struct varlena *attr) +detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx) { struct varlena *result; @@ -51,7 +51,7 @@ detoast_external_attr(struct varlena *attr) /* * This is an external stored plain value */ - result = toast_fetch_datum(attr); + result = toast_fetch_datum(attr, ctx); } else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) { @@ -68,13 +68,13 @@ detoast_external_attr(struct varlena *attr) /* recurse if value is still external in some other way */ if (VARATT_IS_EXTERNAL(attr)) - return detoast_external_attr(attr); + return detoast_external_attr_ext(attr, ctx); /* * Copy into the caller's memory context, in case caller tries to * pfree the result. */ - result = (struct varlena *) palloc(VARSIZE_ANY(attr)); + result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr)); memcpy(result, attr, VARSIZE_ANY(attr)); } else if (VARATT_IS_EXTERNAL_EXPANDED(attr)) @@ -87,7 +87,7 @@ detoast_external_attr(struct varlena *attr) eoh = DatumGetEOHP(PointerGetDatum(attr)); resultsize = EOH_get_flat_size(eoh); - result = (struct varlena *) palloc(resultsize); + result = (struct varlena *) MemoryContextAlloc(ctx, resultsize); EOH_flatten_into(eoh, (void *) result, resultsize); } else @@ -101,32 +101,45 @@ detoast_external_attr(struct varlena *attr) return result; } +struct varlena * +detoast_external_attr(struct varlena *attr) +{ + return detoast_external_attr_ext(attr, CurrentMemoryContext); +} + /* -- - * detoast_attr - + * detoast_attr_ext - * * Public entry point to get back a toasted value from compression * or external storage. The result is always non-extended varlena form. * + * ctx: The
Re: Make query cancellation keys longer
On Fri, 1 Mar 2024 at 15:19, Peter Eisentraut wrote: > > One complication with this was that because we no longer know how long > > the key should be, 4-bytes or something longer, until the backend has > > performed the protocol negotiation, we cannot generate the key in the > > postmaster before forking the process anymore. > > Maybe this would be easier if it's a protocol version number change, > since that is sent earlier than protocol extensions? Protocol version and protocol extensions are both sent in the StartupMessage, so the same complication applies. (But I do agree that a protocol version bump is more appropriate for this type of change)
a wrong index choose when statistics is out of date
The issue can be reproduced with the following steps: create table x_events (.., created_at timestamp, a int, b int); create index idx_1 on t(created_at, a); create index idx_2 on t(created_at, b); query: select * from t where create_at = current_timestamp and b = 1; index (created_at, a) rather than (created_at, b) may be chosen for the above query if the statistics think "create_at = current_timestamp" has no rows, then both index are OK, actually it is true just because statistics is out of date. I just run into this again recently and have two new idea this time, I'd like gather some feedback on this. 1. We can let the user define the column as the value is increased day by day. the syntax may be: ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED. then when a query like 'create_at op const', the statistics module can treat it as 'created_at = $1'. so the missing statistics doesn't make difference. Then I think the above issue can be avoided. This is different from letting user using a PreparedStmt directly because it is possible that we always choose a custom plan, the easiest way to make this happen is we do a planning time partition prune. 2. Use some AI approach to forecast the data it doesn't gather yet. The training stage may happen at analyze stage, take the above case for example, it may get a model like 'there are 100 rows per second for the time during 9:00 to 18:00 and there are 2 rows per seconds for other time range. For now, I think option 1 may be easier to happen. -- Best Regards Andy Fan
RE: Synchronizing slots from primary to standby
On Sunday, March 3, 2024 7:47 AM Peter Smith wrote: > > On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Friday, March 1, 2024 12:23 PM Peter Smith > wrote: > > > > ... > > > == > > > src/backend/replication/slot.c > > > > > > 2. validate_standby_slots > > > > > > + else if (!ReplicationSlotCtl) > > > + { > > > + /* > > > + * We cannot validate the replication slot if the replication slots' > > > + * data has not been initialized. This is ok as we will validate > > > + the > > > + * specified slot when waiting for them to catch up. See > > > + * StandbySlotsHaveCaughtup for details. > > > + */ > > > + } > > > + else > > > + { > > > + /* > > > + * If the replication slots' data have been initialized, verify if > > > + the > > > + * specified slots exist and are logical slots. > > > + */ > > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > > > > > IMO that 2nd comment does not need to say "If the replication slots' > > > data have been initialized," because that is implicit from the if/else. > > > > Removed. > > I only meant to suggest removing the 1st part, not the entire comment. > I thought it is still useful to have a comment like: > > /* Check that the specified slots exist and are logical slots.*/ OK, I misunderstood. Fixed. > > == > > And, here are some review comments for v103-0001. Thanks for the comments. > > == > Commit message > > 1. > Additionally, The SQL functions pg_logical_slot_get_changes, > pg_logical_slot_peek_changes and pg_replication_slot_advance are modified > to wait for the replication slots mentioned in 'standby_slot_names' to catch > up > before returning. > > ~ > > (use the same wording as previous in this message) > > /mentioned in/specified in/ Changed. > > == > doc/src/sgml/config.sgml > > 2. > +Additionally, when using the replication management functions > + > +pg_replication_slot_advance, > + > +pg_logical_slot_get_changes, and > + > +pg_logical_slot_peek_changes, > +with failover enabled logical slots, the functions will wait for the > +physical slots specified in > standby_slot_names to > +confirm WAL receipt before proceeding. > + > > It says "the ... functions" twice so maybe reword it slightly. > > BEFORE > Additionally, when using the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, with failover enabled logical slots, the > functions > will wait for the physical slots specified in standby_slot_names to confirm > WAL > receipt before proceeding. > > SUGGESTION > Additionally, the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, when used with failover enabled logical slots, > will wait for the physical slots specified in standby_slot_names to confirm > WAL > receipt before proceeding. > > (Actually the "will wait ... before proceeding" is also a bit tricky, so > below is > another possible rewording) > > SUGGESTION #2 > Additionally, the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, when used with failover enabled logical slots, > will block until all physical slots specified in standby_slot_names have > confirmed WAL receipt. I prefer the #2 version. > > ~~~ > > 3. > + > + > + Value * is not accepted as it is inappropriate to > + block logical replication for physical slots that either lack > + associated standbys or have standbys associated that are not > enabled > + for replication slot synchronization. (see > + linkend="logicaldecoding-replication-slots-synchronization"/>). > + > + > > Why does the document need to provide an excuse/reason for the rule? > You could just say something like: > > SUGGESTION > The slots must be named explicitly. For example, specifying wildcard values > like > * is not permitted. As suggested by Amit, I moved this to code comments. > > == > doc/src/sgml/func.sgml > > 4. > @@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > > > > - > +role="func_table_entry"> > > pg_logical_slot_get_changes > > @@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > > > > - > +role="func_table_entry"> > > pg_logical_slot_peek_changes > > @@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > > > > - > +role="func_table_entry"> > > pg_replication_slot_advance > > > Should these 3 functions say something about how their beh
RE: Synchronizing slots from primary to standby
On Saturday, March 2, 2024 6:55 PM Amit Kapila wrote: > > On Sat, Mar 2, 2024 at 9:21 AM Zhijie Hou (Fujitsu) > wrote: > > > > Apart from the comments, the code in WalSndWaitForWal was refactored a > > bit to make it neater. Thanks Shveta for helping writing the code and doc. > > > > A few more comments: Thanks for the comments. > == > 1. > +# Wait until the primary server logs a warning indicating that it is > +waiting # for the sb1_slot to catch up. > +$primary->wait_for_log( > + qr/replication slot \"sb1_slot\" specified in parameter > standby_slot_names does not have active_pid/, > + $offset); > > Shouldn't we wait for such a LOG even in the first test as well which > involves two > standbys and two logical subscribers? Yes, we should. Added. > > 2. > +## > +# Test that logical replication will wait for the user-created inactive > +# physical slot to catch up until we remove the slot from standby_slot_names. > +## > > > I don't see anything different tested in this test from what we already > tested in > the first test involving two standbys and two logical subscribers. Can you > please clarify if I am missing something? I think the intention was to test that the wait loop is ended due to GUC config reload, while the first test is for the case when the loop is ended due to restart_lsn movement. But it seems we tested the config reload with xx_get_changes() as well, so I can remove it if you agree. > > 3. > Note that after receiving the shutdown signal, an ERROR > + * is reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > Isn't this part of the comment should be moved inside > NeedToWaitForStandby()? Moved. > > 4. > + /* > + * Update our idea of the currently flushed position only if we are > + * not waiting for standbys to catch up, otherwise the standby would > + * have to catch up to a newer WAL location in each cycle. > + */ > + if (wait_event != WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION) > + { > > This functionality (in function WalSndWaitForWal()) seems to ensure that we > first wait for the required WAL to be flushed and then wait for standbys. If > true, > we should cover that point in the comments here or somewhere in the function > WalSndWaitForWal(). > > Apart from this, I have made a few modifications in the comments. Thanks. I have reviewed and merged them. Here is the V104 patch which addressed above and Peter's comments. Best Regards, Hou zj v104-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch Description: v104-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch v104-0002-Document-the-steps-to-check-if-the-standby-is-r.patch Description: v104-0002-Document-the-steps-to-check-if-the-standby-is-r.patch