Re: Ignore 2PC transaction GIDs in query jumbling
On Sun, Aug 13, 2023 at 02:48:22PM +0800, Julien Rouhaud wrote: > On Sun, Aug 13, 2023 at 03:25:33PM +0900, Michael Paquier wrote: >> Perhaps not as much, actually, because I was just reminded that >> DEALLOCATE is something that pg_stat_statements ignores. So this >> makes harder the introduction of tests. > > Maybe it's time to revisit that? According to [1] the reason why > pg_stat_statements currently ignores DEALLOCATE is because it indeed bloated > the entries but also because at that time the suggestion for jumbling only > this > one was really hackish. Good point. The argument of the other thread does not really hold much these days now that query jumbling can happen for all the utility nodes. > Now that we do have a sensible approach to jumble utility statements, I think > it would be beneficial to be able to track those, for instance to be easily > diagnose a driver that doesn't rely on the extended protocol. Fine by me. Would you like to write a patch? I've begun typing an embryon of patch a few days ago, and did not look yet at the rest. Please see the attached. >> Anyway, I guess that your own >> extension modules have a need for a query ID compiled with these >> fields ignored? > > My module has a need to track statements and still work efficiently after > that. > So anything that can lead to virtually an infinite number of > pg_stat_statements > entries is a problem for me, and I guess to all the other modules / tools that > track pg_stat_statements. I guess the reason why my module is still ignoring > DEALLOCATE is because it existed before pg 9.4 (with a homemade queryid as it > wasn't exposed before that), and it just stayed there since with the rest of > still problematic statements. Yeah, probably. -- Michael diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 2565348303..b7aeebe3b4 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -3925,8 +3925,10 @@ typedef struct ExecuteStmt typedef struct DeallocateStmt { NodeTag type; - char *name; /* The name of the plan to remove */ - /* NULL means DEALLOCATE ALL */ + /* The name of the plan to remove. NULL means DEALLOCATE ALL. */ + char *name pg_node_attr(query_jumble_ignore); + /* token location, or -1 if unknown */ + int location pg_node_attr(query_jumble_location); } DeallocateStmt; /* diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index b3bdf947b6..680c7f3683 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -11936,6 +11936,7 @@ DeallocateStmt: DEALLOCATE name DeallocateStmt *n = makeNode(DeallocateStmt); n->name = $2; + n->location = @2; $$ = (Node *) n; } | DEALLOCATE PREPARE name @@ -11943,6 +11944,7 @@ DeallocateStmt: DEALLOCATE name DeallocateStmt *n = makeNode(DeallocateStmt); n->name = $3; + n->location = @3; $$ = (Node *) n; } | DEALLOCATE ALL @@ -11950,6 +11952,7 @@ DeallocateStmt: DEALLOCATE name DeallocateStmt *n = makeNode(DeallocateStmt); n->name = NULL; + n->location = -1; $$ = (Node *) n; } | DEALLOCATE PREPARE ALL @@ -11957,6 +11960,7 @@ DeallocateStmt: DEALLOCATE name DeallocateStmt *n = makeNode(DeallocateStmt); n->name = NULL; + n->location = -1; $$ = (Node *) n; } ; diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out index 93735d5d85..823d78802b 100644 --- a/contrib/pg_stat_statements/expected/utility.out +++ b/contrib/pg_stat_statements/expected/utility.out @@ -472,6 +472,45 @@ SELECT pg_stat_statements_reset(); (1 row) +-- Execution statements +SELECT 1 as a; + a +--- + 1 +(1 row) + +PREPARE stat_select AS SELECT $1 AS a; +EXECUTE stat_select (1); + a +--- + 1 +(1 row) + +DEALLOCATE stat_select; +PREPARE stat_select AS SELECT $1 AS a; +EXECUTE stat_select (2); + a +--- + 2 +(1 row) + +DEALLOCATE PREPARE stat_select; +DEALLOCATE ALL; +DEALLOCATE PREPARE ALL; +SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; + calls | rows | query +---+--+--- + 2 |2 | PREPARE stat_select AS SELECT $1 AS a + 1 |1 | SELECT $1 as a + 1 |1 | SELECT pg_stat_statements_reset() +(3 rows) + +SELECT pg_stat_statements_reset(); + pg_stat_statements_reset +-- + +(1 row) + -- SET statements. -- These use two different strings, still they count as one entry. SET work_mem = '1MB'; diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql index 87666d9135..5f7d4a467f 100644 --- a/contrib/pg_stat_statements/sql/utility.sql +++ b/contrib/pg_stat_statements/sql/utility.sql @@ -237,6 +237,19 @@ DROP DOMAIN domain_stats; SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
Re: Make psql's qeury canceling test simple by using signal() routine of IPC::Run
Hello Yugo-san, Currently, the psql's test of query cancelling (src/bin/psql/t/020_cancel.pl) gets the PPID of a running psql by using "\!" meta command, and sends SIGINT to the process by using "kill". However, IPC::Run provides signal() routine that sends a signal to a running process, so I think we can rewrite the test using this routine to more simple fashion as attached patch. What do you think about it? I'm the one who pointed out signal(), so I'm a little biaised, nevertheless, regarding the patch: Patch applies with "patch". Test code is definitely much simpler. Test run is ok on my Ubuntu laptop. Let's drop 25 lines of perl! -- Fabien.
Re: pgbench: allow to exit immediately when any client is aborted
Hello Yugo-san, I attached the updated patch v3 including changes above, a test, and fix of the typo you pointed out. I'm sorry but the test in the previous patch was incorrect. I attached the correct one. About pgbench exit on abort v3: Patch does not "git apply", but is ok with "patch" although there are some minor warnings. Compiles. Code is ok. Tests are ok. About Test: The code is simple to get an error quickly but after a few transactions, good. I'll do a minimal "-c 2 -j 2 -t 10" instead of "-c 4 -j 4 -T 10". -- Fabien.
pgbench with libevent?
Hello devs, Pgbench is managing clients I/Os manually with select or poll. Much of this could be managed by libevent. Pros: 1. libevent is portable, stable, and widely used (eg Chromium, Memcached, PgBouncer). 2. libevent implements more I/O wait methods, which may be more efficient on some platforms (eg FreeBSD kqueue, Windows wepoll in libevent 2.2 alpha), and hides portability issues. 3. it would remove significant portions of unattractive pgbench code, esp. in threadRun, and around socket/poll abstraction and portability layer. 4. depending on the number of event loops, the client load could be shared more evenly. currently a thread only manages its own clients, some client I/Os may be waiting to be processed while other threads could be available to process them. Cons: 1. it adds a libevent dependency to postgres. This may be a no go from the start. 2. this is a significant refactoring project which implies a new internal architecture and adds new code to process and generate appropriate events. 3. libevent ability to function efficiently in a highly multithreaded environment is unclear. Should there be one event queue which generate a shared work queue? or several event queues, one per thread (which would remove the sharing pro)? or something in between? Some experiments and configuratibility may be desirable. This may also have an impact on pgbench user interface and output depending on the result, eg there may be specialized event and worker threads, some statistics may be slightly different, new options may be needed… 4. libevent development seems slugish, last bugfix was published 3 years ago, version 2.2 has been baking for years, but the development seems lively (+100 contributors). Neutral? 1. BSD 3 clauses license. Is pros > cons, or not? Other thoughts, pros, cons? -- Fabien.
Re: run pgindent on a regular basis / scripted manner
On 2023-08-12 Sa 20:53, Peter Geoghegan wrote: On Sat, Aug 12, 2023 at 5:20 PM Tom Lane wrote: Hm. I was envisioning that we should expect committers to deal with this, not original patch submitters. So that's an argument against including it in the CI tests. But I'm in favor of anything we can do to make it more painless for committers to fix up patch indentation. I agree with this. Making it a special responsibility for committers comes with the same set of problems that we see with catversion bumps. People are much more likely to forget to do something that must happen last. After I'd been caught by this once or twice I implemented a git hook test for that too - in fact it was the first hook I did. It's not perfect but it's saved me a couple of times: check_catalog_version () { # only do this on master test "$branch" = "master" || return 0 case "$files" in *src/include/catalog/catversion.h*) return 0; ;; *src/include/catalog/*) ;; *) return 0; ;; esac # changes include catalog but not catversion.h, so warn about it { echo 'Commit on master alters catalog but catversion not bumped' echo 'It can be forced with git commit --no-verify' } >&2 exit 1 } cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Support to define custom wait events for extensions
On Thu, Aug 10, 2023 at 05:37:55PM +0900, Michael Paquier wrote: > This looks correct, but perhaps we need to think harder about the > custom event names and define a convention when more of this stuff is > added to the core modules. Okay, I have put my hands on that, fixing a couple of typos, polishing a couple of comments, clarifying the docs and applying an indentation. And here is a v4. Any thoughts or comments? I'd like to apply that soon, so as we are able to move on with the wait event catalog and assigning custom wait events to the other in-core modules. -- Michael From 83e40dace75d0df91e8b7f617bd03d02bd8450d4 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 14 Aug 2023 07:48:15 +0900 Subject: [PATCH v4] Change custom wait events to use shared hash tables Currently, names of the custom wait event must be registered for each backends, requiring all these to link to the shared memory area of an extention, even if not loaded with shared_preload_libraries. This patch relaxes the constraints by storing the wait events and their names in hash tables in shared memory. First, this has the advantage to simplify the registration of custom wait events to a single routine call that returns an event ID, either existing or incremented: uint32 WaitEventExtensionNew(const char *wait_event_name); Any other backend looking at the wait event information, for instance via pg_stat_activity, is then able to automatically know about the new event name. The code changes done in worker_spi show how things are simplified: - worker_spi_init() is gone. - No more shmem hooks. - The custom wait event ID is cached in the process that needs to set it, with one single call to WaitEventExtensionNew() to retrieve it. Per suggestion from Andres Freund. Author: Masahiro Ikeda Discussion: https://postgr.es/m/20230801032349.aaiuvhtrcvvcw...@awork3.anarazel.de --- src/include/utils/wait_event.h| 18 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/backend/utils/activity/wait_event.c | 217 +++--- .../utils/activity/wait_event_names.txt | 1 + .../modules/worker_spi/t/001_worker_spi.pl| 18 +- .../modules/worker_spi/worker_spi--1.0.sql| 5 - src/test/modules/worker_spi/worker_spi.c | 109 + doc/src/sgml/monitoring.sgml | 5 +- doc/src/sgml/xfunc.sgml | 26 +-- src/tools/pgindent/typedefs.list | 2 + 10 files changed, 164 insertions(+), 238 deletions(-) diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h index aad8bc08fa..3cffae23e1 100644 --- a/src/include/utils/wait_event.h +++ b/src/include/utils/wait_event.h @@ -44,12 +44,14 @@ extern PGDLLIMPORT uint32 *my_wait_event_info; * Use this category when the server process is waiting for some condition * defined by an extension module. * - * Extensions can define their own wait events in this category. First, - * they should call WaitEventExtensionNew() to get one or more wait event - * IDs that are allocated from a shared counter. These can be used directly - * with pgstat_report_wait_start() or equivalent. Next, each individual - * process should call WaitEventExtensionRegisterName() to associate a wait - * event string to the number allocated previously. + * Extensions can define their own wait events in this category. They should + * call WaitEventExtensionNew() with a wait event string. If the wait event + * associated the string is already allocated, it returns the wait event info. + * If not, it will get one wait event ID that is allocated from a shared + + counter, associate the string to the number in the shared dynamic hash and + * return the wait event info. + * + * The ID retrieved can be used with pgstat_report_wait_start() or equivalent. */ typedef enum { @@ -60,9 +62,7 @@ typedef enum extern void WaitEventExtensionShmemInit(void); extern Size WaitEventExtensionShmemSize(void); -extern uint32 WaitEventExtensionNew(void); -extern void WaitEventExtensionRegisterName(uint32 wait_event_info, - const char *wait_event_name); +extern uint32 WaitEventExtensionNew(const char *wait_event_name); /* -- * pgstat_report_wait_start() - diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt index b34b6afecd..7af3e0ba1a 100644 --- a/src/backend/storage/lmgr/lwlocknames.txt +++ b/src/backend/storage/lmgr/lwlocknames.txt @@ -53,3 +53,4 @@ XactTruncationLock 44 # 45 was XactTruncationLock until removal of BackendRandomLock WrapLimitsVacuumLock46 NotifyQueueTailLock 47 +WaitEventExtensionLock 48 diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c index b3596ece80..14750233d4 100644 --- a/src/backend/utils/activity/wait_event.c +++ b/src/backend/utils/activity/wait_event.c @@ -45,6 +45,41 @@ uint32 *my_wait_event_info = &local_my_wait_event_info; #define WAIT_E
Re: Make psql's qeury canceling test simple by using signal() routine of IPC::Run
On Sun, Aug 13, 2023 at 11:22:33AM +0200, Fabien COELHO wrote: > Test run is ok on my Ubuntu laptop. I have a few comments about this patch. On HEAD and even after this patch, we still have the following: SKIP: { skip "cancel test requires a Unix shell", 2 if $windows_os; Could the SKIP be removed for $windows_os? If not, this had better be documented because the reason for the skip becomes incorrect. The comment at the top of the SKIP block still states the following: # There is, as of this writing, no documented way to get the PID of # the process from IPC::Run. As a workaround, we have psql print its # own PID (which is the parent of the shell launched by psql) to a # file. This is also incorrect. -- Michael signature.asc Description: PGP signature
Re: run pgindent on a regular basis / scripted manner
On Sun, Aug 13, 2023 at 10:33:21AM -0400, Andrew Dunstan wrote: > After I'd been caught by this once or twice I implemented a git hook test > for that too - in fact it was the first hook I did. It's not perfect but > it's saved me a couple of times: > > check_catalog_version () { I find that pretty cool. Thanks for sharing. -- Michael signature.asc Description: PGP signature
Re: Report planning memory in EXPLAIN ANALYZE
On Thu, 10 Aug 2023 at 20:33, Ashutosh Bapat wrote: > My point is what's relevant here is how much net memory planner asked > for. But that's not what your patch is reporting. All you're reporting is the difference in memory that's *currently* palloc'd from before and after the planner ran. If we palloc'd 600 exabytes then pfree'd it again, your metric won't change. I'm struggling a bit to understand your goals here. If your goal is to make a series of changes that reduces the amount of memory that's palloc'd at the end of planning, then your patch seems to suit that goal, but per the quote above, it seems you care about how many bytes are palloc'd during planning and your patch does not seem track that. With your patch as it is, to improve the metric you're reporting we could go off and do things like pfree Paths once createplan.c is done, but really, why would we do that? Just to make the "Planning Memory" metric looks better doesn't seem like a worthy goal. Instead, if we reported the context's mem_allocated, then it would give us the flexibility to make changes to the memory context code to have the metric look better. It might also alert us to planner inefficiencies and problems with new code that may cause a large spike in the amount of memory that gets allocated. Now, I'm not saying we should add a patch that shows mem_allocated. I'm just questioning if your proposed patch meets the goals you're trying to achieve. I just suggested that you might want to consider something else as a metric for your memory usage reduction work. David
Re: pgbench with libevent?
Pgbench is managing clients I/Os manually with select or poll. Much of this could be managed by libevent. Or maybe libuv (used by nodejs?). From preliminary testing libevent seems not too good at fine grain time management which are used for throttling, whereas libuv advertised that it is good at it, although what it does is yet to be seen. Note: libev had no updates in 8 years. -- Fabien.
Re: Make psql's qeury canceling test simple by using signal() routine of IPC::Run
Bonjour Michaël, On Sun, Aug 13, 2023 at 11:22:33AM +0200, Fabien COELHO wrote: Test run is ok on my Ubuntu laptop. I have a few comments about this patch. Argh, sorry! I looked at what was removed (a lot) from the previous version, not what was remaining and should also have been removed. -- Fabien.
Naive handling of inequalities by nbtree initial positioning code
Suppose that I create the following index on the tenk1 table from the regression tests: create index on tenk1 (two, four, hundred, thousand, tenthous); Now the following query will be able to use index quals for each column that appear in my composite index: select * from tenk1 where two = 1 and four = 3 and hundred = 91 and thousand = 891 and tenthous = 1891; The query returns one row, and touches 3 buffers/pages (according to EXPLAIN ANALYZE with buffers). The overheads here make perfect sense: there's one root page access, one leaf page access, and a single heap page access. Clearly the nbtree initial positioning code is able to descend to the exact leaf page (and page offset) where the first possible match could be found. Pretty standard stuff. But if I rewrite this query to use an inequality, the picture changes. If I replace "four = 3" with "four > 2", I get a query that is very similar to the original (that returns the same single row): select * from tenk1 where two = 1 and four > 2 and hundred = 91 and thousand = 891 and tenthous = 1891; This time our query touches 16 buffers/pages. That's a total of 15 index pages accessed, of which 14 are leaf pages. We'll needlessly plow through an extra 13 leaf pages, before finally arriving at the first leaf page that might *actually* have a real match for us. We can and should find a way for the second query to descend to the same leaf page directly, so that the physical access patterns match those that we saw with the first query. Only the first query can use an insertion scan key with all 4 attributes filled in to find its initial scan position. The second query uses an insertion scan key with values set for the first 2 index columns (on two and four) only. EXPLAIN offers no hint that this is what happens -- the "Index Cond:" shown for each query is practically identical. It seems to me that Markus Winand had a very good point when he complained that we don't expose this difference directly (e.g., by identifying which columns appear in "Access predicates" and which columns are merely "Index filter predicates") [1]. That would make these kinds of issues a lot more obvious. The nbtree code is already capable of tricks that are close enough to what I'm thinking of here. Currently, nbtree's initial positioning code will set BTScanInsertData.nextkey=false for the first query (where BTScanInsertData.keysz=4), and BTScanInsertData.nextkey=true for the second query (where BTScanInsertData.keysz=2 right now). So the second query I came up with does at least manage to locate the leaf page where "four = 3" tuples begin, even today -- its "four > 2" inequality is at least "handled efficiently". The inefficiencies come from how nbtree handles the remaining two index columns when building an insertion scan key for our initial descent. nbtree will treat the inequality as making it unsafe to include further values for the remaining two attributes, which is the real source of the extra leaf page scans (though of course the later attributes are still usable as search-type scan keys). But it's *not* unsafe to position ourselves on the right leaf page from the start. Not really. All that it would take to fix the problem is per-attribute BTScanInsertData.nextkey values. There is no reason why "nextkey" semantics should only work for the last attribute in the insertion scan key. Under this scheme, _bt_first() would be taught to set up the insertion scan key with (say) nextkey=true for the "four > 2" attribute, and nextkey=false for the other 3 attributes (since we do that whenever >= or = are used). It would probably also make sense to generalize this approach to handle (say) a third query that had a "four < 2" inequality, but otherwise matched the first two queries. So we wouldn't literally use multiple "nextkey" fields to do this. The most general approach seems to require that we teach insertion scan key routines like _bt_compare() about "goback" semantics, which must also work at the attribute granularity. So we'd probably replace both "nextkey" and "goback" with something new and more general. I already wrote a patch (still in the CF queue) to teach nbtree insertion scan keys about "goback" semantics [2] (whose use would still be limited to backwards scans), so that we'd avoid needlessly accessing extra pages in so-called boundary cases (which seems like a much less important problem than the one I've highlighted here). That existing patch already removed code in _bt_first that handled "stepping back" once we're on the leaf level. ISTM that the right place to do stuff like that is in routines like _bt_search, _bt_binsrch, and _bt_compare -- not in _bt_first. The code around _bt_compare seems like it would benefit from having more of this sort of context. Having the full picture matters both when searching internal pages and leaf pages. Thoughts? Was this issue discussed at some point in the past? [1] https://use-the-index-luke.com/sql/expl
Re: Naive handling of inequalities by nbtree initial positioning code
On Sun, Aug 13, 2023 at 5:50 PM Peter Geoghegan wrote: > select * from tenk1 > where > two = 1 > and four = 3 > and hundred = 91 > and thousand = 891 > and tenthous = 1891; > > The query returns one row, and touches 3 buffers/pages (according to > EXPLAIN ANALYZE with buffers). The overheads here make perfect sense: > there's one root page access, one leaf page access, and a single heap > page access. Clearly the nbtree initial positioning code is able to > descend to the exact leaf page (and page offset) where the first > possible match could be found. Pretty standard stuff. I probably should have made this first query use "four >= 3" instead of using "four = 3" (while still using "four > 2" for the second, "bad" query). The example works a bit better that way because now the queries are logically equivalent, and yet still have this big disparity. (We get 4 buffer hits for the "good" >= query, but 16 buffer hits for the equivalent "bad" > query.) -- Peter Geoghegan
Re: pg_waldump vs. all-zeros WAL files; server creation of such files
On Sat, Aug 12, 2023 at 08:15:31PM -0700, Noah Misch wrote: > The attached 010_zero.pl, when run as part of the pg_waldump test suite, fails > at today's master (c36b636) and v15 (1bc19df). It passes at v14 (5a32af3). > Command "pg_waldump --start 0/0100 --end 0/01000100" fails as follows: > > pg_waldump: error: WAL segment size must be a power of two between > 1 MB and 1 GB, but the WAL file "00010002" header > specifies 0 bytes So this depends on the ordering of the entries retrieved by readdir() as much as the segments produced by the backend. > Where it fails, the server has created an all-zeros WAL file under that name. > Where it succeeds, that file doesn't exist at all. Two decisions to make: > > - Should a clean server shutdown ever leave an all-zeros WAL file? I think > yes, it's okay to let that happen. It doesn't hurt to leave that around. On the contrary, it makes any follow-up startup cheaper the bigger the segment size. > - Should "pg_waldump --start $X --end $Y" open files not needed for the > requested range? I think no. So this is a case where identify_target_directory() is called with a fname of NULL. Agreed that search_directory could be smarter with the files it should scan, especially if we have start and/or end LSNs at hand to filter out what we'd like to be in the data folder. -- Michael signature.asc Description: PGP signature
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Sat, Aug 12, 2023, 15:20 Amit Kapila wrote: > On Fri, Aug 11, 2023 at 11:38 PM Bruce Momjian wrote: > > > > On Fri, Aug 11, 2023 at 10:46:31AM +0530, Amit Kapila wrote: > > > On Thu, Aug 10, 2023 at 7:07 PM Masahiko Sawada > wrote: > > > > What I imagined is that we do this check before > > > > check_and_dump_old_cluster() while the server is 'off'. Reading the > > > > slot state file would be simple and I guess we would not need a tool > > > > or cli program for that. We need to expose RepliactionSlotOnDisk, > > > > though. > > > > > > Won't that require a lot of version-specific checks as across versions > > > the file format could be different? For the case of the control file, > > > we use version-specific pg_controldata (for the old cluster, the > > > corresponding version's pg_controldata) utility to read the old > > > version control file. I thought we need something similar here if we > > > want to do what you are suggesting. > > > > You mean the slot file format? > > Yes. > > > > > We will need that complexity somewhere, > > so why not in pg_upgrade? > > > > I don't think we need the complexity of version-specific checks if we > do what we do in get_control_data(). Basically, invoke > version-specific pg_replslotdata to get version-specific slot > information. There has been a proposal for a tool like that [1]. Do > you have something better in mind? If so, can you please explain the > same a bit more? > Yeah, we need something like pg_replslotdata. If there are other useful usecases for this tool, it would be good to have it. But I'm not sure other than pg_upgrade usecase. Another idea is (which might have already discussed thoguh) that we check if the latest shutdown checkpoint LSN in the control file matches the confirmed_flush_lsn in pg_replication_slots view. That way, we can ensure that the slot has consumed all WAL records before the last shutdown. We don't need to worry about WAL records generated after starting the old cluster during the upgrade, at least for logical replication slots. Regards,
Extending SMgrRelation lifetimes
Hi, SMgrRelationData objects don't currently have a defined lifetime, so it's hard to know when the result of smgropen() might become a dangling pointer. This has caused a few bugs in the past, and the usual fix is to just call smgropen() more often and not hold onto pointers. If you're doing that frequently enough, the hash table lookups can show up in profiles. I'm interested in this topic for more than just micro-optimisations, though: in order to be able to batch/merge smgr operations, I'd like to be able to collect them in data structures that survive more than just a few lines of code. (Examples to follow in later emails). The simplest idea seems to be to tie object lifetime to transactions using the existing AtEOXact_SMgr() mechanism. In recovery, the obvious corresponding time would be the commit/abort record that destroys the storage. This could be achieved by extending smgrrelease(). That was a solution to the same problem in a narrower context: we didn't want CFIs to randomly free SMgrRelations, but we needed to be able to force-close fds in other backends, to fix various edge cases. The new idea is to overload smgrrelease(it) so that it also clears the owner, which means that AtEOXact_SMgr() will eventually smgrclose(it), unless it is re-owned by a relation before then. That choice stems from the complete lack of information available via sinval in the case of an overflow. We must (1) close all descriptors because any file might have been unlinked, (2) keep all pointers valid and yet (3) not leak dropped smgr objects forever. In this patch, smgrreleaseall() achieves those goals. Proof-of-concept patch attached. Are there holes in this scheme? Better ideas welcome. In terms of spelling, another option would be to change the behaviour of smgrclose() to work as described, ie it would call smgrrelease() and then also disown, so we don't have to change most of those callers, and then add a new function smgrdestroy() for the few places that truly need it. Or something like that. Other completely different ideas I've bounced around with various hackers and decided against: references counts, "holder" objects that can be an "owner" (like Relation, but when you don't have a Relation) but can re-open on demand. Seemed needlessly complicated. While studying this I noticed a minor thinko in smgrrelease() in 15+16, so here's a fix for that also. I haven't figured out a sequence that makes anything bad happen, but we should really invalidate smgr_targblock when a relfilenode is reused, since it might point past the end. This becomes more obvious once smgrrelease() is used for truncation, as proposed here. From 844e56e9ea9a56e04813886a6f7ded19a3af7f90 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 14 Aug 2023 11:01:28 +1200 Subject: [PATCH 1/2] Invalidate smgr_targblock on release. In rare circumstances involving relfilenode reuse, it may be possible for smgr_targblock to finish up pointing past the end. Oversight in b74e94dc. Back-patch to 15. --- src/backend/storage/smgr/smgr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index f76c4605db..65e7436306 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -295,6 +295,7 @@ smgrrelease(SMgrRelation reln) { smgrsw[reln->smgr_which].smgr_close(reln, forknum); reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + reln->smgr_targblock = InvalidBlockNumber; } } -- 2.39.2 From b18fea5f263c46f87b84e242ff228cf1ab97b3e8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 10 Aug 2023 18:16:31 +1200 Subject: [PATCH 2/2] Give SMgrRelation pointers a well-defined lifetime. After calling smgropen(), it was not clear how long you could continue to use the result, because various code paths including cache invalidation could call smgrclose(). Guarantee that the object won't be freed until the end of the current transaction, or in recovery, the commit/abort record that destroys the storage. This is achieved by making most places that previously called smgrclose() use smgrrelease() instead, and giving smgrrelease() the additional task of 'disowning' the relation so that it is closed at end-of-transaction. That allows all pointers to remain valid, even in the case of sinval overflow where we have no information about which relations have been dropped or truncated. --- src/backend/access/heap/heapam_handler.c | 6 +++--- src/backend/access/heap/visibilitymap.c | 2 +- src/backend/access/transam/xlogutils.c | 6 +++--- src/backend/catalog/storage.c| 2 +- src/backend/commands/cluster.c | 4 ++-- src/backend/commands/dbcommands.c| 2 +- src/backend/commands/sequence.c | 2 +- src/backend/commands/tablecmds.c | 4 ++-- src/backend/storage/buffer/bufmgr.c | 4 ++-- src/backend/storage/smgr/smgr.c | 21 ++-- src/backend/utils/cache/
Re: Report planning memory in EXPLAIN ANALYZE
On 14/8/2023 06:53, David Rowley wrote: On Thu, 10 Aug 2023 at 20:33, Ashutosh Bapat wrote: My point is what's relevant here is how much net memory planner asked for. But that's not what your patch is reporting. All you're reporting is the difference in memory that's *currently* palloc'd from before and after the planner ran. If we palloc'd 600 exabytes then pfree'd it again, your metric won't change. I'm struggling a bit to understand your goals here. If your goal is to make a series of changes that reduces the amount of memory that's palloc'd at the end of planning, then your patch seems to suit that goal, but per the quote above, it seems you care about how many bytes are palloc'd during planning and your patch does not seem track that. With your patch as it is, to improve the metric you're reporting we could go off and do things like pfree Paths once createplan.c is done, but really, why would we do that? Just to make the "Planning Memory" metric looks better doesn't seem like a worthy goal. Instead, if we reported the context's mem_allocated, then it would give us the flexibility to make changes to the memory context code to have the metric look better. It might also alert us to planner inefficiencies and problems with new code that may cause a large spike in the amount of memory that gets allocated. Now, I'm not saying we should add a patch that shows mem_allocated. I'm just questioning if your proposed patch meets the goals you're trying to achieve. I just suggested that you might want to consider something else as a metric for your memory usage reduction work. Really, the current approach with the final value of consumed memory smooths peaks of memory consumption. I recall examples likewise massive million-sized arrays or reparameterization with many partitions where the optimizer consumes much additional memory during planning. Ideally, to dive into the planner issues, we should have something like a report-in-progress in the vacuum, reporting on memory consumption at each subquery and join level. But it looks too much for typical queries. -- regards, Andrey Lepikhov Postgres Professional
Re: pgbench with libevent?
On Mon, Aug 14, 2023 at 12:35 PM Fabien COELHO wrote: > > Pgbench is managing clients I/Os manually with select or poll. Much of this > > could be managed by libevent. > > Or maybe libuv (used by nodejs?). > > From preliminary testing libevent seems not too good at fine grain time > management which are used for throttling, whereas libuv advertised that it > is good at it, although what it does is yet to be seen. Do you think our WaitEventSet stuff could be good here, if made frontend-friendly?
Re: Support to define custom wait events for extensions
On 2023-08-14 08:06, Michael Paquier wrote: On Thu, Aug 10, 2023 at 05:37:55PM +0900, Michael Paquier wrote: This looks correct, but perhaps we need to think harder about the custom event names and define a convention when more of this stuff is added to the core modules. Okay, I have put my hands on that, fixing a couple of typos, polishing a couple of comments, clarifying the docs and applying an indentation. And here is a v4. Any thoughts or comments? I'd like to apply that soon, so as we are able to move on with the wait event catalog and assigning custom wait events to the other in-core modules. Thanks! I confirmed the changes, and all tests passed. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
proposal: jsonb_populate_array
Hi Now, there is no native functionality for conversion from json(b) value to some array. https://stackoverflow.com/questions/76894960/unable-to-assign-text-value-to-variable-in-pgsql/76896112#76896112 It should not be too hard to implement native function jsonb_populate_array jsonb_populate_array(anyarray, jsonb) returns anyarray Usage: select jsonb_populate_array(null::text[], '["cust_full_name","cust_email"]') Comments, notes? Regards Pavel
Re: Naive handling of inequalities by nbtree initial positioning code
On Sun, Aug 13, 2023 at 5:50 PM Peter Geoghegan wrote: > All that it would take to fix the problem is per-attribute > BTScanInsertData.nextkey values. There is no reason why "nextkey" > semantics should only work for the last attribute in the insertion > scan key. Under this scheme, _bt_first() would be taught to set up the > insertion scan key with (say) nextkey=true for the "four > 2" > attribute, and nextkey=false for the other 3 attributes (since we do > that whenever >= or = are used). It would probably also make sense to > generalize this approach to handle (say) a third query that had a > "four < 2" inequality, but otherwise matched the first two queries. So > we wouldn't literally use multiple "nextkey" fields to do this. Actually, that can't work when there are a huge number of index tuples with the same values for "four" (enough to span many internal pages). So we'd need specialized knowledge of the data type (probably from an opclass support function) to transform "four > 2" into "four >= 3" up front. Alternatively, we could do roughly the same thing via an initial index probe to do the same thing. The latter approach would be needed for continuous data types, where the transformation isn't possible at all. The probing approach could work by finding an initial position in the same way as we currently locate an initial leaf page -- the way that I complained about earlier on, but with an important twist. Once we'd established that the first "four" value in the index > 2 really was 3 (or whatever it turned out to be), we could fill that value into a new insertion scan key. It would then be possible to do another descent of the index, skipping over most of the leaf pages that we'll access needlessly right now. (Actually, we'd only do all that when it seemed likely to allow us to skip a significant number of intermediate leaf pages -- which is what we saw in my test case.) This is directly related to skip scan. The former approach is more or less what the MDAM paper calls "dense" access (which is naturally limited to discrete data types like integer), while the latter probing approach is what it calls "sparse" access. Skip scan performs this process repeatedly, most of the time, but we'd only skip once here. In fact, if my example had used (say) "four > 1" instead, then it would have made sense to skip multiple times -- not just once, after an initial descent. Because then we'd have had to consider matches for both "two=1 and four=2" and "two=1 and four=3" (there aren't any "two=1 and four=4" matches so we'd then be done). In fact, had there been no mention of the "four" column in the query whatsoever (which is how we tend to think of skip scan), then a decent implementation of skip scan would effectively behave as if the query had been written "two=1 and four > -inf and ...", while following the same general approach. (Or "two=1 and four < +inf and ...", if this was a similar looking backwards scan.) -- Peter Geoghegan
Re: WIP: new system catalog pg_wait_event
On Thu, Aug 10, 2023 at 08:09:34PM +0200, Drouvot, Bertrand wrote: > Agree that's worth it given the fact that iterating one more time is not that > costly here. I have reviewed v4, and finished by putting my hands on it to see what I could do. +printf $wc "\telement = (wait_event_element *) palloc(sizeof(wait_event_element));\n"; + +printf $wc "\telement->wait_event_type = \"%s\";\n", $last; +printf $wc "\telement->wait_event_name = \"%s\";\n", $wev->[1]; +printf $wc "\telement->wait_event_description = \"%s\";\n\n", $new_desc; + +printf $wc "\twait_event = lappend(wait_event, element);\n\n"; +} This is simpler than the previous versions, still I am not much a fan of implying the use of a list and these pallocs. There are two things that we could do: - Hide that behind a macro defined in wait_event_funcs.c. - Feed the data generated here into a static structure, like: +static const struct +{ + const char *type; + const char *name; + const char *description; +} After experimenting with both, I've found the latter a tad cleaner, so the attached version does that. + description texte This one was difficult to see.. I am not sure that "pg_wait_event" is a good idea for the name if the new view. How about "pg_wait_events" instead, in plural form? There is more than one wait event listed. One log entry in Solution.pm has missed the addition of a reference to wait_event_funcs_data.c. And.. src/backend/Makefile missed two updates for maintainer-clean & co, no? One thing that the patch is still missing is the handling of custom wait events for extensions. So this still requires more code. I was thinking about listed these events as: - Type: "Extension" - Name: What a module has registered. - Description: "Custom wait event \"%Name%\" defined by extension". For now I am attaching a v5. -- Michael From 4f0172e216bfa7b929b9ca3465d66088b2ac1566 Mon Sep 17 00:00:00 2001 From: bdrouvotAWS Date: Sat, 5 Aug 2023 12:39:42 + Subject: [PATCH v5] Add catalog pg_wait_events Adding a new system view, namely pg_wait_event, that describes the wait events. --- src/include/catalog/pg_proc.dat | 6 ++ src/include/utils/meson.build | 4 +- src/backend/Makefile | 3 +- src/backend/catalog/system_views.sql | 3 + src/backend/utils/activity/.gitignore | 1 + src/backend/utils/activity/Makefile | 8 ++- .../activity/generate-wait_event_types.pl | 46 +++-- src/backend/utils/activity/meson.build| 1 + src/backend/utils/activity/wait_event_funcs.c | 69 +++ src/test/regress/expected/rules.out | 4 ++ src/test/regress/expected/sysviews.out| 16 + src/test/regress/sql/sysviews.sql | 4 ++ doc/src/sgml/system-views.sgml| 64 + src/tools/msvc/Solution.pm| 3 +- src/tools/msvc/clean.bat | 1 + 15 files changed, 223 insertions(+), 10 deletions(-) create mode 100644 src/backend/utils/activity/wait_event_funcs.c diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 6996073989..1a942c678c 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5417,6 +5417,12 @@ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id}', prosrc => 'pg_stat_get_activity' }, +{ oid => '8403', descr => 'describe wait events', + proname => 'pg_get_wait_events', procost => '10', prorows => '100', + proretset => 't', provolatile => 's', prorettype => 'record', + proargtypes => '', proallargtypes => '{text,text,text}', + proargmodes => '{o,o,o}', proargnames => '{type,name,description}', + prosrc => 'pg_get_wait_events' }, { oid => '3318', descr => 'statistics: information about progress of backends running maintenance command', proname => 'pg_stat_get_progress_info', prorows => '100', proretset => 't', diff --git a/src/include/utils/meson.build b/src/include/utils/meson.build index 6de5d93799..c179478611 100644 --- a/src/include/utils/meson.build +++ b/src/include/utils/meson.build @@ -1,6 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group -wait_event_output = ['wait_event_types.h', 'pgstat_wait_event.c'] +wait_event_output = ['wait_event_types.h', 'pgstat_wait_event.c', 'wait_event_funcs_data.c'] wait_event_target = custom_target('wait_event_names', input: files('../../backend/utils/activity/wait_event_names.txt'), output: wait_event_outp
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Mon, Aug 14, 2023 at 7:57 AM Masahiko Sawada wrote: > > On Sat, Aug 12, 2023, 15:20 Amit Kapila wrote: >> >> I don't think we need the complexity of version-specific checks if we >> do what we do in get_control_data(). Basically, invoke >> version-specific pg_replslotdata to get version-specific slot >> information. There has been a proposal for a tool like that [1]. Do >> you have something better in mind? If so, can you please explain the >> same a bit more? > > > Yeah, we need something like pg_replslotdata. If there are other useful > usecases for this tool, it would be good to have it. But I'm not sure other > than pg_upgrade usecase. > > Another idea is (which might have already discussed thoguh) that we check if > the latest shutdown checkpoint LSN in the control file matches the > confirmed_flush_lsn in pg_replication_slots view. That way, we can ensure > that the slot has consumed all WAL records before the last shutdown. We don't > need to worry about WAL records generated after starting the old cluster > during the upgrade, at least for logical replication slots. > Right, this is somewhat closer to what Patch is already doing. But remember in this case we need to remember and use the latest checkpoint from the control file before the old cluster is started because otherwise the latest checkpoint location could be even updated during the upgrade. So, instead of reading from WAL, we need to change so that we rely on the control file's latest LSN. I would prefer this idea than to invent a new API/tool like pg_replslotdata. The other point you and Bruce seem to be favoring is that instead of dumping/restoring slots via pg_dump, we remember the required information of slots retrieved during their validation in pg_upgrade itself and use that to create the slots in the new cluster. Though I am not aware of doing similar treatment for other objects we restore in this case it seems reasonable especially because slots are not stored in the catalog and we anyway already need to retrieve the required information to validate them, so trying to again retrieve it via pg_dump doesn't seem useful unless I am missing something. Does this match your understanding? Yet another thing I am trying to consider is whether we can allow to upgrade slots from 16 or 15 to later versions. As of now, the patch has the following check: getLogicalReplicationSlots() { ... + /* Check whether we should dump or not */ + if (fout->remoteVersion < 17) + return; ... } If we decide to use the existing view pg_replication_slots then can we consider upgrading slots from the prior version to 17? Now, if we want to invent any new API similar to pg_replslotdata then we can't do this because it won't exist in prior versions but OTOH using existing view pg_replication_slots can allow us to fetch slot info from older versions as well. So, I think it is worth considering. Thoughts? -- With Regards, Amit Kapila.
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Thu, Aug 10, 2023 at 8:32 PM Hayato Kuroda (Fujitsu) wrote: > > Based on recent discussions, I updated the patch set. I did not reply one by > one > because there are many posts, but thank you for giving many suggestion! > > Followings shows what I changed. > > 1. > This feature is now enabled by default. Instead > "--exclude-logical-replication-slots" > was added. (Per suggestions like [1]) > AFAICS, we don't have any concrete agreement on such an option but my vote is to not have such an option as we don't have any similar option for any other object. I understand that it could be convenient for some use cases where some of the logical slots are not yet caught up w.r.t WAL and users want to upgrade without the slots but not sure if that is really the case. Does anyone else have an opinion on this point? -- With Regards, Amit Kapila.
Re: pgbench with libevent?
> On Mon, Aug 14, 2023 at 12:35 PM Fabien COELHO wrote: >> > Pgbench is managing clients I/Os manually with select or poll. Much of this >> > could be managed by libevent. >> >> Or maybe libuv (used by nodejs?). >> >> From preliminary testing libevent seems not too good at fine grain time >> management which are used for throttling, whereas libuv advertised that it >> is good at it, although what it does is yet to be seen. > > Do you think our WaitEventSet stuff could be good here, if made > frontend-friendly? Interesting. In my understanding this also needs to make Latch frontend-friendly? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Support to define custom wait events for extensions
On Mon, Aug 14, 2023 at 12:31:05PM +0900, Masahiro Ikeda wrote: > Thanks! I confirmed the changes, and all tests passed. Okay, cool. I got some extra time today and applied that, with a few more tweaks. -- Michael signature.asc Description: PGP signature
Re: Adding a LogicalRepWorker type field
The main patch for adding the worker type enum has been pushed [1]. Here is the remaining (rebased) patch for changing some previous cascading if/else to switch on the LogicalRepWorkerType enum instead. PSA v8. -- [1] https://github.com/postgres/postgres/commit/2a8b40e3681921943a2989fd4ec6cdbf8766566c Kind Regards, Peter Smith. Fujitsu Australia v8-0001-Switch-on-worker-type.patch Description: Binary data